Autoheal torture test - Initial success, then a terminal state

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Autoheal torture test - Initial success, then a terminal state

Matt Pietrek
Because we're sometimes just mean to our software, I wrote a torture test to see how RabbitMQ's Autoheal deal with repeated partitions.

In a nutshell, we start with two brokers (3.2.4) in a cluster. I run my test which uses "iptables" to knock out the link between the two brokers and then restore things.

It does this break/fix continuously in a loop. The time between partitions, and the time inside partitions is configurable.

Using 60 seconds between inducing a partition, and 60 seconds in a partitioned state, I expect that this might be messy - The brokers try to autoheal, and then everything falls apart. However, I'd expect that once I stop my torture and return things back to "normal", that an autoheal will eventually succeed and the brokers will be happily clustered again.

This isn't what happens. Instead, the two brokers essentially ignore each other. Even after waiting for 10+ minutes. I can see each broker, but they each think the other is missing.

Here's a filtered view of the logs, grepping for "Autoheal|Starting|Stopping|Partitions|Winner|Loser":

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal: I am the winner, waiting for [rabbit@mq1] to stop

[hidden email]: Autoheal: I am the winner, waiting additionally for [rabbit@mq1] to stop

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal request received from rabbit@mq1

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq2

[hidden email]:  * Losers:     [rabbit@mq1]

[hidden email]: Autoheal request received from rabbit@mq2

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq2

[hidden email]:  * Losers:     [rabbit@mq1]

[hidden email]: Autoheal: we were selected to restart; winner is rabbit@mq2

[hidden email]: Stopping RabbitMQ

[hidden email]: Autoheal: aborting - rabbit@mq1 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal: we were selected to restart; winner is rabbit@mq1

[hidden email]: Stopping RabbitMQ

[hidden email]: Autoheal: aborting - rabbit@mq2 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal request received from rabbit@mq2

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal request received from rabbit@mq1

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal: I am the winner, waiting for [rabbit@mq2] to stop

[hidden email]: Autoheal: I am the winner, waiting additionally for [rabbit@mq2] to stop

[hidden email]: Autoheal: aborting - rabbit@mq1 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal: we were selected to restart; winner is rabbit@mq1

[hidden email]: Autoheal: aborting - rabbit@mq2 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal request received from rabbit@mq2

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal request received from rabbit@mq1

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal: I am the winner, waiting for [rabbit@mq2] to stop

[hidden email]: Autoheal: I am the winner, waiting additionally for [rabbit@mq2] to stop

# And nothing else beyond this, even after waiting for 10+ minutes.

I don't ever see the "Stopping RabbitMQ" that I've seen in other Autoheal circumstances.

I can send more complete logs, but wanted to see if this is a known issue or expected behavior first.


Matt


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Autoheal torture test - Initial success, then a terminal state

Tim Watson-6
I don't suppose you can post the code that you're using to trigger this can you?

Cheers,
Tim

On 27 Mar 2014, at 22:15, Matt Pietrek wrote:

Because we're sometimes just mean to our software, I wrote a torture test to see how RabbitMQ's Autoheal deal with repeated partitions.

In a nutshell, we start with two brokers (3.2.4) in a cluster. I run my test which uses "iptables" to knock out the link between the two brokers and then restore things.

It does this break/fix continuously in a loop. The time between partitions, and the time inside partitions is configurable.

Using 60 seconds between inducing a partition, and 60 seconds in a partitioned state, I expect that this might be messy - The brokers try to autoheal, and then everything falls apart. However, I'd expect that once I stop my torture and return things back to "normal", that an autoheal will eventually succeed and the brokers will be happily clustered again.

This isn't what happens. Instead, the two brokers essentially ignore each other. Even after waiting for 10+ minutes. I can see each broker, but they each think the other is missing.

Here's a filtered view of the logs, grepping for "Autoheal|Starting|Stopping|Partitions|Winner|Loser":

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal: I am the winner, waiting for [rabbit@mq1] to stop

[hidden email]: Autoheal: I am the winner, waiting additionally for [rabbit@mq1] to stop

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal request received from rabbit@mq1

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq2

[hidden email]:  * Losers:     [rabbit@mq1]

[hidden email]: Autoheal request received from rabbit@mq2

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq2

[hidden email]:  * Losers:     [rabbit@mq1]

[hidden email]: Autoheal: we were selected to restart; winner is rabbit@mq2

[hidden email]: Stopping RabbitMQ

[hidden email]: Autoheal: aborting - rabbit@mq1 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal: we were selected to restart; winner is rabbit@mq1

[hidden email]: Stopping RabbitMQ

[hidden email]: Autoheal: aborting - rabbit@mq2 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal request received from rabbit@mq2

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal request received from rabbit@mq1

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal: I am the winner, waiting for [rabbit@mq2] to stop

[hidden email]: Autoheal: I am the winner, waiting additionally for [rabbit@mq2] to stop

[hidden email]: Autoheal: aborting - rabbit@mq1 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal: we were selected to restart; winner is rabbit@mq1

[hidden email]: Autoheal: aborting - rabbit@mq2 went down

[hidden email]: Autoheal request sent to rabbit@mq1

[hidden email]: Autoheal request received from rabbit@mq2

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal request received from rabbit@mq1

[hidden email]: Autoheal decision

[hidden email]:  * Partitions: [[rabbit@mq1],[rabbit@mq2]]

[hidden email]:  * Winner:     rabbit@mq1

[hidden email]:  * Losers:     [rabbit@mq2]

[hidden email]: Autoheal: I am the winner, waiting for [rabbit@mq2] to stop


[hidden email]: Autoheal: I am the winner, waiting additionally for [rabbit@mq2] to stop

# And nothing else beyond this, even after waiting for 10+ minutes.

I don't ever see the "Stopping RabbitMQ" that I've seen in other Autoheal circumstances.

I can send more complete logs, but wanted to see if this is a known issue or expected behavior first.


Matt

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Autoheal torture test - Initial success, then a terminal state

Simon MacMullen-2
In reply to this post by Matt Pietrek
On 27/03/14 22:15, Matt Pietrek wrote:
> In a nutshell, we start with two brokers (3.2.4) in a cluster. I run my
> test which uses "iptables" to knock out the link between the two brokers
> and then restore things.

There are a couple more autoheal bugs which have been fixed but not made
it into a release yet. Could you try the same test with a nightly build?

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Autoheal torture test - Initial success, then a terminal state

James
In reply to this post by Matt Pietrek
Hi Matt,

I've experienced some issues with autoheal as well (see thread 28127), when
you tested with ipdates did your test use REJECT or DROP? I've always had
the most issues with DROP (simulating an abrupt network outage). With reject
I haven't seen any issues.

Thanks,
James Eddy


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Autoheal torture test - Initial success, then a terminal state

Matt Pietrek
In reply to this post by Simon MacMullen-2
Thanks Simon.

Assuming the nightly build has a .deb package, I can give it a go.

Is there a particular build I should grab?

Matt


On Fri, Mar 28, 2014 at 3:19 AM, Simon MacMullen <[hidden email]> wrote:
On 27/03/14 22:15, Matt Pietrek wrote:
In a nutshell, we start with two brokers (3.2.4) in a cluster. I run my
test which uses "iptables" to knock out the link between the two brokers
and then restore things.

There are a couple more autoheal bugs which have been fixed but not made it into a release yet. Could you try the same test with a nightly build?

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Autoheal torture test - Initial success, then a terminal state

Matt Pietrek
In reply to this post by James
James,

I'm using DROP, like this:

# 192.168.78.16 is the IP address of a target node (not the current node) that we want to drop connectivity to
sudo iptables -I INPUT 1 -i eth0 -p tcp -s 192.168.78.16 -j DROP
sudo iptables -I OUTPUT 1 -o eth0 -p tcp -d 192.168.78.16 -j DROP


On Fri, Mar 28, 2014 at 11:34 AM, James Eddy <[hidden email]> wrote:
Hi Matt,

I've experienced some issues with autoheal as well (see thread 28127), when
you tested with ipdates did your test use REJECT or DROP? I've always had
the most issues with DROP (simulating an abrupt network outage). With reject
I haven't seen any issues.

Thanks,
James Eddy


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Autoheal torture test - Initial success, then a terminal state

Matthias Radestock-3
In reply to this post by Matt Pietrek
On 28/03/14 21:00, Matt Pietrek wrote:
> Assuming the nightly build has a .deb package, I can give it a go.

Sure does. http://www.rabbitmq.com/nightly-builds.html. There's even a
debian repo for it.

> Is there a particular build I should grab?

The most recent.

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss