Cluster recovery due to network outages

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Cluster recovery due to network outages

Aaron Westendorf
We had an outage in the internal network at our datacenter this
weekend and our rabbit cluster did not fully recover.

We have 4 hosts, all running 1.7.2.  When failures started, we saw
messages like the following (which we've seen before):


=ERROR REPORT==== 1-Aug-2010::06:13:24 ===
** Node rabbit@caerbannog not responding **
** Removing (timedout) connection **

=INFO REPORT==== 1-Aug-2010::06:13:24 ===
node rabbit@caerbannog down


A short time later the hosts recovered, also as we've seen before:

=INFO REPORT==== 1-Aug-2010::06:26:48 ===
node rabbit@caerbannog up
=ERROR REPORT==== 1-Aug-2010::06:26:48 ===
Mnesia(rabbit@bigwig): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network,
 rabbit@caerbannog}

=ERROR REPORT==== 1-Aug-2010::06:26:48 ===
Mnesia(rabbit@bigwig): ** ERROR ** mnesia_event got
{inconsistent_database, starting_partitioned_network
, rabbit@caerbannog}


Another 15 minutes later and the timedout errors were logged and that
was the end of the cluster; I think two of the nodes figured out how
to connect back to each other, but two others remained on their own.
The hosts and nodes themselves never shutdown, and when I restarted
just one of the nodes later in the day, the whole cluster rediscovered
itself and all appeared to be well (`rabbitmqctl status` was
consistent with expectations).

So our first problem is that the nodes did not re-cluster after the
second outage.  Once we corrected the cluster though, our applications
still did not respond and we had to restart all of our clients.

Our clients all have a lot of handling for connection drops and
channel closures, but most of them did not see any TCP disconnects to
their respective nodes.  When the cluster was fixed, we found a lot of
our queues missing (they weren't durable), and so we had to restart
all of the apps to redeclare the queues.  This still didn't fix our
installation though, as our apps were receiving and processing data,
but responses were not being sent back out of our HTTP translators.

We have a single exchange, "response" that any application expecting a
response can bind to.  Our HTTP translators handle traffic from our
public endpoints, publish to various exchanges for the services we
offer, and those services in turn write back to the response exchange.
 We have a monitoring tool that confirmed that these translators could
write a response to its own Rabbit host and immediately receive it (a
ping, more or less).  However, none of the responses from services
which were connected to other Rabbit nodes were received by the
translators.

In short, it appeared that even though the cluster was healed and all
our services had re-declared their queues, the bindings between the
response exchange and the queues which our translators use did not
appear to propagate to the rest of the nodes in the cluster.

So in summary,

* Rabbit didn't re-connect to the other nodes after the second TCP disconnect
* After fixing the cluster (manually or automatically), Rabbit appears
to have lost its non-durable queues even though the nodes never
stopped
* Although we had every indication that exchanges and queues were
still alive and functional, bindings appear to have been lost between
Rabbit nodes

What we'd like to know is,

* Does any of this make sense and can we add more detail to help fix any bugs?
* Have there been fixes for these issues since 1.7.2 that we should deploy?
* Is there anything we should add/change about our applications to
deal with these types of situations?


Thanks in advance for any help.
-Aaron


--
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
[hidden email]
www.agoragames.com
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Cluster recovery due to network outages

Alexandru Scvorţov
Hi Aaron,

> =ERROR REPORT==== 1-Aug-2010::06:13:24 ===
> ** Node rabbit@caerbannog not responding **
> ** Removing (timedout) connection **
>
> =INFO REPORT==== 1-Aug-2010::06:13:24 ===
> node rabbit@caerbannog down

As the error message suggests, it means mnesia timed out a connection to
another node.
 
There was a discussion about this a while ago
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2010-March/006508.html

If you're expecting frequent short outages, you might consider
tweaking the timeout parameters as described above.

> A short time later the hosts recovered, also as we've seen before:
>
> =INFO REPORT==== 1-Aug-2010::06:26:48 ===
> node rabbit@caerbannog up
> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
> Mnesia(rabbit@bigwig): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network,
>  rabbit@caerbannog}
>
> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
> Mnesia(rabbit@bigwig): ** ERROR ** mnesia_event got
> {inconsistent_database, starting_partitioned_network
> , rabbit@caerbannog}
>

During the outage, the nodes were out of contact with each other for
so long that mnesia is worried about possible inconsistencies.

The simplest solution would be to take down 3 of the nodes and
restart them.  This should allow them to sync with the fourth.

There's a longer explanation available here.

http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id2277661

> Another 15 minutes later and the timedout errors were logged and that
> was the end of the cluster; I think two of the nodes figured out how
> to connect back to each other, but two others remained on their own.
> The hosts and nodes themselves never shutdown, and when I restarted
> just one of the nodes later in the day, the whole cluster rediscovered
> itself and all appeared to be well (`rabbitmqctl status` was
> consistent with expectations).
>
> So our first problem is that the nodes did not re-cluster after the
> second outage.

If this was caused by the inconsistent_database errors, there's not
much you can do apart from a restart of some of the nodes.

> Once we corrected the cluster though, our applications
> still did not respond and we had to restart all of our clients.
>
> Our clients all have a lot of handling for connection drops and
> channel closures, but most of them did not see any TCP disconnects to
> their respective nodes.  When the cluster was fixed, we found a lot of
> our queues missing (they weren't durable), and so we had to restart
> all of the apps to redeclare the queues.  This still didn't fix our
> installation though, as our apps were receiving and processing data,
> but responses were not being sent back out of our HTTP translators.
>
> We have a single exchange, "response" that any application expecting a
> response can bind to.  Our HTTP translators handle traffic from our
> public endpoints, publish to various exchanges for the services we
> offer, and those services in turn write back to the response exchange.
>  We have a monitoring tool that confirmed that these translators could
> write a response to its own Rabbit host and immediately receive it (a
> ping, more or less).  However, none of the responses from services
> which were connected to other Rabbit nodes were received by the
> translators.
>
> In short, it appeared that even though the cluster was healed and all
> our services had re-declared their queues, the bindings between the
> response exchange and the queues which our translators use did not
> appear to propagate to the rest of the nodes in the cluster.

That doesn't sound right.  As you say, if the cluster was indeed
running, the queues/exhanges/bindings should have appeared on all of the
nodes.

It's possible that the rabbit nodes reconnected succesfully, but the
mnesia ones didn't.  When a rabbitmq node detects another has gone
down, it automatically removes the queues declared on it from the
cluster.  If the rabbit nodes think everything is fine, this removal
wouldn't happen.  As a result, rabbitmqctl might report
queues/exchanges/bindings that are actually unusable.

> So in summary,
>
> * Rabbit didn't re-connect to the other nodes after the second TCP disconnect

We don't have any logic in the broker to recover from
inconsistent_database errors.  Your best bet is probably to restart
all but one of the nodes.

> * After fixing the cluster (manually or automatically), Rabbit appears
> to have lost its non-durable queues even though the nodes never
> stopped
> * Although we had every indication that exchanges and queues were
> still alive and functional, bindings appear to have been lost between
> Rabbit nodes

See above.  The cluster may not have been completely repaired.  Try
restarting.

> What we'd like to know is,
>
> * Does any of this make sense and can we add more detail to help fix any bugs?

It makes some sense.  Thanks for pointing this problem out.

> * Have there been fixes for these issues since 1.7.2 that we should deploy?

Not to this, sorry.

> * Is there anything we should add/change about our applications to
> deal with these types of situations?

I'm not sure what you could do to prevent this.  This is more of a
mnesia problem.

Cheers,
Alex
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Cluster recovery due to network outages

Aaron Westendorf
Alex,

Thank you for your reply.  We're building up our Erlang, Rabbit and
Mnesia knowledge and I'll pass along your reply to the rest of the
team.

cheers,
Aaron

On Wed, Aug 4, 2010 at 9:45 AM, Alexandru Scvortov
<[hidden email]> wrote:

> Hi Aaron,
>
>> =ERROR REPORT==== 1-Aug-2010::06:13:24 ===
>> ** Node rabbit@caerbannog not responding **
>> ** Removing (timedout) connection **
>>
>> =INFO REPORT==== 1-Aug-2010::06:13:24 ===
>> node rabbit@caerbannog down
>
> As the error message suggests, it means mnesia timed out a connection to
> another node.
>
> There was a discussion about this a while ago
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2010-March/006508.html
>
> If you're expecting frequent short outages, you might consider
> tweaking the timeout parameters as described above.
>
>> A short time later the hosts recovered, also as we've seen before:
>>
>> =INFO REPORT==== 1-Aug-2010::06:26:48 ===
>> node rabbit@caerbannog up
>> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
>> Mnesia(rabbit@bigwig): ** ERROR ** mnesia_event got
>> {inconsistent_database, running_partitioned_network,
>>  rabbit@caerbannog}
>>
>> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
>> Mnesia(rabbit@bigwig): ** ERROR ** mnesia_event got
>> {inconsistent_database, starting_partitioned_network
>> , rabbit@caerbannog}
>>
>
> During the outage, the nodes were out of contact with each other for
> so long that mnesia is worried about possible inconsistencies.
>
> The simplest solution would be to take down 3 of the nodes and
> restart them.  This should allow them to sync with the fourth.
>
> There's a longer explanation available here.
>
> http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id2277661
>
>> Another 15 minutes later and the timedout errors were logged and that
>> was the end of the cluster; I think two of the nodes figured out how
>> to connect back to each other, but two others remained on their own.
>> The hosts and nodes themselves never shutdown, and when I restarted
>> just one of the nodes later in the day, the whole cluster rediscovered
>> itself and all appeared to be well (`rabbitmqctl status` was
>> consistent with expectations).
>>
>> So our first problem is that the nodes did not re-cluster after the
>> second outage.
>
> If this was caused by the inconsistent_database errors, there's not
> much you can do apart from a restart of some of the nodes.
>
>> Once we corrected the cluster though, our applications
>> still did not respond and we had to restart all of our clients.
>>
>> Our clients all have a lot of handling for connection drops and
>> channel closures, but most of them did not see any TCP disconnects to
>> their respective nodes.  When the cluster was fixed, we found a lot of
>> our queues missing (they weren't durable), and so we had to restart
>> all of the apps to redeclare the queues.  This still didn't fix our
>> installation though, as our apps were receiving and processing data,
>> but responses were not being sent back out of our HTTP translators.
>>
>> We have a single exchange, "response" that any application expecting a
>> response can bind to.  Our HTTP translators handle traffic from our
>> public endpoints, publish to various exchanges for the services we
>> offer, and those services in turn write back to the response exchange.
>>  We have a monitoring tool that confirmed that these translators could
>> write a response to its own Rabbit host and immediately receive it (a
>> ping, more or less).  However, none of the responses from services
>> which were connected to other Rabbit nodes were received by the
>> translators.
>>
>> In short, it appeared that even though the cluster was healed and all
>> our services had re-declared their queues, the bindings between the
>> response exchange and the queues which our translators use did not
>> appear to propagate to the rest of the nodes in the cluster.
>
> That doesn't sound right.  As you say, if the cluster was indeed
> running, the queues/exhanges/bindings should have appeared on all of the
> nodes.
>
> It's possible that the rabbit nodes reconnected succesfully, but the
> mnesia ones didn't.  When a rabbitmq node detects another has gone
> down, it automatically removes the queues declared on it from the
> cluster.  If the rabbit nodes think everything is fine, this removal
> wouldn't happen.  As a result, rabbitmqctl might report
> queues/exchanges/bindings that are actually unusable.
>
>> So in summary,
>>
>> * Rabbit didn't re-connect to the other nodes after the second TCP disconnect
>
> We don't have any logic in the broker to recover from
> inconsistent_database errors.  Your best bet is probably to restart
> all but one of the nodes.
>
>> * After fixing the cluster (manually or automatically), Rabbit appears
>> to have lost its non-durable queues even though the nodes never
>> stopped
>> * Although we had every indication that exchanges and queues were
>> still alive and functional, bindings appear to have been lost between
>> Rabbit nodes
>
> See above.  The cluster may not have been completely repaired.  Try
> restarting.
>
>> What we'd like to know is,
>>
>> * Does any of this make sense and can we add more detail to help fix any bugs?
>
> It makes some sense.  Thanks for pointing this problem out.
>
>> * Have there been fixes for these issues since 1.7.2 that we should deploy?
>
> Not to this, sorry.
>
>> * Is there anything we should add/change about our applications to
>> deal with these types of situations?
>
> I'm not sure what you could do to prevent this.  This is more of a
> mnesia problem.
>
> Cheers,
> Alex
>



--
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
[hidden email]
www.agoragames.com
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss