cluster meltdown

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

cluster meltdown

carlhoerberg
We have a two node cluster, an AWS node froze for us, we could neither start or stop it. That made the first node unresponsive to mgmt db actions, all API request timed out. We restart the first node but a lot of queues are then inaccessible:  

=ERROR REPORT==== 1-Apr-2014::02:58:03 ===
connection <0.5559.279>, channel 1 - soft error:
{amqp_error,not_found,
            "home node 'rabbit@node2' of durable queue 'celery' in vhost 'vhost1' is down or inaccessible",
            'queue.declare'}

We issue rabbitmqctl forget_cluster_node rabbit@node2 as we still can't access node2.

Node1 continue to report a lot of "home node of queue is down".

Node2 has now restarted, but can't join the cluster. Is there a way to rejoin the cluster without resetting?

We reset node2 and tries to join_cluster again but with the following result:
Clustering node 'rabbit@node2' with 'rabbit@node1' ...
...done (already_member).

node2# rabbitmqctl cluster_status

Cluster status of node 'rabbit@node2' ...
[{nodes,[{disc,['rabbit@node2']}]},
 {running_nodes,['rabbit@node2']},
 {partitions,[]}]
...done.

But start_app doesn't join node1.
 
node1# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node1' ...
[{nodes,[{disc,['rabbit@node1','rabbit@node2']}]},
 {running_nodes,['rabbit@node1']},
 {partitions,[]}]
...done.

node2# rabbitmqctl update_cluster_nodes rabbit@node1

Now node2 understands that it's clustered with node1 and with start_app it starts and joins node1.

RabbitMQ 3.2.3, Erlang R16B03-1