Cluster problem after network bouncing

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Cluster problem after network bouncing

Leonardo N. S. Pereira
Hi all.
I have a three nodes in a cluster deployed on different AWS AZs: A, B and C.
The queues are fully mirrored among nodes, the queues are persistent and cluster partition handler is configured to pause-minority. The RabbitMq version I'm using is 3.3.1
The queues policies are set as:  {"ha-mode":"all","ha-sync-mode":"automatic"} .

I'm bouncing the network connection between two nodes, for instance C and B, in the way that:
- A: can reach C and B
- C: can reach A
- B can reach A
In this case, the cluster get stuck and only recovers when all nodes are restarted.
Is it an expected behavior?


And if I kill one node, instead of only bounce the connection as before, the cluster is still working.
What I think is expected

Thanks in advance for your time and help

Best Regards,
Leo

 
Leonardo Nogueira de Sá Pereira
Tel.: +55 19 3307-5589
Cel.: +55 19 9122-5943
Skype: leonardo_pereira_77

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Cluster problem after network bouncing

Simon MacMullen-2
On 09/05/2014 20:18, Leonardo N. S. Pereira wrote:
> I'm bouncing the network connection between two nodes, for instance C
> and B, in the way that:
> - A: can reach C and B
> - C: can reach A
> - B can reach A
> In this case, the cluster get stuck and only recovers when all nodes are
> restarted.
> Is it an expected behavior?

We don't deal very well with partial partitions in general, and in this
specific case it's unclear what pause_minority even *could* do. In this
partial partition all three nodes think that they are in the majority -
so pause_minority mode has nothing to do. So I'm afraid this is
expected, yes.

Cheers, Simon
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Cluster problem after network bouncing

Leonardo N. S. Pereira
In reply to this post by Leonardo N. S. Pereira

Hi Simon, thanks very much for your answer.
What is the recommended set up for HA running in AWS?
Is there a way to workaround the partition problem?


> On May 12, 2014, at 6:27 AM, Simon MacMullen <[hidden email]> wrote:

> >
> > On 09/05/2014 20:18, Leonardo N. S. Pereira wrote:
> > I'm bouncing the network connection between two nodes, for instance C
> > and B, in the way that:
> > - A: can reach C and B
> > - C: can reach A
> > - B can reach A
> > In this case, the cluster get stuck and only recovers when all nodes are
> > restarted.
> > Is it an expected behavior?
>
> We don't deal very well with partial partitions in general, and in this specific case it's unclear what pause_minority even *could* do. In this partial partition all >three nodes think that they are in the majority - so pause_minority mode has nothing to do. So I'm afraid this is expected, yes.

Cheers, Simon


 
Leonardo Nogueira de Sá Pereira
Tel.: +55 19 3307-5589
Cel.: +55 19 9122-5943
Skype: leonardo_pereira_77
On Friday, May 9, 2014 4:18 PM, Leonardo N. S. Pereira <[hidden email]> wrote:
Hi all.
I have a three nodes in a cluster deployed on different AWS AZs: A, B and C.
The queues are fully mirrored among nodes, the queues are persistent and cluster partition handler is configured to pause-minority. The RabbitMq version I'm using is 3.3.1
The queues policies are set as:  {"ha-mode":"all","ha-sync-mode":"automatic"} .

I'm bouncing the network connection between two nodes, for instance C and B, in the way that:
- A: can reach C and B
- C: can reach A
- B can reach A
In this case, the cluster get stuck and only recovers when all nodes are restarted.
Is it an expected behavior?


And if I kill one node, instead of only bounce the connection as before, the cluster is still working.
What I think is expected

Thanks in advance for your time and help

Best Regards,
Leo

 
Leonardo Nogueira de Sá Pereira
Tel.: +55 19 3307-5589
Cel.: +55 19 9122-5943
Skype: leonardo_pereira_77





_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Cluster problem after network bouncing

Simon MacMullen-2
On 13/05/2014 18:04, Leonardo N. S. Pereira wrote:
> Hi Simon, thanks very much for your answer.
> What is the recommended set up for HA running in AWS?
> Is there a way to workaround the partition problem?

Don't cluster across more than two AZs.

Unless service availability is more important to you than avoiding data
loss, don't cluster across AZs at all.

Cheers, Simon

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Cluster problem after network bouncing

Matthias Radestock-3
On 14/05/14 08:58, Simon MacMullen wrote:
> On 13/05/2014 18:04, Leonardo N. S. Pereira wrote:
>> Hi Simon, thanks very much for your answer.
>> What is the recommended set up for HA running in AWS?
>> Is there a way to workaround the partition problem?
>
> Don't cluster across more than two AZs.
>
> Unless service availability is more important to you than avoiding data
> loss, don't cluster across AZs at all.

Also note that in practice the situation you created in your tests, and
which causes the odd behaviour - partial partitions (where communication
between nodes is severed in just one direction) - is less likely to
occur in practice than full partitions.

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Cluster problem after network bouncing

Laing, Michael P.
Actually we have many clusters running across 3 zones in AWS :)

But we are prepared to lose entire regions, wholly or partially.

And we never persist messages in our rabbits - instead we use a multi-region Cassandra cluster. Oh and S3 for large message bodies.

Plus important messages (anything not individually addressed) are replicated for processing multiple times across multiple regions, racing to resolution.

It is a 'rabbits everywhere' strategy: a global mesh of redundant cooperating clusters that replicate, route, and resolve messages and use Cassandra and S3 for persistence.

The key to keeping a cluster up across zones in AWS is to never, ever overload it so there is no interruption of inter-cluster communications. The key statistic to monitor is IO wait. 

We over-provision our cluster members to be sure they have enough instantaneous resource at all times. And, as I said, we never persist messages on the cluster.

ml


On Wed, May 14, 2014 at 4:05 AM, Matthias Radestock <[hidden email]> wrote:
On 14/05/14 08:58, Simon MacMullen wrote:
On 13/05/2014 18:04, Leonardo N. S. Pereira wrote:
Hi Simon, thanks very much for your answer.
What is the recommended set up for HA running in AWS?
Is there a way to workaround the partition problem?

Don't cluster across more than two AZs.

Unless service availability is more important to you than avoiding data
loss, don't cluster across AZs at all.

Also note that in practice the situation you created in your tests, and which causes the odd behaviour - partial partitions (where communication between nodes is severed in just one direction) - is less likely to occur in practice than full partitions.

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss