Fully reliable setup impossible?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Fully reliable setup impossible?

Steffen Daniel Jensen
We have two data centers connected closely by LAN.
 
We are interested in a *reliable cluster* setup. It must be a cluster because we want clients to be able to connect to each node transparently. Federation is not an option.
1. It happens that the firewall/switch is restarted, and maybe a few ping messages are lost.
2. The setup should survive data center crash
3. All queues are durable and mirrored, all messages are persisted, all publishes are confirmed
 
There are 3 cluster-recovery settings
a) ignore: A cross data center network break-down would cause message loss on the node that is restarted In order to rejoin.
b) pause_minority: If we choose the same number of nodes in each data center, the whole cluster will pause. If we don't, only the data center with the most nodes can survive.
c) auto_heal: If the cluster decides network partitioning, there is a potential of message loss, when joining.
[I would really like a resync-setting similar to the one described below]
 
Question 1: Is it even possible to have a fully reliable setup in such a setting?
 
In reality we probably won't have actual network partitions, and it will most probably only be a very short network downtime.
 
Question 2: Is it possible to adjust how long it takes rabbitmq to decide "node down"?
 
It is much better to have a halted rabbitmq for some seconds than to have message loss.
 
 
Question 3: Assume that we are using the ignore setting, and that we have only two nodes in the cluster. Would the following be a full recovery with zero message loss?
 
0. Decide which node survives, Ns, and which should be restarted, Nr.
1. Refuse all connections to Nr except from a special recovery application. (One could change the ip, so all running services can't connect or similar)
2. Consume and republish all message from Nr to Ns.
3. Restart Nr
Then the cluster should be up-and-running again.
 
Since all queues are mirrored, all messages published in the partition time is preserved. If a certain service lives only in the one data center, messages will pile up in the other (if there are any publishes).
 
If you have any other suggestions, I would be very interested to hear them.
 
I would be really sad to find it necessary to choose Tibco ESB over RabbitMQ, for this reason.
 
Thank you,
-- Steffen

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

mc717990
On your clustering and clients connecting transparently - I highly recommend a load balancer in front of your Rabbit servers.  With tcp-half open monitoring on the rabbit port, you can tell pretty quickly when a node/site goes down, and then get failover to one of the other nodes/clusters.  With clustering and mirrored queues and by using publisher confirms you'll avoid data loss this way.  You CAN get data duplication though.  But I'd only recommend clustering over a really reliable link.  If you're going across a WAN - use shovel/federations to replicate messages to rabbit clusters on the other side, vs. trying to do cross-wan clusters.  You could for run a cluster in each site and use federation to send messages to the other cluster as needed.  If any given site goes down, your load balancer could switch traffic to the other cluster.  There's still a chance for downtime, but it's pretty minimal.  We use this to redirect traffic to any given node in the cluster right now so if a single node fails, the load balancers pull that node out of service automatically.

Regarding question 2 - if you design it right, using confirms (the default in most clients as i understand it), and use persistent messages, you'll never get message loss with a mirrored queue, unless ALL servers completely crash and hard drives die.  At least this is my understanding :)

Last point - you may want to do manual handling of this situation if it's that much of concern.  e.g. let the nodes remain partitioned, let all messages empty (again, remember there would be duplicates possible), then restrict access to the bad nodes to anything but your consumer processes, shut down the "bad" nodes and bring them back up.  They'd not have any messages, and they'd get their queues/exchanges from  your "master" node that was good when they come back up.  In the case of a load balancer in front, you could use your load balancer to control this very effectively.

Definitely read through partitioning and reliability documentation and actually try these scenarios:


Jason




On Thu, May 22, 2014 at 8:04 AM, Steffen Daniel Jensen <[hidden email]> wrote:
We have two data centers connected closely by LAN.
 
We are interested in a *reliable cluster* setup. It must be a cluster because we want clients to be able to connect to each node transparently. Federation is not an option.
1. It happens that the firewall/switch is restarted, and maybe a few ping messages are lost.
2. The setup should survive data center crash
3. All queues are durable and mirrored, all messages are persisted, all publishes are confirmed
 
There are 3 cluster-recovery settings
a) ignore: A cross data center network break-down would cause message loss on the node that is restarted In order to rejoin.
b) pause_minority: If we choose the same number of nodes in each data center, the whole cluster will pause. If we don't, only the data center with the most nodes can survive.
c) auto_heal: If the cluster decides network partitioning, there is a potential of message loss, when joining.
[I would really like a resync-setting similar to the one described below]
 
Question 1: Is it even possible to have a fully reliable setup in such a setting?
 
In reality we probably won't have actual network partitions, and it will most probably only be a very short network downtime.
 
Question 2: Is it possible to adjust how long it takes rabbitmq to decide "node down"?
 
It is much better to have a halted rabbitmq for some seconds than to have message loss.
 
 
Question 3: Assume that we are using the ignore setting, and that we have only two nodes in the cluster. Would the following be a full recovery with zero message loss?
 
0. Decide which node survives, Ns, and which should be restarted, Nr.
1. Refuse all connections to Nr except from a special recovery application. (One could change the ip, so all running services can't connect or similar)
2. Consume and republish all message from Nr to Ns.
3. Restart Nr
Then the cluster should be up-and-running again.
 
Since all queues are mirrored, all messages published in the partition time is preserved. If a certain service lives only in the one data center, messages will pile up in the other (if there are any publishes).
 
If you have any other suggestions, I would be very interested to hear them.
 
I would be really sad to find it necessary to choose Tibco ESB over RabbitMQ, for this reason.
 
Thank you,
-- Steffen

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss




--
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

Simon MacMullen-2
In reply to this post by Steffen Daniel Jensen
On 22/05/14 14:04, Steffen Daniel Jensen wrote:
> We have two data centers connected closely by LAN.
> We are interested in a *reliable cluster* setup. It must be a cluster
> because we want clients to be able to connect to each node
> transparently. Federation is not an option.

I hope you realise that you are asking for a lot here! You should read
up on the CAP theorem if you have not already done so.

> 1. It happens that the firewall/switch is restarted, and maybe a few
> ping messages are lost.
> 2. The setup should survive data center crash
> 3. All queues are durable and mirrored, all messages are persisted, all
> publishes are confirmed
> There are 3 cluster-recovery settings
> a) ignore: A cross data center network break-down would cause message
> loss on the node that is restarted In order to rejoin.
> b) pause_minority: If we choose the same number of nodes in each data
> center, the whole cluster will pause. If we don't, only the data center
> with the most nodes can survive.
> c) auto_heal: If the cluster decides network partitioning, there is a
> potential of message loss, when joining.
> [I would really like a resync-setting similar to the one described below]
> Question 1: Is it even possible to have a fully reliable setup in such a
> setting?

Depends how you define "fully reliable". If you want Consistency (i.e.
mirrored queues), Availability (i.e. neither data centre pauses) and
Partition tolerance (no loss of data from either side if the network
goes down between them) then I'm afraid you can't.

> In reality we probably won't have actual network partitions, and it will
> most probably only be a very short network downtime.
> Question 2: Is it possible to adjust how long it takes rabbitmq to
> decide "node down"?

Yes, see http://www.rabbitmq.com/nettick.html

> It is much better to have a halted rabbitmq for some seconds than to
> have message loss.
> Question 3: Assume that we are using the ignore setting, and that we
> have only two nodes in the cluster. Would the following be a full
> recovery with zero message loss?
> 0. Decide which node survives, Ns, and which should be restarted, Nr.
> 1. Refuse all connections to Nr except from a special recovery
> application. (One could change the ip, so all running services can't
> connect or similar)
> 2. Consume and republish all message from Nr to Ns.
> 3. Restart Nr
> Then the cluster should be up-and-running again.

That sounds like it would work. You're losing some availability and
consistency, and your message ordering will change. You have a pretty
good chance of duplicating lots of messages too (any that were in the
queues when the partition happened). Assuming you're happy with that it
sounds reasonable.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

Steffen Daniel Jensen
Hi Simon,

Thank you for your reply.

We have two data centers connected closely by LAN.
We are interested in a *reliable cluster* setup. It must be a cluster
because we want clients to be able to connect to each node
transparently. Federation is not an option.

I hope you realise that you are asking for a lot here! You should read up on the CAP theorem if you have not already done so.

Yes, I know. But I am not asking an unreasonably lot, IMO :-)
I am aware of the CAP theorem, but I don't see how it is in violation. I am willing to live with eventual consistency.

1. It happens that the firewall/switch is restarted, and maybe a few
ping messages are lost.
2. The setup should survive data center crash
3. All queues are durable and mirrored, all messages are persisted, all
publishes are confirmed
There are 3 cluster-recovery settings
a) ignore: A cross data center network break-down would cause message
loss on the node that is restarted In order to rejoin.
b) pause_minority: If we choose the same number of nodes in each data
center, the whole cluster will pause. If we don't, only the data center
with the most nodes can survive.
c) auto_heal: If the cluster decides network partitioning, there is a
potential of message loss, when joining.
[I would really like a resync-setting similar to the one described below]
Question 1: Is it even possible to have a fully reliable setup in such a
setting?

Depends how you define "fully reliable". If you want Consistency (i.e. mirrored queues), Availability (i.e. neither data centre pauses) and Partition tolerance (no loss of data from either side if the network goes down between them) then I'm afraid you can't.

What I mean when I say "reliable" is: All subscribers at the time of publish will eventually get the message.

That should be possible, assuming that all live inconsistent nodes will eventually rejoin (without dumping messages). I know this is not the case in rabbitmq, but it is definitely theoretically possible. I guess this is what is usually referred to as eventual consistency.
 
In reality we probably won't have actual network partitions, and it will
most probably only be a very short network downtime.
Question 2: Is it possible to adjust how long it takes rabbitmq to
decide "node down"?

Yes, see http://www.rabbitmq.com/nettick.html

Thank you! (!)
I have been looking for that one. But I am surprised to see that it is actually 60sec. Then I really don't understand how I could have seen so many clusters ending up partitioned.

Do you know what the consequence of doubling it might be?

RabbitMq writes:
Increasing the net_ticktime across all nodes in a cluster will make the cluster more resilient to short network outtages, but it will take longer for remaing nodes to detect crashed nodes.

More specifically I wonder what happens in the time a node is actually in its own network, but before it finds out. In our setup all publishes have all-HA subscriber queues, with publisher confirm. So I will expect a distributed agreement that the msg has been persisted. Will a publisher confirm then block until the node decides that other nodes are down, and then succeed?

It is much better to have a halted rabbitmq for some seconds than to
have message loss.
Question 3: Assume that we are using the ignore setting, and that we
have only two nodes in the cluster. Would the following be a full
recovery with zero message loss?
0. Decide which node survives, Ns, and which should be restarted, Nr.
1. Refuse all connections to Nr except from a special recovery
application. (One could change the ip, so all running services can't
connect or similar)
2. Consume and republish all message from Nr to Ns.
3. Restart Nr
Then the cluster should be up-and-running again.

That sounds like it would work. You're losing some availability and consistency, and your message ordering will change. You have a pretty good chance of duplicating lots of messages too (any that were in the queues when the partition happened). Assuming you're happy with that it sounds reasonable.

The duplication is ok -- but assuming that rabbit is usually empty, it won't really happen, I think. 
But -- I am sure that rabbit does not guarantee exactly once delivery anyway.
For that reason, we will build in idempotency for critical messages.

Ordering can always get scrambled when nacking consuming messages, so we are not assuming ordering either.


About the CAP theorem in relation to rabbit.
Reliable messaging (zero message loss), is often preferred in SOA-settings. I wonder why vmware/pivotal/... chose not to prioritize this guarantee. It is aimed by the federation setup, but it is a little to weak in its synchronization. It would be preferred if it had a possibility of communicating consumption of messages. Then one could mirror queues between up/down-stream exchanges, and have even more "availability". One would definitely give up consistency a little further, but it would be possible to have the setup above, I think. I know it definitely doesn't come out-of-the-box, and it is not a part of AMQP, AFAIK, but it seems possible.

Thank you, Simon!

-- S

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

Ron Cordell
It seems that one of your assumptions is that a cluster would operate across data centers. This is not recommended for RabbitMQ and we don't use it that way - we use shovels between clusters since we, like you, can deal with eventual consistency.

Our clusters are similar to what you describe. For us, (almost) all queues are persistent and mirrored because we can't tolerate message loss. We have seen significant sensitivity to partitioning in Windows OS-based clusters under very heavy load; we do not see this on Linux and can run at least twice the load as well. 

There is an F5 in front of the cluster, but it doesn't do load balancing but just acts as a persistent router. We've found that by directing all traffic to one node and letting it replicate to other nodes we don't have to deal with issues if a network partition occurs, and they *do* occur (earlier this week one of the virtual NICs stopped on one of the Rabbit nodes, for example). The F5 will detect this very quickly and divert traffic to another node if necessary. (Note to others - we've found that this arrangement scales significantly better than round robin load balancing for mirrored persistent queues).

We have clusters in each data center; for clusters that need to replicate to a different data center there is the shovel. However, there is a chance of message loss if the application sends the message to the cluster in one DC and that entire DC is hit by a meteor before the message can be delivered to the other data center. For most scenarios, however, the messages will be eventually delivered when the DC comes back up.

Hope that helps a little...

Cheers,

-ronc


On Thu, May 22, 2014 at 12:04 PM, Steffen Daniel Jensen <[hidden email]> wrote:
Hi Simon,

Thank you for your reply.

We have two data centers connected closely by LAN.
We are interested in a *reliable cluster* setup. It must be a cluster
because we want clients to be able to connect to each node
transparently. Federation is not an option.

I hope you realise that you are asking for a lot here! You should read up on the CAP theorem if you have not already done so.

Yes, I know. But I am not asking an unreasonably lot, IMO :-)
I am aware of the CAP theorem, but I don't see how it is in violation. I am willing to live with eventual consistency.

1. It happens that the firewall/switch is restarted, and maybe a few
ping messages are lost.
2. The setup should survive data center crash
3. All queues are durable and mirrored, all messages are persisted, all
publishes are confirmed
There are 3 cluster-recovery settings
a) ignore: A cross data center network break-down would cause message
loss on the node that is restarted In order to rejoin.
b) pause_minority: If we choose the same number of nodes in each data
center, the whole cluster will pause. If we don't, only the data center
with the most nodes can survive.
c) auto_heal: If the cluster decides network partitioning, there is a
potential of message loss, when joining.
[I would really like a resync-setting similar to the one described below]
Question 1: Is it even possible to have a fully reliable setup in such a
setting?

Depends how you define "fully reliable". If you want Consistency (i.e. mirrored queues), Availability (i.e. neither data centre pauses) and Partition tolerance (no loss of data from either side if the network goes down between them) then I'm afraid you can't.

What I mean when I say "reliable" is: All subscribers at the time of publish will eventually get the message.

That should be possible, assuming that all live inconsistent nodes will eventually rejoin (without dumping messages). I know this is not the case in rabbitmq, but it is definitely theoretically possible. I guess this is what is usually referred to as eventual consistency.
 
In reality we probably won't have actual network partitions, and it will
most probably only be a very short network downtime.
Question 2: Is it possible to adjust how long it takes rabbitmq to
decide "node down"?

Yes, see http://www.rabbitmq.com/nettick.html

Thank you! (!)
I have been looking for that one. But I am surprised to see that it is actually 60sec. Then I really don't understand how I could have seen so many clusters ending up partitioned.

Do you know what the consequence of doubling it might be?

RabbitMq writes:
Increasing the net_ticktime across all nodes in a cluster will make the cluster more resilient to short network outtages, but it will take longer for remaing nodes to detect crashed nodes.

More specifically I wonder what happens in the time a node is actually in its own network, but before it finds out. In our setup all publishes have all-HA subscriber queues, with publisher confirm. So I will expect a distributed agreement that the msg has been persisted. Will a publisher confirm then block until the node decides that other nodes are down, and then succeed?

It is much better to have a halted rabbitmq for some seconds than to
have message loss.
Question 3: Assume that we are using the ignore setting, and that we
have only two nodes in the cluster. Would the following be a full
recovery with zero message loss?
0. Decide which node survives, Ns, and which should be restarted, Nr.
1. Refuse all connections to Nr except from a special recovery
application. (One could change the ip, so all running services can't
connect or similar)
2. Consume and republish all message from Nr to Ns.
3. Restart Nr
Then the cluster should be up-and-running again.

That sounds like it would work. You're losing some availability and consistency, and your message ordering will change. You have a pretty good chance of duplicating lots of messages too (any that were in the queues when the partition happened). Assuming you're happy with that it sounds reasonable.

The duplication is ok -- but assuming that rabbit is usually empty, it won't really happen, I think. 
But -- I am sure that rabbit does not guarantee exactly once delivery anyway.
For that reason, we will build in idempotency for critical messages.

Ordering can always get scrambled when nacking consuming messages, so we are not assuming ordering either.


About the CAP theorem in relation to rabbit.
Reliable messaging (zero message loss), is often preferred in SOA-settings. I wonder why vmware/pivotal/... chose not to prioritize this guarantee. It is aimed by the federation setup, but it is a little to weak in its synchronization. It would be preferred if it had a possibility of communicating consumption of messages. Then one could mirror queues between up/down-stream exchanges, and have even more "availability". One would definitely give up consistency a little further, but it would be possible to have the setup above, I think. I know it definitely doesn't come out-of-the-box, and it is not a part of AMQP, AFAIK, but it seems possible.

Thank you, Simon!

-- S

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

Laing, Michael P.
In addition, to protect against meteors, we replicate messages and process the replicas independently in multiple data centers (actually AWS regions). Our processing is idempotent, so this works. We resolve replicas in the output stages.

ml


On Thu, May 22, 2014 at 4:04 PM, Ron Cordell <[hidden email]> wrote:
It seems that one of your assumptions is that a cluster would operate across data centers. This is not recommended for RabbitMQ and we don't use it that way - we use shovels between clusters since we, like you, can deal with eventual consistency.

Our clusters are similar to what you describe. For us, (almost) all queues are persistent and mirrored because we can't tolerate message loss. We have seen significant sensitivity to partitioning in Windows OS-based clusters under very heavy load; we do not see this on Linux and can run at least twice the load as well. 

There is an F5 in front of the cluster, but it doesn't do load balancing but just acts as a persistent router. We've found that by directing all traffic to one node and letting it replicate to other nodes we don't have to deal with issues if a network partition occurs, and they *do* occur (earlier this week one of the virtual NICs stopped on one of the Rabbit nodes, for example). The F5 will detect this very quickly and divert traffic to another node if necessary. (Note to others - we've found that this arrangement scales significantly better than round robin load balancing for mirrored persistent queues).

We have clusters in each data center; for clusters that need to replicate to a different data center there is the shovel. However, there is a chance of message loss if the application sends the message to the cluster in one DC and that entire DC is hit by a meteor before the message can be delivered to the other data center. For most scenarios, however, the messages will be eventually delivered when the DC comes back up.

Hope that helps a little...

Cheers,

-ronc


On Thu, May 22, 2014 at 12:04 PM, Steffen Daniel Jensen <[hidden email]> wrote:
Hi Simon,

Thank you for your reply.

We have two data centers connected closely by LAN.
We are interested in a *reliable cluster* setup. It must be a cluster
because we want clients to be able to connect to each node
transparently. Federation is not an option.

I hope you realise that you are asking for a lot here! You should read up on the CAP theorem if you have not already done so.

Yes, I know. But I am not asking an unreasonably lot, IMO :-)
I am aware of the CAP theorem, but I don't see how it is in violation. I am willing to live with eventual consistency.

1. It happens that the firewall/switch is restarted, and maybe a few
ping messages are lost.
2. The setup should survive data center crash
3. All queues are durable and mirrored, all messages are persisted, all
publishes are confirmed
There are 3 cluster-recovery settings
a) ignore: A cross data center network break-down would cause message
loss on the node that is restarted In order to rejoin.
b) pause_minority: If we choose the same number of nodes in each data
center, the whole cluster will pause. If we don't, only the data center
with the most nodes can survive.
c) auto_heal: If the cluster decides network partitioning, there is a
potential of message loss, when joining.
[I would really like a resync-setting similar to the one described below]
Question 1: Is it even possible to have a fully reliable setup in such a
setting?

Depends how you define "fully reliable". If you want Consistency (i.e. mirrored queues), Availability (i.e. neither data centre pauses) and Partition tolerance (no loss of data from either side if the network goes down between them) then I'm afraid you can't.

What I mean when I say "reliable" is: All subscribers at the time of publish will eventually get the message.

That should be possible, assuming that all live inconsistent nodes will eventually rejoin (without dumping messages). I know this is not the case in rabbitmq, but it is definitely theoretically possible. I guess this is what is usually referred to as eventual consistency.
 
In reality we probably won't have actual network partitions, and it will
most probably only be a very short network downtime.
Question 2: Is it possible to adjust how long it takes rabbitmq to
decide "node down"?

Yes, see http://www.rabbitmq.com/nettick.html

Thank you! (!)
I have been looking for that one. But I am surprised to see that it is actually 60sec. Then I really don't understand how I could have seen so many clusters ending up partitioned.

Do you know what the consequence of doubling it might be?

RabbitMq writes:
Increasing the net_ticktime across all nodes in a cluster will make the cluster more resilient to short network outtages, but it will take longer for remaing nodes to detect crashed nodes.

More specifically I wonder what happens in the time a node is actually in its own network, but before it finds out. In our setup all publishes have all-HA subscriber queues, with publisher confirm. So I will expect a distributed agreement that the msg has been persisted. Will a publisher confirm then block until the node decides that other nodes are down, and then succeed?

It is much better to have a halted rabbitmq for some seconds than to
have message loss.
Question 3: Assume that we are using the ignore setting, and that we
have only two nodes in the cluster. Would the following be a full
recovery with zero message loss?
0. Decide which node survives, Ns, and which should be restarted, Nr.
1. Refuse all connections to Nr except from a special recovery
application. (One could change the ip, so all running services can't
connect or similar)
2. Consume and republish all message from Nr to Ns.
3. Restart Nr
Then the cluster should be up-and-running again.

That sounds like it would work. You're losing some availability and consistency, and your message ordering will change. You have a pretty good chance of duplicating lots of messages too (any that were in the queues when the partition happened). Assuming you're happy with that it sounds reasonable.

The duplication is ok -- but assuming that rabbit is usually empty, it won't really happen, I think. 
But -- I am sure that rabbit does not guarantee exactly once delivery anyway.
For that reason, we will build in idempotency for critical messages.

Ordering can always get scrambled when nacking consuming messages, so we are not assuming ordering either.


About the CAP theorem in relation to rabbit.
Reliable messaging (zero message loss), is often preferred in SOA-settings. I wonder why vmware/pivotal/... chose not to prioritize this guarantee. It is aimed by the federation setup, but it is a little to weak in its synchronization. It would be preferred if it had a possibility of communicating consumption of messages. Then one could mirror queues between up/down-stream exchanges, and have even more "availability". One would definitely give up consistency a little further, but it would be possible to have the setup above, I think. I know it definitely doesn't come out-of-the-box, and it is not a part of AMQP, AFAIK, but it seems possible.

Thank you, Simon!

-- S

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

mc717990
In reply to this post by Ron Cordell
Interesting piece of info on that F5 and the round robin and performance.  We've got an F5 as well, with round-robin handling and mirrored persistent queues.  Our biggest challenge was the number of persistent connections.  We have over 4k connections to our main production environment right now, so we try and even out the connections among our servers.  I may want to go back and look and see if this was really a good idea....  But, we're pure physical hardware and I don't think I've seen a network partition in the last 6 months.  We maintain clusters in each center (well, most of our stuff is actually local nodes that act as a temporary cache and shovel to another data center for processing).  

Jason


On Thu, May 22, 2014 at 3:04 PM, Ron Cordell <[hidden email]> wrote:
It seems that one of your assumptions is that a cluster would operate across data centers. This is not recommended for RabbitMQ and we don't use it that way - we use shovels between clusters since we, like you, can deal with eventual consistency.

Our clusters are similar to what you describe. For us, (almost) all queues are persistent and mirrored because we can't tolerate message loss. We have seen significant sensitivity to partitioning in Windows OS-based clusters under very heavy load; we do not see this on Linux and can run at least twice the load as well. 

There is an F5 in front of the cluster, but it doesn't do load balancing but just acts as a persistent router. We've found that by directing all traffic to one node and letting it replicate to other nodes we don't have to deal with issues if a network partition occurs, and they *do* occur (earlier this week one of the virtual NICs stopped on one of the Rabbit nodes, for example). The F5 will detect this very quickly and divert traffic to another node if necessary. (Note to others - we've found that this arrangement scales significantly better than round robin load balancing for mirrored persistent queues).

We have clusters in each data center; for clusters that need to replicate to a different data center there is the shovel. However, there is a chance of message loss if the application sends the message to the cluster in one DC and that entire DC is hit by a meteor before the message can be delivered to the other data center. For most scenarios, however, the messages will be eventually delivered when the DC comes back up.

Hope that helps a little...

Cheers,

-ronc


On Thu, May 22, 2014 at 12:04 PM, Steffen Daniel Jensen <[hidden email]> wrote:
Hi Simon,

Thank you for your reply.

We have two data centers connected closely by LAN.
We are interested in a *reliable cluster* setup. It must be a cluster
because we want clients to be able to connect to each node
transparently. Federation is not an option.

I hope you realise that you are asking for a lot here! You should read up on the CAP theorem if you have not already done so.

Yes, I know. But I am not asking an unreasonably lot, IMO :-)
I am aware of the CAP theorem, but I don't see how it is in violation. I am willing to live with eventual consistency.

1. It happens that the firewall/switch is restarted, and maybe a few
ping messages are lost.
2. The setup should survive data center crash
3. All queues are durable and mirrored, all messages are persisted, all
publishes are confirmed
There are 3 cluster-recovery settings
a) ignore: A cross data center network break-down would cause message
loss on the node that is restarted In order to rejoin.
b) pause_minority: If we choose the same number of nodes in each data
center, the whole cluster will pause. If we don't, only the data center
with the most nodes can survive.
c) auto_heal: If the cluster decides network partitioning, there is a
potential of message loss, when joining.
[I would really like a resync-setting similar to the one described below]
Question 1: Is it even possible to have a fully reliable setup in such a
setting?

Depends how you define "fully reliable". If you want Consistency (i.e. mirrored queues), Availability (i.e. neither data centre pauses) and Partition tolerance (no loss of data from either side if the network goes down between them) then I'm afraid you can't.

What I mean when I say "reliable" is: All subscribers at the time of publish will eventually get the message.

That should be possible, assuming that all live inconsistent nodes will eventually rejoin (without dumping messages). I know this is not the case in rabbitmq, but it is definitely theoretically possible. I guess this is what is usually referred to as eventual consistency.
 
In reality we probably won't have actual network partitions, and it will
most probably only be a very short network downtime.
Question 2: Is it possible to adjust how long it takes rabbitmq to
decide "node down"?

Yes, see http://www.rabbitmq.com/nettick.html

Thank you! (!)
I have been looking for that one. But I am surprised to see that it is actually 60sec. Then I really don't understand how I could have seen so many clusters ending up partitioned.

Do you know what the consequence of doubling it might be?

RabbitMq writes:
Increasing the net_ticktime across all nodes in a cluster will make the cluster more resilient to short network outtages, but it will take longer for remaing nodes to detect crashed nodes.

More specifically I wonder what happens in the time a node is actually in its own network, but before it finds out. In our setup all publishes have all-HA subscriber queues, with publisher confirm. So I will expect a distributed agreement that the msg has been persisted. Will a publisher confirm then block until the node decides that other nodes are down, and then succeed?

It is much better to have a halted rabbitmq for some seconds than to
have message loss.
Question 3: Assume that we are using the ignore setting, and that we
have only two nodes in the cluster. Would the following be a full
recovery with zero message loss?
0. Decide which node survives, Ns, and which should be restarted, Nr.
1. Refuse all connections to Nr except from a special recovery
application. (One could change the ip, so all running services can't
connect or similar)
2. Consume and republish all message from Nr to Ns.
3. Restart Nr
Then the cluster should be up-and-running again.

That sounds like it would work. You're losing some availability and consistency, and your message ordering will change. You have a pretty good chance of duplicating lots of messages too (any that were in the queues when the partition happened). Assuming you're happy with that it sounds reasonable.

The duplication is ok -- but assuming that rabbit is usually empty, it won't really happen, I think. 
But -- I am sure that rabbit does not guarantee exactly once delivery anyway.
For that reason, we will build in idempotency for critical messages.

Ordering can always get scrambled when nacking consuming messages, so we are not assuming ordering either.


About the CAP theorem in relation to rabbit.
Reliable messaging (zero message loss), is often preferred in SOA-settings. I wonder why vmware/pivotal/... chose not to prioritize this guarantee. It is aimed by the federation setup, but it is a little to weak in its synchronization. It would be preferred if it had a possibility of communicating consumption of messages. Then one could mirror queues between up/down-stream exchanges, and have even more "availability". One would definitely give up consistency a little further, but it would be possible to have the setup above, I think. I know it definitely doesn't come out-of-the-box, and it is not a part of AMQP, AFAIK, but it seems possible.

Thank you, Simon!

-- S

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss




--
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

Simon MacMullen-2
In reply to this post by Steffen Daniel Jensen
On 22/05/14 20:04, Steffen Daniel Jensen wrote:
> Yes, I know. But I am not asking an unreasonably lot, IMO :-)
> I am aware of the CAP theorem, but I don't see how it is in violation. I
> am willing to live with eventual consistency.

Ah, right.

> What I mean when I say "reliable" is: All subscribers at the time of
> publish will eventually get the message.
>
> That should be possible, assuming that all live inconsistent nodes will
> eventually rejoin (without dumping messages). I know this is not the
> case in rabbitmq, but it is definitely theoretically possible. I guess
> this is what is usually referred to as eventual consistency.

So why did you originally say:

> It must be a cluster because we want clients to be able to connect to
> each node transparently. Federation is not an option.

...because it really sounds like federation is what you want :-)

>     Yes, see http://www.rabbitmq.com/__nettick.html
>     <http://www.rabbitmq.com/nettick.html>
>
>
> Thank you! (!)
> I have been looking for that one. But I am surprised to see that it is
> actually 60sec. Then I really don't understand how I could have seen so
> many clusters ending up partitioned.
>
> Do you know what the consequence of doubling it might be?

It will take twice as long to conclude that a remote node that is no
longer connected has actually gone away. Until then, things can block.

> RabbitMq writes:
> Increasing the net_ticktime across all nodes in a cluster will make the
> cluster more resilient to short network outtages, but it will take
> longer for remaing nodes to detect crashed nodes.
>
> More specifically I wonder what happens in the time a node is actually
> in its own network, but before it finds out. In our setup all publishes
> have all-HA subscriber queues, with publisher confirm. So I will expect
> a distributed agreement that the msg has been persisted.

Yes.

> Will a
> publisher confirm then block until the node decides that other nodes are
> down, and then succeed?

Yes.

> The duplication is ok -- but assuming that rabbit is usually empty, it
> won't really happen, I think.
> But -- I am sure that rabbit does not guarantee exactly once delivery
> anyway.
> For that reason, we will build in idempotency for critical messages.
>
> Ordering can always get scrambled when nacking consuming messages, so we
> are not assuming ordering either.

OK.

> About the CAP theorem in relation to rabbit.
> Reliable messaging (zero message loss), is often preferred in
> SOA-settings. I wonder why vmware/pivotal/... chose not to prioritize
> this guarantee. It is aimed by the federation setup, but it is a little
> to weak in its synchronization. It would be preferred if it had a
> possibility of communicating consumption of messages. Then one could
> mirror queues between up/down-stream exchanges, and have even more
> "availability".

I'm not sure what you're talking about here. Federation certainly should
be able to ensure zero message loss! (Assuming you leave it in
"on-confirm" ack-mode).

So when you say "if it had a possibility of communicating consumption of
messages" you're talking about a sort of eventually consistent federated
mirrored queue? I have wondered about producing such a thing. But as
usual the list of things we could do is large, and the resources small.
And I suspect the cost of producing it would be quite large, mostly due
to the need to somehow reunify the queues after a partition.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fully reliable setup impossible?

Steffen Daniel Jensen
What I mean when I say "reliable" is: All subscribers at the time of
publish will eventually get the message.

That should be possible, assuming that all live inconsistent nodes will
eventually rejoin (without dumping messages). I know this is not the
case in rabbitmq, but it is definitely theoretically possible. I guess
this is what is usually referred to as eventual consistency.
So why did you originally say:
It must be a cluster because we want clients to be able to connect to
each node transparently. Federation is not an option.
...because it really sounds like federation is what you want :-)
 
Services in our setup must be ignorant to which of the two data centers they are connected to. Since queues are not mirrored, this is not possible with a federation.
 
Do you know what the consequence of doubling it might be?

It will take twice as long to conclude that a remote node that is no longer connected has actually gone away. Until then, things can block.
 
Thanks. Perfect.
 
About the CAP theorem in relation to rabbit.
Reliable messaging (zero message loss), is often preferred in
SOA-settings. I wonder why vmware/pivotal/... chose not to prioritize
this guarantee. It is aimed by the federation setup, but it is a little
to weak in its synchronization. It would be preferred if it had a
possibility of communicating consumption of messages. Then one could
mirror queues between up/down-stream exchanges, and have even more
"availability".

I'm not sure what you're talking about here.
 
Sorry for being unclear.
 
Federation certainly should be able to ensure zero message loss! (Assuming you leave it in "on-confirm" ack-mode).
 
Yes, assuming that there actually is a queue (with a binding) on the right node.
 
So when you say "if it had a possibility of communicating consumption of messages" you're talking about a sort of eventually consistent federated mirrored queue? I have wondered about producing such a thing. But as usual the list of things we could do is large, and the resources small. And I suspect the cost of producing it would be quite large, mostly due to the need to somehow reunify the queues after a partition.
 
This is really exactly what is needed IMO. It sounds very much like this is the setup TIBCO uses, when guaranteeing zero-message-loss.
I definitely appreciate that the reunification procedure is not simple. However, mirrored queues are usually somewhat static (in that they are created once and lives forever). Maybe a simple queue union, with some sort of message consumption union as well, so node 1 can delete messages consumed on node 2. I suppose it could be implemented efficiently by adding a sequence number to the messages in each queue, and then having a list of intervals of consumed message sequence numbers, or similar.
 
It would be a great (!) addition to rabbitmq, and I am very interested in hearing of any progress.
 
Thank you big time,
-- S :-)

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss