Quantcast

Instable HA cluster

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Instable HA cluster

ben.west
Hi,


I've been having lots of instability problems with a rabbit cluster I have set up.
Since deploying Rabbit MQ a few months ago we have experienced ongoing instability with the service.

Our setup:

  • RabbitMQ 3.0.1 running on Windows Azure VMs (Server 2012)
  • We have 2 VMs set up with a rabbit running on each and clustered (with a HA policy), both running as disk nodes
  • Using EasyNetQ for producers consumers

When I first set up the cluster everything seems to be running well and can see through the Rabbit console the HA mirroring is all set up.

However after a couple of days I will log into the console and find the cluster is broken, usually with one node saying "Node not running" and a number of the consumers have dropped off the queues. I then have to tear down the cluster and build it all up again - not a massive job, but I'm sure i shouldn't need to do this every week!

At first I thought it may be due to Windows updates automatically restarting the server and possibly breaking the cluster, but have now turned these off and still get the same problem. I have also tried rebooting the servers one at a time to test this but the cluster seems to pick itself back up once the reboot has finished.

I've done lots of reading online but cant seem to find too many suggestions so I hope someone here may be able to help.

I'm more thank happy to provide logs if these may be useful?

Any help will be grately appreciated!

Thanks,

Ben



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Instable HA cluster

Simon MacMullen-2
On 25/01/13 09:52, [hidden email] wrote:
> I'm more thank happy to provide logs if these may be useful?

That would be useful. Preferably if you can also tell us about the time
at which the cluster broke (even if it's just "some time on Wednesday"
or similar).

Cheers, Simon
--
Simon MacMullen
RabbitMQ, VMware
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Instable HA cluster

Simon MacMullen-2
Hi, thanks. So it looks like the cluster is experiencing network partitions:

http://www.rabbitmq.com/partitions.html

since we can see (for example) that around 18-Jan-2013 03:30:21 both
nodes saw the other one go down. RabbitMQ clusters do not tolerate
partitions well, so this needs to be fixed I'm afraid.

You should not need to rebuild the cluster when this happens however,
just stopping and starting nodes should be enough to recover.

Cheers, Simon

On 25/01/13 17:04, Ben West wrote:

> Hi Simon,
>
> Please find attached.
>
> Thanks again,
>
> Ben
>
>
> On 25 January 2013 16:41, Simon MacMullen <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Could you send the log for MOPAYPL2 as well? It would be useful to
>     correlate what's happening with it too.
>
>     Cheers, Simon
>
>
>     On 25/01/13 16:28, Ben West wrote:
>
>
>         Hi Simon,
>
>         Thanks for for coming back to me.
>
>         Attached is a rabbit mq log from one of the nodes.
>
>         As I mentioned before, this seems to happen once every week or
>         two. I
>         can be too specific on when it happened because i only noticed
>         when I
>         check the console, however I believe the last occurrence would
>         have been
>         yesterday (24th Jan).
>
>         If you need further logs / info let me know and I'll dig it out
>         for you.
>
>         Kind regards,
>
>         Ben
>
>
>         On 25 January 2013 15:32, Simon MacMullen <[hidden email]
>         <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>
>              On 25/01/13 09:52, [hidden email]
>         <mailto:[hidden email]>
>              <mailto:ben.west@mobankgroup.__com
>         <mailto:[hidden email]>> wrote:
>
>                  I'm more thank happy to provide logs if these may be
>         useful?
>
>
>              That would be useful. Preferably if you can also tell us
>         about the
>              time at which the cluster broke (even if it's just "some
>         time on
>              Wednesday" or similar).
>
>              Cheers, Simon
>              --
>              Simon MacMullen
>              RabbitMQ, VMware
>
>
>
>         <http://www.mopowered.co.uk>
>
>
>
>     --
>     Simon MacMullen
>     RabbitMQ, VMware
>
>
>
> <http://www.mopowered.co.uk>


--
Simon MacMullen
RabbitMQ, VMware
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Instable HA cluster

ben.west

Hi Simon,

Thanks for taking the time to look into this.

I had a feeling this was an environmental issue because it only occurred on our QA and PreLive environments, we do not experience the same issue on our LIVE environment (hosted in another service, not Azure).

I will get in touch with Microsoft and find out what our options are with Azure VMs and establish if there is anything we can do to eliminate network partitions. We do not currently use their Virtual Networks so we may try setting up the two VMs in a virtual Network to increase the stability between the two nodes. Failing that we may look at the Federation / Shovel solutions mentioned in the link.

Cheers,

Ben


Ben West
Product Owner
 
Mobile: +44(0)7824 617 813




On 28 January 2013 10:55, Simon MacMullen <[hidden email]> wrote:
Hi, thanks. So it looks like the cluster is experiencing network partitions:

http://www.rabbitmq.com/partitions.html

since we can see (for example) that around 18-Jan-2013 03:30:21 both nodes saw the other one go down. RabbitMQ clusters do not tolerate partitions well, so this needs to be fixed I'm afraid.

You should not need to rebuild the cluster when this happens however, just stopping and starting nodes should be enough to recover.

Cheers, Simon


On 25/01/13 17:04, Ben West wrote:
Hi Simon,

Please find attached.

Thanks again,

Ben


On 25 January 2013 16:41, Simon MacMullen <[hidden email]
<mailto:[hidden email]>> wrote:

    Could you send the log for MOPAYPL2 as well? It would be useful to
    correlate what's happening with it too.

    Cheers, Simon


    On 25/01/13 16:28, Ben West wrote:


        Hi Simon,

        Thanks for for coming back to me.

        Attached is a rabbit mq log from one of the nodes.

        As I mentioned before, this seems to happen once every week or
        two. I
        can be too specific on when it happened because i only noticed
        when I
        check the console, however I believe the last occurrence would
        have been
        yesterday (24th Jan).

        If you need further logs / info let me know and I'll dig it out
        for you.

        Kind regards,

        Ben


        On 25 January 2013 15:32, Simon MacMullen <[hidden email]
        <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

             On 25/01/13 09:52, [hidden email]
        <mailto:[hidden email]>
             <mailto:[hidden email]__com

        <mailto:[hidden email]>> wrote:

                 I'm more thank happy to provide logs if these may be
        useful?


             That would be useful. Preferably if you can also tell us
        about the
             time at which the cluster broke (even if it's just "some
        time on
             Wednesday" or similar).

             Cheers, Simon
             --
             Simon MacMullen
             RabbitMQ, VMware



        <http://www.mopowered.co.uk>



    --
    Simon MacMullen
    RabbitMQ, VMware



<http://www.mopowered.co.uk>


--
Simon MacMullen
RabbitMQ, VMware




_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Instable HA cluster

James Lewis
In reply to this post by ben.west
Hi, did you have any luck with this?  I'm just setting up a rabbit cluster for my team and I'm running into exactly the same problem.  What worries me is that the docs on clustering say that the network connection between nodes must be reliable - I was hoping that a 3 node cluster would continue running if the connection between 2 nodes dropped and then allow the dropped node to join back into the cluster when the connection came back up...?

Any help appreciated,
James



On Friday, January 25, 2013 9:52:30 AM UTC, Ben West wrote:
Hi,


I've been having lots of instability problems with a rabbit cluster I have set up.
Since deploying Rabbit MQ a few months ago we have experienced ongoing instability with the service.

Our setup:

  • RabbitMQ 3.0.1 running on Windows Azure VMs (Server 2012)
  • We have 2 VMs set up with a rabbit running on each and clustered (with a HA policy), both running as disk nodes
  • Using EasyNetQ for producers consumers

When I first set up the cluster everything seems to be running well and can see through the Rabbit console the HA mirroring is all set up.

However after a couple of days I will log into the console and find the cluster is broken, usually with one node saying "Node not running" and a number of the consumers have dropped off the queues. I then have to tear down the cluster and build it all up again - not a massive job, but I'm sure i shouldn't need to do this every week!

At first I thought it may be due to Windows updates automatically restarting the server and possibly breaking the cluster, but have now turned these off and still get the same problem. I have also tried rebooting the servers one at a time to test this but the cluster seems to pick itself back up once the reboot has finished.

I've done lots of reading online but cant seem to find too many suggestions so I hope someone here may be able to help.

I'm more thank happy to provide logs if these may be useful?

Any help will be grately appreciated!

Thanks,

Ben



--
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: 69 Wilson Street, London EC2A 2BB. Registered in
England and Wales. Registered No. 04843573.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Instable HA cluster

James Lewis
 Whoops, answered my own question, I'm using cfengine for configuration management and I'd done something funny to one of my nodes that triggered some infinite loop in cfeninge where it kept trying to reinstall erlang which was triggering the rabbit node to restart.  It was my mistake and it's all fixed now.
--
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: 69 Wilson Street, London EC2A 2BB. Registered in
England and Wales. Registered No. 04843573.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Loading...