When rabbitmq is clustered with one other node we see a very slow dequeue of messages

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

When rabbitmq is clustered with one other node we see a very slow dequeue of messages

GENTLING Gregory

Classification: Open

When rabbitmq is clustered with one other node we see a very slow dequeue of messages. The scenario is simple, Node A and Node B in the cluster. They are clustered with the auto_heal option and default netticktime. Steps to repeat are:

 

(These are all local connections)

(1)    Connect client A1 to Node A

a.       Client A1 creates a topic exchange

b.      Client A1 is a publisher with 1msg/sec

(2)    Connect client A2 to Node A

a.       Client A1 listens for the messages in the exchange

(3)    Connect client B to Node B   (this is important, the issue does not occur unless you have this remote client)

a.       Client B listens for the messages in the exchange

(4)    Pull the plug on Node B (you will not see the issue with a graceful shutdown), alternately you can just use “route” to now make Node B not routable from Node A

a.       If you kill rabbitmq, you will not see the issue

(5)    Wait for netticktime (or until you see NodeB being removed from the cluster in Node A’s log)

(6)    Client A2 no longer receives messages at 1msg/sec, it will fall considerably behind but recover in about 10 mins.

 

We have two setups with slightly different network setups (two pairs of Node A and B). One we see this issue on, the other we do not, so this is not an issue that can be always reproduced.

 

Other issues observed in this state:

·         rabbitmqctl cluster_status/list_queues/list_connections/list_exchanges all hang, rabbitmq status does not hang

·         declareQueue, declareExchange, declareExchangePassive all hang

·         disabling auto_heal does not help

·         tested with both Erlang 5.9 and 5.10.3

·         tested with both RabbitMq 3.1.5 and 3.1.3, same issue in both

·         don’t see this issue with direct exchange

·         nothing in vmstat out of the ordinary, CPU is not pegged, system is not thrashing

 

Things we have ruled out:

·         Iptables, tested with no rules

·         Selinux, tested in permissive

·         Java drivers

 

Same thing as described here:

 

http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-June/027674.html

 

Thank you,

 

Greg Gentling

Principal Software Architecture - Avant CommonApps

Thales Avionics, Inc.

In-Flight Entertainment and Connectivity

Irvine, CA 92618

949-595-4943

 

[@@OPEN@@]

This email was classified by
GENTLING Gregory on Tuesday, December 03, 2013 5:52:25 PM.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

Simon MacMullen-2
The issue is that you are getting into a situation where node B is down,
but node A is not aware of this (probably because from a TCP level it's
not aware that the connection has been closed). Node A therefore has to
wait a considerable time (the net_ticktime) trying to send packets to
node B before giving up and treating the node as down.

If node A can tell at the TCP level that the connection to node B has
gone down, then you won't have this wait, it'll just mark the node as
down immediately and carry on.

To some extent you can tweak this behaviour by reducing net_ticktime -
but a short net_ticktime makes it plausible that a node will be
considered down when it isn't.

See http://www.rabbitmq.com/partitions.html for more.

Cheers, Simon

On 04/12/13 01:52, GENTLING Gregory wrote:

> Classification: Open
>
> When rabbitmq is clustered with one other node we see a very slow
> dequeue of messages. The scenario is simple, Node A and Node B in the
> cluster. They are clustered with the auto_heal option and default
> netticktime. Steps to repeat are:
>
> (These are all local connections)
>
> (1)Connect client A1 to Node A
>
> a.Client A1 creates a topic exchange
>
> b.Client A1 is a publisher with 1msg/sec
>
> (2)Connect client A2 to Node A
>
> a.Client A1 listens for the messages in the exchange
>
> (3)Connect client B to Node B   (this is important, the issue does not
> occur unless you have this remote client)
>
> a.Client B listens for the messages in the exchange
>
> (4)Pull the plug on Node B (you will not see the issue with a graceful
> shutdown), alternately you can just use “route” to now make Node B not
> routable from Node A
>
> a.If you kill rabbitmq, you will not see the issue
>
> (5)Wait for netticktime (or until you see NodeB being removed from the
> cluster in Node A’s log)
>
> (6)Client A2 no longer receives messages at 1msg/sec, it will fall
> considerably behind but recover in about 10 mins.
>
> We have two setups with slightly different network setups (two pairs of
> Node A and B). One we see this issue on, the other we do not, so this is
> not an issue that can be always reproduced.
>
> Other issues observed in this state:
>
> ·rabbitmqctl cluster_status/list_queues/list_connections/list_exchanges
> all hang, rabbitmq status does not hang
>
> ·declareQueue, declareExchange, declareExchangePassive all hang
>
> ·disabling auto_heal does not help
>
> ·tested with both Erlang 5.9 and 5.10.3
>
> ·tested with both RabbitMq 3.1.5 and 3.1.3, same issue in both
>
> ·don’t see this issue with direct exchange
>
> ·nothing in vmstat out of the ordinary, CPU is not pegged, system is not
> thrashing
>
> Things we have ruled out:
>
> ·Iptables, tested with no rules
>
> ·Selinux, tested in permissive
>
> ·Java drivers
>
> Same thing as described here:
>
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-June/027674.html
>
> Thank you,
>
> *Greg Gentling*
>
> Principal Software Architecture - Avant CommonApps
>
> Thales Avionics, Inc.
>
> In-Flight Entertainment and Connectivity
>
> Irvine, CA 92618
>
> 949-595-4943
>
> [@@OPEN@@]
>
> This email was classified by GENTLING Gregoryon Tuesday, December 03,
> 2013 5:52:25 PM.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>


--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

James
Simon MacMullen <simon@...> writes:

>
> The issue is that you are getting into a situation where node B is down,
> but node A is not aware of this (probably because from a TCP level it's
> not aware that the connection has been closed). Node A therefore has to
> wait a considerable time (the net_ticktime) trying to send packets to
> node B before giving up and treating the node as down.
>
> If node A can tell at the TCP level that the connection to node B has
> gone down, then you won't have this wait, it'll just mark the node as
> down immediately and carry on.
>
> To some extent you can tweak this behaviour by reducing net_ticktime -
> but a short net_ticktime makes it plausible that a node will be
> considered down when it isn't.
>
> See http://www.rabbitmq.com/partitions.html for more.
>
> Cheers, Simon
 
 

Hi Simon,

The issue is not that RabbitMQ does not detect a node down in a timely
fashion, it does what I expect. The behavior in question is what happens
after RabbitMQ removes the node due to net_ticktime expiration. If I set
net_ticktime to 20 seconds, 20 seconds goes by, Node B is removed, and then
the slow message delivery occurs. Likewise, set it to 10 mins, after 10
mins, Node B is removed and the slowness occurs. Five to ten minutes after
Node B is removed, the server catches up. So we are seeing degraded
performance *after* Node B is removed from the cluster for up to 10 minutes.
So much so, that even with a light load of 1MSG/sec after about 5 minutes
the consumer falls behind by over 100MSGs. net_ticktime only effects when we
will see the server become degraded, but not how long.

Thanks,
James

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

GENTLING Gregory
In reply to this post by Simon MacMullen-2
Classification: Thales Group Internal

Hi Simon,

The issue is not that RabbitMQ does not detect a node down in a timely fashion, it does what I expect. The behavior in question is what happens after RabbitMQ removes the node due to net_ticktime expiration. If I set net_ticktime to 20 seconds, 20 seconds goes by, Node B is removed, and then the slow message delivery occurs. Likewise, set it to 10 mins, after 10 mins, Node B is removed and the slowness occurs. Five to ten minutes after Node B is removed, the server catches up. So we are seeing degraded performance *after* Node B is removed from the cluster for up to 10 minutes.
So much so, that even with a light load of 1MSG/sec after about 5 minutes the consumer falls behind by over 100MSGs. net_ticktime only effects when we will see the server become degraded, but not how long.

Thanks,

Greg & James

-----Original Message-----
From: Simon MacMullen [mailto:[hidden email]]
Sent: Wednesday, December 04, 2013 3:41 AM
To: Discussions about RabbitMQ
Cc: GENTLING Gregory
Subject: Re: [rabbitmq-discuss] When rabbitmq is clustered with one other node we see a very slow dequeue of messages

The issue is that you are getting into a situation where node B is down, but node A is not aware of this (probably because from a TCP level it's not aware that the connection has been closed). Node A therefore has to wait a considerable time (the net_ticktime) trying to send packets to node B before giving up and treating the node as down.

If node A can tell at the TCP level that the connection to node B has gone down, then you won't have this wait, it'll just mark the node as down immediately and carry on.

To some extent you can tweak this behaviour by reducing net_ticktime - but a short net_ticktime makes it plausible that a node will be considered down when it isn't.

See http://www.rabbitmq.com/partitions.html for more.

Cheers, Simon

On 04/12/13 01:52, GENTLING Gregory wrote:

> Classification: Open
>
> When rabbitmq is clustered with one other node we see a very slow
> dequeue of messages. The scenario is simple, Node A and Node B in the
> cluster. They are clustered with the auto_heal option and default
> netticktime. Steps to repeat are:
>
> (These are all local connections)
>
> (1)Connect client A1 to Node A
>
> a.Client A1 creates a topic exchange
>
> b.Client A1 is a publisher with 1msg/sec
>
> (2)Connect client A2 to Node A
>
> a.Client A1 listens for the messages in the exchange
>
> (3)Connect client B to Node B   (this is important, the issue does not
> occur unless you have this remote client)
>
> a.Client B listens for the messages in the exchange
>
> (4)Pull the plug on Node B (you will not see the issue with a graceful
> shutdown), alternately you can just use "route" to now make Node B not
> routable from Node A
>
> a.If you kill rabbitmq, you will not see the issue
>
> (5)Wait for netticktime (or until you see NodeB being removed from the
> cluster in Node A's log)
>
> (6)Client A2 no longer receives messages at 1msg/sec, it will fall
> considerably behind but recover in about 10 mins.
>
> We have two setups with slightly different network setups (two pairs
> of Node A and B). One we see this issue on, the other we do not, so
> this is not an issue that can be always reproduced.
>
> Other issues observed in this state:
>
> *rabbitmqctl
> cluster_status/list_queues/list_connections/list_exchanges
> all hang, rabbitmq status does not hang
>
> *declareQueue, declareExchange, declareExchangePassive all hang
>
> *disabling auto_heal does not help
>
> *tested with both Erlang 5.9 and 5.10.3
>
> *tested with both RabbitMq 3.1.5 and 3.1.3, same issue in both
>
> *don't see this issue with direct exchange
>
> *nothing in vmstat out of the ordinary, CPU is not pegged, system is
> not thrashing
>
> Things we have ruled out:
>
> *Iptables, tested with no rules
>
> *Selinux, tested in permissive
>
> *Java drivers
>
> Same thing as described here:
>
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-June/027674.
> html
>
> Thank you,
>
> *Greg Gentling*
>
> Principal Software Architecture - Avant CommonApps
>
> Thales Avionics, Inc.
>
> In-Flight Entertainment and Connectivity
>
> Irvine, CA 92618
>
> 949-595-4943
>
> [@@OPEN@@]
>
> This email was classified by GENTLING Gregoryon Tuesday, December 03,
> 2013 5:52:25 PM.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>


--
Simon MacMullen
RabbitMQ, Pivotal

[@@THALES GROUP INTERNAL@@]
 
This email was classified by GENTLING Gregory on Thursday, December 05, 2013 11:00:11 AM.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

James
Here's an example of the slowness with Mnesia debug on:
The "Got mnesia down" occurs when nettick expires.


 [x][DefaultConsumer] Received 'hi-155' ..
 [x][DefaultConsumer] Received 'hi-156' ..
 [x][DefaultConsumer] Received 'hi-157' ..
 [x][DefaultConsumer] Received 'hi-158' ..
Mnesia(rabbit@NODEA): Logging mnesia_down rabbit@NODEB
Mnesia(rabbit@NODEA): Got mnesia_down from rabbit@NODEB, reconfiguring...

# At this point the message delivery to the consumer becomes very slow.

 [x][DefaultConsumer] Received 'hi-159' ..
 [x][DefaultConsumer] Received 'hi-160' ..


Thanks,
James

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

James
James <james.eddy@...> writes:

> We have two setups with slightly different network setups (two pairs of
Node A and B). One we see this issue on, the other we do not, so this is not
an issue that can be always reproduced.

I see now why this is not always reproducible. On one of our setups there
was no queueBind to the exchange on the failed server (Node B).

So this issue becomes always reproducible (in our environment) when it meets
the following conditions.

(1) NODE_A and NODE_B in a cluster
(2) CLIENT_A0 connects to NODE_A creates topic exchange EXCHANGE_X, and
publishes 1MSG/sec
(3) CLIENT_A1 connects to NODE_A creates and binds queue to EXCHANGE_X,
CLIENT_A1 now is dequeueing messages at the rate of ~ 1MSG/sec
(4) CLIENT_B connects to NODE_B binds queue to EXCHANGE_X, CLIENT_B1 also is
now dequeueing messages at the rate of ~ 1MSG/sec
(5) Pull the cable from NODE_A to NODE_B (anything that does not allow the
tcp connection drop to be detected, making net_ticktime come into play.
Making NODE_B not routable from NODE_A will work, or iptables DROP will
work).

Result: CLIENT_A0 continues to publish at 1MSG/sec, after net_ticktime
CLIENT_A1 is dequeueing much less then 1MSG/sec and falls behind.


* Note: "connects" is a localhost connection for this test

Thanks,
James


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

Simon MacMullen-2
On 06/12/13 21:29, James wrote:
> Result: CLIENT_A0 continues to publish at 1MSG/sec, after net_ticktime
> CLIENT_A1 is dequeueing much less then 1MSG/sec and falls behind.

Thanks for a detailed description of the issue. Just to let you know
that I've reproduced this.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

James
Simon MacMullen <simon@...> writes:

>
> On 06/12/13 21:29, James wrote:
> > Result: CLIENT_A0 continues to publish at 1MSG/sec, after net_ticktime
> > CLIENT_A1 is dequeueing much less then 1MSG/sec and falls behind.
>
> Thanks for a detailed description of the issue. Just to let you know
> that I've reproduced this.
>
> Cheers, Simon
>

Hi Simon. Thanks for the update.

Where you also able to observe the other issues that occur in this state?
Such as:
* rabbitmqctl cluster_status/list_queues/list_connections/list_exchanges all
hang, rabbitmq status does not hang
* declareQueue, declarePassiveQueue, declareExchange, declareExchangePassive
all hang


One last thing I wanted to add to this. When I run this on average hardware
(4 core 2GHZ, x86, 8GB ram) the issue is still seen, but not as noticeable.
When I run this on much more modest hardware (PPC 2core 1GHZ 1GB RAM) The
issue is several of orders of magnitude worse, but otherwise this lower end
hardware is sufficient for the message load.

Thanks,
James



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

James
In reply to this post by Simon MacMullen-2
Simon MacMullen <simon@...> writes:

>
> On 06/12/13 21:29, James wrote:
> > Result: CLIENT_A0 continues to publish at 1MSG/sec, after net_ticktime
> > CLIENT_A1 is dequeueing much less then 1MSG/sec and falls behind.
>
> Thanks for a detailed description of the issue. Just to let you know
> that I've reproduced this.
>
> Cheers, Simon
>


Hi Simon,

Do you have a target release date/version for this issue?

Thanks,
James


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

Simon MacMullen-2
On 27/12/2013 18:13, James wrote:
> Do you have a target release date/version for this issue?

I'm afraid not. I haven't even had time to look at it yet. It won't get
forgotten though.

Cheers, Simon

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

James
Simon MacMullen <simon@...> writes:

>
> On 27/12/2013 18:13, James wrote:
> > Do you have a target release date/version for this issue?
>
> I'm afraid not. I haven't even had time to look at it yet. It won't get
> forgotten though.
>
> Cheers, Simon
>
>

Hi Simon,

Do you have a bug tracking number for this that I can keep my eye out for it
in Mercurial?

Thanks,
James



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

Simon MacMullen-2
On 20/02/14 00:05, James Eddu wrote:
> Do you have a bug tracking number for this that I can keep my eye out for it
> in Mercurial?

It's 25921. Nothing has happened yet though.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

James
Simon MacMullen <simon@...> writes:

>
> On 20/02/14 00:05, James Eddu wrote:
> > Do you have a bug tracking number for this that I can keep my eye out for it
> > in Mercurial?
>
> It's 25921. Nothing has happened yet though.
>
> Cheers, Simon
>


Hi Simon,

I did test the RMQ from this branch here (bug25921)
http://hg.rabbitmq.com/rabbitmq-server/rev/0afd7955a109 . But it did not
seem to improve the original issue. Is this the complete fix, or is there
possible other bug reports that where spawned from this one?

Thanks,
James


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: When rabbitmq is clustered with one other node we see a very slow dequeue of messages

Simon MacMullen-2
On 26/03/14 23:44, James Eddu wrote:
> I did test the RMQ from this branch here (bug25921)
> http://hg.rabbitmq.com/rabbitmq-server/rev/0afd7955a109  . But it did not
> seem to improve the original issue. Is this the complete fix, or is there
> possible other bug reports that where spawned from this one?

No, that's not considered finished yet.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

pika having 2 connections opened a consumer and a producer

Grenier,Michel [CMC]
Hi all

     In a python pika  script  is it possible to have 2 co-existing
connections
- one consuming from a broker1,   doing something with each
message and
- a second connected to broker2,  producing a different message
once the previous is processed.

    Any means to do so ?
    Any easy way to break each  ioloop needed into  mixed event module
calls  under a while loop or
    something better ?

                Michel Grenier
                (514) 421-7204

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: pika having 2 connections opened a consumer and a producer

michaelklishin
2014-03-27 21:36 GMT+04:00 Grenier,Michel [CMC] <[hidden email]>:
     In a python pika  script  is it possible to have 2 co-existing
connections
-        one consuming from a broker1,   doing something with each
message and
-        a second connected to broker2,  producing a different message
once the previous is processed.

    Any means to do so ?

It should be possible (and quite straightforward) to create two connections

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss