Excessive memory consumption of one server in cluster setup

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Excessive memory consumption of one server in cluster setup

Matthias Reik-2
Hi,

2 days ago I have upgraded our RabbitMQ cluster (2 machines running in
HA mode) from 2.8.1 to 2.8.5.
Mainly to get those OOD fixes in, since we experienced those exact issues.

The upgrade went very smooth, but at some point one of the machines
(server2) started to allocate more
and more memory (even though all queues are more or less at 0 with
almost no outstanding acks)

server 1 uses ~200Mb
server 2 (at the point where I took it down) used ~6Gb

I run a rabbitmqctl report... but it didn't give me any insights
I run a rabbitmqctl eval 'erlang:memory().' but that didn't tell me too
much more (but I will attach that at the end)

I found people with similar problems:
http://grokbase.com/t/rabbitmq/rabbitmq-discuss/1223qcx3gg/rabbitmq-memory-usage-in-two-node-cluster 

but that's a while back so many things might have changed since then.
Also the memory difference was
rather minimal, whereas here the difference is _very_ significant,
especially since the node with less load
has the increased memory footprint.

I can upgrade to 2.8.6 (unfortunately I upgraded just before it was
released :-(), but I only want to do that if
there is some hope that the problem is solved.
I can bring server2 back online and try to investigate what is consuming
that much memory, but my
RabbitMQ/Erlang knowledge is not good enough, therefore reaching out for
some help.

So any help would be much appreciated.

Thx
Matthias

Our setup is something like the following:
       2 servers exclusively running RabbitMQ on CentOS 5.x (high
watermark ~22Gb),
             - both with web-console enabled
             - both defined as disk nodes
             - both running RabbitMQ 2.8.5 on Erlang R15B01 (after the
upgrade, Erlang was already at R15 before)
     10 queues configured with mirroring
       3 queues configured (without mirroring) only on server1 with a TTL
     Most consumers are connecting to server1, server2 only in case of
failover

We get about 1k messages/sec (with peaks much higher than that) into the
system, and each message is
passed through several queues for processing.

-bash-3.2$ sbin/rabbitmqctl eval 'erlang:memory().'
[{total,5445584424},
  {processes,2184155418},
  {processes_used,2184122352},
  {system,3261429006},
  {atom,703377},
  {atom_used,678425},
  {binary,3216386480},
  {code,17978535},
  {ets,4142048}]
...done.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Excessive memory consumption of one server in cluster setup

Matthias Reik
I just upgraded to 2.8.6, and I see the same effect with the latest version :-(
(not really unexpected, since nothing was fixed in that regard in 2.8.6).

Already after about 10 minutes server2 consumes 200% more memory than server1,
even though all clients are connected to server1, and there are no queues with TTL anymore.

(removing the other queues is unfortunately not possible, since it's a production
server :-(, and I haven't seen this effect on any of our other (development) servers)

Cheers
Matthias

On Fri, Aug 24, 2012 at 2:25 PM, Matthias Reik <[hidden email]> wrote:
Hi,

2 days ago I have upgraded our RabbitMQ cluster (2 machines running in HA mode) from 2.8.1 to 2.8.5.
Mainly to get those OOD fixes in, since we experienced those exact issues.

The upgrade went very smooth, but at some point one of the machines (server2) started to allocate more
and more memory (even though all queues are more or less at 0 with almost no outstanding acks)

server 1 uses ~200Mb
server 2 (at the point where I took it down) used ~6Gb

I run a rabbitmqctl report... but it didn't give me any insights
I run a rabbitmqctl eval 'erlang:memory().' but that didn't tell me too much more (but I will attach that at the end)

I found people with similar problems:
http://grokbase.com/t/rabbitmq/rabbitmq-discuss/1223qcx3gg/rabbitmq-memory-usage-in-two-node-cluster
but that's a while back so many things might have changed since then. Also the memory difference was
rather minimal, whereas here the difference is _very_ significant, especially since the node with less load
has the increased memory footprint.

I can upgrade to 2.8.6 (unfortunately I upgraded just before it was released :-(), but I only want to do that if
there is some hope that the problem is solved.
I can bring server2 back online and try to investigate what is consuming that much memory, but my
RabbitMQ/Erlang knowledge is not good enough, therefore reaching out for some help.

So any help would be much appreciated.

Thx
Matthias

Our setup is something like the following:
      2 servers exclusively running RabbitMQ on CentOS 5.x (high watermark ~22Gb),
            - both with web-console enabled
            - both defined as disk nodes
            - both running RabbitMQ 2.8.5 on Erlang R15B01 (after the upgrade, Erlang was already at R15 before)
    10 queues configured with mirroring
      3 queues configured (without mirroring) only on server1 with a TTL
    Most consumers are connecting to server1, server2 only in case of failover

We get about 1k messages/sec (with peaks much higher than that) into the system, and each message is
passed through several queues for processing.

-bash-3.2$ sbin/rabbitmqctl eval 'erlang:memory().'
[{total,<a href="tel:5445584424" value="+15445584424" target="_blank">5445584424},
 {processes,<a href="tel:2184155418" value="+12184155418" target="_blank">2184155418},
 {processes_used,<a href="tel:2184122352" value="+12184122352" target="_blank">2184122352},
 {system,3261429006},
 {atom,703377},
 {atom_used,678425},
 {binary,<a href="tel:3216386480" value="+13216386480" target="_blank">3216386480},
 {code,17978535},
 {ets,4142048}]
...done.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Excessive memory consumption of one server in cluster setup

Matthias Reik
Just to make sure, I stopped the application, reset it, and rejoined the cluster (on server2).
Thus, any "old" information in mnesia that could have caused that strange behavior should
be wiped out.

Cheers
Matthias

On Mon, Aug 27, 2012 at 11:26 AM, Matthias Reik <[hidden email]> wrote:
I just upgraded to 2.8.6, and I see the same effect with the latest version :-(
(not really unexpected, since nothing was fixed in that regard in 2.8.6).

Already after about 10 minutes server2 consumes 200% more memory than server1,
even though all clients are connected to server1, and there are no queues with TTL anymore.

(removing the other queues is unfortunately not possible, since it's a production
server :-(, and I haven't seen this effect on any of our other (development) servers)

Cheers
Matthias


On Fri, Aug 24, 2012 at 2:25 PM, Matthias Reik <[hidden email]> wrote:
Hi,

2 days ago I have upgraded our RabbitMQ cluster (2 machines running in HA mode) from 2.8.1 to 2.8.5.
Mainly to get those OOD fixes in, since we experienced those exact issues.

The upgrade went very smooth, but at some point one of the machines (server2) started to allocate more
and more memory (even though all queues are more or less at 0 with almost no outstanding acks)

server 1 uses ~200Mb
server 2 (at the point where I took it down) used ~6Gb

I run a rabbitmqctl report... but it didn't give me any insights
I run a rabbitmqctl eval 'erlang:memory().' but that didn't tell me too much more (but I will attach that at the end)

I found people with similar problems:
http://grokbase.com/t/rabbitmq/rabbitmq-discuss/1223qcx3gg/rabbitmq-memory-usage-in-two-node-cluster
but that's a while back so many things might have changed since then. Also the memory difference was
rather minimal, whereas here the difference is _very_ significant, especially since the node with less load
has the increased memory footprint.

I can upgrade to 2.8.6 (unfortunately I upgraded just before it was released :-(), but I only want to do that if
there is some hope that the problem is solved.
I can bring server2 back online and try to investigate what is consuming that much memory, but my
RabbitMQ/Erlang knowledge is not good enough, therefore reaching out for some help.

So any help would be much appreciated.

Thx
Matthias

Our setup is something like the following:
      2 servers exclusively running RabbitMQ on CentOS 5.x (high watermark ~22Gb),
            - both with web-console enabled
            - both defined as disk nodes
            - both running RabbitMQ 2.8.5 on Erlang R15B01 (after the upgrade, Erlang was already at R15 before)
    10 queues configured with mirroring
      3 queues configured (without mirroring) only on server1 with a TTL
    Most consumers are connecting to server1, server2 only in case of failover

We get about 1k messages/sec (with peaks much higher than that) into the system, and each message is
passed through several queues for processing.

-bash-3.2$ sbin/rabbitmqctl eval 'erlang:memory().'
[{total,<a href="tel:5445584424" value="+15445584424" target="_blank">5445584424},
 {processes,<a href="tel:2184155418" value="+12184155418" target="_blank">2184155418},
 {processes_used,<a href="tel:2184122352" value="+12184122352" target="_blank">2184122352},
 {system,3261429006},
 {atom,703377},
 {atom_used,678425},
 {binary,<a href="tel:3216386480" value="+13216386480" target="_blank">3216386480},
 {code,17978535},
 {ets,4142048}]
...done.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Excessive memory consumption of one server in cluster setup

Matthias Radestock-3
In reply to this post by Matthias Reik
Matthias,

On 27/08/12 10:26, Matthias Reik wrote:
> I just upgraded to 2.8.6, and I see the same effect with the latest
> version :-(
> (not really unexpected, since nothing was fixed in that regard in 2.8.6).
>
> Already after about 10 minutes server2 consumes 200% more memory than
> server1,

Thanks for reporting this.

This is almost certainly the same bug as
http://rabbitmq.1065348.n5.nabble.com/Shovel-stops-receiving-acks-from-cluster-tp21384p21649.html

Regards,

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Fwd: Excessive memory consumption of one server in cluster setup

Matthias Reik

Hi Matthias,

even though the setup looks slightly differently (since we are not using the shovel plugin), the reason
could be the same. We are explicitly ACKing the messages (i.e. no auto-ack), even though the consumers
are in the same data-centers (so we should have a reliable network), but if the acks are lost and that
causes memory increase in the server then it could be the same bug.

Is there anything I could do to validate this assumption? I have provided initial logs to Francesco.

Is there anything I can do in the meantime to get into a state where I have a working cluster again
(currently I took our second server out from the cluster, but that's of course a bit of a risky thing to do)

Thx for your help, appreciate a lot.

Cheers
Matthias


On Mon, Aug 27, 2012 at 3:40 PM, Matthias Radestock <[hidden email]> wrote:
Matthias,


On 27/08/12 10:26, Matthias Reik wrote:
I just upgraded to 2.8.6, and I see the same effect with the latest
version :-(
(not really unexpected, since nothing was fixed in that regard in 2.8.6).

Already after about 10 minutes server2 consumes 200% more memory than
server1,

Thanks for reporting this.

This is almost certainly the same bug as
http://rabbitmq.1065348.n5.nabble.com/Shovel-stops-receiving-acks-from-cluster-tp21384p21649.html

Regards,

Matthias.



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Excessive memory consumption of one server in cluster setup

Matthias Radestock-3
Matthias,

On 27/08/12 16:02, Matthias Reik wrote:
> even though the setup looks slightly differently (since we are not
> using the shovel plugin), the reason could be the same. We are
> explicitly ACKing the messages (i.e. no auto-ack), even though the
> consumers are in the same data-centers (so we should have a reliable
> network), but if the acks are lost and that causes memory increase in
> the server then it could be the same bug.

As noted in my analysis, the bug has nothing do with the shovel, or
consuming/acking - simply publishing to HA queues when (re)starting
slaves is sufficient to trigger it.

> Is there anything I could do to validate this assumption?

I don't think it's worth the hassle. I am quite certain that you are
suffering from the same bug.

> Is there anything I can do in the meantime to get into a state where I
> have a working cluster again

Pause all publishing before (re)starting any cluster nodes.

Regards,

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Excessive memory consumption of one server in cluster setup

Laing, Michael P.
I have not yet run into this issue but I have a question:

Would it be appropriate to use 'rabbitmqctl set_vm_memory_high_watermark'
to a low value to temporarily pause publishers, then restart the cluster
nodes, and then reset to the normal value?

Thanks.

Michael

On 8/27/12 11:22 AM, "Matthias Radestock" <[hidden email]> wrote:

>Matthias,
>
>On 27/08/12 16:02, Matthias Reik wrote:
>> even though the setup looks slightly differently (since we are not
>> using the shovel plugin), the reason could be the same. We are
>> explicitly ACKing the messages (i.e. no auto-ack), even though the
>> consumers are in the same data-centers (so we should have a reliable
>> network), but if the acks are lost and that causes memory increase in
>> the server then it could be the same bug.
>
>As noted in my analysis, the bug has nothing do with the shovel, or
>consuming/acking - simply publishing to HA queues when (re)starting
>slaves is sufficient to trigger it.
>
>> Is there anything I could do to validate this assumption?
>
>I don't think it's worth the hassle. I am quite certain that you are
>suffering from the same bug.
>
>> Is there anything I can do in the meantime to get into a state where I
>> have a working cluster again
>
>Pause all publishing before (re)starting any cluster nodes.
>
>Regards,
>
>Matthias.
>_______________________________________________
>rabbitmq-discuss mailing list
>[hidden email]
>https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Excessive memory consumption of one server in cluster setup

Matthias Radestock-3
Michael,

On 27/08/12 16:53, Laing, Michael P. wrote:
> I have not yet run into this issue but I have a question:
>
> Would it be appropriate to use 'rabbitmqctl set_vm_memory_high_watermark'
> to a low value to temporarily pause publishers, then restart the cluster
> nodes, and then reset to the normal value?

Yes, that should work, i.e. you could lower the threshold on all nodes,
wait for a bit to let the broker work through any backlog, then bounce
individual nodes.

Note that restarting a node will reset the threshold, but since network
listeners only get enabled after queue slave initialisation has
completed that should be fine.

The only time that doesn't work is when you bring down an entire cluster
and then start it back up. There you'd have to lower the threshold on
the first node that comes up before starting any subsequent nodes.

Regards,

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Excessive memory consumption of one server in cluster setup

Matthias Reik
In reply to this post by Matthias Radestock-3
See comments inline

Thanks
Matthias

On Mon, Aug 27, 2012 at 5:22 PM, Matthias Radestock <[hidden email]> wrote:
Matthias,


On 27/08/12 16:02, Matthias Reik wrote:
even though the setup looks slightly differently (since we are not
using the shovel plugin), the reason could be the same. We are
explicitly ACKing the messages (i.e. no auto-ack), even though the
consumers are in the same data-centers (so we should have a reliable
network), but if the acks are lost and that causes memory increase in
the server then it could be the same bug.

As noted in my analysis, the bug has nothing do with the shovel, or consuming/acking - simply publishing to HA queues when (re)starting slaves is sufficient to trigger it. 
Wasn't sure I understood it 100% correctly (sorry not too experienced with RabbitMQ yet). Thx for the confirmation.
 


Is there anything I could do to validate this assumption?

I don't think it's worth the hassle. I am quite certain that you are suffering from the same bug.
 OK, if you expect a fix for the issue to appear soon then I could wait with with "fixing" the cluster and try out any updated version. If it will take more time, then I will probably go for your (below) suggested fix/workaround.



Is there anything I can do in the meantime to get into a state where I
have a working cluster again

Pause all publishing before (re)starting any cluster nodes.
Yes, that makes sense.

Thank you for your quick response.

Regards,

Matthias.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Excessive memory consumption of one server in cluster setup

Matthias Reik
> Pause all publishing before (re)starting any cluster nodes.
Just want to report back that the "work around" did the trick :-) Of course the situation is not ideal, but we have a working cluster again

Thx Matthias!

Cheers
Matthias

On Mon, Aug 27, 2012 at 10:44 PM, Matthias Reik <[hidden email]> wrote:
See comments inline

Thanks
Matthias

On Mon, Aug 27, 2012 at 5:22 PM, Matthias Radestock <[hidden email]> wrote:
Matthias,


On 27/08/12 16:02, Matthias Reik wrote:
even though the setup looks slightly differently (since we are not
using the shovel plugin), the reason could be the same. We are
explicitly ACKing the messages (i.e. no auto-ack), even though the
consumers are in the same data-centers (so we should have a reliable
network), but if the acks are lost and that causes memory increase in
the server then it could be the same bug.

As noted in my analysis, the bug has nothing do with the shovel, or consuming/acking - simply publishing to HA queues when (re)starting slaves is sufficient to trigger it. 
Wasn't sure I understood it 100% correctly (sorry not too experienced with RabbitMQ yet). Thx for the confirmation.
 


Is there anything I could do to validate this assumption?

I don't think it's worth the hassle. I am quite certain that you are suffering from the same bug.
 OK, if you expect a fix for the issue to appear soon then I could wait with with "fixing" the cluster and try out any updated version. If it will take more time, then I will probably go for your (below) suggested fix/workaround.



Is there anything I can do in the meantime to get into a state where I
have a working cluster again

Pause all publishing before (re)starting any cluster nodes.
Yes, that makes sense.

Thank you for your quick response.

Regards,

Matthias.



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss