Quantcast

Crash with RabbitMQ 3.1.5

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Crash with RabbitMQ 3.1.5

David Harrison
Hi there,

Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.

During the crash symptoms were as follows:

- Escalating (and sudden) CPU utilisation on some (but not all) nodes
- Escalation memory usage (not necessarily aligned to the spiking CPU)
- Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)
- Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)
- Management plugin stops responding / or responding so slowly it's no longer loading any data at all (probably same thing that causes the preceeding item)
- Can't force nodes to forget other nodes (calling `rabbitmqctl forget_cluster_node` doesn't return)

- When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return
--- When that doesn't return we eventually have to ctrl-c the command
--- We have to issue a kill signal to rabbit to stop it
--- Do the same to the epmd process
--- However the other nodes all still think that the killed node is active (based on `rabbitmqctl cluster status` -- both nodes slow to run this, and fast to run it saw the same view of the cluster that included the dead node)


Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux

[
        {rabbit, [
                {cluster_nodes, {['[hidden email]', '[hidden email]','[hidden email]','[hidden email]','[hidden email]'], disc}},
                {cluster_partition_handling, pause_minority}
        ]}
]

And the env:

NODENAME="[hidden email]"
SERVER_ERL_ARGS="-kernel inet_dist_listen_min 27248 -kernel inet_dist_listen_max 27248"

The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").

System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Compiled Fri Dec 16 03:22:15 2011
Taints (none)
Memory allocated 6821368760 bytes
Atoms 22440
Processes 4899
ETS tables 80
Timers 23
Funs 3994

When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower:

Pid Name/Spawned as State Reductions Stack+heap MsgQ Length
<0.400.0> proc_lib:init_p/5 Scheduled 146860259 59786060 37148
<0.373.0> proc_lib:init_p/5 Scheduled 734287949 1346269 23360
<0.366.0> proc_lib:init_p/5 Waiting 114695635 5135590 19744
<0.444.0> proc_lib:init_p/5 Waiting 154538610 832040 3326

when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)

{'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}

mixed in with other more regular events:

{'$gen_cast',
    {gm,{publish,<2708.20321.59>,
            {message_properties,undefined,false},
            {basic_message,
<.. snip..>

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

Tim Watson-6
Hello David!

On 16 Oct 2013, at 15:14, David Harrison wrote:
> Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.
>

What version of rabbit are you running, and how was it installed?

> During the crash symptoms were as follows:
>
> - Escalating (and sudden) CPU utilisation on some (but not all) nodes

We've fixed at least one bug with that symptom in recent releases.

> - Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)

Are there multiple publishers on the same connection/channel when this happens? It wouldn't be unusual, if the server was struggling, to see flow control kick in and affect publishers in this fashion.

> - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)

Wow, 10m is amazingly slow. Can you provide log files for this period of activity and problems?

> - When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return

Again, we've fixed bugs in that area in recent releases.

> --- When that doesn't return we eventually have to ctrl-c the command
> --- We have to issue a kill signal to rabbit to stop it
> --- Do the same to the epmd process

Even if you have to `kill -9' a rabbit node, you shouldn't need to kill epmd. In theory at least. If that was necessary to fix the "state of the world", it would be indicative of a problem related to the erlang distribution mechanism, but I very much doubt that's the case here.

> Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux
>

How many queues are we talking about here?

> [
>         {rabbit, [
>                 {cluster_nodes, {['[hidden email]', '[hidden email]','[hidden email]','[hidden email]','[hidden email]'], disc}},
>                 {cluster_partition_handling, pause_minority}

Are you sure that what you're seeing is not caused by a network partition? If it were, any nodes in a minority island would "pause", which would certainly lead to the kind of symptoms you've mentioned here, viz rabbitmqctl calls not returning and so on.

> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
>

That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.

> System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
>

I'd strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient and a *lot* of bug fixes have appeared in erts + OTP since then.

> When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower:
>

That's not unusual.

> when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)
>
> {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}
>

Interesting - will take a look at that. If you could provide logs for the participating nodes during this whole time period, that would help a lot.

> mixed in with other more regular events:
>

Actually, sender_death messages are not "irregular" as such. They're just notifying the GM group members that another member (on another node) has died. This is quite normal with mirrored queues, when nodes get partitioned or stopped due to cluster recovery modes.

Cheers,
Tim


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

Tim Watson-6
On 16 Oct 2013, at 15:29, Tim Watson wrote:
>> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
>>
>
> That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.
>

Oh and BTW, you haven't changed the memory high watermark have you?
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

David Harrison
Hey Tim,

No, still set at 40% as per default -- read up and found numerous posts indicating double that was possible during GC and stayed well clear ;-)

Cheers
Dave


On 17 October 2013 01:34, Tim Watson <[hidden email]> wrote:
On 16 Oct 2013, at 15:29, Tim Watson wrote:
>> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
>>
>
> That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.
>

Oh and BTW, you haven't changed the memory high watermark have you?
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

David Harrison
In reply to this post by Tim Watson-6
On 17 October 2013 01:29, Tim Watson <[hidden email]> wrote:
Hello David!


Hey Tim, thanks for replying so quickly!
 
On 16 Oct 2013, at 15:14, David Harrison wrote:
> Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.
>

What version of rabbit are you running, and how was it installed?

3.1.5, running on Ubuntu Precise, installed via deb package.
 

> During the crash symptoms were as follows:
>
> - Escalating (and sudden) CPU utilisation on some (but not all) nodes

We've fixed at least one bug with that symptom in recent releases.

I think 3.1.5 is the latest stable ??
 

> - Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)

Are there multiple publishers on the same connection/channel when this happens? It wouldn't be unusual, if the server was struggling, to see flow control kick in and affect publishers in this fashion.

Yes in some cases there would be, for our test queue there wouldn't be -- we saw up to 10s on the test queue though
 

> - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)

Wow, 10m is amazingly slow. Can you provide log files for this period of activity and problems?

I'll take a look, we saw a few "too many processes" messages,

"Generic server net_kernel terminating" followed by :

** Reason for termination ==
** {system_limit,[{erlang,spawn_opt,
[inet_tcp_dist,do_setup,
[<0.19.0>,'[hidden email]',normal,
'[hidden email]',longnames,7000],
[link,{priority,max}]]},
{net_kernel,setup,4},
{net_kernel,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
 
=ERROR REPORT==== 15-Oct-2013::16:07:10 ===
** gen_event handler rabbit_error_logger crashed.
** Was installed in error_logger
** Last event was: {error,<0.8.0>,{emulator,"~s~n",["Too many processes\n"]}}
** When handler state == {resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>}
** Reason == {aborted,
{no_exists,
[rabbit_topic_trie_edge,
{trie_edge,
{resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>},
root,"error"}]}}


=ERROR REPORT==== 15-Oct-2013::16:07:10 === Mnesia(nonode@nohost): ** ERROR ** mnesia_controller got unexpected info: {'EXIT', <0.97.0>, shutdown}

=ERROR REPORT==== 15-Oct-2013::16:11:38 === Mnesia('[hidden email]'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, '[hidden email]'}



> - When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return

Again, we've fixed bugs in that area in recent releases.

> --- When that doesn't return we eventually have to ctrl-c the command
> --- We have to issue a kill signal to rabbit to stop it
> --- Do the same to the epmd process

Even if you have to `kill -9' a rabbit node, you shouldn't need to kill epmd. In theory at least. If that was necessary to fix the "state of the world", it would be indicative of a problem related to the erlang distribution mechanism, but I very much doubt that's the case here.

> Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux
>

How many queues are we talking about here?

~30
 

> [
>         {rabbit, [
>                 {cluster_nodes, {['[hidden email]', '[hidden email]','[hidden email]','[hidden email]','[hidden email]'], disc}},
>                 {cluster_partition_handling, pause_minority}

Are you sure that what you're seeing is not caused by a network partition? If it were, any nodes in a minority island would "pause", which would certainly lead to the kind of symptoms you've mentioned here, viz rabbitmqctl calls not returning and so on.

There was definitely a network partition, but the whole cluster nose dived during the crash
 

> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
>

That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.

These hosts aren't running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap) 
 

> System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
>

I'd strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient and a *lot* of bug fixes have appeared in erts + OTP since then.


ok good advice, we'll do that
 
> When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower:
>

That's not unusual.

> when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)
>
> {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}
>

Interesting - will take a look at that. If you could provide logs for the participating nodes during this whole time period, that would help a lot.

> mixed in with other more regular events:
>

Actually, sender_death messages are not "irregular" as such. They're just notifying the GM group members that another member (on another node) has died. This is quite normal with mirrored queues, when nodes get partitioned or stopped due to cluster recovery modes.

Cheers,
Tim


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

David Harrison
Quick update on the queue count: 56

On 17 October 2013 02:29, David Harrison <[hidden email]> wrote:
On 17 October 2013 01:29, Tim Watson <[hidden email]> wrote:
Hello David!


Hey Tim, thanks for replying so quickly!
 
On 16 Oct 2013, at 15:14, David Harrison wrote:
> Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.
>

What version of rabbit are you running, and how was it installed?

3.1.5, running on Ubuntu Precise, installed via deb package.
 

> During the crash symptoms were as follows:
>
> - Escalating (and sudden) CPU utilisation on some (but not all) nodes

We've fixed at least one bug with that symptom in recent releases.

I think 3.1.5 is the latest stable ??
 

> - Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)

Are there multiple publishers on the same connection/channel when this happens? It wouldn't be unusual, if the server was struggling, to see flow control kick in and affect publishers in this fashion.

Yes in some cases there would be, for our test queue there wouldn't be -- we saw up to 10s on the test queue though
 

> - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)

Wow, 10m is amazingly slow. Can you provide log files for this period of activity and problems?

I'll take a look, we saw a few "too many processes" messages,

"Generic server net_kernel terminating" followed by :

** Reason for termination ==
** {system_limit,[{erlang,spawn_opt,
[inet_tcp_dist,do_setup,
[<0.19.0>,'[hidden email]',normal,
'[hidden email]',longnames,7000],
[link,{priority,max}]]},
{net_kernel,setup,4},
{net_kernel,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
 
=ERROR REPORT==== 15-Oct-2013::16:07:10 ===
** gen_event handler rabbit_error_logger crashed.
** Was installed in error_logger
** Last event was: {error,<0.8.0>,{emulator,"~s~n",["Too many processes\n"]}}
** When handler state == {resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>}
** Reason == {aborted,
{no_exists,
[rabbit_topic_trie_edge,
{trie_edge,
{resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>},
root,"error"}]}}


=ERROR REPORT==== 15-Oct-2013::16:07:10 === Mnesia(nonode@nohost): ** ERROR ** mnesia_controller got unexpected info: {'EXIT', <0.97.0>, shutdown}

=ERROR REPORT==== 15-Oct-2013::16:11:38 === Mnesia('[hidden email]'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, '[hidden email]'}



> - When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return

Again, we've fixed bugs in that area in recent releases.

> --- When that doesn't return we eventually have to ctrl-c the command
> --- We have to issue a kill signal to rabbit to stop it
> --- Do the same to the epmd process

Even if you have to `kill -9' a rabbit node, you shouldn't need to kill epmd. In theory at least. If that was necessary to fix the "state of the world", it would be indicative of a problem related to the erlang distribution mechanism, but I very much doubt that's the case here.

> Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux
>

How many queues are we talking about here?

~30
 

> [
>         {rabbit, [
>                 {cluster_nodes, {['[hidden email]', '[hidden email]','[hidden email]','[hidden email]','[hidden email]'], disc}},
>                 {cluster_partition_handling, pause_minority}

Are you sure that what you're seeing is not caused by a network partition? If it were, any nodes in a minority island would "pause", which would certainly lead to the kind of symptoms you've mentioned here, viz rabbitmqctl calls not returning and so on.

There was definitely a network partition, but the whole cluster nose dived during the crash
 

> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
>

That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.

These hosts aren't running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap) 
 

> System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
>

I'd strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient and a *lot* of bug fixes have appeared in erts + OTP since then.


ok good advice, we'll do that
 
> When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower:
>

That's not unusual.

> when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)
>
> {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}
>

Interesting - will take a look at that. If you could provide logs for the participating nodes during this whole time period, that would help a lot.

> mixed in with other more regular events:
>

Actually, sender_death messages are not "irregular" as such. They're just notifying the GM group members that another member (on another node) has died. This is quite normal with mirrored queues, when nodes get partitioned or stopped due to cluster recovery modes.

Cheers,
Tim


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

Tim Watson-5
On 16 Oct 2013, at 16:34, David Harrison <[hidden email]> wrote:

Quick update on the queue count: 56

Hmm. That seems perfectly reasonable.

On 17 October 2013 02:29, David Harrison <[hidden email]> wrote:

What version of rabbit are you running, and how was it installed?

3.1.5, running on Ubuntu Precise, installed via deb package.

Of course - I missed that in the subject line. 

I think 3.1.5 is the latest stable ??

Yep.


I'll take a look, we saw a few "too many processes" messages,


That's not a good sign. I can't say we've run into that very frequently - it is possible to raise the limit (on the number of processes), but I suspect that's not the root of this anyway.

"Generic server net_kernel terminating" followed by :

** Reason for termination ==
** {system_limit,[{erlang,spawn_opt,

Yeah - once that goes you're in trouble. That's an unrecoverable error, the equivalent of crashing the jvm.

There was definitely a network partition, but the whole cluster nose dived during the crash
 

Yeah, partitions are bad and can even become unrecoverable without restarts (which is why we warn against using clustering in some environments), but what you're experiencing shouldn't happen.


> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
>

That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.

These hosts aren't running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap) 
 

This. I suspect the root of your problem is that you don't have any available swap and somehow ran out I memory. Rabbit should've been paging to disk (by hand, not via swap) once you got within a tolerance level of the high watermark, which is why I'd like to see logs if possible since we might be able to identify what led to runaway process spawns and memory allocation during the partition. My money, for the memory use part, is on error_logger, which has been known to blow up in this way when flooded with large logging terms. During a partition, various things can go wrong leading to crashing processes such as queues, some of which can have massive state that RTS logged leading to potential oom situations like this one. Replacing error logger has been in our radar before, but we've not had strong enough reasons to warrant the expense. If what you've seen can be linked to that however...

To properly diagnose what you've seen though, I will new to get my hands on those logs. Can we arrange that somehow?  

Cheers,
Tim

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

Tim Watson-5
Also....

It is possible for a runaway application to create so many connections and channels that processes and/or memory becomes exhausted. Is it possible this happened in your case? It doesn't sound like it, since your troubles sound like they stated with the partition, but its e good to confirm that. As well as the logs, can you post any rabbitmqctl status or report calls from before/after the problems started?

BTW, you're not by any chance based in British Colombia are you?

Cheers,
Tim Watson

On 16 Oct 2013, at 18:01, Tim Watson <[hidden email]> wrote:

On 16 Oct 2013, at 16:34, David Harrison <[hidden email]> wrote:

Quick update on the queue count: 56

Hmm. That seems perfectly reasonable.

On 17 October 2013 02:29, David Harrison <[hidden email]> wrote:

What version of rabbit are you running, and how was it installed?

3.1.5, running on Ubuntu Precise, installed via deb package.

Of course - I missed that in the subject line. 

I think 3.1.5 is the latest stable ??

Yep.


I'll take a look, we saw a few "too many processes" messages,


That's not a good sign. I can't say we've run into that very frequently - it is possible to raise the limit (on the number of processes), but I suspect that's not the root of this anyway.

"Generic server net_kernel terminating" followed by :

** Reason for termination ==
** {system_limit,[{erlang,spawn_opt,

Yeah - once that goes you're in trouble. That's an unrecoverable error, the equivalent of crashing the jvm.

There was definitely a network partition, but the whole cluster nose dived during the crash
 

Yeah, partitions are bad and can even become unrecoverable without restarts (which is why we warn against using clustering in some environments), but what you're experiencing shouldn't happen.


> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
>

That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.

These hosts aren't running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap) 
 

This. I suspect the root of your problem is that you don't have any available swap and somehow ran out I memory. Rabbit should've been paging to disk (by hand, not via swap) once you got within a tolerance level of the high watermark, which is why I'd like to see logs if possible since we might be able to identify what led to runaway process spawns and memory allocation during the partition. My money, for the memory use part, is on error_logger, which has been known to blow up in this way when flooded with large logging terms. During a partition, various things can go wrong leading to crashing processes such as queues, some of which can have massive state that RTS logged leading to potential oom situations like this one. Replacing error logger has been in our radar before, but we've not had strong enough reasons to warrant the expense. If what you've seen can be linked to that however...

To properly diagnose what you've seen though, I will new to get my hands on those logs. Can we arrange that somehow?  

Cheers,
Tim
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Crash with RabbitMQ 3.1.5

Nyah Pascal
This post has NOT been accepted by the mailing list yet.
Hi Tim,

We experience a similar issues using RMQ 3.3.5 and Erlang R15B02.

My question is if:

If the error_logger is known to be an issue, do new versions of erlang R16B02 and later fix this problem?
If you take a look at the processes on the crashdump file below that we saw on our system, you will notice that the error_logger accounts for a signification amount of the memory bloat.
Snapshot of crashdump on Processes

Thanks,

Pascal


 
Loading...