Hi there, Hoping someone can help me out. We recently experienced 2 crashes with our RabbitMQ cluster. After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again. This got us up and running. Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.
During the crash symptoms were as follows: - Escalating (and sudden) CPU utilisation on some (but not all) nodes - Escalation memory usage (not necessarily aligned to the spiking CPU)
- Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts) - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)
- Management plugin stops responding / or responding so slowly it's no longer loading any data at all (probably same thing that causes the preceeding item) - Can't force nodes to forget other nodes (calling `rabbitmqctl forget_cluster_node` doesn't return)
- When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return --- When that doesn't return we eventually have to ctrl-c the command
--- We have to issue a kill signal to rabbit to stop it --- Do the same to the epmd process --- However the other nodes all still think that the killed node is active (based on `rabbitmqctl cluster status` -- both nodes slow to run this, and fast to run it saw the same view of the cluster that included the dead node)
Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux [ {rabbit, [ {cluster_nodes, {['[hidden email]', '[hidden email]','[hidden email]','[hidden email]','[hidden email]'], disc}},
{cluster_partition_handling, pause_minority} ]} ] And the env: NODENAME="[hidden email]" SERVER_ERL_ARGS="-kernel inet_dist_listen_min 27248 -kernel inet_dist_listen_max 27248"
The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap"). System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
Compiled Fri Dec 16 03:22:15 2011 Taints (none) Memory allocated 6821368760 bytes
Atoms 22440 Processes 4899 ETS tables 80 Timers 23
Funs 3994 When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower:
Pid Name/Spawned as State Reductions Stack+heap MsgQ Length
<0.400.0> proc_lib:init_p/5 Scheduled 146860259 59786060 37148
<0.373.0> proc_lib:init_p/5 Scheduled 734287949 1346269 23360
<0.366.0> proc_lib:init_p/5 Waiting 114695635 5135590 19744
<0.444.0> proc_lib:init_p/5 Waiting 154538610 832040 3326
when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?) {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}
mixed in with other more regular events: {'$gen_cast', {gm,{publish,<2708.20321.59>, {message_properties,undefined,false},
{basic_message, <.. snip..> _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
Hello David!
On 16 Oct 2013, at 15:14, David Harrison wrote: > Hoping someone can help me out. We recently experienced 2 crashes with our RabbitMQ cluster. After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again. This got us up and running. Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down. > What version of rabbit are you running, and how was it installed? > During the crash symptoms were as follows: > > - Escalating (and sudden) CPU utilisation on some (but not all) nodes We've fixed at least one bug with that symptom in recent releases. > - Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts) Are there multiple publishers on the same connection/channel when this happens? It wouldn't be unusual, if the server was struggling, to see flow control kick in and affect publishers in this fashion. > - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s) Wow, 10m is amazingly slow. Can you provide log files for this period of activity and problems? > - When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return Again, we've fixed bugs in that area in recent releases. > --- When that doesn't return we eventually have to ctrl-c the command > --- We have to issue a kill signal to rabbit to stop it > --- Do the same to the epmd process Even if you have to `kill -9' a rabbit node, you shouldn't need to kill epmd. In theory at least. If that was necessary to fix the "state of the world", it would be indicative of a problem related to the erlang distribution mechanism, but I very much doubt that's the case here. > Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux > How many queues are we talking about here? > [ > {rabbit, [ > {cluster_nodes, {['[hidden email]', '[hidden email]','[hidden email]','[hidden email]','[hidden email]'], disc}}, > {cluster_partition_handling, pause_minority} Are you sure that what you're seeing is not caused by a network partition? If it were, any nodes in a minority island would "pause", which would certainly lead to the kind of symptoms you've mentioned here, viz rabbitmqctl calls not returning and so on. > The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap"). > That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness. > System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false] > I'd strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient and a *lot* of bug fixes have appeared in erts + OTP since then. > When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower: > That's not unusual. > when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?) > > {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}} > Interesting - will take a look at that. If you could provide logs for the participating nodes during this whole time period, that would help a lot. > mixed in with other more regular events: > Actually, sender_death messages are not "irregular" as such. They're just notifying the GM group members that another member (on another node) has died. This is quite normal with mirrored queues, when nodes get partitioned or stopped due to cluster recovery modes. Cheers, Tim _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
On 16 Oct 2013, at 15:29, Tim Watson wrote:
>> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap"). >> > > That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness. > Oh and BTW, you haven't changed the memory high watermark have you? _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
Hey Tim,
No, still set at 40% as per default -- read up and found numerous posts indicating double that was possible during GC and stayed well clear ;-) Cheers
Dave On 17 October 2013 01:34, Tim Watson <[hidden email]> wrote:
_______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
In reply to this post by Tim Watson-6
On 17 October 2013 01:29, Tim Watson <[hidden email]> wrote:
Hello David! Hey Tim, thanks for replying so quickly!
3.1.5, running on Ubuntu Precise, installed via deb package.
I think 3.1.5 is the latest stable ??
Yes in some cases there would be, for our test queue there wouldn't be -- we saw up to 10s on the test queue though
I'll take a look, we saw a few "too many processes" messages, "Generic server net_kernel terminating" followed by :
** Reason for termination == ** {system_limit,[{erlang,spawn_opt, [inet_tcp_dist,do_setup, [<0.19.0>,'[hidden email]',normal, '[hidden email]',longnames,7000],
[link,{priority,max}]]}, {net_kernel,setup,4}, {net_kernel,handle_call,3}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}
~30
There was definitely a network partition, but the whole cluster nose dived during the crash
These hosts aren't running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap)
ok good advice, we'll do that
_______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
Quick update on the queue count: 56
On 17 October 2013 02:29, David Harrison <[hidden email]> wrote:
_______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
On 16 Oct 2013, at 16:34, David Harrison <[hidden email]> wrote:
Hmm. That seems perfectly reasonable.
Of course - I missed that in the subject line.
Yep.
That's not a good sign. I can't say we've run into that very frequently - it is possible to raise the limit (on the number of processes), but I suspect that's not the root of this anyway.
Yeah - once that goes you're in trouble. That's an unrecoverable error, the equivalent of crashing the jvm.
Yeah, partitions are bad and can even become unrecoverable without restarts (which is why we warn against using clustering in some environments), but what you're experiencing shouldn't happen.
This. I suspect the root of your problem is that you don't have any available swap and somehow ran out I memory. Rabbit should've been paging to disk (by hand, not via swap) once you got within a tolerance level of the high watermark, which is why I'd like to see logs if possible since we might be able to identify what led to runaway process spawns and memory allocation during the partition. My money, for the memory use part, is on error_logger, which has been known to blow up in this way when flooded with large logging terms. During a partition, various things can go wrong leading to crashing processes such as queues, some of which can have massive state that RTS logged leading to potential oom situations like this one. Replacing error logger has been in our radar before, but we've not had strong enough reasons to warrant the expense. If what you've seen can be linked to that however... To properly diagnose what you've seen though, I will new to get my hands on those logs. Can we arrange that somehow? Cheers, Tim _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
Also.... It is possible for a runaway application to create so many connections and channels that processes and/or memory becomes exhausted. Is it possible this happened in your case? It doesn't sound like it, since your troubles sound like they stated with the partition, but its e good to confirm that. As well as the logs, can you post any rabbitmqctl status or report calls from before/after the problems started? BTW, you're not by any chance based in British Colombia are you? Cheers, Tim Watson
_______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
This post has NOT been accepted by the mailing list yet.
Hi Tim,
We experience a similar issues using RMQ 3.3.5 and Erlang R15B02. My question is if: If the error_logger is known to be an issue, do new versions of erlang R16B02 and later fix this problem? If you take a look at the processes on the crashdump file below that we saw on our system, you will notice that the error_logger accounts for a signification amount of the memory bloat. ![]() Thanks, Pascal |
Free forum by Nabble | Edit this page |