beam blocking

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

beam blocking

carlhoerberg
Have a cluster where the nodes sometimes crash on high load, nothing is shown in the rabbitmq logs, but this shows up in syslog:

kernel: [582840.748073] INFO: task beam:9794 blocked for more than 120 seconds.
kernel: [582840.748082] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: [582840.748088] beam            D ffff8800efc13700     0  9794   9718 0x00000000
kernel: [582840.748092]  ffff8800e7bc3cb8 0000000000000282 0000000000000000 ffffffffffffffe0
kernel: [582840.748095]  ffff8800e7bc3fd8 ffff8800e7bc3fd8 ffff8800e7bc3fd8 0000000000013700
kernel: [582840.748098]  ffff88000249c4d0 ffff88000249ade0 00007f7c2aa289e0 ffff8800024b1180
kernel: [582840.748101] Call Trace:
kernel: [582840.748109]  [<ffffffff81659bbf>] schedule+0x3f/0x60
kernel: [582840.748113]  [<ffffffff8106c535>] exit_mm+0x85/0x130
kernel: [582840.748116]  [<ffffffff8106c74e>] do_exit+0x16e/0x450
kernel: [582840.748120]  [<ffffffff8109e4d9>] ? futex_wait_queue_me+0xc9/0x100
kernel: [582840.748122]  [<ffffffff8109e14f>] ? __unqueue_futex+0x3f/0x80
kernel: [582840.748126]  [<ffffffff8107ad4a>] ? __dequeue_signal+0x6a/0xb0
kernel: [582840.748128]  [<ffffffff8106cbd4>] do_group_exit+0x44/0xa0
kernel: [582840.748131]  [<ffffffff8107d8cc>] get_signal_to_deliver+0x21c/0x420
kernel: [582840.748135]  [<ffffffff81014825>] do_signal+0x45/0x130
kernel: [582840.748137]  [<ffffffff810a126c>] ? do_futex+0x7c/0x1b0
kernel: [582840.748139]  [<ffffffff810a14e2>] ? sys_futex+0x142/0x1a0
kernel: [582840.748142]  [<ffffffff81091d7f>] ? __put_cred+0x3f/0x50
kernel: [582840.748144]  [<ffffffff81014ad5>] do_notify_resume+0x65/0x80
kernel: [582840.748147]  [<ffffffff81664350>] int_signal+0x12/0x17

RabbitMQ 3.3.1, Erlang 17, ubuntu 12.04
Reply | Threaded
Open this post in threaded view
|

Re: beam blocking

Brett Cameron
Carl,

Possibly an I/O problem. When the the operating system flushes cached file system data to disk there is a default timeout of 120s and while flushing is occurring writes will become blocked/synchronous. If your server has a lot of RAM and slow storage, and you're doing a lot of writes, this could happen. By default the kernel param vm.dirty_ratio (% RAM used for file system caching) is 40% (see /etc/sysctl.conf). You could try playing with this value to increase the frequency of I/O flushes but reduce their size...

Brett



On Thu, May 8, 2014 at 1:36 PM, carlhoerberg <[hidden email]> wrote:
Have a cluster where the nodes sometimes crash on high load, nothing is shown
in the rabbitmq logs, but this shows up in syslog:

kernel: [582840.748073] INFO: task beam:9794 blocked for more than 120
seconds.
kernel: [582840.748082] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
kernel: [582840.748088] beam            D ffff8800efc13700     0  9794
9718 0x00000000
kernel: [582840.748092]  ffff8800e7bc3cb8 0000000000000282 0000000000000000
ffffffffffffffe0
kernel: [582840.748095]  ffff8800e7bc3fd8 ffff8800e7bc3fd8 ffff8800e7bc3fd8
0000000000013700
kernel: [582840.748098]  ffff88000249c4d0 ffff88000249ade0 00007f7c2aa289e0
ffff8800024b1180
kernel: [582840.748101] Call Trace:
kernel: [582840.748109]  [<ffffffff81659bbf>] schedule+0x3f/0x60
kernel: [582840.748113]  [<ffffffff8106c535>] exit_mm+0x85/0x130
kernel: [582840.748116]  [<ffffffff8106c74e>] do_exit+0x16e/0x450
kernel: [582840.748120]  [<ffffffff8109e4d9>] ?
futex_wait_queue_me+0xc9/0x100
kernel: [582840.748122]  [<ffffffff8109e14f>] ? __unqueue_futex+0x3f/0x80
kernel: [582840.748126]  [<ffffffff8107ad4a>] ? __dequeue_signal+0x6a/0xb0
kernel: [582840.748128]  [<ffffffff8106cbd4>] do_group_exit+0x44/0xa0
kernel: [582840.748131]  [<ffffffff8107d8cc>]
get_signal_to_deliver+0x21c/0x420
kernel: [582840.748135]  [<ffffffff81014825>] do_signal+0x45/0x130
kernel: [582840.748137]  [<ffffffff810a126c>] ? do_futex+0x7c/0x1b0
kernel: [582840.748139]  [<ffffffff810a14e2>] ? sys_futex+0x142/0x1a0
kernel: [582840.748142]  [<ffffffff81091d7f>] ? __put_cred+0x3f/0x50
kernel: [582840.748144]  [<ffffffff81014ad5>] do_notify_resume+0x65/0x80
kernel: [582840.748147]  [<ffffffff81664350>] int_signal+0x12/0x17

RabbitMQ 3.3.1, Erlang 17, ubuntu 12.04



--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/beam-blocking-tp35412.html
Sent from the RabbitMQ mailing list archive at Nabble.com.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: beam blocking

carlhoerberg
Ah, makes sense, I'm trying that, thank!
Reply | Threaded
Open this post in threaded view
|

Re: beam blocking

mc717990
Carl:

Also watch out for some nasty kernel bugs - what OS are you running?  We saw the above error messages, and it wasn't disk IO but a kernel bug on RHEL 6.2.  Well, Oracle Enterprise Linux, but there's an uptime bug on 6.2 that bit us that looked like a disk io error,

Jason


On Thu, May 8, 2014 at 1:56 AM, carlhoerberg <[hidden email]> wrote:
Ah, makes sense, I'm trying that, thank!



--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/beam-blocking-tp35412p35415.html
Sent from the RabbitMQ mailing list archive at Nabble.com.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



--
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: beam blocking

carlhoerberg
ubuntu 12.04, with all upgrades applied.. It might have been an EC2 EBS problem. We changed  instance type and enabled EBS optimization.