Odd behavior where server stops responding

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Odd behavior where server stops responding

mc717990
This may be a 3.0.4-1 issue (or erlang esl-erlang-R15B03-2.x86_64) and I just may need to upgrade rabbit but I thought I'd see if anyone had seen this before.  This is on oracle enterprise linux 6.2.  

A few days after a software raid filesystem check (may or not be related, only thing I can see that's common), on three separate servers, rabbit just completely hung.  I couldn't even do an LSOF on any file system.  The port was still responsive, but the rabbit process itself was completely hung.  CPU use jumped up to 25% on the beam process and just stayed there...  I killed all the rabbit processes and tried to restart them.  There was nothing in the logs and the startup failed.  I killed everything again, including the EPMD process and then rabbit was finally able to start.  With-in a few moments though the beam hung again - I see a few connections show in the logs and then the process is non-responsive.

I'm GUESSING this is an OS level issue, and I'd swear I've seen this before.  I've had to do a full restart of the server to get things back to a decent state.  Anyone have any advice/ideas?

Jason

--
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Odd behavior where server stops responding

Tim Watson-6

On 11 Mar 2014, at 16:41, Jason McIntosh wrote:

> This may be a 3.0.4-1 issue (or erlang esl-erlang-R15B03-2.x86_64) and I just may need to upgrade rabbit but I thought I'd see if anyone had seen this before.  This is on oracle enterprise linux 6.2.  
>
> A few days after a software raid filesystem check (may or not be related, only thing I can see that's common), on three separate servers, rabbit just completely hung.  I couldn't even do an LSOF on any file system.  

That sounds like the OS hung not rabbit?

> The port was still responsive, but the rabbit process itself was completely hung.  CPU use jumped up to 25% on the beam process and just stayed there...

25% isn't exactly massive, though if the spike in CPU wasn't associated with an increase in messaging traffic then something could be wrong.

>  I killed all the rabbit processes and tried to restart them.  There was nothing in the logs and the startup failed

How did it fail? Non-zero return code(s) or more?

> I killed everything again, including the EPMD process and then rabbit was finally able to start.  With-in a few moments though the beam hung again - I see a few connections show in the logs and then the process is non-responsive.

How did the beam process hang and in what way did the broker become unresponsive? How are you trying to interact with the broker? Are we talking about rabbitmqctl commands not responding, or something else?

>
> I'm GUESSING this is an OS level issue, and I'd swear I've seen this before.

I've not heard of this before either. If you can get a rabbit node up and running then we may be able to run some diagnostic commands, depending on whether or not rabbitmqctl commands are working (or not). There are other ways to access a console on a running node too, though they're a bit more involved.

>  I've had to do a full restart of the server to get things back to a decent state.  Anyone have any advice/ideas?
>

Please clarify what we mean by hung and non-responsive. Also, do you now have the broker running without any of these issues? Can you fire up a standalone (test) node in a VM or some such, run the broker (successfully) on it and re-run the raid check to see if it causes the same problem?

It would be good to get to the bottom of this, though clearly without any logs it's going to be tricky.

Cheers,
Tim

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Odd behavior where server stops responding

mc717990
I never could get any node up - nothing showed up in the logs for startup, shutdown, regular and sasl logs.  One interesting thing is on the startup after killing all the processes (including epmd) it appears to have started multiple beams instead of the typical one.  

By non-responsive, rabbitmqctl shows the node as being down, but I could telnet to both the management port and the rabbitmq port (which I'm guessing is the epmd process), but nothing shows in the log files for rabbit itself, nothing in the sasl logs, no content from the management port, etc.  The OS was completely responsive - I could get to the shell, do most commands, though lsof (I think I mentioned this) wouldn't respond in any timely manner while rabbit server was running.  All the ways I know of to talk to rabbit and all the os things i know to do were failing and all the things I know to try and restart it (killing the processes, waiting for network connections in TIM_WAIT to drain, killing EPMD as well) all failed.

At that point, I recycled each of the servers and they're back to a running state.  I don't know that I have a box handy but I'll see what I can do to replicate this.  For some reason I think I've seen this before and it's something with the OEL 6.2 kernel that Oracle put together, dealing with disk io flush on a journaled file system.  I'm pretty sure I've seen the same thing about 6 months ago or so on a completely different set of servers.

Jason




On Wed, Mar 12, 2014 at 4:39 AM, Tim Watson <[hidden email]> wrote:

On 11 Mar 2014, at 16:41, Jason McIntosh wrote:

> This may be a 3.0.4-1 issue (or erlang esl-erlang-R15B03-2.x86_64) and I just may need to upgrade rabbit but I thought I'd see if anyone had seen this before.  This is on oracle enterprise linux 6.2.
>
> A few days after a software raid filesystem check (may or not be related, only thing I can see that's common), on three separate servers, rabbit just completely hung.  I couldn't even do an LSOF on any file system.

That sounds like the OS hung not rabbit?

> The port was still responsive, but the rabbit process itself was completely hung.  CPU use jumped up to 25% on the beam process and just stayed there...

25% isn't exactly massive, though if the spike in CPU wasn't associated with an increase in messaging traffic then something could be wrong.

>  I killed all the rabbit processes and tried to restart them.  There was nothing in the logs and the startup failed

How did it fail? Non-zero return code(s) or more?

> I killed everything again, including the EPMD process and then rabbit was finally able to start.  With-in a few moments though the beam hung again - I see a few connections show in the logs and then the process is non-responsive.

How did the beam process hang and in what way did the broker become unresponsive? How are you trying to interact with the broker? Are we talking about rabbitmqctl commands not responding, or something else?

>
> I'm GUESSING this is an OS level issue, and I'd swear I've seen this before.

I've not heard of this before either. If you can get a rabbit node up and running then we may be able to run some diagnostic commands, depending on whether or not rabbitmqctl commands are working (or not). There are other ways to access a console on a running node too, though they're a bit more involved.

>  I've had to do a full restart of the server to get things back to a decent state.  Anyone have any advice/ideas?
>

Please clarify what we mean by hung and non-responsive. Also, do you now have the broker running without any of these issues? Can you fire up a standalone (test) node in a VM or some such, run the broker (successfully) on it and re-run the raid check to see if it causes the same problem?

It would be good to get to the bottom of this, though clearly without any logs it's going to be tricky.

Cheers,
Tim

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



--
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Odd behavior where server stops responding

Tim Watson-6
Hi Jason,

On 12 Mar 2014, at 14:32, Jason McIntosh wrote:

> I never could get any node up - nothing showed up in the logs for startup, shutdown, regular and sasl logs.  One interesting thing is on the startup after killing all the processes (including epmd) it appears to have started multiple beams instead of the typical one.  

Well, if there are rabbit (i.e., beam.smp) processes running, then you _do_ have a node up, though not necessarily responding properly. That's not the same as the program refusing to start though.

>
> By non-responsive, rabbitmqctl shows the node as being down,

Could this be a file system corruption issue? Have you checked all the usual suspects for when rabbitmqctl won't connect to a node that you know is running, like checking the erlang cookies match?

> but I could telnet to both the management port and the rabbitmq port (which I'm guessing is the epmd process)

I'm not really sure what you mean by "the rabbitmq port", but we could be talking about several things here. (a) the port on which the broker accepts AMQP connections, (b) the port on which the broker accepts distributed erlang connections (which is assigned by epmd) or something else (?) so I'm not really sure what we're saying here.

> , but nothing shows in the log files for rabbit itself, nothing in the sasl logs, no content from the management port, etc.

Hmn, so you're sure (via ps and/or top) that there are beam.smp processes running, and you can see (via netstat) that the management HTTP port is in use, but there's no response from the HTTP (management) server?

>  The OS was completely responsive - I could get to the shell, do most commands, though lsof (I think I mentioned this) wouldn't respond in any timely manner while rabbit server was running.

That _is_ very strange.

>  All the ways I know of to talk to rabbit and all the os things i know to do were failing and all the things I know to try and restart it (killing the processes, waiting for network connections in TIM_WAIT to drain, killing EPMD as well) all failed.
>
> At that point, I recycled each of the servers and they're back to a running state.  I don't know that I have a box handy but I'll see what I can do to replicate this.  For some reason I think I've seen this before and it's something with the OEL 6.2 kernel that Oracle put together, dealing with disk io flush on a journaled file system.  I'm pretty sure I've seen the same thing about 6 months ago or so on a completely different set of servers.

Urgh, that sounds horrible. The more info you can provide us with the better. If you can replicate, that would be amazing since we can do the same thing and investigate.

Cheers,
Tim

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Odd behavior where server stops responding

mc717990
The last time I saw this I found an article talking about how the OEL kernels on 6.2 for servers with large amounts of ram had a high buffer cache situation on the journal before flushing to disk which would cause long IO times and potentially crash the OS or other processes.  We just saw this on a completely different server the other morning not running rabbitmq (all kinds of stuff in the logs about 120 second wait times, ssh stops responding, have to get into the ilo or vmware console to see these).  The OS just stopped responding though in these cases, so it might not be related to the issues I saw on these three servers.  But this is why I suspect an issue with the kernel and disk IO, particularly since lsof wasn't working, and seeing the software raid checks in the logs.  Note, when I DID kill rabbit and epmd, I checked lsof on both the /data partition where I have the mnesia database, and the /var/log looking for any sign of a problem, and it did work when rabbit wasn't running.

When I say "refusing to start" the init script never completed and exited out.  It started the process, and that process never responded (guessing) or the init script would have finished whatever it does.  I wish I still had the ps aufx results I took when I saw this - it looked fairly normal with some things missing.  Here's what we have now:
root      2484  0.0  0.0 108488  1116 ?        S    Mar11   0:00 /bin/sh /etc/rc3.d/S80rabbitmq-server start
root      2490  0.0  0.0 108216  1444 ?        S    Mar11   0:00  \_ /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/sbin/rabbitmq-server
root      2493  0.0  0.0 108216  1488 ?        S    Mar11   0:00      \_ /bin/sh /usr/sbin/rabbitmq-server
root      2506  0.0  0.0 163228  2160 ?        S    Mar11   0:00          \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq  2511  5.9  0.0 3881272 79464 ?       Ssl  Mar11  87:11              \_ /usr/lib/erlang/erts-5.9.3.1/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -h
rabbitmq  2676  0.0  0.0  10844   540 ?        Ss   Mar11   0:00                  \_ inet_gethost 4
rabbitmq  2677  0.0  0.0  12948   712 ?        S    Mar11   0:00                      \_ inet_gethost 4

When I was having the issues, I saw duplicate of the beam.smp processes running, and what appeared to be duplicate start commands as well, but no inet_gethost 4 running.  A netstat showed nothing on 5672 or 15672 neither listening or connected nor in any other state.

By "telneting to the port" I mean literally I did these:
telnet <server> 5672
telnet <server> 15672
Enter a bunch of garbage in both cases, nothing happens.  At the very least on the management port, typing "GET /" should have given a response with a request for authorization or similar.  And yeah before doing this, beam and the epmd process were running.

Please note these servers are NOT clustered together - they're completely independent.  They all just failed around the same time (about a four hour window).  They'd been started about the same time back in August.

Next time I see this I'll try and get an strace on the beam process, see if I can get some more information.  As said - definitely SOMETHING screwy here, and the way it behaved felt very much like an OS lock of some sort someplace, semaphore lock, or something else similar, and I'd guess rabbit (well, erlang) didn't handle it very well.  And the only thing I found even remotely suspicious was the md check in the syslog:

Mar  9 03:22:02 mongobdc14p kernel: imklog 4.6.2, log source = /proc/kmsg started.
Mar  9 03:22:02 mongobdc14p rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="5815" x-info="http://www.rsyslog.com"] (re)start
Mar  9 04:48:19 mongobdc14p kernel: md: md6: data-check done.
Mar  9 04:48:19 mongobdc14p kernel: md: data-check of RAID array md5
Mar  9 04:48:19 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Mar  9 04:48:19 mongobdc14p kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Mar  9 04:48:19 mongobdc14p kernel: md: using 128k window, over a total of 104856508 blocks.
Mar  9 04:48:19 mongobdc14p kernel: md: delaying data-check of md0 until md5 has finished (they share one or more physical units)
Mar  9 04:48:19 mongobdc14p kernel: md: delaying data-check of md1 until md5 has finished (they share one or more physical units)
Mar  9 05:03:45 mongobdc14p kernel: md: md5: data-check done.
Mar  9 05:03:45 mongobdc14p kernel: md: data-check of RAID array md0
Mar  9 05:03:45 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Mar  9 05:03:45 mongobdc14p kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Mar  9 05:03:45 mongobdc14p kernel: md: using 128k window, over a total of 524276 blocks.
Mar  9 05:03:45 mongobdc14p kernel: md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Mar  9 05:03:50 mongobdc14p kernel: md: md0: data-check done.
Mar  9 05:03:50 mongobdc14p kernel: md: data-check of RAID array md1
Mar  9 05:03:50 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Mar  9 05:03:50 mongobdc14p kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Mar  9 05:03:50 mongobdc14p kernel: md: using 128k window, over a total of 31456188 blocks.
Mar  9 05:08:53 mongobdc14p kernel: md: md1: data-check done.
Mar 10 14:40:01 mongobdc14p auditd[21812]: Audit daemon rotating log files
Mar 11 11:46:46 mongobdc14p init: Failed to spawn splash-manager main process: unable to open console: Input/output error

(11:46 was when I finally gave up trying to debug and just restarted the whole server so you can ignore the io error on the splash-manager).  Cron jobs all were running fine, no problems anyplace that I could actually see.  Before things stopped responding on the 11th, rabbit used a few more file descriptors then stopped responding about 10am.  Memory was a bit spiky but nothing out of the ordinary.  Disk free hadn't changed at all to speak of (few mb here up and down).

9.1 rabbitmq /usr/lib/erlang/erts-5.9.3.1/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.0.4/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@mongobdc14p -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/data/rabbitmq/[hidden email]"} -rabbit sasl_error_logger {file,"/data/rabbitmq/[hidden email]"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.0.4/sbin/../plugins" -rabbit plugins_expand_dir "/data/rabbitmq/rabbitmq/mnesia/rabbit@mongobdc14p-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/data/rabbitmq/rabbitmq/mnesia/rabbit@mongobdc14p"

If you'd like, I've got more graphs and data i can share (I had the zabbix agent running the zabbix-rabbitmq monitoring stuff I wrote on that box), but I really haven't been able to find anything wrong.  Just the management API at 10:32 stopped responding, context switches on the kernel suddenly dropped, it's like everything just suddenly stopped doing anything.

At this point I'd chalk this up to "it's an anomoly that shouldn't require a whole lot of investigation" kinda situation though so don't worry about it too heavily.  Or blame Oracle - I like blaming Oracle :)
Jason

On Wed, Mar 12, 2014 at 11:45 AM, Tim Watson <[hidden email]> wrote:
Hi Jason,

On 12 Mar 2014, at 14:32, Jason McIntosh wrote:

> I never could get any node up - nothing showed up in the logs for startup, shutdown, regular and sasl logs.  One interesting thing is on the startup after killing all the processes (including epmd) it appears to have started multiple beams instead of the typical one.

Well, if there are rabbit (i.e., beam.smp) processes running, then you _do_ have a node up, though not necessarily responding properly. That's not the same as the program refusing to start though.

>
> By non-responsive, rabbitmqctl shows the node as being down,

Could this be a file system corruption issue? Have you checked all the usual suspects for when rabbitmqctl won't connect to a node that you know is running, like checking the erlang cookies match?

> but I could telnet to both the management port and the rabbitmq port (which I'm guessing is the epmd process)

I'm not really sure what you mean by "the rabbitmq port", but we could be talking about several things here. (a) the port on which the broker accepts AMQP connections, (b) the port on which the broker accepts distributed erlang connections (which is assigned by epmd) or something else (?) so I'm not really sure what we're saying here.

> , but nothing shows in the log files for rabbit itself, nothing in the sasl logs, no content from the management port, etc.

Hmn, so you're sure (via ps and/or top) that there are beam.smp processes running, and you can see (via netstat) that the management HTTP port is in use, but there's no response from the HTTP (management) server?

>  The OS was completely responsive - I could get to the shell, do most commands, though lsof (I think I mentioned this) wouldn't respond in any timely manner while rabbit server was running.

That _is_ very strange.

>  All the ways I know of to talk to rabbit and all the os things i know to do were failing and all the things I know to try and restart it (killing the processes, waiting for network connections in TIM_WAIT to drain, killing EPMD as well) all failed.
>
> At that point, I recycled each of the servers and they're back to a running state.  I don't know that I have a box handy but I'll see what I can do to replicate this.  For some reason I think I've seen this before and it's something with the OEL 6.2 kernel that Oracle put together, dealing with disk io flush on a journaled file system.  I'm pretty sure I've seen the same thing about 6 months ago or so on a completely different set of servers.

Urgh, that sounds horrible. The more info you can provide us with the better. If you can replicate, that would be amazing since we can do the same thing and investigate.

Cheers,
Tim

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



--
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Odd behavior where server stops responding

Michael Klishin-2

On 12 Mar 2014, at 21:23, Jason McIntosh <[hidden email]> wrote:

> rabbitmq  2676  0.0  0.0  10844   540 ?        Ss   Mar11   0:00                  \_ inet_gethost 4
> rabbitmq  2677  0.0  0.0  12948   712 ?        S    Mar11   0:00                      \_ inet_gethost 4

Jason,

Can you please post your /etc/hosts file (or send it to me off list).

MK

Software Engineer, Pivotal/RabbitMQ


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Odd behavior where server stops responding

mc717990
SO we finally figured out the issue.  Apparently there's a rather nasty bug in the 2.6.32 kernels:

These are just a few of the sites listing the problem and upgrade notes.  The behavior on this for us was really kinda random.  We've had some systems just completely freeze, the OS being totally hung.  Other systems seem to manage to make it past the limit by a day or two then die.  In the case of the servers I mentioned, only the Rabbit process itself hung.  But the one consistent thing we discovered was that all the servers having an issue had an uptime right around 208 days - we missed this on our initial investigation.  SO If anyone is running CentOS 6.2, upgrade!

Jason


--
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss