rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Sander
Hi,

rabbitmq seems to be running fine but every few days it seems to crash.  Restarting seems to work fine, but then it crashes again after a few more days. Has anyone experienced the following before?

Here is the result of rabbitmqctl status:
$ sudo rabbitmqctl status
Status of node 'rabbit@ocr-proc-1' ...
Error: unable to connect to node 'rabbit@ocr-proc-1': nodedown

DIAGNOSTICS
===========

nodes in question: ['rabbit@ocr-proc-1']

hosts, their running nodes and ports:
- ocr-proc-1: [{rabbitmqctl28222,41934}]

current node details:
- node name: 'rabbitmqctl28222@ocr-proc-1'
- home dir: /var/lib/rabbitmq
- cookie hash: d0gr0DxCd08BG2w9+0Yy6Q==


And the following is the logs. Let me know if there are any other files/configs I should send along. 
=ERROR REPORT==== 28-Oct-2013::19:48:36 ===
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@ocr-proc-1",
                               1000000000,3090534400,10000,
                               {interval,#Ref<0.0.47.196889>},
                               false}
** Reason for termination ==
** {{badmatch,[]},
    [{rabbit_disk_monitor,parse_free_unix,1,[]},
     {rabbit_disk_monitor,internal_update,1,[]},
     {rabbit_disk_monitor,handle_info,2,[]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
=INFO REPORT==== 28-Oct-2013::19:48:36 ===
Disabling disk free space monitoring on unsupported platform: {{'EXIT',
                                                                {{badmatch,[]},
                                                                 [{rabbit_disk_monitor,
                                                                   parse_free_unix,
                                                                   1,[]},
                                                                  {rabbit_disk_monitor,
                                                                   init,1,[]},
                                                                  {gen_server,
                                                                   init_it,6,
                                                                   [{file,
                                                                     "gen_server.erl"},
                                                                    {line,
                                                                     304}]},
                                                                  {proc_lib,
                                                                   init_p_do_apply,
                                                                   3,
                                                                   [{file,
                                                                     "proc_lib.erl"},
                                                                    {line,
                                                                     227}]}]}},
                                                               1887428608}

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Klishin-2
On 29 Oct 2013, at 08:56, Michael Sander <[hidden email]> wrote:

> Error: unable to connect to node 'rabbit@ocr-proc-1': nodedown

rabbitmqctl cannot reach rabbit@ocr-proc-1. I assume that’s happened after
the node went down.

> And the following is the logs. Let me know if there are any other files/configs I should send along.

What OS do you run? This means that RabbitMQ cannot detect available disk space
on your platform and disables disk monitor. It does not mean that the entire RabbitMQ
process is terminating.

One reason for obscure node termination may be that the node simply runs out of disk space
without monitoring in place.


MK

Software Engineer, Pivotal/RabbitMQ

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Klishin-2
In reply to this post by Michael Sander
On 29 Oct 2013, at 08:56, Michael Sander <[hidden email]> wrote:

> ** Reason for termination ==
> ** {{badmatch,[]},
>     [{rabbit_disk_monitor,parse_free_unix,1,[]},
>      {rabbit_disk_monitor,internal_update,1,[]},
>      {rabbit_disk_monitor,handle_info,2,[]},
>      {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},
>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
> =INFO REPORT==== 28-Oct-2013::19:48:36 ===
> Disabling disk free space monitoring on unsupported platform: {{'EXIT',
>                                                                 {{badmatch,[]},
>                                                                  [{rabbit_disk_monitor,
>                                                                    parse_free_unix,
>                                                                    1,[]},
>                                                                   {rabbit_disk_monitor,
>                                                                    init,1,[]},
>                                                                   {gen_server,
>                                                                    init_it,6,
>                                                                    [{file,
>                                                                      "gen_server.erl"},
>                                                                     {line,
>                                                                      304}]},
>                                                                   {proc_lib,
>                                                                    init_p_do_apply,
>                                                                    3,
>                                                                    [{file,
>                                                                      "proc_lib.erl"},
>                                                                     {line,
>                                                                      227}]}]}},
>                                                                1887428608}

Michael,

Is there a more or less reliable way to reproduce the issue? E.g. what OS, RabbitMQ
version and RabbitMQ configuration can we try? What is your workload like?

Thank you.

MK

Software Engineer, Pivotal/RabbitMQ



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Sander
Hi Michael,

Version: I'm running debian linux on a Google Compute Engine instance with RabbitMQ 3.1.3. (more version info below).

Workload: I'm using rabbitmq as part of a process to OCR many PDFs.  I add many URLs to various PDFs into rabbitmq. Then, a consumer pulls that link out of rabbitmq, downloads the file, attempts to OCR it, and then sends the result to another server.   

Diskspace: During OCR, my app writes, reads, and deletes a lot of temporary files, so it is possible that I briefly an out of free disk space momentarily. But I ran df afterwards and it looks like I have enough space.  (see below). Even if I did run out of disk space, shouldn't rabbitmq be somewhat graceful about it? Perhaps it should refuse any new jobs while there is no space but then come back online once it detects there is space available.

Reproducing: Unfortunately nothing reliably reproduces it.  This is one of those annoying situations where everything is working fine for a week and then all of a sudden it goes down. I'm going to setup a script that will email me when my disk space goes over 90%, so hopefully that will help identify the issue.

Here is some additional version and disk usage information that may be useful
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
rootfs           10G  6.9G  2.7G  73% /
/dev/root        10G  6.9G  2.7G  73% /
none            899M     0  899M   0% /dev
tmpfs           180M  4.1M  176M   3% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           360M     0  360M   0% /run/shm
$ cat /proc/version
Linux version 3.3.8-gcg-201305291443 ([hidden email]) (gcc version 4.6.x-google 20111101 (prerelease) (Google_crosstoolv15-gcc-4.6.x-glibc-2.11.1-grte) ) #1 SMP Wed May 29 14:49:59 PDT 2013

Appreciate the help.

Best,

Michael Sander
[hidden email]
607-227-9859


On Tue, Oct 29, 2013 at 8:23 AM, Michael Klishin <[hidden email]> wrote:
On 29 Oct 2013, at 08:56, Michael Sander <[hidden email]> wrote:

> ** Reason for termination ==
> ** {{badmatch,[]},
>     [{rabbit_disk_monitor,parse_free_unix,1,[]},
>      {rabbit_disk_monitor,internal_update,1,[]},
>      {rabbit_disk_monitor,handle_info,2,[]},
>      {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},
>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
> =INFO REPORT==== 28-Oct-2013::19:48:36 ===
> Disabling disk free space monitoring on unsupported platform: {{'EXIT',
>                                                                 {{badmatch,[]},
>                                                                  [{rabbit_disk_monitor,
>                                                                    parse_free_unix,
>                                                                    1,[]},
>                                                                   {rabbit_disk_monitor,
>                                                                    init,1,[]},
>                                                                   {gen_server,
>                                                                    init_it,6,
>                                                                    [{file,
>                                                                      "gen_server.erl"},
>                                                                     {line,
>                                                                      304}]},
>                                                                   {proc_lib,
>                                                                    init_p_do_apply,
>                                                                    3,
>                                                                    [{file,
>                                                                      "proc_lib.erl"},
>                                                                     {line,
>                                                                      227}]}]}},
>                                                                1887428608}

Michael,

Is there a more or less reliable way to reproduce the issue? E.g. what OS, RabbitMQ
version and RabbitMQ configuration can we try? What is your workload like?

Thank you.

MK

Software Engineer, Pivotal/RabbitMQ





_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Klishin-2
On 29 Oct 2013, at 16:43, Michael Sander <[hidden email]> wrote:

> Diskspace: During OCR, my app writes, reads, and deletes a lot of temporary files, so it is possible that I briefly an out of free disk space momentarily. But I ran df afterwards and it looks like I have enough space.  (see below).

Another reason may be running out of file descriptors. What’s the ulimit -n value for user rabbitmq?

> Even if I did run out of disk space, shouldn't rabbitmq be somewhat graceful about it? Perhaps it should refuse any new jobs while there is no space but then come back online once it detects there is space available.

See http://www.rabbitmq.com/memory.html.

RabbitMQ will normally block publishers until there’s enough disk space available but for some reason
disk monitor does not work in your environment, so rabbit cannot know when it runs out of disk space
or file descriptors.
--
MK

Software Engineer, Pivotal/RabbitMQ

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Klishin-2
In reply to this post by Michael Sander
On 29 Oct 2013, at 16:43, Michael Sander <[hidden email]> wrote:

> Here is some additional version and disk usage information that may be useful

Michael,

Can you please run /bin/df -kP [rabbitmq database directory location]
and post the output?

This is what RabbitMQ disk monitor uses internally. Supposedly it gets a non-standard
output back which it fails to parse.

Thank you.
--
MK

Software Engineer, Pivotal/RabbitMQ


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Klishin-2

On 31 Oct 2013, at 17:49, Michael Sander <[hidden email]> wrote:

> $ sudo /bin/df -kP /var/lib/rabbitmq/mnesia/
> Filesystem     1024-blocks    Used Available Capacity Mounted on
> /dev/root         10451916 6331368   3596312      64% /
>
> I'm having a problem sudoing into rabbitmq, this is ulimit for myself though:
> $ ulimit -n
> 1024
>
> Does this explain anything?

This output looks pretty standard. We will investigate if there may be a subtle difference that causes
the parser to break.

I doubt ulimit -n value can affect df output.
--
MK

Software Engineer, Pivotal/RabbitMQ

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: rabbitmq nodedown - nodedown - Generic server rabbit_disk_monitor terminating

Michael Klishin-2
In reply to this post by Michael Klishin-2

On 31 Oct 2013, at 17:49, Michael Sander <[hidden email]> wrote:

> Here is df:
> $ sudo /bin/df -kP /var/lib/rabbitmq/mnesia/
> Filesystem     1024-blocks    Used Available Capacity Mounted on
> /dev/root         10451916 6331368   3596312      64% /

Michael,

Can you please run rabbitmqctl eval 'rabbit_misc:os_cmd("/bin/df -kP
/var/lib/rabbitmq/mnesia/").’ ?

Have you managed to su to rabbitmq and see if /bin/df output may be different?
Does the problem persist?

Thank you.
--
MK

Software Engineer, Pivotal/RabbitMQ

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss