RabbitMQ waits forever for PID file during startup

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

RabbitMQ waits forever for PID file during startup

Cesar Munoz
Hi all,

I am seeing this intermittent issue in RabbitMQ during its startup, right after installation. As you can see in this gist:

https://gist.github.com/xcu/9509f1d285dd9556667c

The startup was attempted at 0:51, and the wait /var/run/rabbitmq/pid is still hanging 9 hours later, which is blocking the rest of the installation. I believe the message in the startup_err is probably the cause of this, but after googling it I have found nothing useful.

Any ideas?

Thanks,
Cesar.

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: RabbitMQ waits forever for PID file during startup

Simon MacMullen-2
On 06/06/2014 9:48AM, Cesar wrote:
> Hi all,
>
> I am seeing this intermittent issue in RabbitMQ during its startup,
> right after installation. As you can see in this gist:
>
> https://gist.github.com/xcu/9509f1d285dd9556667c

Firstly, thanks for all the detail there.

> The startup was attempted at 0:51, and the wait /var/run/rabbitmq/pid is
> still hanging 9 hours later, which is blocking the rest of the
> installation. I believe the message in the startup_err is probably the
> cause of this, but after googling it I have found nothing useful.

So it looks like the startup shell script is failing to write its pid
(which will become RabbitMQ's pid) to a file. The process which then
waits for startup to complete then hangs forever since the file is missing.

So there are two issues here. One is that failing to write the pid file
should lead to immediate failure - probably the script just needs to set
"-e". I will make sure this gets some attention.

The second, and more baffling, issue is that on your system(s) the shell
command:

     echo $$ > ${RABBITMQ_PID_FILE}

is actually failing due to "Cannot allocate memory". Yet in your startup
log the machine seems to have 256GB, which ought to be enough to run
even the most bloated version of echo.

So is it something about piping to a file? Are you hitting some ulimit
or another? Is there anything unusual on this machine in terms of
resource limits?

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: RabbitMQ waits forever for PID file during startup

Cesar Munoz
Hi Simon,

the set -e looks like a very good idea, at least the process will return the failure straight away!

These are the ulimits:

[root@ms1 ~]# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2066207
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2066207
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


No error log showed up in /var/log/messages though...


Thanks!
Cesar.


On 6 June 2014 10:16, Simon MacMullen <[hidden email]> wrote:
On 06/06/2014 9:48AM, Cesar wrote:
Hi all,

I am seeing this intermittent issue in RabbitMQ during its startup,
right after installation. As you can see in this gist:

https://gist.github.com/xcu/9509f1d285dd9556667c

Firstly, thanks for all the detail there.

The startup was attempted at 0:51, and the wait /var/run/rabbitmq/pid is
still hanging 9 hours later, which is blocking the rest of the
installation. I believe the message in the startup_err is probably the
cause of this, but after googling it I have found nothing useful.

So it looks like the startup shell script is failing to write its pid (which will become RabbitMQ's pid) to a file. The process which then waits for startup to complete then hangs forever since the file is missing.

So there are two issues here. One is that failing to write the pid file should lead to immediate failure - probably the script just needs to set "-e". I will make sure this gets some attention.

The second, and more baffling, issue is that on your system(s) the shell command:

    echo $$ > ${RABBITMQ_PID_FILE}

is actually failing due to "Cannot allocate memory". Yet in your startup log the machine seems to have 256GB, which ought to be enough to run even the most bloated version of echo.

So is it something about piping to a file? Are you hitting some ulimit or another? Is there anything unusual on this machine in terms of resource limits?

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal


This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: RabbitMQ waits forever for PID file during startup

Simon MacMullen-2
On 06/06/2014 10:49AM, Cesar Munoz wrote:
> Hi Simon,
>
> the set -e looks like a very good idea, at least the process will return
> the failure straight away!

Sure!

> These are the ulimits:
>
> [root@ms1 ~]# ulimit -a

<snip>

Those are the ulimits which apply to root - maybe they are different for
the "rabbitmq" user?

But more to the point: we're failing to do something very very simple
here, there has to be something weird about this system if echo or shell
redirection can fail with an error message about memory allocation.

So have you configured anything unusual about this system?

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: RabbitMQ waits forever for PID file during startup

Cesar Munoz
Hi Simon,

the ulimits for rabbitmq user are pretty much the same, the only difference is that max user processes is set to 1024 instead of 2066207.

About the system itself, it is true that there has to be something strange going on if a shell redirection can fail, but I'm checking the configuration and I don't see anything specially awkward.

We are using Red Hat 6.4, and these are the parameters that we set in the sysctl.conf:
http://pastebin.com/SfJBwrna

The rest of the parameters in the kickstart file are pretty much the standard ones.
This is an intermittent issue (we are testing how often it happens, so far we got 3 failures in 13 installations), so it is harder to track it!
Either way, restarting the service works, so it looks like whatever causes the problem disappears after a while. I've been trying to find what could make this non-deterministic, but so far I haven't noticed anything unusual.

Thanks again!
Cesar.


On 6 June 2014 11:27, Simon MacMullen <[hidden email]> wrote:
On 06/06/2014 10:49AM, Cesar Munoz wrote:
Hi Simon,

the set -e looks like a very good idea, at least the process will return
the failure straight away!

Sure!


These are the ulimits:

[root@ms1 ~]# ulimit -a

<snip>

Those are the ulimits which apply to root - maybe they are different for the "rabbitmq" user?

But more to the point: we're failing to do something very very simple here, there has to be something weird about this system if echo or shell redirection can fail with an error message about memory allocation.

So have you configured anything unusual about this system?


Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal


This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: RabbitMQ waits forever for PID file during startup

Cesar Munoz
Hi Simon,

so we have tried to find the problem with the initial installation, but no luck yet. It is very difficult to track it, as it is totally non-deterministic!
In the meantime, we installed the latest version of RabbitMQ, which includes de set -e fix, but the same issue still happened. Given the output of ps auxf

https://gist.github.com/anonymous/62239513b154179a8a4e

it looks like
/bin/sh /etc/init.d/rabbitmq-server start
and
/bin/sh /usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid
were running concurrently. Is there any chance that this fact created some sort of race condition between these 2 processes that would make the set -e fix not work?

Cheers,
Cesar.



On 6 June 2014 11:55, Cesar Munoz <[hidden email]> wrote:
Hi Simon,

the ulimits for rabbitmq user are pretty much the same, the only difference is that max user processes is set to 1024 instead of 2066207.

About the system itself, it is true that there has to be something strange going on if a shell redirection can fail, but I'm checking the configuration and I don't see anything specially awkward.

We are using Red Hat 6.4, and these are the parameters that we set in the sysctl.conf:
http://pastebin.com/SfJBwrna

The rest of the parameters in the kickstart file are pretty much the standard ones.
This is an intermittent issue (we are testing how often it happens, so far we got 3 failures in 13 installations), so it is harder to track it!
Either way, restarting the service works, so it looks like whatever causes the problem disappears after a while. I've been trying to find what could make this non-deterministic, but so far I haven't noticed anything unusual.

Thanks again!
Cesar.


On 6 June 2014 11:27, Simon MacMullen <[hidden email]> wrote:
On 06/06/2014 10:49AM, Cesar Munoz wrote:
Hi Simon,

the set -e looks like a very good idea, at least the process will return
the failure straight away!

Sure!


These are the ulimits:

[root@ms1 ~]# ulimit -a

<snip>

Those are the ulimits which apply to root - maybe they are different for the "rabbitmq" user?

But more to the point: we're failing to do something very very simple here, there has to be something weird about this system if echo or shell redirection can fail with an error message about memory allocation.

So have you configured anything unusual about this system?


Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal



This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.


_______________________________________________
rabbitmq-discuss mailing list has moved to https://groups.google.com/forum/#!forum/rabbitmq-users,
please subscribe to the new list!

[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: RabbitMQ waits forever for PID file during startup

Simon MacMullen-2
On 10/07/2014 3:02PM, Cesar Munoz wrote:

> Hi Simon,
>
> so we have tried to find the problem with the initial installation, but
> no luck yet. It is very difficult to track it, as it is totally
> non-deterministic!
> In the meantime, we installed the latest version of RabbitMQ, which
> includes de set -e fix, but the same issue still happened. Given the
> output of ps auxf
>
> https://gist.github.com/anonymous/62239513b154179a8a4e
>
> it looks like
>
> /bin/sh /etc/init.d/rabbitmq-server start
>
> and
>
> /bin/sh /usr/sbin/rabbitmqctlwait  /var/run/rabbitmq/pid
>
> were running concurrently. Is there any chance that this fact created
> some sort of race condition between these 2 processes that would make
> the set -e fix not work?

The "set -e" should cause a failure in the case where the script was not
able to write the pid file for whatever reason. That's all. Looking at
the ps output posted in the latest case, the startup has got past that
point as it's started the beam process for the server.

"rabbitmqctl wait" should wait indefinitely for the server to start up,
as long as the server has not actually died.

But it looks like something is getting stuck? Is there anything in the
server logs at this point? Bearing in mind that the machine in question
has claimed to run out of memory writing a 5-byte file, so I don't
necessarily trust it.

Cheers, Simon


> Cheers,
> Cesar.
>
>
>
> On 6 June 2014 11:55, Cesar Munoz <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Simon,
>
>     the ulimits for rabbitmq user are pretty much the same, the only
>     difference is that max user processes is set to 1024 instead of 2066207.
>
>     About the system itself, it is true that there has to be something
>     strange going on if a shell redirection can fail, but I'm checking
>     the configuration and I don't see anything specially awkward.
>
>     We are using Red Hat 6.4, and these are the parameters that we set
>     in the sysctl.conf:
>     http://pastebin.com/SfJBwrna
>
>     The rest of the parameters in the kickstart file are pretty much the
>     standard ones.
>     This is an intermittent issue (we are testing how often it happens,
>     so far we got 3 failures in 13 installations), so it is harder to
>     track it!
>     Either way, restarting the service works, so it looks like whatever
>     causes the problem disappears after a while. I've been trying to
>     find what could make this non-deterministic, but so far I haven't
>     noticed anything unusual.
>
>     Thanks again!
>     Cesar.
>
>
>     On 6 June 2014 11:27, Simon MacMullen <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         On 06/06/2014 10:49AM, Cesar Munoz wrote:
>
>             Hi Simon,
>
>             the set -e looks like a very good idea, at least the process
>             will return
>             the failure straight away!
>
>
>         Sure!
>
>
>             These are the ulimits:
>
>             [root@ms1 ~]# ulimit -a
>
>
>         <snip>
>
>         Those are the ulimits which apply to root - maybe they are
>         different for the "rabbitmq" user?
>
>         But more to the point: we're failing to do something very very
>         simple here, there has to be something weird about this system
>         if echo or shell redirection can fail with an error message
>         about memory allocation.
>
>         So have you configured anything unusual about this system?
>
>
>         Cheers, Simon
>
>         --
>         Simon MacMullen
>         RabbitMQ, Pivotal
>
>
>
>
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. If you have received this email in error please notify the
> system manager. This message contains confidential information and is
> intended only for the individual named. If you are not the named
> addressee you should not disseminate, distribute or copy this e-mail.
>

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list has moved to https://groups.google.com/forum/#!forum/rabbitmq-users,
please subscribe to the new list!

[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: RabbitMQ waits forever for PID file during startup

Cesar Munoz
Hi Simon,

the thing is, /var/run/rabbitmq/pid still contains the "Cannot allocate memory" error, that's probably why the wait pid is still blocked. The system logs are not saying anything new, but we run sos after reproducing the issue and we're taking a look to see if there is anything interesting. I'll let you know!

Thanks,
Cesar.


On 11 July 2014 11:37, Simon MacMullen <[hidden email]> wrote:
On 10/07/2014 3:02PM, Cesar Munoz wrote:
Hi Simon,

so we have tried to find the problem with the initial installation, but
no luck yet. It is very difficult to track it, as it is totally
non-deterministic!
In the meantime, we installed the latest version of RabbitMQ, which
includes de set -e fix, but the same issue still happened. Given the
output of ps auxf

https://gist.github.com/anonymous/62239513b154179a8a4e

it looks like

/bin/sh /etc/init.d/rabbitmq-server start

and

/bin/sh /usr/sbin/rabbitmqctlwait  /var/run/rabbitmq/pid


were running concurrently. Is there any chance that this fact created
some sort of race condition between these 2 processes that would make
the set -e fix not work?

The "set -e" should cause a failure in the case where the script was not able to write the pid file for whatever reason. That's all. Looking at the ps output posted in the latest case, the startup has got past that point as it's started the beam process for the server.

"rabbitmqctl wait" should wait indefinitely for the server to start up, as long as the server has not actually died.

But it looks like something is getting stuck? Is there anything in the server logs at this point? Bearing in mind that the machine in question has claimed to run out of memory writing a 5-byte file, so I don't necessarily trust it.

Cheers, Simon


Cheers,
Cesar.



On 6 June 2014 11:55, Cesar Munoz <[hidden email]
<mailto:[hidden email]>> wrote:

    Hi Simon,

    the ulimits for rabbitmq user are pretty much the same, the only
    difference is that max user processes is set to 1024 instead of 2066207.

    About the system itself, it is true that there has to be something
    strange going on if a shell redirection can fail, but I'm checking
    the configuration and I don't see anything specially awkward.

    We are using Red Hat 6.4, and these are the parameters that we set
    in the sysctl.conf:
    http://pastebin.com/SfJBwrna

    The rest of the parameters in the kickstart file are pretty much the
    standard ones.
    This is an intermittent issue (we are testing how often it happens,
    so far we got 3 failures in 13 installations), so it is harder to
    track it!
    Either way, restarting the service works, so it looks like whatever
    causes the problem disappears after a while. I've been trying to
    find what could make this non-deterministic, but so far I haven't
    noticed anything unusual.

    Thanks again!
    Cesar.


    On 6 June 2014 11:27, Simon MacMullen <[hidden email]
    <mailto:[hidden email]>> wrote:

        On 06/06/2014 10:49AM, Cesar Munoz wrote:

            Hi Simon,

            the set -e looks like a very good idea, at least the process
            will return
            the failure straight away!


        Sure!


            These are the ulimits:

            [root@ms1 ~]# ulimit -a


        <snip>

        Those are the ulimits which apply to root - maybe they are
        different for the "rabbitmq" user?

        But more to the point: we're failing to do something very very
        simple here, there has to be something weird about this system
        if echo or shell redirection can fail with an error message
        about memory allocation.

        So have you configured anything unusual about this system?


        Cheers, Simon

        --
        Simon MacMullen
        RabbitMQ, Pivotal




This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named
addressee you should not disseminate, distribute or copy this e-mail.


--
Simon MacMullen
RabbitMQ, Pivotal


This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.


_______________________________________________
rabbitmq-discuss mailing list has moved to https://groups.google.com/forum/#!forum/rabbitmq-users,
please subscribe to the new list!

[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss