Really bizarre startup issue...

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Really bizarre startup issue...

Robert Nickel
First, thank you.

I have two hosts (sdcloudsh01 and sdcloudsh02) with the following specifications:
  CentOS 5.1
  erlang R13B03
  rabbitmq-server 1.7.2
  selinux is disabled

Both are identically configured using puppet. 02 works fine but 01 has
interesting startup issues.

On sdcloudsh01, contents of /etc/rabbitmq files:
  rabbitmq.conf:
    NODENAME=regsvc@sdcloudsh01
  rabbitmq.config:
    [
      {rabbit, []}
    ].
  rabbitmq_cluster.config:
    [ 'regsvc@sdcloudsh01','regsvc@sdcloudsh02' ].

When starting the rabbitmq server using /sbin/service rabbitmq-server start,
the service fails and the following outputs are in
/var/log/rabbitmq/startup_err and log:

  _log:
    Starting all nodes...
    Starting node regsvc@sdcloudsh01...
  _err:
    Error: {node_start_failed,normal}

When the node fails, there is not erl_dump file to be found and the epmd
process is running.

After a bunch of troubleshooting, I noticed that if I strace the above
command, everything works fine:

  strace -f /sbin/service rabbitmq-server start

Terminating the strace leaves the rabbit server running happily.

I have no idea what this could be.

Any pointers are greatly appreciated.

Thank you!
--Robert

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Robert Nickel
On 2010.04.30 16:01:43 -0700, Robert Nickel wrote:
> First, thank you.
>
> I have two hosts (sdcloudsh01 and sdcloudsh02) with the following specifications:
>   CentOS 5.1
>   erlang R13B03
>   rabbitmq-server 1.7.2
>   selinux is disabled

Still no luck figuring out what is going on with this machine.

Can someone advise me on how to get erlang to do a debug trace of some sort?
I have zero experience with erlang programming and would rather not delve into
it at this time.

That said, the lack of any kind of useful error is driving me nuts.

Some additional information surrounding this issue:

* rabbitmq-server was downloaded from
  http://www.rabbitmq.com/releases/rabbitmq-server/v1.7.2/rabbitmq-server-1.7.2-1.i386.rpm
  $ rpm -qi rabbitmq-server
  Name        : rabbitmq-server              Relocations: (not relocatable)
  Version     : 1.7.2                             Vendor: (none)
  Release     : 1                             Build Date: Mon 15 Feb 2010 08:02:10 AM PST
  Install Date: Wed 05 May 2010 02:30:56 PM PDT      Build Host: debian
  Group       : Development/Libraries         Source RPM: rabbitmq-server-1.7.2-1.src.rpm
  Size        : 706344                           License: MPLv1.1
  Signature   : DSA/SHA1, Mon 15 Feb 2010 08:07:35 AM PST, Key ID f7b8cea6056e8e56
  URL         : http://www.rabbitmq.com/
  Summary     : The RabbitMQ server
  Description :
  RabbitMQ is an implementation of AMQP, the emerging standard for high
  performance enterprise messaging. The RabbitMQ server is a robust and
  scalable implementation of an AMQP broker.

Adding set -x to /usr/lib/bin/rabbitmq-multi at the top yeilds the following
output in /var/lib/rabbitmq/startup_err:

    + for arg in '"$@"'
    ++ sed -e 's/"/\\"/g'
    + arg=start_all
    + CMDLINE=' "start_all"'
    + for arg in '"$@"'
    ++ sed -e 's/"/\\"/g'
    + arg=1
    + CMDLINE=' "start_all" "1"'
    + cd /var/lib/rabbitmq
    ++ basename /usr/sbin/rabbitmq-multi
    + SCRIPT=rabbitmq-multi
    ++ id -u
    + '[' 0 = 0 ']'
    + su rabbitmq -s /bin/sh -c '/usr/lib/rabbitmq/bin/rabbitmq-multi  "start_all" "1"'
    + NODENAME=rabbit
    ++ dirname /usr/lib/rabbitmq/bin/rabbitmq-multi
    + SCRIPT_HOME=/usr/lib/rabbitmq/bin
    + PIDS_FILE=/var/lib/rabbitmq/pids
    + MULTI_ERL_ARGS=
    + MULTI_START_ARGS=
    + CONFIG_FILE=/etc/rabbitmq/rabbitmq
    ++ dirname /usr/lib/rabbitmq/bin/rabbitmq-multi
    + . /usr/lib/rabbitmq/bin/rabbitmq-env
    ++ SCRIPT_PATH=/usr/lib/rabbitmq/bin/rabbitmq-multi
    ++ '[' -h /usr/lib/rabbitmq/bin/rabbitmq-multi ']'
    +++ readlink -f /usr/lib/rabbitmq/bin/rabbitmq-multi
    ++ FULL_PATH=/usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/rabbitmq-multi
    ++ '[' 0 '!=' 0 ']'
    ++ SCRIPT_PATH=/usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/rabbitmq-multi
    ++ '[' -h /usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/rabbitmq-multi ']'
    +++ dirname /usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/rabbitmq-multi
    ++ SCRIPT_DIR=/usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin
    ++ RABBITMQ_HOME=/usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/..
    ++ '[' -f /etc/rabbitmq/rabbitmq.conf ']'
    ++ . /etc/rabbitmq/rabbitmq.conf
    +++ NODENAME=regsvc@sdcloudsh01
    + DEFAULT_NODE_IP_ADDRESS=0.0.0.0
    + DEFAULT_NODE_PORT=5672
    + '[' x = x ']'
    + '[' x '!=' x ']'
    + '[' x = x ']'
    + '[' x '!=' x ']'
    + '[' x = x ']'
    + '[' x '!=' x ']'
    + '[' x = x ']'
    + RABBITMQ_NODENAME=regsvc@sdcloudsh01
    + '[' x = x ']'
    + RABBITMQ_SCRIPT_HOME=/usr/lib/rabbitmq/bin
    + '[' x = x ']'
    + RABBITMQ_PIDS_FILE=/var/lib/rabbitmq/pids
    + '[' x = x ']'
    + RABBITMQ_MULTI_ERL_ARGS=
    + '[' x = x ']'
    + RABBITMQ_MULTI_START_ARGS=
    + '[' x = x ']'
    + RABBITMQ_CONFIG_FILE=/etc/rabbitmq/rabbitmq
    + export RABBITMQ_NODENAME RABBITMQ_NODE_IP_ADDRESS RABBITMQ_NODE_PORT RABBITMQ_SCRIPT_HOME RABBITMQ_PIDS_FILE RABBITMQ_CONFIG_FILE
    + env
    + RABBITMQ_CONFIG_ARG=
    + '[' -f /etc/rabbitmq/rabbitmq.config ']'
    + RABBITMQ_CONFIG_ARG='-config /etc/rabbitmq/rabbitmq'
    + set -f
    + exec erl -pa /usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/../ebin -noinput -hidden -sname rabbitmq_multi32076 -config /etc/rabbitmq/rabbitmq -s rabbit_multi -extra start_all 1
    Error: {node_start_failed,normal}

Thanks in advance for any help.

--Robert

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Matthias Radestock-3
In reply to this post by Robert Nickel
Robert,

Robert Nickel wrote:

> On sdcloudsh01, contents of /etc/rabbitmq files:
>   rabbitmq.conf:
>     NODENAME=regsvc@sdcloudsh01
>   rabbitmq.config:
>     [
>       {rabbit, []}
>     ].
>   rabbitmq_cluster.config:
>     [ 'regsvc@sdcloudsh01','regsvc@sdcloudsh02' ].
>
> When starting the rabbitmq server using /sbin/service rabbitmq-server start,
> the service fails

Does rabbit start up fine if you a) remove all the above configuration
files, and b) delete the database directory (usually
/var/lib/rabbitmq/mnesia)?

> the following outputs are in /var/log/rabbitmq/startup_err and log:
>
>   _log:
>     Starting all nodes...
>     Starting node regsvc@sdcloudsh01...
>   _err:
>     Error: {node_start_failed,normal}

Are there any other non-empty log files in /var/log/rabbitmq?

> After a bunch of troubleshooting, I noticed that if I strace the above
> command, everything works fine:
>
>   strace -f /sbin/service rabbitmq-server start
>
> Terminating the strace leaves the rabbit server running happily.

That would suggest some sort of race / timing issue. Strange.


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Robert Nickel
On 2010.05.06 05:56:56 +0100, Matthias Radestock wrote:

> Robert,
>
> Robert Nickel wrote:
>> On sdcloudsh01, contents of /etc/rabbitmq files:
>>   rabbitmq.conf:
>>     NODENAME=regsvc@sdcloudsh01
>>   rabbitmq.config:
>>     [
>>       {rabbit, []}
>>     ].
>>   rabbitmq_cluster.config:
>>     [ 'regsvc@sdcloudsh01','regsvc@sdcloudsh02' ].
>>
>> When starting the rabbitmq server using /sbin/service rabbitmq-server start,
>> the service fails
>
> Does rabbit start up fine if you a) remove all the above configuration  
> files, and b) delete the database directory (usually  
> /var/lib/rabbitmq/mnesia)?

Cleaned out the files and ran the test:

    [root@sdcloudsh01 ~]# ls /etc/rabbitmq/
    [root@sdcloudsh01 ~]# ls /var/lib/rabbitmq/
    [root@sdcloudsh01 ~]# /etc/init.d/rabbitmq-server start
    Starting rabbitmq-server: FAILED - check /var/log/rabbitmq/startup_log, _err
    rabbitmq-server.
    [root@sdcloudsh01 ~]# ls -l /var/log/rabbitmq/*
    -rw-r--r-- 1 root root 34 May  6 09:40 /var/log/rabbitmq/startup_err
    -rw-r--r-- 1 root root 58 May  6 09:40 /var/log/rabbitmq/startup_log
    [root@sdcloudsh01 ~]# cat /var/log/rabbitmq/*
    Error: {node_start_failed,normal}
    Starting all nodes...
    Starting node rabbit@sdcloudsh01...

Same results, unfortunately.

>> the following outputs are in /var/log/rabbitmq/startup_err and log:
>>
>>   _log:
>>     Starting all nodes...
>>     Starting node regsvc@sdcloudsh01...
>>   _err:
>>     Error: {node_start_failed,normal}
>
> Are there any other non-empty log files in /var/log/rabbitmq?

None.  See listing above.

>> After a bunch of troubleshooting, I noticed that if I strace the above
>> command, everything works fine:
>>
>>   strace -f /sbin/service rabbitmq-server start
>>
>> Terminating the strace leaves the rabbit server running happily.
>
> That would suggest some sort of race / timing issue. Strange.

Agreed.  My inital thought is that the parent process is terminating before
the fork somehow.  I don't have a good way to validate this as I have zero
experience with erlang and how to debug it.

FWIW:
[root@sdcloudsh01 ~]# uname -rvmi
2.6.18-53.el5PAE #1 SMP Mon Nov 12 02:55:09 EST 2007 i686 i386

The rabbitmq user has no .bash* files other than .bash_history.

Thank you,
  --Robert

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Matthias Radestock-3
Robert,

Robert Nickel wrote:
> Cleaned out the files and ran the test: [...]
> Same results, unfortunately.

*sigh*. It might be worth adding some debugging output to
/usr/lib/rabbitmq/bin/rabbitmq-server, just to see whether starting
rabbit actually gets as far as running that script (and if so, how far
into it).

> FWIW:
> [root@sdcloudsh01 ~]# uname -rvmi
> 2.6.18-53.el5PAE #1 SMP Mon Nov 12 02:55:09 EST 2007 i686 i386

Is there another machine on which you could try to get rabbit installed?
That way we could rule out any peculiarities with that particular machine.

And/or could you give one of the rabbit team access to the above machine?


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Matthias Radestock-3
Robert,

Robert Nickel wrote:

> On 2010.05.06 19:35:20 +0100, Matthias Radestock wrote:
>> *sigh*. It might be worth adding some debugging output to  
>> /usr/lib/rabbitmq/bin/rabbitmq-server, just to see whether starting  
>> rabbit actually gets as far as running that script (and if so, how far  
>> into it).
>
> I don't think that script is ever invoked as the init script invokes
> /usr/sbin/rabbitmq-multi which then invokes
> /usr/lib/rabbitmq/bin/rabbitmq-multi which has the call to erl.
>
> Is this incorrect?

The Erlang code executed by rabbitmq-multi in turn calls the
rabbitmq-server script. If things are working as they should.


Matthias.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Robert Nickel
On 2010.05.06 21:35:39 +0100, Matthias Radestock wrote:

> Robert,
>
> Robert Nickel wrote:
>> On 2010.05.06 19:35:20 +0100, Matthias Radestock wrote:
>>> *sigh*. It might be worth adding some debugging output to  
>>> /usr/lib/rabbitmq/bin/rabbitmq-server, just to see whether starting  
>>> rabbit actually gets as far as running that script (and if so, how
>>> far  into it).
>>
>> I don't think that script is ever invoked as the init script invokes
>> /usr/sbin/rabbitmq-multi which then invokes
>> /usr/lib/rabbitmq/bin/rabbitmq-multi which has the call to erl.
>>
>> Is this incorrect?
>
> The Erlang code executed by rabbitmq-multi in turn calls the  
> rabbitmq-server script. If things are working as they should.

Excellent.  I will add my set -x into that.  Is there a way to get more
debugging information out of erlang?

--Robert

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Robert Nickel
On 2010.05.06 13:41:36 -0700, Robert Nickel wrote:

> On 2010.05.06 21:35:39 +0100, Matthias Radestock wrote:
> > Robert,
> >
> > Robert Nickel wrote:
> >> On 2010.05.06 19:35:20 +0100, Matthias Radestock wrote:
> >>> *sigh*. It might be worth adding some debugging output to  
> >>> /usr/lib/rabbitmq/bin/rabbitmq-server, just to see whether starting  
> >>> rabbit actually gets as far as running that script (and if so, how
> >>> far  into it).
> >>
> >> I don't think that script is ever invoked as the init script invokes
> >> /usr/sbin/rabbitmq-multi which then invokes
> >> /usr/lib/rabbitmq/bin/rabbitmq-multi which has the call to erl.
> >>
> >> Is this incorrect?
> >
> > The Erlang code executed by rabbitmq-multi in turn calls the  
> > rabbitmq-server script. If things are working as they should.
>
> Excellent.  I will add my set -x into that.  Is there a way to get more
> debugging information out of erlang?

rabbitmq-server is not getting invoked according to the output in
/var/log/rabbitmq/startup*.

I added an echo and set -x to the script and got nothing.

--
-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-
Robert Nickel
Senior Systems Administrator, Systems Engineering
WWS Global Platform
Sony Computer Entertainment America
10075 Barnes Canyon Rd
San Diego Ca, 92121
(858)824-4802 Phone
[hidden email]

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Matthias Radestock-3
Robert,

Robert Nickel wrote:
> rabbitmq-server is not getting invoked according to the output in
> /var/log/rabbitmq/startup*.
>
> I added an echo and set -x to the script and got nothing.

The logging could possibly go astray; I suggest you add something like a
'touch /tmp/foo' right at the beginning to serve as a "got here" marker.

The {node_start_failed,normal} error you are seeing in the logs
indicates that the rabbit_multi erlang code is attempting to invoke the
rabbitmq-server script but that either somehow fails or at least doesn't
result in a running rabbit server. It's very unusual for that to happen
and not result in some errors in the startup logs. I cannot think of an
easy way of tracking down the cause w/o hacking on the rabbitmq_multi
erlang code.


Matthias.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Really bizarre startup issue...

Robert Nickel
On 2010.05.06 22:48:08 +0100, Matthias Radestock wrote:

> Robert,
>
> Robert Nickel wrote:
>> rabbitmq-server is not getting invoked according to the output in
>> /var/log/rabbitmq/startup*.
>>
>> I added an echo and set -x to the script and got nothing.
>
> The logging could possibly go astray; I suggest you add something like a  
> 'touch /tmp/foo' right at the beginning to serve as a "got here" marker.
>
> The {node_start_failed,normal} error you are seeing in the logs  
> indicates that the rabbit_multi erlang code is attempting to invoke the  
> rabbitmq-server script but that either somehow fails or at least doesn't  
> result in a running rabbit server. It's very unusual for that to happen  
> and not result in some errors in the startup logs. I cannot think of an  
> easy way of tracking down the cause w/o hacking on the rabbitmq_multi  
> erlang code.

Closing the loop on this issue.

I have, unfortunately, run out of time that I can spend chasing this issue.
The startup using rabbitmq-multi is still unresolved. I have resorted to
altering the init script to call rabbitmq-server and use rabbitmqctl to stop
and gather status information.

If you see this come up again, please let me know.

Thanks for all your help.

--Robert

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss