Quantcast

cluster node "stuck" during start

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

cluster node "stuck" during start

Not Drew Stevens
When a RabbitMQ cluster node starts back up after a server reboot, we have experienced (more than a few) cases where the RabbitMQ server on the node does not completely start.

This condition persisted even if the rabbit processes were killed and rabbit manually restarted.

The only way to get the server to start required a node reset (or explicit deletion of the mnesia database)

Are there any suggestions about how to handle this without losing the state of the node?

The system process list looked like this:

# ps aux | grep rabbit
rabbitmq  1005  0.0  0.0   9888  2788 ?        S    Jun13   1:01 /usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
root     15746  0.0  0.0  11232  1708 pts/3    S+   23:26   0:00 /bin/sh /etc/init.d/rabbitmq-server start
root     15797  0.0  0.0  11036  1468 pts/3    S+   23:26   0:00 /bin/sh /usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid
rabbitmq 15799  0.0  0.0  11036  1424 ?        S    23:26   0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 15807  3.1  1.2 599876 47728 ?        Sl   23:26   0:03 /usr/lib/erlang/erts-5.10.2/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.1/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@my-rmq-server -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,[hidden email]} -rabbit sasl_error_logger {file,[hidden email]} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.1/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@my-rmq-server-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@my-rmq-server"
rabbitmq 15814  0.0  0.0  94432  2636 pts/3    S+   23:26   0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl  "wait" "/var/run/rabbitmq/pid"
rabbitmq 15819  0.2  0.3 106624 14008 pts/3    Sl+  23:26   0:00 /usr/lib/erlang/erts-5.10.2/bin/beam -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.1/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl15819 -boot start_clean -s rabbit_control_main -nodename rabbit@my-rmq-server -extra wait /var/run/rabbitmq/pid

This RabbitMQ node showed as an "up" node in the Nodes list in the management console of another node in the cluster.

Also, rabbitmqctl returned some results:


# rabbitmqctl status

Status of node 'rabbit@my-rmq-server' ...
[{pid,1114},
 {running_applications,
     [{os_mon,"CPO  CXC 138 46","2.2.12"},
      {inets,"INETS  CXC 138 49","5.9.5"},
      {mnesia,"MNESIA  CXC 138 12","4.9"},
      {amqp_client,"RabbitMQ AMQP Client","3.2.1"},
      {rabbitmq_auth_mechanism_ssl,
          "RabbitMQ SSL authentication (SASL EXTERNAL)","3.2.1"},
      {xmerl,"XML parser","1.3.3"},
      {eldap,"Ldap api","1.0.1"},
      {rfc4627_jsonrpc,"JSON RPC Service","3.2.1-git5e67120"},
      {sasl,"SASL  CXC 138 11","2.3.2"},
      {stdlib,"ERTS  CXC 138 10","1.19.2"},
      {kernel,"ERTS  CXC 138 10","2.16.2"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R16B01 (erts-5.10.2) [source-bdf5300] [64-bit] [smp:2:2] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{total,44596672},
      {connection_procs,2808},
      {queue_procs,0},
      {plugins,8464},
      {other_proc,15751480},
      {mnesia,1191152},
      {mgmt_db,0},
      {msg_index,0},
      {other_ets,1235896},
      {binary,716136},
      {code,20445199},
      {atom,711569},
      {other_system,4533968}]},
 {file_descriptors,
     [{total_limit,924},{total_used,0},{sockets_limit,829},{sockets_used,0}]},
 {processes,[{limit,1048576},{used,105}]},
 {run_queue,0},
 {uptime,271}]
...done.

The startup log and rabbitmq log indicated that the node had started to start up

# cat startup_log

              RabbitMQ 3.2.1. Copyright (C) 2007-2013 GoPivotal, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/[hidden email]
  ######  ##        /var/log/rabbitmq/[hidden email]
  ##########
              Starting broker...


# cat [hidden email]

=INFO REPORT==== 25-Jul-2014::17:18:21 ===
Starting RabbitMQ 3.2.1 on Erlang R16B01
Copyright (C) 2007-2013 GoPivotal, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 25-Jul-2014::17:18:21 ===
node           : rabbit@my-rmq-server
home dir       : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash    : WmWI9mzuXn9u47LQDipY3g==
log            : /var/log/rabbitmq/[hidden email]
sasl log       : /var/log/rabbitmq/[hidden email]
database dir   : /var/lib/rabbitmq/mnesia/rabbit@my-rmq-server

=INFO REPORT==== 25-Jul-2014::17:18:23 ===
Limiting to approx 924 file handles (829 sockets)
root@my-rmq-server:/var/log/rabbitmq#


Some time had passed without any activity to either the logs, or files in the mnesia database

# date
Fri Jul 25 17:23:56 UTC 2014


# ls -lt /var/lib/rabbitmq/mnesia/rabbit@my-rmq-server
total 148
-rw-r--r--   1 rabbitmq rabbitmq   271 Jul 25 17:21 DECISION_TAB.LOG
-rw-r--r--   1 rabbitmq rabbitmq   102 Jul 25 17:21 LATEST.LOG
-rw-r--r--   1 rabbitmq rabbitmq   171 Jul 25 17:18 nodes_running_at_shutdown
-rw-r--r--   1 rabbitmq rabbitmq   317 Jul 25 17:18 cluster_nodes.config
-rw-r--r--   1 rabbitmq rabbitmq   137 Jul 25 17:18 rabbit_vhost.DCD
-rw-r--r--   1 rabbitmq rabbitmq   640 Jul 25 17:18 rabbit_user.DCD
-rw-r--r--   1 rabbitmq rabbitmq 10207 Jul 25 17:18 rabbit_runtime_parameters.DCD
-rw-r--r--   1 rabbitmq rabbitmq 20423 Jul 25 17:18 rabbit_durable_route.DCD
-rw-r--r--   1 rabbitmq rabbitmq 21020 Jul 25 17:18 rabbit_durable_queue.DCD
-rw-r--r--   1 rabbitmq rabbitmq  2724 Jul 25 17:18 rabbit_durable_exchange.DCD
-rw-r--r--   1 rabbitmq rabbitmq   850 Jul 25 17:18 rabbit_user_permission.DCD
drwxr-xr-x   2 rabbitmq rabbitmq  4096 Jul 25 17:16 msg_store_transient
drwxr-xr-x   2 rabbitmq rabbitmq  4096 Jul 25 17:16 msg_store_persistent
drwxr-xr-x 170 rabbitmq rabbitmq 12288 Jul 25 17:16 queues
-rw-r--r--   1 rabbitmq rabbitmq 28983 Jul 24 23:35 schema.DAT
-rw-r--r--   1 rabbitmq rabbitmq     3 Jun 13 09:41 rabbit_serial
-rw-r--r--   1 rabbitmq rabbitmq   238 Jun 13 09:41 schema_version












_______________________________________________
rabbitmq-discuss mailing list has moved to https://groups.google.com/forum/#!forum/rabbitmq-users,
please subscribe to the new list!

[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: cluster node "stuck" during start

Michael Klishin-2
 On 25 July 2014 at 23:37:17, Not Drew Stevens ([hidden email]) wrote:
> > When a RabbitMQ cluster node starts back up after a server reboot,  
> we have experienced (more than a few) cases where the RabbitMQ  
> server on the node does not completely start.
>  
> This condition persisted even if the rabbit processes were killed  
> and rabbit manually restarted.

What do you mean by "does not completely start"?

> {file_descriptors,
> [{total_limit,924},{total_used,0},{sockets_limit,829},{sockets_used,0}]},  

This is a really low limit. I can think of one scenario:

 * ulimit -n was set to a high value manually but not via /etc
 * You have over 1000 queues
 * Node rebooted, ulimit -n was reset
 * RabbitMQ tried to recover durable queues and persistent messages and runs out of file descriptors
   in the process

Please bump ulimit -n for the rabbitmq user to 50K and try reproducing the issue.
--  
MK  

Staff Software Engineer, Pivotal/RabbitMQ
_______________________________________________
rabbitmq-discuss mailing list has moved to https://groups.google.com/forum/#!forum/rabbitmq-users,
please subscribe to the new list!

[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: cluster node "stuck" during start

Michael Klishin-2
In reply to this post by Not Drew Stevens
On 25 July 2014 at 23:37:17, Not Drew Stevens ([hidden email]) wrote:
> > # cat startup_log
>  
> RabbitMQ 3.2.1. Copyright (C) 2007-2013

So, this is with 3.2.1. 

> # cat [hidden email](mailto:[hidden email])  
>  
> =INFO REPORT==== 25-Jul-2014(http://airmail.calendar/2014-07-25%2012:00:00%20GMT+4)::17:18:21 
> ===
> Starting RabbitMQ 3.2.1 on Erlang R16B01
> Copyright (C) 2007-2013(tel://(C)%202007-2013) GoPivotal,  
> Inc.
> Licensed under the MPL. See http://www.rabbitmq.com/
>  
> =INFO REPORT==== 25-Jul-2014(http://airmail.calendar/2014-07-25%2012:00:00%20GMT+4)::17:18:21 
> ===
> node : rabbit@my-rmq-server
> home dir : /var/lib/rabbitmq
> config file(s) : /etc/rabbitmq/rabbitmq.config
> cookie hash : WmWI9mzuXn9u47LQDipY3g==
> log : /var/log/rabbitmq/[hidden email](mailto:var/log/rabbitmq/[hidden email])  
> sasl log : /var/log/rabbitmq/[hidden email](mailto:var/log/rabbitmq/[hidden email])  
> database dir : /var/lib/rabbitmq/mnesia/rabbit@my-rmq-server  
>  
> =INFO REPORT==== 25-Jul-2014(http://airmail.calendar/2014-07-25%2012:00:00%20GMT+4)::17:18:23 
> ===
> Limiting to approx 924 file handles (829 sockets)
> root@my-rmq-server:/var/log/rabbitmq#
>  
>  
> Some time had passed without any activity to either the logs,  
> or files in the mnesia database

This looks like bug 25873, fixed in 3.2.2:

http://www.rabbitmq.com/release-notes/README-3.2.2.txt

and I recall an issue that lead to RabbitMQ taking an unreasonably long time
to start.

Can you please try 3.3.4?
--  
MK  

Staff Software Engineer, Pivotal/RabbitMQ
_______________________________________________
rabbitmq-discuss mailing list has moved to https://groups.google.com/forum/#!forum/rabbitmq-users,
please subscribe to the new list!

[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Loading...