Bring cluster up after node crash

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Bring cluster up after node crash

carlhoerberg
Have a 3 node cluster, node 2 and 3 went down due to OOM, but node 1 survived, clients could push new messages but non were delivered, node 1 had plenty of memory left so no blocking were (or at least shouldn't have been) in action due to that.  

I then tried to bring node 2 and 3 back online by simply restarting them, this is what happened:  

Node1 floods the logs for a while at a rate of 20-100/sec:
=ERROR REPORT==== 18-Mar-2013::07:10:40 ===
Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)

Start up node 3
Floods
=ERROR REPORT==== 18-Mar-2013::08:23:15 ===
Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
and is stuck at  
"starting exchange, queue and binding recovery ..."

rabbitmqctl status hangs for ever on node 1

Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and  rabbitmqctl status and  never returns

node 2 then runs out of memory again, without client connections this time:  
=INFO REPORT==== 18-Mar-2013::09:09:35 ===
vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
=WARNING REPORT==== 18-Mar-2013::09:09:35 ===
memory resource limit alarm set on node rabbit@tiger02

Querying /api/overview at node1 gives:
{error,{error,{badmatch,false},
[{rabbit_mgmt_wm_overview,version,1},
{rabbit_mgmt_wm_overview,to_json,2},
{webmachine_resource,resource_call,3},
{webmachine_resource,do,3},
{webmachine_decision_core,resource_call,1},
{webmachine_decision_core,decision,1},
{webmachine_decision_core,handle_request,2},
{rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}

node 3 starts eventually.  
kills node 2, starts again, stops at "starting database …"
nothing in the log or startup_err, cpu usage 0%
kills after 30min and starts again, same thing.  

node 3 can now output rabbitmqctl status, node 1 still cannot.
node 1 can't be shutdown, force kills
with node1 down, node 2 now comes pass "starting database" and starts
neither node 2 or node 3 responds to rabbitmqctl status
shutting down node 2, but doesn't respond, have to do kill -9
node 3 still doesn't respond to rabbitmqctl status
shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.

note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.  

Starting up node3, log now gets flooded with:
=ERROR REPORT==== 18-Mar-2013::11:09:04 ===
** Generic server <0.629.0> terminating
** Last message in was {init,<0.182.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==  
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}

but comes online eventually and can do "rabbitmqctl status"

starts up node2, also reports a lot of:
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.640.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],
undefined,[]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==  
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],undefined,[]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.645.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},
{<0.32304.1>,<0.30804.0>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==  
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},{<0.32304.1>,<0.30804.0>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}

node 2 comes online i can now query rabbitmqctl status
starting up node 1, comes online
the cluster is now working again but several durables queues are gone(!)




_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Bring cluster up after node crash

Tim Watson-6
Hi Carl,

What version of rabbit are you running? A number of bugs pertaining to the 'Discarding message ... in an old incarnation .. of this node' were fixed in recent(ish) releases.

Cheers,
Tim

On 19 Mar 2013, at 03:41, Carl Hörberg wrote:

> Have a 3 node cluster, node 2 and 3 went down due to OOM, but node 1 survived, clients could push new messages but non were delivered, node 1 had plenty of memory left so no blocking were (or at least shouldn't have been) in action due to that.  
>
> I then tried to bring node 2 and 3 back online by simply restarting them, this is what happened:  
>
> Node1 floods the logs for a while at a rate of 20-100/sec:
> =ERROR REPORT==== 18-Mar-2013::07:10:40 ===
> Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)
>
> Start up node 3
> Floods
> =ERROR REPORT==== 18-Mar-2013::08:23:15 ===
> Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
> and is stuck at  
> "starting exchange, queue and binding recovery ..."
>
> rabbitmqctl status hangs for ever on node 1
>
> Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and  rabbitmqctl status and  never returns
>
> node 2 then runs out of memory again, without client connections this time:  
> =INFO REPORT==== 18-Mar-2013::09:09:35 ===
> vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
> =WARNING REPORT==== 18-Mar-2013::09:09:35 ===
> memory resource limit alarm set on node rabbit@tiger02
>
> Querying /api/overview at node1 gives:
> {error,{error,{badmatch,false},
> [{rabbit_mgmt_wm_overview,version,1},
> {rabbit_mgmt_wm_overview,to_json,2},
> {webmachine_resource,resource_call,3},
> {webmachine_resource,do,3},
> {webmachine_decision_core,resource_call,1},
> {webmachine_decision_core,decision,1},
> {webmachine_decision_core,handle_request,2},
> {rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}
>
> node 3 starts eventually.  
> kills node 2, starts again, stops at "starting database …"
> nothing in the log or startup_err, cpu usage 0%
> kills after 30min and starts again, same thing.  
>
> node 3 can now output rabbitmqctl status, node 1 still cannot.
> node 1 can't be shutdown, force kills
> with node1 down, node 2 now comes pass "starting database" and starts
> neither node 2 or node 3 responds to rabbitmqctl status
> shutting down node 2, but doesn't respond, have to do kill -9
> node 3 still doesn't respond to rabbitmqctl status
> shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.
>
> note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.  
>
> Starting up node3, log now gets flooded with:
> =ERROR REPORT==== 18-Mar-2013::11:09:04 ===
> ** Generic server <0.629.0> terminating
> ** Last message in was {init,<0.182.0>}
> ** When Server state == {q,{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.9934227485209703">>},
> true,true,<0.21310.24>,[],<0.629.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6868.7071.0>,<6868.7070.0>},
> {<6867.19845.80>,<6867.19844.80>},
> {<0.21601.24>,<0.21548.24>}]},
> none,false,undefined,undefined,
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> undefined,undefined}
> ** Reason for termination ==  
> ** {'module could not be loaded',
> [{undefined,init,
> [{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.9934227485209703">>},
> true,true,<0.21310.24>,[],<0.629.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6868.7071.0>,<6868.7070.0>},
> {<6867.19845.80>,<6867.19844.80>},
> {<0.21601.24>,<0.21548.24>}]},
> true,#Fun<rabbit_amqqueue_process.5.64830354>]},
> {rabbit_amqqueue_process,handle_call,3},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
>
> but comes online eventually and can do "rabbitmqctl status"
>
> starts up node2, also reports a lot of:
> =ERROR REPORT==== 18-Mar-2013::11:11:06 ===
> ** Generic server <0.640.0> terminating
> ** Last message in was {init,<0.152.0>}
> ** When Server state == {q,{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.1019297200255096">>},
> true,true,<0.977.11>,[],<0.640.0>,[],[],
> undefined,[]},
> none,false,undefined,undefined,
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> undefined,undefined}
> ** Reason for termination ==  
> ** {'module could not be loaded',
> [{undefined,init,
> [{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.1019297200255096">>},
> true,true,<0.977.11>,[],<0.640.0>,[],[],undefined,[]},
> true,#Fun<rabbit_amqqueue_process.5.64830354>]},
> {rabbit_amqqueue_process,handle_call,3},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
> =ERROR REPORT==== 18-Mar-2013::11:11:06 ===
> ** Generic server <0.645.0> terminating
> ** Last message in was {init,<0.152.0>}
> ** When Server state == {q,{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.8794151877518743">>},
> true,true,<0.30538.0>,[],<0.645.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6872.28270.5>,<6872.28269.5>},
> {<0.32304.1>,<0.30804.0>}]},
> none,false,undefined,undefined,
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> undefined,undefined}
> ** Reason for termination ==  
> ** {'module could not be loaded',
> [{undefined,init,
> [{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.8794151877518743">>},
> true,true,<0.30538.0>,[],<0.645.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6872.28270.5>,<6872.28269.5>},{<0.32304.1>,<0.30804.0>}]},
> true,#Fun<rabbit_amqqueue_process.5.64830354>]},
> {rabbit_amqqueue_process,handle_call,3},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
>
> node 2 comes online i can now query rabbitmqctl status
> starting up node 1, comes online
> the cluster is now working again but several durables queues are gone(!)
>
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Bring cluster up after node crash

Tim Watson-6
Hi Carl,

On 20 Mar 2013, at 16:25, Tim Watson wrote:
> What version of rabbit are you running? A number of bugs pertaining to the 'Discarding message ... in an old incarnation .. of this node' were fixed in recent(ish) releases.
>

And another couple of questions if that's ok. Firstly - how did you install RabbitMQ on each of these nodes? It's possible one or more installs is corrupted somehow - have you made any modifications to the installs? What does the config look like for each of the nodes?  

> On 19 Mar 2013, at 03:41, Carl Hörberg wrote:
>> Node1 floods the logs for a while at a rate of 20-100/sec:
>> =ERROR REPORT==== 18-Mar-2013::07:10:40 ===
>> Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)
>>
>> Start up node 3
>> Floods
>> =ERROR REPORT==== 18-Mar-2013::08:23:15 ===
>> Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
>> and is stuck at  
>> "starting exchange, queue and binding recovery ..."
>>

This 'old incarnation of ... ' stuff indicates that we have a process id for a queue that is no longer valid. In theory, the only way (I can see) for this to happen is if a queue master restarts faster than any of the slaves can detect it's death (we have an outstanding bug to look at that, but it may not be relevant since recent releases have included several HA bug fixes) - but regardless, that kind of problem ought to present far earlier than the 'stat' request that's failing...

>> Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and  rabbitmqctl status and  never returns
>>

That sounds suspicious - are you sure the enabled-plugins file and configuration for that node are intact?

>> node 2 then runs out of memory again, without client connections this time:  
>> =INFO REPORT==== 18-Mar-2013::09:09:35 ===
>> vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
>> =WARNING REPORT==== 18-Mar-2013::09:09:35 ===
>> memory resource limit alarm set on node rabbit@tiger02
>>

Is this happening whilst node 1 is still stuck? How long does it take (roughly) to reach this state?

>> Querying /api/overview at node1 gives:
>> {error,{error,{badmatch,false},
>> [{rabbit_mgmt_wm_overview,version,1},
>> {rabbit_mgmt_wm_overview,to_json,2},
>> {webmachine_resource,resource_call,3},
>> {webmachine_resource,do,3},
>> {webmachine_decision_core,resource_call,1},
>> {webmachine_decision_core,decision,1},
>> {webmachine_decision_core,handle_request,2},
>> {rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}
>>

What version of Erlang are you running? Upgrading to a recent version of Erlang would be a good idea due to bug fixes and the fact that line numbers in exception stack traces would make it easier to identify where things are going wrong.

For that matter, what OS/Platform are you running on? How did you install Erlang?

>> node 3 starts eventually.  
>> kills node 2, starts again, stops at "starting database …"

What do you mean 'kills node 2' exactly? A node will never kill another node. Do you mean that 'you' killed node 2? If so, how did you do this?

>> nothing in the log or startup_err, cpu usage 0%
>> kills after 30min and starts again, same thing.  
>>

Again, what do you mean 'kills after 30min and starts again' - is this something you're doing? How are you 'killing' these nodes?

>> node 3 can now output rabbitmqctl status, node 1 still cannot.
>> node 1 can't be shutdown, force kills

Right - so at this point you've done something like `kill -9` right?

>> with node1 down, node 2 now comes pass "starting database" and starts
>> neither node 2 or node 3 responds to rabbitmqctl status

For how long do they not respond? I wonder if it could be that all these 'kill' signals you're issuing have left the mnesia database in an inconsistent state somehow.

>> shutting down node 2, but doesn't respond, have to do kill -9

'shutting down node 2' how - are you issuing `sudo rabbitmqctl stop` to do that?

>> node 3 still doesn't respond to rabbitmqctl status
>> shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.
>>

The same approach right?

>> note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.  
>>

Sounds like a process is stuck somewhere - the status call attempts to list all running erlang applications on the node, with the timeout set to 'infinity'. If an application has got stuck during startup (or shutdown!) that can be one of the symptoms. Again, please tell us which version of rabbit you're running. We've fixed bugs in (relatively) recent releases that presented as supervision trees getting stuck during shutdown/restart, which might (possibly) explain some of this.

>> Starting up node3, log now gets flooded with:
>> =ERROR REPORT==== 18-Mar-2013::11:09:04 ===
>> ** Generic server <0.629.0> terminating
>> ** Last message in was {init,<0.182.0>}
>> ** When Server state == {q,{amqqueue,
[snip]
>> ** Reason for termination ==  
>> ** {'module could not be loaded',
>> [{undefined,init,
[snip]

This error has occurred because the backing queue module for the queue process is set to 'undefined' - have you made any configuration changes, such as setting the name of the backing queue module by any chance?

Please let us know the answers to these queries and we'll try to figure out what's going on.

Cheers,
Tim

 
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Bring cluster up after node crash

carlhoerberg
RabbitMQ 3.0.4, Erlang R14B04. Ubuntu 12.04, installed with apt-get install from your ppa, nothing custom at all.

Nothing exciting in the config:
[
  {rabbit, [
     {log_levels, [{connection, error}]},
     {vm_memory_high_watermark, 0.8},
     {tcp_listeners, [{"0.0.0.0", 5672}]},
     {ssl_listeners, [{"0.0.0.0", 5671}]},
     {ssl_options, [{cacertfile,"/etc/rabbitmq/ca.pem"},
                    {certfile,"/etc/rabbitmq/key.pem"}
                   ]}
   ]},
{rabbitmq_management,
  [{listener, [{port, 15672},
               {ip, "0.0.0.0"},
                {ssl,      true}
              ]}
  ]}
].

Only the mgmt plugin enabled.

I had no problems with file corruption as far as i know.

Yes, I was "killing" the nodes, with kill -9

No, haven't touched the backing queue or anything like that.
Reply | Threaded
Open this post in threaded view
|

Re: Bring cluster up after node crash

Tim Watson-5
Hi

On 23 Mar 2013, at 03:01, carlhoerberg <[hidden email]> wrote:
> RabbitMQ 3.0.4, Erlang R14B04. Ubuntu 12.04, installed with apt-get install
> from your ppa, nothing custom at all.
>

I would strongly advise you to upgrade to the latest erlang if possible. Many very important bug fixed have been incorporated since R14B.

> I had no problems with file corruption as far as i know.
>
[snip]
> Yes, I was "killing" the nodes, with kill -9
>

You shouldn't (have to) be doing that, obviously. It's possible (though somewhat unlikely) that a brutal kill might've let your file system in an inconsistent state.

> No, haven't touched the backing queue or anything like that.
>

Is there anything else in the logs you can give us to go on?

>
>
> --
> View this message in context: http://rabbitmq.1065348.n5.nabble.com/Bring-cluster-up-after-node-crash-tp25530p25676.html
> Sent from the RabbitMQ mailing list archive at Nabble.com.
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Bring cluster up after node crash

carlhoerberg
What's the recommended way of using a recent erlang version and rabbitmq in ubuntu?
erlang-solutions and your ubuntu package doesn't seem to play well together, found this "hack" though, https://gist.github.com/RJ/2284940

We had to "kill" them, as rabbitmqctl stop didn't respond at all..
Reply | Threaded
Open this post in threaded view
|

Re: Bring cluster up after node crash

Jean Paul Galea
I ran into the same problem when installing Erlang from erlang-solutions.

The problem is with the dependency "erlang-nox" which is declared in the
RabbitMQ package. This package is a meta-package which tries to install
erlang packages from the Ubuntu repository, conflicting with the ones
from erlang-solutions.

I also found the hack that you are linking to, but I think it can be
done more elegantly.

I wrote this simple bash script; fetches rabbitmq-server package using
apt-get, drops the erlang-nox dependency and installs it.

One thing that may bother you is that apt-get will always flag
rabbitmq-server as "upgradeable", since the installed package differs
from the one in the repo.

However, if you do something like `apt-get update && apt-get upgrade &&
apt-get dist-upgrade` it will __not__ automatically re-install the package.

Also note that the script does not actually remove the "erlang-nox"
dependency, rather it simply replaces the whole line with "adduser,
logrotate", hence if the RabbitMQ team declares a new dependency, this
script would need to be updated.

------------------------

cat > /tmp/rabbitmq-install.sh << "EOF"
#!/bin/bash

TMPDIR1=`mktemp -d` || exit 99
TMPDIR2=`mktemp -d` || exit 99
trap 'rm -rf "$TMPDIR1" "$TMPDIR2"' 0 1 2 3 13 15

cd $TMPDIR1

/usr/bin/apt-get download rabbitmq-server || exit 1

PACKAGE=`ls -1`

/usr/bin/dpkg-deb --extract $PACKAGE $TMPDIR2
/usr/bin/dpkg-deb --control $PACKAGE ${TMPDIR2}/DEBIAN
sed --in-place 's/^Depends:.*$/Depends: adduser, logrotate/'
${TMPDIR2}/DEBIAN/control
/usr/bin/dpkg --build $TMPDIR2 ${PACKAGE}.modified
/usr/bin/dpkg --install ${PACKAGE}.modified
EOF

/bin/bash /tmp/rabbitmq-install.sh

rm /tmp/rabbitmq-install.sh


On 03/25/2013 05:55 AM, carlhoerberg wrote:

> What's the recommended way of using a recent erlang version and rabbitmq in
> ubuntu?
> erlang-solutions and your ubuntu package doesn't seem to play well together,
> found this "hack" though, https://gist.github.com/RJ/2284940
>
> We had to "kill" them, as rabbitmqctl stop didn't respond at all..
>
>
>
> --
> View this message in context: http://rabbitmq.1065348.n5.nabble.com/Bring-cluster-up-after-node-crash-tp25530p25682.html
> Sent from the RabbitMQ mailing list archive at Nabble.com.
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss