|
Hi.
I'm new to RabbitMQ and the whole messaging platforms world in general. I'm working with this Rabbit setup already in place as part of my sysadmin duties. There's other people who are the ones that really know about it, but in general i have to take care of the servers. Today a rabbit instance went down. I restarted the service only to be greeted by screams as apparently many messages were lost in the process (as far as i understood, once queues were marked as "Durable" this couldn't happen, but it happened). The "Reason for termination" was: {{badmatch,[{file_summary,2064936,4810835,2064935,2064937,16780759,true,1}]}, [ {rabbit_msg_store,combine_files,3}, {rabbit_msg_store_gc,attempt_action,3}, {rabbit_msg_store_gc,handle_cast,2}, {gen_server2,handle_msg,2}, {proc_lib,wake_up,3} ] } I'm having trouble even identifying what does this mean, let alone preventing it from happening again. It started just fine, so it was probably a transient error, but the fact that it took with it all the messages in the queue is troubling. Can you please point me towards more resources to handle these kinds of problems in the future that don't involve loss of data? What did i do wrong? Also, do you see a hint of what went wrong there, or do i need to give more info for this? Thanks for any help or hints. _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
|
Hi
On 31 Aug 2012, at 10:35, Jaime Herazo B. wrote: > Hi. > > Today a rabbit instance went down. I restarted the service only to be > greeted by screams as apparently many messages were lost in the process > (as far as i understood, once queues were marked as "Durable" this > couldn't happen, but it happened). > This shouldn't happen, so let's try and figure out what went wrong. > The "Reason for termination" was: > > {{badmatch,[{file_summary,2064936,4810835,2064935,2064937,16780759,true,1}]}, > [ > {rabbit_msg_store,combine_files,3}, > {rabbit_msg_store_gc,attempt_action,3}, > {rabbit_msg_store_gc,handle_cast,2}, > {gen_server2,handle_msg,2}, > {proc_lib,wake_up,3} > ] > } > The rabbit_msg_store module handles persisting the contents (i.e., data) for durable queues on disk. The badmatch (which means we didn't see the data we expected to see when assigning something) occurs because the 'readers' field for the file_summary is expected to be 0, not 1. This routine is called when compacting the data (e.g., during a garbage collection-esque process) and is called when the message store is initialising, so my reading thus far is that we've somehow ended up with a process trying to read from the message store before it's properly initialised. > I'm having trouble even identifying what does this mean, let alone > preventing it from happening again. It started just fine, so it was > probably a transient error, but the fact that it took with it all the > messages in the queue is troubling. > Indeed. We must stop this from happening. > Can you please point me towards more resources to handle these kinds of > problems in the future that don't involve loss of data? What did i do > wrong? > > Also, do you see a hint of what went wrong there, or do i need to give > more info for this? > Someone better versed in the mechanics of the message store may chime in with a good explanation, but for my part I'd like to understand a few more things: 1. how did your rabbit go down (crashed, accidentally restarted, etc)? 2. exactly what steps did you take to restart it 3. what kind of configuration do you have (is it clustered, any HA queues, etc) And if you could please send over the logs and sasl-logs (stripped of any private data if needs be) that would be very helpful indeed. _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
|
On 31/08/12 10:56, Tim Watson wrote:
> 1. how did your rabbit go down (crashed, accidentally restarted, etc)? > 2. exactly what steps did you take to restart it > 3. what kind of configuration do you have (is it clustered, any HA queues, etc) The most important question of all is "What version of rabbit are you running?" We fixed numerous bugs in the persister, but that was a long time ago and we have not seen any problem reports since. Matthias. _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
|
I just checked on the machine with "rabbitmqctl list_queues name
durable", and all show "false". Seems like someone here made a mistake and forgot to make the queues durable, which was something we all had assumed. Sorry for the alarm. Is it possible to change the parameters of the queues and make them durable without having to make a new one with a different name as suggested in the Python tutorial? Also the broker still crashed, so i'm posting the info asked anyway: * rabbitmq-server 2.8.4-1 on Debian Stable (6.0.3) * The service crashed, so i did a "service rabbitmq-server restart". Naturally, since now we know that the queues are not durable as we expected, all was lost. * No HA, no clustering, 2 machines sending information to each other, and a couple dozen clients consuming the messages put on the queues by the one that crashed. And for the logs, here's the pastebin: http://pastebin.com/FTHGQVPQ The sasl log has far more information, but it was almost 1Mb of information on that crash alone, so initially i'm just pasting this and if needed i'll just add the other one. On 31/08/12 12:13, Matthias Radestock wrote: > On 31/08/12 10:56, Tim Watson wrote: >> 1. how did your rabbit go down (crashed, accidentally restarted, etc)? >> 2. exactly what steps did you take to restart it >> 3. what kind of configuration do you have (is it clustered, any HA >> queues, etc) > > The most important question of all is "What version of rabbit are you > running?" > > We fixed numerous bugs in the persister, but that was a long time ago > and we have not seen any problem reports since. > > > Matthias. > _______________________________________________ > rabbitmq-discuss mailing list > [hidden email] > https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss > _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
|
On 31 Aug 2012, at 12:32, Jaime Herazo B. wrote: > I just checked on the machine with "rabbitmqctl list_queues name > durable", and all show "false". Seems like someone here made a mistake > and forgot to make the queues durable, which was something we all had > assumed. Sorry for the alarm. > Well at least the message loss is not unexpected then > Is it possible to change the parameters of the queues and make them > durable without having to make a new one with a different name as > suggested in the Python tutorial? Not that I'm aware of. > > Also the broker still crashed, so i'm posting the info asked anyway: > > * rabbitmq-server 2.8.4-1 on Debian Stable (6.0.3) That sounds new enough to me - Matthias? > * The service crashed, so i did a "service rabbitmq-server restart". > Naturally, since now we know that the queues are not durable as we > expected, all was lost. > * No HA, no clustering, 2 machines sending information to each other, > and a couple dozen clients consuming the messages put on the queues by > the one that crashed. > > And for the logs, here's the pastebin: > http://pastebin.com/FTHGQVPQ > The sasl log has far more information, but it was almost 1Mb of > information on that crash alone, so initially i'm just pasting this and > if needed i'll just add the other one. > Is that it? It seems like you've pasted only a subset of the log file, and I'd like to know a bit more about what was happening on both nodes when this occurs. Can we see the sasl logs for both nodes too please, and the output of `rabbitmqctl report` for both nodes? _______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
|
On 31 Aug 2012, at 13:18, Tim Watson wrote:
I mean for 'the node' not 'both nodes'. Also, what Erlang version are you using (what do you see when running `erl`) ?
_______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
|
Jaime, I note that the Erlang release for Debian Stable is quite old (R14A) and that a couple of important bug fixes were made in ets since then: - OTP-9181 ETS table type ordered_set could order large integer keys wrongly on pure 64bit platforms. This is now corrected. - OTP-9281 ETS tables using the ordered_set option could potentially get into an internally inconsistent state. So I would strongly suggest upgrading to the latest Erlang (R15). The squeeze-backports repo has an 15B version, which should do nicely to remove this from the line up of potential causes. Cheers, Tim On 31 Aug 2012, at 13:26, Tim Watson wrote:
_______________________________________________ rabbitmq-discuss mailing list [hidden email] https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss |
| Powered by Nabble | Edit this page |
