Quantcast

Help pinpointing an error

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Help pinpointing an error

Jaime Herazo B.
Hi.

I'm new to RabbitMQ and the whole messaging platforms world in general.
I'm working with this Rabbit setup already in place as part of my
sysadmin duties. There's other people who are the ones that really know
about it, but in general i have to take care of the servers.

Today a rabbit instance went down. I restarted the service only to be
greeted by screams as apparently many messages were lost in the process
(as far as i understood, once queues were marked as "Durable" this
couldn't happen, but it happened).

The "Reason for termination" was:

{{badmatch,[{file_summary,2064936,4810835,2064935,2064937,16780759,true,1}]},                                                                                                            
 [
  {rabbit_msg_store,combine_files,3},
  {rabbit_msg_store_gc,attempt_action,3},
  {rabbit_msg_store_gc,handle_cast,2},
  {gen_server2,handle_msg,2},
  {proc_lib,wake_up,3}
 ]
}

I'm having trouble even identifying what does this mean, let alone
preventing it from happening again. It started just fine, so it was
probably a transient error, but the fact that it took with it all the
messages in the queue is troubling.

Can you please point me towards more resources to handle these kinds of
problems in the future that don't involve loss of data? What did i do
wrong?

Also, do you see a hint of what went wrong there, or do i need to give
more info for this?

Thanks for any help or hints.

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help pinpointing an error

Tim Watson-6
Hi

On 31 Aug 2012, at 10:35, Jaime Herazo B. wrote:

> Hi.
>
> Today a rabbit instance went down. I restarted the service only to be
> greeted by screams as apparently many messages were lost in the process
> (as far as i understood, once queues were marked as "Durable" this
> couldn't happen, but it happened).
>

This shouldn't happen, so let's try and figure out what went wrong.


> The "Reason for termination" was:
>
> {{badmatch,[{file_summary,2064936,4810835,2064935,2064937,16780759,true,1}]},                                                                                                            
> [
>  {rabbit_msg_store,combine_files,3},
>  {rabbit_msg_store_gc,attempt_action,3},
>  {rabbit_msg_store_gc,handle_cast,2},
>  {gen_server2,handle_msg,2},
>  {proc_lib,wake_up,3}
> ]
> }
>

The rabbit_msg_store module handles persisting the contents (i.e., data) for durable queues on disk. The badmatch (which means we didn't see the data we expected to see when assigning something) occurs because the 'readers' field for the file_summary is expected to be 0, not 1. This routine is called when compacting the data (e.g., during a garbage collection-esque process) and is called when the message store is initialising, so my reading thus far is that we've somehow ended up with a process trying to read from the message store before it's properly initialised.

> I'm having trouble even identifying what does this mean, let alone
> preventing it from happening again. It started just fine, so it was
> probably a transient error, but the fact that it took with it all the
> messages in the queue is troubling.
>

Indeed. We must stop this from happening.

> Can you please point me towards more resources to handle these kinds of
> problems in the future that don't involve loss of data? What did i do
> wrong?
>
> Also, do you see a hint of what went wrong there, or do i need to give
> more info for this?
>

Someone better versed in the mechanics of the message store may chime in with a good explanation, but for my part I'd like to understand a few more things:

1. how did your rabbit go down (crashed, accidentally restarted, etc)?
2. exactly what steps did you take to restart it
3. what kind of configuration do you have (is it clustered, any HA queues, etc)

And if you could please send over the logs and sasl-logs (stripped of any private data if needs be) that would be very helpful indeed.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help pinpointing an error

Matthias Radestock-3
On 31/08/12 10:56, Tim Watson wrote:
> 1. how did your rabbit go down (crashed, accidentally restarted, etc)?
> 2. exactly what steps did you take to restart it
> 3. what kind of configuration do you have (is it clustered, any HA queues, etc)

The most important question of all is "What version of rabbit are you
running?"

We fixed numerous bugs in the persister, but that was a long time ago
and we have not seen any problem reports since.


Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help pinpointing an error

Jaime Herazo B.
I just checked on the machine with "rabbitmqctl list_queues name
durable", and all show "false". Seems like someone here made a mistake
and forgot to make the queues durable, which was something we all had
assumed. Sorry for the alarm.

Is it possible to change the parameters of the queues and make them
durable without having to make a new one with a different name as
suggested in the Python tutorial?

Also the broker still crashed, so i'm posting the info asked anyway:

* rabbitmq-server 2.8.4-1 on Debian Stable (6.0.3)
* The service crashed, so i did a "service rabbitmq-server restart".
Naturally, since now we know that the queues are not durable as we
expected, all was lost.
* No HA, no clustering, 2 machines sending information to each other,
and a couple dozen clients consuming the messages put on the queues by
the one that crashed.

And for the logs, here's the pastebin:
http://pastebin.com/FTHGQVPQ
The sasl log has far more information, but it was almost 1Mb of
information on that crash alone, so initially i'm just pasting this and
if needed i'll just add the other one.

On 31/08/12 12:13, Matthias Radestock wrote:

> On 31/08/12 10:56, Tim Watson wrote:
>> 1. how did your rabbit go down (crashed, accidentally restarted, etc)?
>> 2. exactly what steps did you take to restart it
>> 3. what kind of configuration do you have (is it clustered, any HA
>> queues, etc)
>
> The most important question of all is "What version of rabbit are you
> running?"
>
> We fixed numerous bugs in the persister, but that was a long time ago
> and we have not seen any problem reports since.
>
>
> Matthias.
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help pinpointing an error

Tim Watson-6

On 31 Aug 2012, at 12:32, Jaime Herazo B. wrote:

> I just checked on the machine with "rabbitmqctl list_queues name
> durable", and all show "false". Seems like someone here made a mistake
> and forgot to make the queues durable, which was something we all had
> assumed. Sorry for the alarm.
>

Well at least the message loss is not unexpected then

> Is it possible to change the parameters of the queues and make them
> durable without having to make a new one with a different name as
> suggested in the Python tutorial?

Not that I'm aware of.

>
> Also the broker still crashed, so i'm posting the info asked anyway:
>
> * rabbitmq-server 2.8.4-1 on Debian Stable (6.0.3)

That sounds new enough to me - Matthias?

> * The service crashed, so i did a "service rabbitmq-server restart".
> Naturally, since now we know that the queues are not durable as we
> expected, all was lost.
> * No HA, no clustering, 2 machines sending information to each other,
> and a couple dozen clients consuming the messages put on the queues by
> the one that crashed.
>
> And for the logs, here's the pastebin:
> http://pastebin.com/FTHGQVPQ
> The sasl log has far more information, but it was almost 1Mb of
> information on that crash alone, so initially i'm just pasting this and
> if needed i'll just add the other one.
>

Is that it? It seems like you've pasted only a subset of the log file, and I'd like to know a bit more about what was happening on both nodes when this occurs. Can we see the sasl logs for both nodes too please, and the output of `rabbitmqctl report` for both nodes?
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help pinpointing an error

Tim Watson-6

On 31 Aug 2012, at 13:18, Tim Watson wrote:


Is that it? It seems like you've pasted only a subset of the log file, and I'd like to know a bit more about what was happening on both nodes when this occurs. Can we see the sasl logs for both nodes too please, and the output of `rabbitmqctl report` for both nodes?
_

I mean for 'the node' not 'both nodes'. Also, what Erlang version are you using (what do you see when running `erl`) ?

______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help pinpointing an error

Tim Watson-6
Jaime,

I note that the Erlang release for Debian Stable is quite old (R14A) and that a couple of important bug fixes were made in ets since then:

- OTP-9181  ETS table type ordered_set could order large integer keys                wrongly on pure 64bit platforms. This is now corrected.  
- OTP-9281  ETS tables using the ordered_set option could potentially get           into an internally inconsistent state.

So I would strongly suggest upgrading to the latest Erlang (R15). The squeeze-backports repo has an 15B version, which should do nicely to remove this from the line up of potential causes.

Cheers,
Tim

On 31 Aug 2012, at 13:26, Tim Watson wrote:


On 31 Aug 2012, at 13:18, Tim Watson wrote:


Is that it? It seems like you've pasted only a subset of the log file, and I'd like to know a bit more about what was happening on both nodes when this occurs. Can we see the sasl logs for both nodes too please, and the output of `rabbitmqctl report` for both nodes?
_

I mean for 'the node' not 'both nodes'. Also, what Erlang version are you using (what do you see when running `erl`) ?

______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Loading...