Consumer crash, redelivery and prefetch

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Consumer crash, redelivery and prefetch

Thomas Riccardi
Consumer crash, redelivery and prefetch

Hi,

I'm having issues with consumer crash, redelivery and prefetch:

The messages ack strategy I've chosen is to ack the message only after
the consumer has finished working on the message. I need this because I
cannot aford any lost message, even in case of consumer crash during the
work on the message.

For performance reason my consumer have a non 0 prefetch, so when my
consumer is handling a message (doing some work related to the message),
other messages are pushed to the client so that it has them already
available when it's done working on the previous message.

With this, when a message makes the consumer crash, the handled message
*and* the prefetched ones are re-queued, with the flag "redelivered" ==
1.

I currently have no other choice than rejecting all messages that have
the "redelivered" flag set to true, because I have no way to distinguish
the message that cause the crash from the prefetched ones that did
nothing wrong.
Indeed if I retry to work on a redelivered message, the consumer will
crash again on the poison message, in an infinite loop. However I would
like to not reject the previously prefetched messages since they are
probably OK.
This reasoning also applies to transient crashes that are not
systematically reproduced by the message.


Ideally the solution would be to be able to distinguish the first non
ack'd message from the other ones when the consumer crashes, or
alternatively having two types of ack : one "I'm starting to work on
this message", and a second one "I've finished working on this message".

Otherwise a redelivery count would do the trick (even if it means
working multiple times on a message that did make the consumer crash,
which is not efficient if we know crashes happen mostly systematically).

The redelivery count is a "planned" feature according to
http://www.rabbitmq.com/specification.html, but when will it be really
implemented?
Having a parameter per queue like the TTL to tell a maximum redelivery
count before dead lettering the message would be much helpful in my case
(a RPC above rabbitmq, which is probably not a highly specific and rare
use-case).


Thanks,
Thomas

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

michaelklishin

2014-03-13 23:31 GMT+04:00 Thomas Riccardi <[hidden email]>:
Ideally the solution would be to be able to distinguish the first non
ack'd message from the other ones when the consumer crashes, or
alternatively having two types of ack : one "I'm starting to work on
this message", and a second one "I've finished working on this message".

As far as RabbitMQ goes, all messages were delivered and are unacknowledged.
 

Otherwise a redelivery count would do the trick (even if it means
working multiple times on a message that did make the consumer crash,
which is not efficient if we know crashes happen mostly systematically).

This is a sensible feature which I'm not sure why Rabbit still does not have.
Definitely worth investigating.
--
MK

http://github.com/michaelklishin
http://twitter.com/michaelklishin

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

Karl Nilsson
In reply to this post by Thomas Riccardi
Hi Thomas,

It is a great shame that a mature message broker such as RabbitMQ is so lacking in sensible poison message handling (or any strategies regarding redelivery). I raised the issue here a few months ago and was advised the priority of planned feature [basic / deliver / 01] had been raised but wasn't given any indication of when it might be delivered. 


After that I wrote a blog entry summarising my investigations which includes some of the suggested workarounds (such as using multiple queues), none of which I consider adequate. 


There is an argument for not designing systems where losing a message is a big deal but I think that is a topic of a different discussion. :)

Cheers
Karl


On 13 March 2014 19:31, Thomas Riccardi <[hidden email]> wrote:
Consumer crash, redelivery and prefetch

Hi,

I'm having issues with consumer crash, redelivery and prefetch:

The messages ack strategy I've chosen is to ack the message only after
the consumer has finished working on the message. I need this because I
cannot aford any lost message, even in case of consumer crash during the
work on the message.

For performance reason my consumer have a non 0 prefetch, so when my
consumer is handling a message (doing some work related to the message),
other messages are pushed to the client so that it has them already
available when it's done working on the previous message.

With this, when a message makes the consumer crash, the handled message
*and* the prefetched ones are re-queued, with the flag "redelivered" ==
1.

I currently have no other choice than rejecting all messages that have
the "redelivered" flag set to true, because I have no way to distinguish
the message that cause the crash from the prefetched ones that did
nothing wrong.
Indeed if I retry to work on a redelivered message, the consumer will
crash again on the poison message, in an infinite loop. However I would
like to not reject the previously prefetched messages since they are
probably OK.
This reasoning also applies to transient crashes that are not
systematically reproduced by the message.


Ideally the solution would be to be able to distinguish the first non
ack'd message from the other ones when the consumer crashes, or
alternatively having two types of ack : one "I'm starting to work on
this message", and a second one "I've finished working on this message".

Otherwise a redelivery count would do the trick (even if it means
working multiple times on a message that did make the consumer crash,
which is not efficient if we know crashes happen mostly systematically).

The redelivery count is a "planned" feature according to
http://www.rabbitmq.com/specification.html, but when will it be really
implemented?
Having a parameter per queue like the TTL to tell a maximum redelivery
count before dead lettering the message would be much helpful in my case
(a RPC above rabbitmq, which is probably not a highly specific and rare
use-case).


Thanks,
Thomas

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



--
Karl Nilsson
twitter: @kjnilsson

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

Simon MacMullen-2
On 14/03/2014 9:42AM, Karl Nilsson wrote:
> It is a great shame that a mature message broker such as RabbitMQ is so
> lacking in sensible poison message handling (or any strategies regarding
> redelivery).

Agreed.

But there are a great many things we want to do, and only limited time
to do them in.

I suspect it will happen one day. Sorry I can't be more specific than
that, but we tend not to plan out a long way in advance.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

Laing, Michael P.
It's a good topic. 

In our std framework, based on python pika, a service may fail in processing a message due to an exception being raised - something unanticipated - the service will have chosen a default action to take in that case when it was initialized, typically 'reject'. Typically it will log a warning as well.

We gather rejected messages in a 'reject' exchange and process them enough (via their headers) to route them back to their originators as well as to our own 'triage' queue.

Our messages all carry their processing history in their headers: region, zone, instance, pid, service, timestamp, etc. - again part of the framework.

We also gather and coordinate the logs of all services on all instances.

Additionally we replicate messages and process them in parallel through our Core clusters in multiple regions.

A truly poison message will fail spectacularly everywhere. We have not actually encountered one yet in production. We do get them in staging, and bells go off everywhere.

A failure of infrastructure will be localized to a region, zone, instance, or supporting service like Cassandra or the AWS control plane. Anticipated failures are retried. Unanticipated failures result in rejection of that message replica but other replicas should succeed. We do get these in production and can immediately tell where failures occurred and take appropriate action, e.g. shifting load away from failure if it has not yet taken place automatically.

Of course it would be nice to get more info upon rejection. We compensate by creating context around rejection and coordinating the context in near real time across the nyt⨍aбrik.

ml


On Fri, Mar 14, 2014 at 6:06 AM, Simon MacMullen <[hidden email]> wrote:
On 14/03/2014 9:42AM, Karl Nilsson wrote:
It is a great shame that a mature message broker such as RabbitMQ is so
lacking in sensible poison message handling (or any strategies regarding
redelivery).

Agreed.

But there are a great many things we want to do, and only limited time to do them in.

I suspect it will happen one day. Sorry I can't be more specific than that, but we tend not to plan out a long way in advance.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

Thomas Riccardi
Thanks for the replies all, so we are not alone with this feature
request.


On ven., 2014-03-14 at 08:28 -0400, Laing, Michael wrote:

> It's a good topic.
>
>
> In our std framework, based on python pika, a service may fail in
> processing a message due to an exception being raised - something
> unanticipated - the service will have chosen a default action to take
> in that case when it was initialized, typically 'reject'. Typically it
> will log a warning as well.
>
>
> We gather rejected messages in a 'reject' exchange and process them
> enough (via their headers) to route them back to their originators as
> well as to our own 'triage' queue.
>

This dead-lettering back to the originator using a headers dead-letter
exchange is a really useful pattern for RPC over RabbitMQ, I was
surprised to not find any mention of this in the readings or ML on the
subject.

(Alas it requires duplicating the "replyTo" information (one time in the
amqp standard message property, and a second time in a non standard
header, as there is no RabbitMQ exchange routing on the "replyTo"
property), or not using the standard "replyTo" at all.)

>
> Our messages all carry their processing history in their headers:
> region, zone, instance, pid, service, timestamp, etc. - again part of
> the framework.
>
>
> We also gather and coordinate the logs of all services on all
> instances.
>
>
> Additionally we replicate messages and process them in parallel
> through our Core clusters in multiple regions.
>
>
> A truly poison message will fail spectacularly everywhere. We have not
> actually encountered one yet in production. We do get them in staging,
> and bells go off everywhere.
>
>
> A failure of infrastructure will be localized to a region, zone,
> instance, or supporting service like Cassandra or the AWS control
> plane. Anticipated failures are retried. Unanticipated failures result
> in rejection of that message replica but other replicas should
> succeed. We do get these in production and can immediately tell where
> failures occurred and take appropriate action, e.g. shifting load away
> from failure if it has not yet taken place automatically.
>
>
> Of course it would be nice to get more info upon rejection. We
> compensate by creating context around rejection and coordinating the
> context in near real time across the nyt⨍aбrik.
>

I think we will go with manual re-queueing (in the same queue) of
redelivered messages with a custom "redelivery-count" header manually
incremented, instead of currently just rejecting them.
Then, upon receiving a non-redelivered message, we reject it or not
according to the custom "redelivery-count" header.

It's a variation of the re-queueing to a "probably poison" second queue
technique mentioned earlier in this thread. Indeed I don't see why we
need a second queue when we can just modify a header and re-queue.

The only drawback is that we modify the message, which is avoided as
much as possible on the broker, but we are not on the broker so we can
do that without any issue.


Cheers,
Thomas

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

Matthias Radestock-3
On 17/03/14 18:47, Thomas Riccardi wrote:
> I think we will go with manual re-queueing (in the same queue) of
> redelivered messages with a custom "redelivery-count" header manually
> incremented, instead of currently just rejecting them.

That's certainly a viable approach. There are two notable differences
compared to using reject:

- rejecting re-queues messages in place whereas re-publishing enqueues
at the back

- extra logic is required to prevent message loss, i.e. depending the
guarantees required by the application you may want to re-publish in
confirm mode and wait for confirmation before acknowledging the original
message.

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

Thomas Riccardi

On mar., 2014-03-18 at 08:15 +0000, Matthias Radestock wrote:

> On 17/03/14 18:47, Thomas Riccardi wrote:
> > I think we will go with manual re-queueing (in the same queue) of
> > redelivered messages with a custom "redelivery-count" header manually
> > incremented, instead of currently just rejecting them.
>
> That's certainly a viable approach. There are two notable differences
> compared to using reject:
>
> - rejecting re-queues messages in place whereas re-publishing enqueues
> at the back

What about re-queueing and TTL? Is the timer still running from the
original queueing time? Or is it reset like a re-publishing?
Is there a difference between per-message TTL and per-queue message TTL
in this case?

>
> - extra logic is required to prevent message loss, i.e. depending the
> guarantees required by the application you may want to re-publish in
> confirm mode and wait for confirmation before acknowledging the original
> message.

Indeed, but this question should already have been raised and answered
(by confirm mode if needed) when publishing the original message.


Thanks for additional details.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Consumer crash, redelivery and prefetch

Matthias Radestock-3
On 18/03/14 13:52, Thomas Riccardi wrote:
> What about re-queueing and TTL? Is the timer still running from the
> original queueing time? Or is it reset like a re-publishing?

 From http://www.rabbitmq.com/ttl.html#per-queue-message-ttl
<quote>
The original expiry time of a message is preserved if it is requeued
</quote

So yes, that's another difference.

> Is there a difference between per-message TTL and per-queue message TTL
> in this case?

No.

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss