Exchange Feature request: Drop Duplicates

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Exchange Feature request: Drop Duplicates

Laing, Michael P.
In our scenarios, messages are ultimately delivered to a 'retail' rabbitmq instance to be delivered to a client. The pipelines that process and deliver messages are purposefully redundant, hence there may be multiple replicas of each message 'racing' to the endpoint.

Usually, the replicas are resolved before getting to the retail rabbit. When components fail, however, duplicates can leak through during a small window of time. We eliminate those duplicates at the retail layer by looking at each message_id. Ultimately, our client contract allows duplicates as well in case one slips by.

It seems to me that this is a generic issue.

What would be useful in our case, and hopefully for many others, would be a 'Duplicate Message ID Window' in milliseconds, as an exchange attribute.

If non-zero, the exchange would drop any message with a duplicate message_id that appeared within the specified window of time, possibly routing it to the alternate exchange, if set.

In our case, a window of a few seconds, perhaps up to a minute would suffice.

Thanks,

Michael


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Exchange Feature request: Drop Duplicates

Matthias Reik-2
Even though it sounds like a nice feature, it is probably difficult to really implement, if not done on the client side. The duplicates might happen when delivering to the client side. but on the client side it should be quite easy to do the filtering:
* get a message from the queue,
* check against memcached (couchbase, or some other cache technology) whether the messageID exists.
* Add the new message to memcached (can be done with the previous step)
* Set the timeout in memcached to your window size.

This should be straight forward, would scale up to quite a lot of messages) and should remove (depending on your window size) all duplicates.

Is there a good reason why you wouldn't want to do this on the client side as described?

Cheers
Matthias

PS: as a caching technology you could of course do your own in-memory-solution but that's probably more work than to use an out-of-the-box solution.

On 2013-11-11 12:35 , Laing, Michael wrote:
In our scenarios, messages are ultimately delivered to a 'retail' rabbitmq instance to be delivered to a client. The pipelines that process and deliver messages are purposefully redundant, hence there may be multiple replicas of each message 'racing' to the endpoint.

Usually, the replicas are resolved before getting to the retail rabbit. When components fail, however, duplicates can leak through during a small window of time. We eliminate those duplicates at the retail layer by looking at each message_id. Ultimately, our client contract allows duplicates as well in case one slips by.

It seems to me that this is a generic issue.

What would be useful in our case, and hopefully for many others, would be a 'Duplicate Message ID Window' in milliseconds, as an exchange attribute.

If non-zero, the exchange would drop any message with a duplicate message_id that appeared within the specified window of time, possibly routing it to the alternate exchange, if set.

In our case, a window of a few seconds, perhaps up to a minute would suffice.

Thanks,

Michael



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Exchange Feature request: Drop Duplicates

Laing, Michael P.
Yes - that's actually what we do currently, using Cassandra, and it scales well.

And we also do it in memory, at the retail level, and it is very fast as well.

I am just trying to shave a millisecond off at the retail level.

Cheers,

Michael


On Mon, Nov 11, 2013 at 2:22 PM, Matthias Reik <[hidden email]> wrote:
Even though it sounds like a nice feature, it is probably difficult to really implement, if not done on the client side. The duplicates might happen when delivering to the client side. but on the client side it should be quite easy to do the filtering:
* get a message from the queue,
* check against memcached (couchbase, or some other cache technology) whether the messageID exists.
* Add the new message to memcached (can be done with the previous step)
* Set the timeout in memcached to your window size.

This should be straight forward, would scale up to quite a lot of messages) and should remove (depending on your window size) all duplicates.

Is there a good reason why you wouldn't want to do this on the client side as described?

Cheers
Matthias

PS: as a caching technology you could of course do your own in-memory-solution but that's probably more work than to use an out-of-the-box solution.


On 2013-11-11 12:35 , Laing, Michael wrote:
In our scenarios, messages are ultimately delivered to a 'retail' rabbitmq instance to be delivered to a client. The pipelines that process and deliver messages are purposefully redundant, hence there may be multiple replicas of each message 'racing' to the endpoint.

Usually, the replicas are resolved before getting to the retail rabbit. When components fail, however, duplicates can leak through during a small window of time. We eliminate those duplicates at the retail layer by looking at each message_id. Ultimately, our client contract allows duplicates as well in case one slips by.

It seems to me that this is a generic issue.

What would be useful in our case, and hopefully for many others, would be a 'Duplicate Message ID Window' in milliseconds, as an exchange attribute.

If non-zero, the exchange would drop any message with a duplicate message_id that appeared within the specified window of time, possibly routing it to the alternate exchange, if set.

In our case, a window of a few seconds, perhaps up to a minute would suffice.

Thanks,

Michael



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Exchange Feature request: Drop Duplicates

Simon MacMullen-2
The trouble is, exchanges are meant to be stateless. So it's possible to
introduce some state into an exchange, but we have to choose between
having per-node state (in which case dedup only works per-node), or
having cluster-global state (where we either funnel all messages through
one node in the cluster before they get routed to queues, or distribute
the state around the cluster, making updates into expensive 2PC).

So this is doable but it's not obvious where compromises should be made.
And as Matthias sort of pointed out, duplication can still happen due to
redelivery, so this has to be an optimisation rather than something that
guarantees duplicates won't happen.

Having said all that, it wouldn't be hideously difficult to implement,
so I might give it a go. Depends on whether anybody else would find such
a feature useful...

Cheers, Simon

On 11/11/2013 19:28, Laing, Michael wrote:

> Yes - that's actually what we do currently, using Cassandra, and it
> scales well.
>
> And we also do it in memory, at the retail level, and it is very fast as
> well.
>
> I am just trying to shave a millisecond off at the retail level.
>
> Cheers,
>
> Michael
>
>
> On Mon, Nov 11, 2013 at 2:22 PM, Matthias Reik <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Even though it sounds like a nice feature, it is probably difficult
>     to really implement, if not done on the client side. The duplicates
>     might happen when delivering to the client side. but on the client
>     side it should be quite easy to do the filtering:
>     * get a message from the queue,
>     * check against memcached (couchbase, or some other cache
>     technology) whether the messageID exists.
>     * Add the new message to memcached (can be done with the previous step)
>     * Set the timeout in memcached to your window size.
>
>     This should be straight forward, would scale up to quite a lot of
>     messages) and should remove (depending on your window size) all
>     duplicates.
>
>     Is there a good reason why you wouldn't want to do this on the
>     client side as described?
>
>     Cheers
>     Matthias
>
>     PS: as a caching technology you could of course do your own
>     in-memory-solution but that's probably more work than to use an
>     out-of-the-box solution.
>
>
>     On 2013-11-11 12:35 , Laing, Michael wrote:
>>     In our scenarios, messages are ultimately delivered to a 'retail'
>>     rabbitmq instance to be delivered to a client. The pipelines that
>>     process and deliver messages are purposefully redundant, hence
>>     there may be multiple replicas of each message 'racing' to the
>>     endpoint.
>>
>>     Usually, the replicas are resolved before getting to the retail
>>     rabbit. When components fail, however, duplicates can leak through
>>     during a small window of time. We eliminate those duplicates at
>>     the retail layer by looking at each message_id. Ultimately, our
>>     client contract allows duplicates as well in case one slips by.
>>
>>     It seems to me that this is a generic issue.
>>
>>     What would be useful in our case, and hopefully for many others,
>>     would be a 'Duplicate Message ID Window' in milliseconds, as an
>>     exchange attribute.
>>
>>     If non-zero, the exchange would drop any message with a duplicate
>>     message_id that appeared within the specified window of time,
>>     possibly routing it to the alternate exchange, if set.
>>
>>     In our case, a window of a few seconds, perhaps up to a minute
>>     would suffice.
>>
>>     Thanks,
>>
>>     Michael
>>
>>
>>
>>     _______________________________________________
>>     rabbitmq-discuss mailing list
>>     [hidden email]  <mailto:[hidden email]>
>>     https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
>     _______________________________________________
>     rabbitmq-discuss mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Exchange Feature request: Drop Duplicates

Laing, Michael P.
Yes I can see the point about statelessness.

It seems to me that in a messaging fabric, it is generally useful to have ways of dampening duplicates.

It occurred to me this morning that federation uses hop counts - in some topologies, esp. with planned redundancy, this does not work so well, and perhaps a feature like this would help.

Michael






On Tue, Nov 12, 2013 at 4:48 AM, Simon MacMullen <[hidden email]> wrote:
The trouble is, exchanges are meant to be stateless. So it's possible to introduce some state into an exchange, but we have to choose between having per-node state (in which case dedup only works per-node), or having cluster-global state (where we either funnel all messages through one node in the cluster before they get routed to queues, or distribute the state around the cluster, making updates into expensive 2PC).

So this is doable but it's not obvious where compromises should be made. And as Matthias sort of pointed out, duplication can still happen due to redelivery, so this has to be an optimisation rather than something that guarantees duplicates won't happen.

Having said all that, it wouldn't be hideously difficult to implement, so I might give it a go. Depends on whether anybody else would find such a feature useful...

Cheers, Simon


On 11/11/2013 19:28, Laing, Michael wrote:
Yes - that's actually what we do currently, using Cassandra, and it
scales well.

And we also do it in memory, at the retail level, and it is very fast as
well.

I am just trying to shave a millisecond off at the retail level.

Cheers,

Michael


On Mon, Nov 11, 2013 at 2:22 PM, Matthias Reik <[hidden email]
<mailto:[hidden email]>> wrote:

    Even though it sounds like a nice feature, it is probably difficult
    to really implement, if not done on the client side. The duplicates
    might happen when delivering to the client side. but on the client
    side it should be quite easy to do the filtering:
    * get a message from the queue,
    * check against memcached (couchbase, or some other cache
    technology) whether the messageID exists.
    * Add the new message to memcached (can be done with the previous step)
    * Set the timeout in memcached to your window size.

    This should be straight forward, would scale up to quite a lot of
    messages) and should remove (depending on your window size) all
    duplicates.

    Is there a good reason why you wouldn't want to do this on the
    client side as described?

    Cheers
    Matthias

    PS: as a caching technology you could of course do your own
    in-memory-solution but that's probably more work than to use an
    out-of-the-box solution.


    On 2013-11-11 12:35 , Laing, Michael wrote:
    In our scenarios, messages are ultimately delivered to a 'retail'
    rabbitmq instance to be delivered to a client. The pipelines that
    process and deliver messages are purposefully redundant, hence
    there may be multiple replicas of each message 'racing' to the
    endpoint.

    Usually, the replicas are resolved before getting to the retail
    rabbit. When components fail, however, duplicates can leak through
    during a small window of time. We eliminate those duplicates at
    the retail layer by looking at each message_id. Ultimately, our
    client contract allows duplicates as well in case one slips by.

    It seems to me that this is a generic issue.

    What would be useful in our case, and hopefully for many others,
    would be a 'Duplicate Message ID Window' in milliseconds, as an
    exchange attribute.

    If non-zero, the exchange would drop any message with a duplicate
    message_id that appeared within the specified window of time,
    possibly routing it to the alternate exchange, if set.

    In our case, a window of a few seconds, perhaps up to a minute
    would suffice.

    Thanks,

    Michael



    _______________________________________________
    rabbitmq-discuss mailing list
    [hidden email]  <mailto:[hidden email]>
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


    _______________________________________________
    rabbitmq-discuss mailing list
    [hidden email]
    <mailto:[hidden email]>



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss