Can't Bind After Upgrading from 3.1.1 to 3.1.5

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Can't Bind After Upgrading from 3.1.1 to 3.1.5

Chris-3
Hello RabbitMQ gurus,

After upgrading a customer site from RabbitMQ 3.1.1 to 3.1.5 (on RHEL 6.2), we had a few durable queues that did not seem to be working correctly (they weren't receiving any messages).  It should be noted that this is a cluster of 5 servers with queue mirroring set to exactly 2 nodes.

During troubleshooting, we deleted and recreated the queues.  After creating the queues, we attempted to rebind them to the exchange (in the web management GUI), but this always failed. 

[In the following example, names have been changed to protect the innocent].  After attempting to bind durable mirrored 'my.queue' to durable direct exchange 'my.exchange' using routing key 'my.queue' (in vhost 'abc'), we get the following error:

NOT_FOUND - no binding my.queue between exchange 'my.exchange' in vhost 'abc' and queue 'my.queue' in vhost 'abc'

This behavior in the web console confirms the behavior we see when trying to bind programmatically in our app (we get an AMQP 404 error).

Here is the kicker-- if we change the routing key by just one character, it works flawlessly!  Or if we don't change anything, but add "foo=bar" to the binding arguments, it also works!  So it seems that because we did this binding in the past, and it was somehow corrupted, it won't allow us to re-bind with the same arguments now.  As noted above, even after deleting and re-creating the queue, it still won't let us do that one binding we need.

Is there any way we can fix this easily without interrupting the customer too much?  Or is it likely we will have to wipe mnesia on all the nodes and rebuild the cluster (my fear is that this may be the only way)?

Is there anything I can or should do on my end to further debug / investigate?

Thanks for your help!

-Chris

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

michaelklishin

2013/9/20 Chris <[hidden email]>
Is there anything I can or should do on my end to further debug / investigate?

Can you post actual routing keys you use (both that do not work and do work)
and create a script that can reproduce your issue?
--
MK

http://github.com/michaelklishin
http://twitter.com/michaelklishin

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Chris-3
Hi Michael,

I don't think it is easily reproducible.  We haven't run into this with any of our upgrade tests or at other customer sites.  So... I wouldn't know how to create a script that reproduces the problem.

If it is helpful to have the actual queue, routing key, and exchange names, then here you go (they actually aren't that special-- not sure why I censored them the first time):  
Queue: mail.global.v0.requests
RoutingKey: mail.global.v0.requests
Exchange: Local.Requests

This happened with two queues.  There are about 10 other durable queues with similar names/keys (and the same exchange) that do not have this problem.

-Chris

-Chris


On Fri, Sep 20, 2013 at 1:23 PM, Michael Klishin <[hidden email]> wrote:

2013/9/20 Chris <[hidden email]>
Is there anything I can or should do on my end to further debug / investigate?

Can you post actual routing keys you use (both that do not work and do work)
and create a script that can reproduce your issue?
--
MK

http://github.com/michaelklishin
http://twitter.com/michaelklishin

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Matthias Radestock-3
In reply to this post by Chris-3
Chris,

On 20/09/13 18:14, Chris wrote:

> After upgrading a customer site from RabbitMQ 3.1.1 to 3.1.5 (on RHEL
> 6.2), we had a few durable queues that did not seem to be working
> correctly (they weren't receiving any messages).  It should be noted
> that this is a cluster of 5 servers with queue mirroring set to exactly
> 2 nodes.
>
> During troubleshooting, we deleted and recreated the queues.  After
> creating the queues, we attempted to rebind them to the exchange (in the
> web management GUI), but this always failed.
>
> [In the following example, names have been changed to protect the
> innocent].  After attempting to bind durable mirrored 'my.queue' to
> durable direct exchange 'my.exchange' using routing key 'my.queue' (in
> vhost 'abc'), we get the following error:
>
>     NOT_FOUND - no binding my.queue between exchange 'my.exchange' in
>     vhost 'abc' and queue 'my.queue' in vhost 'abc'

That happens when the binding in question exists but the 'home' node of
the (durable) queue is not alive. In case of a mirrored queue that would
imply that all mirrors are down. Essentially both the queue and
associated bindings are in a limbo state at that point - they neither
exist nor do they not exist.

So when you see the above error, please check whether the queue you are
binding actually exists, i.e. shows up in 'rabbitmqctl list_queues'. If
it does not then you are seeing normal behaviour. Otherwise there's a bug.

Regards,

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Chris-3
Hi Matthias,

According to "rabbitmqctl cluster_status" and the management web console, all 5 nodes were up in the cluster (with no partitions).  In addition, I could see the queue in the web management console-- and successfully deleted and re-created the queue too.  I also could successfully create other bindings against the same queue and exchange.

I did not try "rabbitmqctl list_qeueus" (and don't have access to the system right now to try it).  Still, I would think that successfully deleting and creating the queue in the management console would indicate the queue should be alive and on an online node.

If this is in fact a bug, is there anything you can recommend to get it into a better state?

-Chris




On Fri, Sep 20, 2013 at 3:35 PM, Matthias Radestock <[hidden email]> wrote:
Chris,


On 20/09/13 18:14, Chris wrote:
After upgrading a customer site from RabbitMQ 3.1.1 to 3.1.5 (on RHEL
6.2), we had a few durable queues that did not seem to be working
correctly (they weren't receiving any messages).  It should be noted
that this is a cluster of 5 servers with queue mirroring set to exactly
2 nodes.

During troubleshooting, we deleted and recreated the queues.  After
creating the queues, we attempted to rebind them to the exchange (in the
web management GUI), but this always failed.

[In the following example, names have been changed to protect the
innocent].  After attempting to bind durable mirrored 'my.queue' to
durable direct exchange 'my.exchange' using routing key 'my.queue' (in
vhost 'abc'), we get the following error:

    NOT_FOUND - no binding my.queue between exchange 'my.exchange' in
    vhost 'abc' and queue 'my.queue' in vhost 'abc'

That happens when the binding in question exists but the 'home' node of the (durable) queue is not alive. In case of a mirrored queue that would imply that all mirrors are down. Essentially both the queue and associated bindings are in a limbo state at that point - they neither exist nor do they not exist.

So when you see the above error, please check whether the queue you are binding actually exists, i.e. shows up in 'rabbitmqctl list_queues'. If it does not then you are seeing normal behaviour. Otherwise there's a bug.

Regards,

Matthias.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Matthias Radestock-3
Chris,

On 20/09/13 21:01, Chris wrote:
> According to "rabbitmqctl cluster_status" and the management web
> console, all 5 nodes were up in the cluster (with no partitions).

But was that *always* the case? Would be good to see the logs all the
way from the time of the upgrade.

> Still, I would think that successfully deleting and creating the
> queue in the management console would indicate the queue should be
> alive and on an online node.

Yes, that should be the case.

> If this is in fact a bug, is there anything you can recommend to get
> it into a better state?

Hard to know without understanding what the problem is. Is this
reproducible at all? e.g. do you have a queue *now* that is suffering
from this behaviour?

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Chris-3
Hi Matthias,

I can try to get the logs.  Of course, during the upgrade nodes went up and down...  So there could have been a time when the nodes the queue was on were both down.  But this continued even after all nodes were back up and the queue was recreated.

It is very reproducible now on this specific customer system (we haven't yet resolved it there since I was hoping for an option that doesn't include clearing mnesia or resetting the nodes).  So, yes, I can make the binding fail all I want on that system.  I cannot, however, get the bug to reproduce on any other systems.

Thanks again,
Chris


On Fri, Sep 20, 2013 at 4:27 PM, Matthias Radestock <[hidden email]> wrote:
Chris,


On 20/09/13 21:01, Chris wrote:
According to "rabbitmqctl cluster_status" and the management web
console, all 5 nodes were up in the cluster (with no partitions).

But was that *always* the case? Would be good to see the logs all the way from the time of the upgrade.


Still, I would think that successfully deleting and creating the
queue in the management console would indicate the queue should be
alive and on an online node.

Yes, that should be the case.


If this is in fact a bug, is there anything you can recommend to get
it into a better state?

Hard to know without understanding what the problem is. Is this
reproducible at all? e.g. do you have a queue *now* that is suffering
from this behaviour?

Matthias.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Matthias Radestock-3
Chris,

On 20/09/13 22:00, Chris wrote:
> I can try to get the logs.  Of course, during the upgrade nodes went up
> and down...  So there could have been a time when the nodes the queue
> was on were both down.  But this continued even after all nodes were
> back up and the queue was recreated.

Right. I want to check the logs for anything unusual.

> It is very reproducible now on this specific customer system (we haven't
> yet resolved it there since I was hoping for an option that doesn't
> include clearing mnesia or resetting the nodes).  So, yes, I can make
> the binding fail all I want on that system.

In which case please post the output of 'rabbitmqctl report' from one of
the nodes, and a screenshot of the management UI when you get the
NOT_FOUND error.

Regards,

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Chris-3
Hi Matthias,

Thank you for your help!  I will send you the logs, screenshot, and report privately in the next couple of hours.  If there are any other pivotal folks who want me to send it to them too, please let me know!

-Chris


On Fri, Sep 20, 2013 at 5:23 PM, Matthias Radestock <[hidden email]> wrote:
Chris,


On 20/09/13 22:00, Chris wrote:
I can try to get the logs.  Of course, during the upgrade nodes went up
and down...  So there could have been a time when the nodes the queue
was on were both down.  But this continued even after all nodes were
back up and the queue was recreated.

Right. I want to check the logs for anything unusual.


It is very reproducible now on this specific customer system (we haven't
yet resolved it there since I was hoping for an option that doesn't
include clearing mnesia or resetting the nodes).  So, yes, I can make
the binding fail all I want on that system.

In which case please post the output of 'rabbitmqctl report' from one of the nodes, and a screenshot of the management UI when you get the NOT_FOUND error.

Regards,

Matthias.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Matthias Radestock-3
Chris,

(putting list back on cc)

On 23/09/13 21:42, Chris wrote:
> Thank you very much for taking a look at these logs.  It is a strange
> bug for sure!  I guess I have two goals, really:
>
>   * To get the customer back up and running in the least disruptive way
>   * To help you guys understand what happened since I know it's no fun
>     to have mystery bugs in your product. ;-)
>
> Regarding #1, if there is not a minimally disruptive way, I am assuming
> I will need to reset all nodes and rebuild the cluster.

You should be able to recover the 'not found' bindings by a) recreating
the queues (as you already have) and then b) stopping the entire cluster
(i.e. no node must be left running) and restarting it.

> Regarding #2, if you need anything else from me, please let me know!

There is no smoking gun in the logs, so the likely source of the problem
is some edge case error in the mirroring and/or recovery logic. That may
take us a while to track down. I don't think there's any other info we
need from your running cluster. Thanks for reporting this issue.


Regards,

Matthias.
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Chris-3
Matthias and Team,

Many, many thanks for your help in this matter.  We'll proceed with your recommended approach.  We will also probably stop doing upgrades on running clusters.  It seems it might be safer to do the full procedure (bringing down the whole cluster for the upgrade).

If you happen to find the cause of this bug at some point, of course I would be interested to know (especially if you fix it).

Thanks again,
Chris


On Tue, Sep 24, 2013 at 8:00 AM, Matthias Radestock <[hidden email]> wrote:
Chris,

(putting list back on cc)


On 23/09/13 21:42, Chris wrote:
Thank you very much for taking a look at these logs.  It is a strange
bug for sure!  I guess I have two goals, really:

  * To get the customer back up and running in the least disruptive way
  * To help you guys understand what happened since I know it's no fun

    to have mystery bugs in your product. ;-)

Regarding #1, if there is not a minimally disruptive way, I am assuming
I will need to reset all nodes and rebuild the cluster.

You should be able to recover the 'not found' bindings by a) recreating the queues (as you already have) and then b) stopping the entire cluster (i.e. no node must be left running) and restarting it.


Regarding #2, if you need anything else from me, please let me know!

There is no smoking gun in the logs, so the likely source of the problem is some edge case error in the mirroring and/or recovery logic. That may take us a while to track down. I don't think there's any other info we need from your running cluster. Thanks for reporting this issue.


Regards,

Matthias.


_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Simon MacMullen-2
On 24/09/13 13:54, Chris wrote:
> Many, many thanks for your help in this matter.  We'll proceed with your
> recommended approach.  We will also probably stop doing upgrades on
> running clusters.  It seems it might be safer to do the full procedure
> (bringing down the whole cluster for the upgrade).

At the moment that's not clear. There's very likely not to be anything
upgrade-specific to this bug, minor version upgrades (i.e. 3.1.1 to
3.1.5) don't really look any different to the cluster from stopping a
node and then starting it again.

> If you happen to find the cause of this bug at some point, of course I
> would be interested to know (especially if you fix it).

Sure, will do.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal
_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

Chris-3
For those of you following along... We did as Matthias suggested, stopping all the RabbitMQ nodes in the cluster, then restarted them.

After restarting, the bindings were there and working!  We did not need to rebind.  So it seems that the bindings were really there all along, but they were in some kind of weird latent state where they were not actually active.  Bringing down the whole cluster and restarting it seems to have resolved the issue.

In case it is of any importance, I will also mention that when we stopped the cluster, nodes 2 and 5 did not stop gracefully-- they just hung.  So we had to kill beam on those nodes.

Thanks again for your help.

-Chris

_______________________________________________
rabbitmq-discuss mailing list
[hidden email]
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

rofc
This post has NOT been accepted by the mailing list yet.
Hi Chris, I'm having the same behaivor! After my clusterd master node (non HA) went down. Now is back, and configured as HA cluster, but there is one queue showing the same error message:

AMQPQueueException Server channel error: 404, message: NOT_FOUND - no binding XXXXXX between exchange 'YYYYYY' in vhost 'ZZZZZZ' and queue 'FOOFOO' in vhost 'BARBAR'

Do you have the exact steps you made? I tried to follow you by stopping nodes ("rabbitmqctl stop app") and then restart the service ("/etc/init.d/rabbitmq-server restart") but there was any different.

Please, let me know if you need any extra information.

Many thanks in advance.

Best,
@rofc
Reply | Threaded
Open this post in threaded view
|

Re: Can't Bind After Upgrading from 3.1.1 to 3.1.5

skriza
This post has NOT been accepted by the mailing list yet.
This post was updated on .
Hi,

I had exactly the same issue. But it started when I ran out of disk space and rabbit crashed. I tried to restart when I cleared the disk space, but every queue had the same issue. I couldn't recreate the binding unless it was different in some way.

I restored recovery.dets from another server and it begand to work, but then just one queue had no binding. this happens to be the dead letter queue.

When I check the web tools, I found the following error when i Click bind

main.js:1019 POST URL/api/bindings/%2F/e/amq.topic/q/ErroredMessage_Processing 404 (Object Not Found)