I'm running series of performance tests on clustered RabbitMQ 3.1.5 and would like to share and validate my test results.
In current test I'm running single-threaded Java clients publishing non-persistent messages of a fixed size (~200 bytes) as fast as they can via single connection/channel (per thread) to a direct exchange on a target Rabbit node. Exchange has multiple queues bound, but there is always exactly one queue matching the routing key used by the publisher. Queues are created on different nodes in cluster, so there is a choice to publish to "master" or "slave" node. Because I'm not interested in consumers (yet) and want to avoid memory flow control, all queues are size bounded with "x-max-length: 100" (if there are any performance implications of this, please let me know!). All non-mentioned parameters are using default settings.
I have two Rabbit nodes in a current cluster, each box has 16 CPU cores. Can provide spec details if needed.
First of all I wanted to achieve best publish TPS numbers, where I'd max out Rabbit CPU while keeping cluster stable. For that I setup non-mirrored queues and configured publishers to always use master node (where destination queue was created). Since each thread is publishing with same unique routing key (i.e. to the same queue), it's expected that flow control will throttle each thread at certain (about same across all threads) rate and total throughput will scale linearly with thread count until Rabbit runs out of CPU. The expectation was confirmed by tests:
1 threads - ~18K/s
2 threads - ~40K/s
4 threads - ~80K/s
8 threads - ~130K/s
16 threads - ~160K/s
Note, that throughput reached it's ceiling around 8-9 threads, while total number of CPU cores is 16. CPU on Rabbit was 80-95% (total across all cores, constantly going up and down) busy at 8 threads and at 97-99% at 16 publishing threads. In all tests management ui was showing "flow" for all publishing connections.
Running the same test in parallel against both nodes shows that (so far) cluster throughput scales almost linearly with number of nodes:
4 + 4 threads - ~153K/s
8 + 8 threads - ~260K/s
16 + 16 threads - ~310K/s
To find the point of complete CPU saturation, I've added publish rate throttling and tuned it so that flow control almost never kicks in (on 1 thread test). Running with that, I got Rabbit CPU constantly 99% busy at 9 threads test with same (as in the unlimited publish test) ~130K/s total publish rate. Limiting publish rate per thread allowed to evenly load Rabbit cores, but didn't improve the overall throughput, so I'm wondering if there are any other benefits in explicitly limiting publish rate instead of letting per-connection flow control do that?
Next I've tried to publish to a slave node (member of cluster, which doesn't host the non-mirrored queue). In these tests the throughput will only scale up to ~20K/s and remain roughly the same from 2 threads and above (with slave node running at ~13-20% CPU and master at ~4-7%). It looks like the flow control in this case is shared between all threads publishing to slave node (which share slave-master connection to deliver messages). Running parallel threads publishing to master confirms that they are throttled at an independent, same as in a baseline test rate. This suggests that for best performance, publishers must be aware of queue master node and use it at all times. Which seems to be non-trivial given that publishers usually only aware of exchange and routing key, while queues could be redeclared by clients at runtime at any nodes in the cluster (in case of node outage). Is there any good reading on how to address this problem?
Next I've tried publishing to HA queues (set_policy ha-all). This proved to be limiting the throughput and horizontal scaling even further - max throughput went down to ~8-9K/s achieved on 1 thread and remained the same with more threads added. Removing HA policy on selected queues during the test unblocks affected publishers within ~3 seconds to their baseline rate, while keeping others throttled heavily. This suggests that all HA queues are sharing flow control threshold between all their publishers, is this correct?
Furthermore, I've noticed in my tests that the rate at which HA publishers could be impacted by non-HA publishers depends on which node they are pointing to. This was consistent through multiple retries, but I'm not sure if it's intended or a bug. Below is best description of the process I can get so far for this test case:
1) Running 2 publishers to HA queues and 2 publishers to non-HA queues (all publishing to the same node, all queues are different and 'owned' by this node):
- non-HA queues throughput is same as baseline (~16K/s per queue)
- HA queues throughput is throttled based on assumingly HA-shared threshold (~4.5K/s per thread)
2) Running 2 publishers to HA queues on node 1 and 2 publishers to non-HA queues on node 2 (all queues are different and 'owned' by the node being published to):
- non-HA queues throughput is same as baseline (~16K/s per queue)
- HA queues throughput is throttled as if there were 4 threads publishing to HA queues! (~2K/s) Flow control bug?
It's obvious that for high performance/scalability HA queues are not good (currently we run all queues in production under HA policy). I'm going to add consumers next.
Today I've tested the overhead that "x-consistent-hash" exchange adds. In this test setup I had 16 queues with same weights behind hash exchange (by custom hash-header). All queues declared on same node, non-HA, publisher threads connecting to that node as well. Throughput numbers
1 thread - 9K/s
2 threads - 18K/s
4 threads - 40K/s
8 threads - 82K/s
16 threads - 95K/s
Publish rate is reduced by about a half, Rabbit CPU at 16 threads was constantly 99%+ busy.
The more I think about scaling RabbitMQ, the more I come with conclusion that cluster is no help. In the test case above for example it would not help at all to have a cluster as publishers have no means to know which node the message would end up going to and thus can't pick which node to connect for most efficient publishing (to avoid cross node synchronization overhead). It would be better to have multiple independent RabbitMQ brokers, each having multiple consumers with even load distribution (via hash exchange) and publishers randomly choosing which node to connect to.
Am I going crazy here or clustering feature of RabbitMQ is suited best for reliability and not scalability? I just don't see how it helps so far.
In reply to this post by Pavel
Correction: my test machines have 2 CPUs with 4 cores each. Additional 8 cores appeared as a result of hyperthreading. This explains why scaling pretty much stops at 8 concurrent publishers.
I should say that ~160K/s publishes per box is quite impressive and I'd be happy if I can get 50-70K/s throughput for simple publish/consume scenario. With that I wonder if performance improvements in 3.3 release will improve cross-node synchronization and throughput for HA queues.
On 06/05/14 23:47, Pavel wrote:
> I wonder if performance improvements in 3.3 release
> will improve cross-node synchronization and throughput for HA queues.
The specific improvements listed there won't improve throughput for HA
queues, unless you are publishing with 'mandatory' or are consuming with
a prefetch limit. However, there are other performance improvements in
3.3.x. Generally, I recommend performing tests with the latest RabbitMQ
and latest Erlang.
rabbitmq-discuss mailing list
This post has NOT been accepted by the mailing list yet.
I was benchmarking rabbitmq 3.5.3 mirrored cluster with 2 node AWS c3.2xlarge with 15GB and 8 cores and magnetic disk (ubuntu 14.04 )
and figures were very disappointing for me. HiPe compilation was false.
For 1 queue, 1 producer -> ~14k msg/s test with java Perl tool on separate aws instance c3.2xlarge.
and multiple producer threads not helping to increase the throughput.
Please suggest me some tuning or guidelines for it. let me know if you need more information
|Free forum by Nabble||Edit this page|