3.13.0: some HTTP API requests fail with 500 errors after a complete cluster restart #11303

daveofthedogs · 2024-05-22T20:25:47Z

daveofthedogs
May 22, 2024

Describe the bug

After all servers in a three-node cluster were rebooted, we were able to log into the console, but received 500 errors from each host. The engineer that rebooted said he rebooted all three at the same time. Errors received on all three nodes were similar:

024-05-22 19:48:43.940704+00:00 [error] <0.60719.0> ** Generic server <0.60719.0> terminating
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0> ** Last message in was {submit,#Fun<rabbit_mgmt_db.23.2453518>,<0.60894.0>,
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>                                reuse}
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0> ** When Server state == {from,<0.60894.0>,#Ref<0.4270829454.224133121.172576>}
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0> ** Reason for termination ==
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0> ** {{badkey,'rabbit@ip-10-157-224-126'},
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>     [{erlang,map_get,
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>              ['rabbit@ip-10-157-224-126',#{}],
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>              [{error_info,#{module => erl_erts_errors}}]},
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>      {rabbit_mgmt_db,'-node_stats/3-lc$^1/1-1-',4,
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>                      [{file,"rabbit_mgmt_db.erl"},{line,652}]},
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>      {worker_pool_worker,handle_call,3,
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>                          [{file,"worker_pool_worker.erl"},{line,96}]},
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1035}]},
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}
2024-05-22 19:48:43.940704+00:00 [error] <0.60719.0>
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>   crasher:
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     initial call: worker_pool_worker:init/1
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     pid: <0.60719.0>
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     registered_name: []
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     exception exit: {{badkey,'rabbit@ip-10-157-224-126'},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                      [{erlang,map_get,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                           ['rabbit@ip-10-157-224-126',#{}],
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                           [{error_info,#{module => erl_erts_errors}}]},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                       {rabbit_mgmt_db,'-node_stats/3-lc$^1/1-1-',4,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                           [{file,"rabbit_mgmt_db.erl"},{line,652}]},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                       {worker_pool_worker,handle_call,3,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                           [{file,"worker_pool_worker.erl"},{line,96}]},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                       {gen_server2,handle_msg,2,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                           [{file,"gen_server2.erl"},{line,1035}]},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                       {proc_lib,wake_up,3,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                           [{file,"proc_lib.erl"},{line,251}]}]}
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     ancestors: [management_worker_pool_sup,rabbit_mgmt_sup,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                   rabbit_mgmt_sup_sup,<0.991.0>]
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     message_queue_len: 0
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     messages: []
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     links: [<0.1054.0>]
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     dictionary: [{worker_pool_worker,true},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                   {worker_pool_name,management_worker_pool},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                   {rand_seed,{#{max => 288230376151711743,type => exsplus,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                                 next => #Fun<rand.5.65977474>,
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                                 jump => #Fun<rand.3.65977474>},
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>                               [37182325038934756|34872676376496283]}}]
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     trap_exit: false
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     status: running
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     heap_size: 4185
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     stack_size: 28
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>     reductions: 23644
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>   neighbours:
2024-05-22 19:48:43.941318+00:00 [error] <0.60719.0>
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>     supervisor: {local,management_worker_pool_sup}
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>     errorContext: child_terminated
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>     reason: {{badkey,'rabbit@ip-10-157-224-126'},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>              [{erlang,map_get,
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                       ['rabbit@ip-10-157-224-126',#{}],
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                       [{error_info,#{module => erl_erts_errors}}]},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>               {rabbit_mgmt_db,'-node_stats/3-lc$^1/1-1-',4,
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                               [{file,"rabbit_mgmt_db.erl"},{line,652}]},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>               {worker_pool_worker,handle_call,3,
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                                   [{file,"worker_pool_worker.erl"},{line,96}]},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>               {gen_server2,handle_msg,2,
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                            [{file,"gen_server2.erl"},{line,1035}]},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>               {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>     offender: [{pid,<0.60719.0>},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                {id,3},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                {mfargs,
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                    {worker_pool_worker,start_link,[management_worker_pool]}},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                {restart_type,transient},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                {significant,false},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                {shutdown,4294967295},
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>                {child_type,worker}]
2024-05-22 19:48:43.943204+00:00 [error] <0.1054.0>
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>   crasher:
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     initial call: cowboy_stream_h:request_process/3
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     pid: <0.60894.0>
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     registered_name: []
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     exception exit: {{{{badkey,'rabbit@ip-10-157-224-126'},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                        [{erlang,map_get,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                             ['rabbit@ip-10-157-224-126',#{}],
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                             [{error_info,#{module => erl_erts_errors}}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                         {rabbit_mgmt_db,'-node_stats/3-lc$^1/1-1-',4,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                             [{file,"rabbit_mgmt_db.erl"},{line,652}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                         {worker_pool_worker,handle_call,3,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                             [{file,"worker_pool_worker.erl"},{line,96}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                         {gen_server2,handle_msg,2,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                             [{file,"gen_server2.erl"},{line,1035}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                         {proc_lib,wake_up,3,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                             [{file,"proc_lib.erl"},{line,251}]}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {gen_server2,call,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [<0.60719.0>,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                            {submit,#Fun<rabbit_mgmt_db.23.2453518>,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                                <0.60894.0>,reuse},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                            infinity]}},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                      [{gen_server2,call,3,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"gen_server2.erl"},{line,346}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {rabbit_mgmt_wm_overview,web_contexts,1,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"rabbit_mgmt_wm_overview.erl"},{line,123}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {rabbit_mgmt_wm_overview,to_json,2,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"rabbit_mgmt_wm_overview.erl"},{line,68}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {cowboy_rest,call,3,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"src/cowboy_rest.erl"},{line,1590}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {cowboy_rest,set_resp_body,2,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"src/cowboy_rest.erl"},{line,1473}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {cowboy_rest,upgrade,4,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"src/cowboy_rest.erl"},{line,284}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {cowboy_stream_h,execute,3,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"src/cowboy_stream_h.erl"},{line,306}]},
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                       {cowboy_stream_h,request_process,3,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                           [{file,"src/cowboy_stream_h.erl"},{line,295}]}]}
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in function  gen_server2:call/3 (gen_server2.erl, line 346)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in call from rabbit_mgmt_wm_overview:web_contexts/1 (rabbit_mgmt_wm_overview.erl, line 123)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in call from rabbit_mgmt_wm_overview:to_json/2 (rabbit_mgmt_wm_overview.erl, line 68)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in call from cowboy_rest:call/3 (src/cowboy_rest.erl, line 1590)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in call from cowboy_rest:set_resp_body/2 (src/cowboy_rest.erl, line 1473)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl, line 284)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in call from cowboy_stream_h:execute/3 (src/cowboy_stream_h.erl, line 306)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>       in call from cowboy_stream_h:request_process/3 (src/cowboy_stream_h.erl, line 295)
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     ancestors: [<0.49741.0>,<0.1030.0>,<0.1027.0>,<0.1026.0>,<0.1024.0>,
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>                   rabbit_web_dispatch_sup,<0.945.0>]
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     message_queue_len: 0
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     messages: []
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     links: [<0.49741.0>]
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     dictionary: []
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     trap_exit: false
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     status: running
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     heap_size: 2586
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     stack_size: 28
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>     reductions: 8642
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>   neighbours:
2024-05-22 19:48:43.943345+00:00 [error] <0.60894.0>
2024-05-22 19:48:43.945862+00:00 [error] <0.49741.0> Ranch listener {acceptor,{0,0,0,0,0,0,0,0},15671}, connection process <0.49741.0>, stream 142 had its request process <0.60894.0> exit with reason {{{badkey,'rabbit@ip-10-157-224-126'},[{erlang,map_get,['rabbit@ip-10-157-224-126',#{}],[{error_info,#{module => erl_erts_errors}}]},{rabbit_mgmt_db,'-node_stats/3-lc$^1/1-1-',4,[{file,"rabbit_mgmt_db.erl"},{line,652}]},{worker_pool_worker,handle_call,3,[{file,"worker_pool_worker.erl"},{line,96}]},{gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1035}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]},{gen_server2,call,[<0.60719.0>,{submit,#Fun<rabbit_mgmt_db.23.2453518>,<0.60894.0>,reuse},infinity]}} and stacktrace [{gen_server2,call,3,[{file,"gen_server2.erl"},{line,346}]},{rabbit_mgmt_wm_overview,web_contexts,1,[{file,"rabbit_mgmt_wm_overview.erl"},{line,123}]},{rabbit_mgmt_wm_overview,to_json,2,[{file,"rabbit_mgmt_wm_overview.erl"},{line,68}]},{cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,1590}]},{cowboy_rest,set_resp_body,2,[{file,"src/cowboy_rest.erl"},{line,1473}]},{cowboy_rest,upgrade,4,[{file,"src/cowboy_rest.erl"},{line,284}]},{cowboy_stream_h,execute,3,[{file,"src/cowboy_stream_h.erl"},{line,306}]},{cowboy_stream_h,request_process,3,[{file,"src/cowboy_stream_h.erl"},{line,295}]}]
2024-05-22 19:48:43.945862+00:00 [error] <0.49741.0>

Reproduction steps

Reboot all three nodes in cluster at the same time
Log into console and check for errors at bottom of page

...

Expected behavior

I would not reboot all nodes at once, but if therre were some kind of outage, that _could_happen. I would exoect RMQ to recover without errors.

Additional context

No response

michaelklishin · 2024-05-22T21:22:09Z

michaelklishin
May 22, 2024
Maintainer

@daveofthedogs there are certain (completely unrelated to the management plugin) tests that form clusters in parallel from scratch and this does not happen. Our team has been running some of them many times a day against 3.13.2 and future 3.13.3 (the tip of v3.13.x), and then using the management UI on some of the nodes.

My best guess is that this is #10901 and that as soon as the underlying condition clears (for example, the virtual host is seeded), the HTTP API will work just as it always does.

According to the stack trace, a single HTTP request should have failed and that's it.

Anyhow, we need an executable way to reproduce with 3.13.2 and evidence that after such a parallel restart the HTTP API does not eventually "recover" (respond as expected). Some exceptions cannot be avoided when the entire cluster is being formed in parallel or is restarted.

1 reply

michaelklishin May 22, 2024
Maintainer

Specifically the exception means that some stats for a certain node (rabbit@ip-10-157-224-126) were not available in a map of node stats. That's not at all surprising after a parallel restart of all nodes since after booting those stats are emitted eventually, and so they will be missing
at first and endpoints like GET /api/overview will run into exceptions because of that or at best could render an empty list of node stats.

After some 10-15 seconds the stats would be in place for subsequent requests to use them.

daveofthedogs · 2024-05-25T15:21:51Z

daveofthedogs
May 25, 2024
Author

Thinks MK

…

On Wed, May 22, 2024 at 5:26 PM Michael Klishin ***@***.***> wrote: Specifically the exception means that some stats for a certain node ( ***@***.***) were not available in a map of node stats. That's not at all surprising after a parallel restart of all nodes since after booting those stats are emitted eventually, and so they will be missing at first and endpoints like GET /api/overview will run into exceptions because of that or at best could render an empty list of node stats. After some 10-15 seconds the stats would be in place for subsequent requests to use them. — Reply to this email directly, view it on GitHub <#11303 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAF7KWBHEBEUG3ANZVOPE2LZDUEOXAVCNFSM6AAAAABIENSRE6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TKMRXGMZDS> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

0 replies

daveofthedogs · 2025-07-15T15:20:01Z

daveofthedogs
Jul 15, 2025
Author

@michaelklishin ressurecting this thread. Karl or one of the other devs told me that the http api was being deprecated. In the past, I had seen cowboy errors and some other errors. I saw that in the last version of 3.13, 3.13.7, you were still making updates to the http api. Has the api become more stable (an in 14)? RIght now, I have the console stats turned off in all my environments (about 200+ clusters) in favor of prometheus. All of the support teams hate this, and I would love to turn console stats back on.

thanks,

Dave

1 reply

michaelklishin Jul 15, 2025
Maintainer

@daveofthedogs no, the HTTP API is not being deprecated, you have misunderstood something.

The HTTP API has not been "unstable". It has close to 100 endpoints, hitting a 500 in certain cases is not evidence of the entire API being "unstable".

Prometheus is the recommended option for monitoring. RabbitMQ 3.x has long been out of community support, so don't expect our team to respond to anything other than upgrading-related questions around 3.13.x.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3.13.0: some HTTP API requests fail with 500 errors after a complete cluster restart #11303

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

3.13.0: some HTTP API requests fail with 500 errors after a complete cluster restart #11303

Uh oh!

daveofthedogs May 22, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

michaelklishin May 22, 2024 Maintainer

Uh oh!

michaelklishin May 22, 2024 Maintainer

Uh oh!

daveofthedogs May 25, 2024 Author

Uh oh!

daveofthedogs Jul 15, 2025 Author

Uh oh!

michaelklishin Jul 15, 2025 Maintainer

daveofthedogs
May 22, 2024

Replies: 3 comments 2 replies

michaelklishin
May 22, 2024
Maintainer

michaelklishin May 22, 2024
Maintainer

daveofthedogs
May 25, 2024
Author

daveofthedogs
Jul 15, 2025
Author

michaelklishin Jul 15, 2025
Maintainer