Skip to main content
 首页 » 编程设计

erlang之RabbitMQ (beam.smp) 和高 CPU/内存负载问题

2024年05月29日278三少

我有一个 debian 盒子,用 celery 和rabbitmq 运行任务大约一年了。最近我注意到任务没有被处理,所以我登录系统并注意到 celery 无法连接到rabbitmq。我重新启动了rabbitmq-server,尽管celery不再提示,但它现在没有执行新任务。奇怪的是,rabbitmq 疯狂地吞噬着 cpu 和内存资源。重新启动服务器并不能解决问题。在花了几个小时在网上寻找解决方案无济于事后,我决定重建服务器。

我用 Debian 7.5、rabbitmq 2.8.4、celery 3.1.13 (Cipater) 重建了新服务器。大约一个小时左右,一切又恢复正常,直到 celery 再次开始提示它无法连接到rabbitmq!

[2014-08-06 05:17:21,036: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused. 
Trying again in 6.00 seconds... 

我重新启动了rabbitmq servicerabbitmq-server start 并获得了相同的问题:

rabbitmq 再次开始膨胀,不断冲击 cpu,并慢慢接管所有 ram 和交换:

PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND 
21823 rabbitmq  20   0  908m 488m 3900 S 731.2 49.4   9:44.74 beam.smp 

这是 rabbitmqctl status 的结果:

Status of node 'rabbit@li370-61' ... 
[{pid,21823}, 
 {running_applications,[{rabbit,"RabbitMQ","2.8.4"}, 
                        {os_mon,"CPO  CXC 138 46","2.2.9"}, 
                        {sasl,"SASL  CXC 138 11","2.2.1"}, 
                        {mnesia,"MNESIA  CXC 138 12","4.7"}, 
                        {stdlib,"ERTS  CXC 138 10","1.18.1"}, 
                        {kernel,"ERTS  CXC 138 10","2.15.1"}]}, 
 {os,{unix,linux}}, 
 {erlang_version,"Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:8:8] [async-threads:30] [kernel-poll:true]\n"}, 
 {memory,[{total,489341272}, 
          {processes,462841967}, 
          {processes_used,462685207}, 
          {system,26499305}, 
          {atom,504409}, 
          {atom_used,473810}, 
          {binary,98752}, 
          {code,11874771}, 
          {ets,6695040}]}, 
 {vm_memory_high_watermark,0.3999999992280962}, 
 {vm_memory_limit,414559436}, 
 {disk_free_limit,1000000000}, 
 {disk_free,48346546176}, 
 {file_descriptors,[{total_limit,924}, 
                    {total_used,924}, 
                    {sockets_limit,829}, 
                    {sockets_used,3}]}, 
 {processes,[{limit,1048576},{used,1354}]}, 
 {run_queue,0}, 

来自/var/log/rabbitmq 的一些条目:

=WARNING REPORT==== 8-Aug-2014::00:11:35 === 
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log, 
                                                                write_threshold} 
 
=WARNING REPORT==== 8-Aug-2014::00:11:35 === 
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log, 
                                                                write_threshold} 
 
=WARNING REPORT==== 8-Aug-2014::00:11:35 === 
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log, 
                                                                write_threshold} 
 
=WARNING REPORT==== 8-Aug-2014::00:11:35 === 
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log, 
                                                                write_threshold} 
 
=WARNING REPORT==== 8-Aug-2014::00:11:36 === 
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log, 
                                                                write_threshold} 
 
=INFO REPORT==== 8-Aug-2014::00:11:36 === 
vm_memory_high_watermark set. Memory used:422283840 allowed:414559436 
 
=WARNING REPORT==== 8-Aug-2014::00:11:36 === 
memory resource limit alarm set on node 'rabbit@li370-61'. 
 
********************************************************** 
*** Publishers will be blocked until this alarm clears *** 
********************************************************** 
 
=INFO REPORT==== 8-Aug-2014::00:11:43 === 
started TCP Listener on [::]:5672 
 
=INFO REPORT==== 8-Aug-2014::00:11:44 === 
vm_memory_high_watermark clear. Memory used:290424384 allowed:414559436 
 
=WARNING REPORT==== 8-Aug-2014::00:11:44 === 
memory resource limit alarm cleared on node 'rabbit@li370-61' 
 
=INFO REPORT==== 8-Aug-2014::00:11:59 === 
vm_memory_high_watermark set. Memory used:414584504 allowed:414559436 
 
=WARNING REPORT==== 8-Aug-2014::00:11:59 === 
memory resource limit alarm set on node 'rabbit@li370-61'. 
 
********************************************************** 
*** Publishers will be blocked until this alarm clears *** 
********************************************************** 
 
=INFO REPORT==== 8-Aug-2014::00:12:00 === 
vm_memory_high_watermark clear. Memory used:411143496 allowed:414559436 
 
=WARNING REPORT==== 8-Aug-2014::00:12:00 === 
memory resource limit alarm cleared on node 'rabbit@li370-61' 
 
=INFO REPORT==== 8-Aug-2014::00:12:01 === 
vm_memory_high_watermark set. Memory used:415563120 allowed:414559436 
 
=WARNING REPORT==== 8-Aug-2014::00:12:01 === 
memory resource limit alarm set on node 'rabbit@li370-61'. 
 
********************************************************** 
*** Publishers will be blocked until this alarm clears *** 
********************************************************** 
 
=INFO REPORT==== 8-Aug-2014::00:12:07 === 
Server startup complete; 0 plugins started. 
 
=ERROR REPORT==== 8-Aug-2014::00:15:32 === 
** Generic server rabbit_disk_monitor terminating  
** Last message in was update 
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61", 
                               50000000,46946492416,100,10000, 
                               #Ref<0.0.1.79456>,false} 
** Reason for termination ==  
** {unparseable,[]} 
 
=INFO REPORT==== 8-Aug-2014::00:15:37 === 
Disk free limit set to 50MB 
 
=ERROR REPORT==== 8-Aug-2014::00:16:03 === 
** Generic server rabbit_disk_monitor terminating  
** Last message in was update 
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61", 
                               50000000,46946426880,100,10000, 
                               #Ref<0.0.1.80930>,false} 
** Reason for termination ==  
** {unparseable,[]} 
 
=INFO REPORT==== 8-Aug-2014::00:16:05 === 
Disk free limit set to 50MB 

更新: 从rabbitmq.com存储库安装最新版本的rabbitmq(3.3.4-1)后,问题似乎得到了解决。最初我从 Debian 存储库安装了一个(2.8.4)。到目前为止rabbitmq-server运行顺利。如果问题再次出现,我将更新这篇文章。

更新: 不幸的是,大约 24 小时后,问题再次出现,rabbitmq 关闭并重新启动进程会使其消耗资源,直到几分钟内再次关闭。

请您参考如下方法:

终于找到解决办法了。这些帖子有助于解决这个问题。 RabbitMQ on EC2 Consuming Tons of CPUhttps://serverfault.com/questions/337982/how-do-i-restart-rabbitmq-after-switching-machines

发生的事情是,rabbitmq 保留了所有从未释放的结果,以至于它变得过载。我清除了 /var/lib/rabbitmq/mnesia/rabbit/ 中的所有陈旧数据,重新启动rabbit,现在工作正常。

我的解决方案是禁用 CELERY_IGNORE_RESULT = True 一起存储结果在 Celery 配置文件中,以确保这种情况不会再次发生。