Nova compute services check-in timeouts when register to rabbitmq behind haproxy+keepalived

Bug #1386406 reported by Jerry Zhao
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Jerry Zhao

Bug Description

After a heat stack-update, all nova-compute services are down.
The problem exist in both 3 controller+n compute and 1 controller + n compute.

The overcloud image was built off trunk on 10/20/2014

Nova conductor log:
Oct 27 20:10:48 ci-overcloud-controller0-kbteq7w5rqla nova-conductor: 2014-10-27 20:10:48.992 28817 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
Oct 27 20:10:49 ci-overcloud-controller0-kbteq7w5rqla nova-conductor: 2014-10-27 20:10:49.248 28813 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
Oct 27 20:10:49 ci-overcloud-controller0-kbteq7w5rqla nova-conductor: 2014-10-27 20:10:49.992 28817 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 162.3.121.200:5672
Oct 27 20:10:50 ci-overcloud-controller0-kbteq7w5rqla nova-conductor: 2014-10-27 20:10:50.248 28813 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 162.3.121.200:5672
Oct 27 20:11:00 ci-overcloud-controller0-kbteq7w5rqla nova-conductor: 2014-10-27 20:11:00.119 28816 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server 162.3.121.200:5672 closed the connection. Check login credentials: Socket closed
Oct 27 20:11:01 ci-overcloud-controller0-kbteq7w5rqla nova-conductor: 2014-10-27 20:11:01.143 28822 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server 162.3.121.200:5672 closed the connection. Check login credentials: Socket closed
Oct 27 20:11:03 ci-overcloud-controller0-kbteq7w5rqla nova-conductor: 2014-10-27 20:11:03.447 28817 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server 162.3.121.200:5672 closed the connection. Check login credentials: Socket closed

nova compute log:
Oct 27 20:01:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:01:36.509 22247 TRACE nova.openstack.common.periodic_task
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.570 22247 WARNING nova.openstack.common.loopingcall [-] task <bound method DbDriver._report_state of <nova.servicegroup.drivers.db.DbDriver object at 0x7f454c8db710>> run outlasted interval by 110.06 sec
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager._run_pending_deletes: Timed out waiting for a reply to message ID 9c691f820eb648d6ac58d9107a4fcf27
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py", line 198, in run_periodic_tasks
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task task(self, context)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 6205, in _run_pending_deletes
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task context, filters, expected_attrs=attrs, use_slave=True)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/objects/base.py", line 153, in wrapper
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task args, kwargs)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 346, in object_class_action
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task objver=objver, args=args, kwargs=kwargs)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 152, in call
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task retry=self.retry)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task timeout=timeout, retry=retry)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 408, in send
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task retry=retry)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 397, in _send
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task result = self._waiter.wait(msg_id, timeout)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 298, in wait
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task reply, ending, trylock = self._poll_queue(msg_id, timeout)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 238, in _poll_queue
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task message = self.waiters.get(msg_id, timeout)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 144, in get
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task 'to message ID %s' % msg_id)
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID 9c691f820eb648d6ac58d9107a4fcf27
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.571 22247 TRACE nova.openstack.common.periodic_task
Oct 27 20:02:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:02:36.573 22247 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 162.3.121.200:5672
Oct 27 20:03:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:03:36.639 22247 WARNING nova.openstack.common.loopingcall [-] task <bound method DbDriver._report_state of <nova.servicegroup.drivers.db.DbDriver object at 0x7f454c8db710>> run outlasted interval by 50.07 sec
Oct 27 20:04:36 ci-overcloud-novacompute0-cdfjqk62jmo3 nova-compute: 2014-10-27 20:04:36.702 22247 WARNING nova.openstack.common.loopingcall

After changing nova.conf to connect conductor directly to rabbitmq server without haproxy, everything would be back.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Hi Jerry, can you provide logs from the rabbitmq server? I wonder if haproxy is confusing rabbitmq-server somehow. I believe they'd be in /var/log/rabbitmq. Thanks.

Changed in tripleo:
status: New → Incomplete
importance: Undecided → High
Revision history for this message
Jerry Zhao (zhaoxinyu) wrote :

I didn't see much log in /var/log/rabbit\@overcloud-controller0-ghqtsmsgjgck.log except accepting AMQP connection <0.6062.122> (162.3.121.24:45231 -> 162.3.121.24:5672)
Today after changing rabbit host to haproxy and restarting nova-compute, nova-compute log only printed:
 Connecting to AMQP server on 162.3.121.200:5672
without saying Connected. And rabbitmq log didn't print any new connection.
It seems that haproxy just held the connection as the socket from nova compute to rabbitmq had been established.

Revision history for this message
Jerry Zhao (zhaoxinyu) wrote :

here is what my keepalived.conf looked like, https://review.openstack.org/#/c/104368/ seems to have changed keepalived element to support ha for more than one virtual ip. I will try building my overcloud image again to see if it fixes my problem.
:

vrrp_script chk_haproxy {
      script "killall -0 haproxy"
      interval 2
      weight 2
}

vrrp_instance VI_1 {
      interface br-ex
      state BACKUP
      nopreempt
      virtual_router_id 51
      priority 101 # 101 on master, 100 on backup
      virtual_ipaddress {
           162.3.121.200 dev br-ex
           162.3.121.74 dev br-ex
      }
      track_script {
           chk_haproxy
      }
}

Revision history for this message
Jerry Zhao (zhaoxinyu) wrote :

I finally found the cause of the issue. it is because the session limit was reached for rabbitmq. the default value was only 150 and it is controlled by haproxy.service_maxconn in heat template, which didn't exist in the template generated in my environment. so default 150 was used.
the number of sessions generated by my cluster of 14 nodes is a little over 200, exceeded the default 150, according to the statistics on http://haproxy_host:1993
I wonder how tripleo ci cloud in hp1 and hp2 regions set their session number. i didn't see service_maxconn specified anywhere in tripleo-cd element or tripleo-incubator.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/137142

Changed in tripleo:
assignee: nobody → Jerry Zhao (zhaoxinyu)
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/137142
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=61172031135814f34f139c51453154af86e47e44
Submitter: Jenkins
Branch: master

commit 61172031135814f34f139c51453154af86e47e44
Author: Jerry Zhao <email address hidden>
Date: Tue Nov 25 09:02:13 2014 -0800

    bump up the number of sesson limit for rabbitmq

    the default maxconn is only 150, which maybe good for api services
    but not enough for the rabbitmq session in a cluster as small as 15
    nodes. so bump up the number to 1500 for rabbitmq to allow for 100
    nodes. this number should be calculated based on the scale numbers
    in the long run.
    Closes-bug: #1386406

    Change-Id: Ieb707b31022a6fc9ade32ed2a332b67bf4dc0311

Changed in tripleo:
status: In Progress → Fix Committed
Derek Higgins (derekh)
Changed in tripleo:
status: Fix Committed → Fix Released
Revision history for this message
Mayank Kumar (mayank0107) wrote :

I have a 3 node ACTIVE/ACTIVE ha setup.
I bumped up the limit of rabbitmq max connection to 30000, still compute are failing
Logs: http://pastebin.com/6EhCRnYR

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.