nova compute failed to report health when nova conductor started

Bug #1393182 reported by Jerry Zhao
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Compass
Confirmed
Undecided
Compass Team
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
oslo.messaging
Invalid
Undecided
Unassigned

Bug Description

I have an icehouse openstack deployment that includes one controller and 3 computes. the controller was the last to go up and running. when rabbitmq started, nova compute tried to connect to it. As the log shows, it seemed that it finally got connected, but when doing nova service-list, the nova-compute service was still down. I later restarted nova-compute, and this time, nova service-list showed that the nova-compute became UP.
I still have two other nova compute remain down status. probably if restarted, they would become UP as well. If any more info is needed, let me know.

2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 622, in ensure
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit return method(*args, **kwargs)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 702, in _consume
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit return self.connection.drain_events(timeout=timeout)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/connection.py", line 139, in drain_events
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/transport/pyamqplib.py", line 223, in drain_events
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit return connection.drain_events(**kwargs)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/transport/pyamqplib.py", line 56, in drain_events
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit return self.wait_multi(self.channels.values(), timeout=timeout)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/transport/pyamqplib.py", line 81, in wait_multi
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit return amqp_method(channel, args)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 365, in _close
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit self._x_close_ok()
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 384, in _x_close_ok
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit self._send_method((10, 61))
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 70, in _send_method
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit method_sig, args, content)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 233, in write_method
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit self.dest.write_frame(1, channel, payload)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqplib/client_0_8/transport.py", line 125, in write_frame
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit frame_type, channel, size, payload, 0xce))
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 359, in sendall
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit tail = self.send(data, flags)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 342, in send
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit total_sent += fd.send(data[total_sent:], flags)
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 104] Connection reset by peer
2014-11-14 21:34:43.449 4314 TRACE oslo.messaging._drivers.impl_rabbit
2014-11-14 21:34:43.450 4314 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.1.0.144:5672
2014-11-14 21:34:43.450 4314 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-11-14 21:34:44.501 4314 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.1.0.144:5672
2014-11-14 21:34:44.933 4314 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.1.0.144:5672
2014-11-14 21:34:44.933 4314 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-11-14 21:34:45.982 4314 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.1.0.144:5672

Revision history for this message
Jerry Zhao (zhaoxinyu) wrote :

PS: the controller went up 30 minutes after the 3 nova-computes.

| nova-compute | n142 | nova | enabled | down | 2014-11-14T13:34:26.000000 | - |
| nova-compute | n141 | nova | enabled | down | 2014-11-14T13:34:27.000000 | - |

Revision history for this message
Jerry Zhao (zhaoxinyu) wrote :
Download full text (7.1 KiB)

Here is the log that shows the first couple of attempts to connect to rabbitmq.

2014-11-14 21:29:59.616 4284 WARNING nova.conductor.api [req-4457f21d-dde3-40f7-8825-b8c25fbaa098 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2014-11-14 21:30:09.668 4284 WARNING nova.conductor.api [req-4457f21d-dde3-40f7-8825-b8c25fbaa098 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2014-11-14 21:30:19.720 4284 WARNING nova.conductor.api [req-4457f21d-dde3-40f7-8825-b8c25fbaa098 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2014-11-14 21:30:29.774 4284 WARNING nova.conductor.api [req-4457f21d-dde3-40f7-8825-b8c25fbaa098 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2014-11-14 21:30:39.827 4284 WARNING nova.conductor.api [req-4457f21d-dde3-40f7-8825-b8c25fbaa098 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2014-11-14 21:30:49.880 4284 WARNING nova.conductor.api [req-4457f21d-dde3-40f7-8825-b8c25fbaa098 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2014-11-14 21:30:59.933 4284 WARNING nova.conductor.api [req-4457f21d-dde3-40f7-8825-b8c25fbaa098 None None] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor?
2014-11-14 21:31:00.259 4284 WARNING nova.virt.libvirt.driver [req-de38aca0-7ac4-4b82-8465-ec97c66c6f8c None None] Cannot update service status on host: n141,since it is not registered.
2014-11-14 21:31:02.389 4284 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.1.0.144:5672
2014-11-14 21:31:13.887 4284 WARNING nova.compute.manager [req-c2d8ca05-307a-473c-ba4d-c6b4211cad2c None None] Bandwidth usage not supported by hypervisor.
2014-11-14 21:34:17.591 4284 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.1.0.144:5672
2014-11-14 21:34:32.917 4284 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: [Errno 104] Connection reset by peer
2014-11-14 21:34:32.917 4284 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-11-14 21:34:32.917 4284 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 622, in ensure
2014-11-14 21:34:32.917 4284 TRACE oslo.messaging._drivers.impl_rabbit return method(*args, **kwargs)
2014-11-14 21:34:32.917 4284 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 702, in _consume
2014-11-14 21:34:32.917 4284 TRACE oslo.messaging._drivers.impl_rabbit return self.connection.drain_events(timeout=timeout)
2014-11-14 21:34:32.917 4284 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/connection.py", line 139, in drain_events
2014-11-14 21:34:32.917 4284 TRACE oslo.messaging._drivers.impl_ra...

Read more...

Jerry Zhao (zhaoxinyu)
Changed in compass:
assignee: nobody → Compass Team (syscompass)
syang1 (shuo-yang)
tags: added: customer-issues deploy-adapter
Changed in compass:
status: New → Confirmed
Revision history for this message
Joe Gordon (jogo) wrote :

based on the log this may be an oslo messaging issue.

Revision history for this message
Joe Gordon (jogo) wrote :

in devstack trunk I was unable to reproduce this using the followings steps:

* turned off nova-conductor
* restart nova-compute
* wait for nova-compute to show up as down in 'nova service-list'
* start nova-conductor

Changed in nova:
status: New → Incomplete
Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Jerry Zhao (zhaoxinyu) wrote :

it was in icehouse that i had that issue. oslo.messaging improved a lot after that.
but this bug is still valid for icehouse.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to compass-core (dev/experimental)

Fix proposed to branch: dev/experimental
Review: https://review.openstack.org/144129

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to compass-core (dev/experimental)

Reviewed: https://review.openstack.org/144129
Committed: https://git.openstack.org/cgit/stackforge/compass-core/commit/?id=4bb0d0a14c7f0768c4f8d5bc305aa2beb75572e7
Submitter: Jenkins
Branch: dev/experimental

commit 4bb0d0a14c7f0768c4f8d5bc305aa2beb75572e7
Author: Jerry Zhao <email address hidden>
Date: Fri Dec 26 03:54:11 2014 -0800

    add service status check after openstack cluster is deployed

    check service status after openstack cluster deployment is complete
    to have a basic peace of mind.
    will add tempest tests later.
    also update openstack logs archive rule to use latest variable.
    fix fastest_url to follow http redirects.
    pin python-daemon version to 1.6.1 because 2.0 breaks.

    Change-Id: I0830138381f99e128fd4dfc72f146dd16530d72b
    Partial-bug: #1393182

Revision history for this message
Robert Collins (lifeless) wrote :

@Jogo I suspect this is still a problem if you stop rabbit: turn off the conductor *and rabbit*, start nova-compute, wait for a few minutes, then start rabbit and the conductor.

Changed in nova:
status: Incomplete → New
Revision history for this message
Joe Gordon (jogo) wrote :

sounds like an oslo.messaging issue at this point

Changed in oslo.messaging:
status: Incomplete → New
Changed in nova:
status: New → Invalid
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Triaged
Mehdi Abaakouk (sileht)
tags: added: icehouse juno
Revision history for this message
Ken Giusti (kgiusti) wrote :

This bug is > 180 days without activity. We are removing the assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid please update this bug with the latest observations.

Changed in oslo.messaging:
status: Triaged → Incomplete
Changed in oslo.messaging:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.