nova service-list shows nova-compute as down and is required to be restarted frequently in order to provision new vms

Bug #1331537 reported by Leonardo Borda
66
This bug affects 11 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Nova compute services in Openstack Havana go down frequently as listed by "nova service-list" and requires to be restarted very frequently, multiple times every day. All the compute nodes have the ntp times in sync.

When a node shows down, it is not able to use those compute nodes for launching new VMs and we quickly run out of compute resources. Hence our workaround is to restart the Compute nodes on those servers hourly.

In the nova-compute node I've found the following error and they did match with the "Updated_at" field from nova service-list.
2014-06-07 00:21:15.690 511340 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py", l ine 92, in _report_state
5804 2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db report_count = service.service_ref['report_count'] + 1
5805 2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db TypeError: 'NoneType' object has no attribute '__getitem__'
5806 2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db

It looks like the ones that are shown as down haven't been able to update the database with the latest status and they did match with the Traceback seen above (2014-06-07 00:21:15.690) on at least two compute nodes that I have seen.
+------------------+------------------------+--------------+----------+-------+----------------------------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+------------------+------------------------+--------------+----------+-------+----------------------------+-----------------+
| nova-compute | nova1| blabla | enabled | up | 2014-06-07T00:37:42.000000 | None |
| nova-compute | nova2 | blabla | enabled | down | 2014-06-07T00:21:05.000000 | None |

Tracy Jones (tjones-i)
tags: added: compute
Revision history for this message
Zap Chang (zapchang) wrote :

We've seen this issue recently. By viewing conductor's log, the "service_update" message is received and there is no exceptions related to db in the log. The periodic task for doing service_update is not invoked again until nova-compute is restarted.

Revision history for this message
George Kissandrakis (gkissand) wrote :

Setting these
service_down_time=60
report_interval=10
which are the default values to nova.conf [DEFAULT] session, seems to solve the problem
while i had to do 3-4 times nova-compute restart, now nova-compute is registered for 3-4 days without a problem

Revision history for this message
melanie witt (melwitt) wrote :

Is anyone still experiencing this bug? The message "model server went away" indicates a connection problem with the database (message queue problem in your environment?).

Zap and George saw issues where there is no "model server went away" error message but instead the periodic task doesn't fire when it's supposed to. George's observation is strange given that setting default values in nova.conf fixes the problem? That doesn't make sense as those values are anyway set in the code that way, without requiring external configuration:

https://github.com/openstack/nova/blob/65135e7/nova/service.py#L44
https://github.com/openstack/nova/blob/65135e7/nova/service.py#L103

Changed in nova:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/122471

Changed in nova:
assignee: nobody → Roman Podoliaka (rpodolyaka)
status: Incomplete → In Progress
Sean Dague (sdague)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

There are 1450 hits of this in 7 days in the check/gate queues:

message:"model server went away" AND tags:"screen-n-cpu.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwibW9kZWwgc2VydmVyIHdlbnQgYXdheVwiIEFORCB0YWdzOlwic2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDExMjMwMjYxNjI2LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

There are only 5% of failing jobs when this shows up, so it's not causing us huge issues in the gate apparently. I'd bet if we look at those failures there are other things failing due to messaging timeouts which are waiting on some response, like bug 1371677.

Matt Riedemann (mriedem)
tags: added: conductor
Revision history for this message
Qin Zhao (zhaoqin) wrote :

I used to encounter this problem one time. Because no exact way to recreate it, I did not find out why conductor_api.service_update() returned None.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Which RPC backend is being used?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Is cells being used?

Revision history for this message
Matt Riedemann (mriedem) wrote :

This has 0 hits:

message:"model server went away" AND NOT message:"Timeout" AND tags:"screen-n-cpu.txt"

Which means every time we hit 'model server went away' in our Jenkins runs it's due to a messaging timeout.

I wouldn't be surprised if there was a bug in an old kombu or qpid client that was swallowing something on an RPC call and returning None, but we'd need an example of that (log output), and it'd need to be reported against the offending library.

Changed in nova:
status: In Progress → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :

Marking as Incomplete because we need more info here.

Revision history for this message
Leonardo Borda (lborda) wrote :

Hi Matt,

RabbitMQ is being used.

Revision history for this message
jichenjc (jichenjc) wrote :

@Zhao qin

I worked on other commit and some one suggest to think about using cast instead of call for RPC here because
the cast is fine to this timely update function
https://review.openstack.org/#/c/127876/5

I knew this bug before and really don't know why it returns none ,it doesn't make sense from code analysis

maybe cast make sense as a workaround ? Thanks

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/122471
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message
jizhilong (zlji) wrote :

We also encountered this issue occasionally recently, and our code base is grizzely.

After digging into some rpc related code, I noticed that there are at least 2 amqp message sent for returning the result of a rpc.call, one for the actual result, and the other for marking the ending.
For someone interested, the code is here: https://github.com/openstack/nova/blob/grizzly-eol/nova/openstack/common/rpc/amqp.py#L419

I wonder if this issue is caused by the lost of the actual result or the misorder of the two messages.

Revision history for this message
jichenjc (jichenjc) wrote :

https://bugs.launchpad.net/bugs/1407936
Vladimir Eremin also reported occurrance of this issue

Revision history for this message
jizhilong (zlji) wrote :

After some experiments with rabbitmq and nova-conductor rpcapi, I noticed that this issue is caused by disorder of reply messages.

Acoording to the code here: https://github.com/openstack/nova/blob/grizzly-eol/nova/openstack/common/rpc/amqp.py#L419, a service have to send at least 2 rabbitmq message to compose a rpc.call response, and each message is sent by calling msg_reply.The problem here is, the connections used for sending these two messages may be different, and in such circumstance, message ordering won't be guaranteed according to the AMQP specification: 'Section 4.7 of the AMQP 0-9-1 core specification explains the conditions under which ordering is guaranteed: messages published in one channel, passing through one exchange and one queue and one outgoing channel will be received in the same order that they were sent', quoted from https://www.rabbitmq.com/semantics.html.

Revision history for this message
jizhilong (zlji) wrote :

Continued from #16, in latest oslo.messaging, reply messages were sent via a same connection.

Revision history for this message
jizhilong (zlji) wrote :

Since icehouse, nova turned to use oslo.messaging, which should have fixed this issue. I made a patch to restricting using just one connection for replying messages during rpc.call. This patch should apply to grizzly, and apply to havana with little modification

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

Cleanup
=======

This bug report has the status "Incomplete" since more than 30 days
and it looks like that there are no open reviews for it. To keep
the bug list sane, I close this bug with "won't fix". This does not
mean that it is not a valid bug report, it's more to acknowledge that
no progress can be expected here anymore. You are still free to push a
new patch for this bug. If you could reproduce it on the current master
code or on a maintained stable branch, please switch it to "Confirmed".

Changed in nova:
status: Incomplete → Won't Fix
assignee: Roman Podoliaka (rpodolyaka) → nobody
Revision history for this message
Shoham Peller (shoham-peller) wrote :

Happened to me in Liberty:

ERROR nova.servicegroup.drivers.db [-] Failed reporting state, will try next interval (part of a looping call)
ERROR nova.servicegroup.drivers.db Traceback (most recent call last):
ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 86, in _report_state
ERROR nova.servicegroup.drivers.db service.service_ref.report_count += 1
ERROR nova.servicegroup.drivers.db AttributeError: 'NoneType' object has no attribute 'report_count'

Changed in nova:
status: Won't Fix → New
Sean Dague (sdague)
Changed in nova:
status: New → Incomplete
Revision history for this message
John Garbutt (johngarbutt) wrote :

This sounds like its a new bug, have you looked at 1517926 and 1505471?

We might want to re-close this, and instead look at backporting the above fixes, if they have not already been backported.

Revision history for this message
Sean Dague (sdague) wrote :

@speller can you provide the information above.

What does your configuration look like

Which hypervisor? Is this cells? What base.

The very narrow stack trace isn't enough to be actionable.

Revision history for this message
Sean Dague (sdague) wrote :

It would also be good to figure out what errors your conductor is having, because that seems to be the main trigger for this.

Revision history for this message
Shoham Peller (shoham-peller) wrote :

@sdague I don't know what questions you're asking about.

hypervisor is libvirt, no cells, and Liberty is the base
Also, attached is the full nova-compute log with the exception

Any other information I can provide?

Revision history for this message
Shoham Peller (shoham-peller) wrote :

Attached the nova-conductor log as well.
There are these errors in the end, regarding message handling. Could this make this bug happen?

Revision history for this message
Shoham Peller (shoham-peller) wrote :

@johngarbutt Both those bugs-fixes we're backported to nova 12.0.1., I'm using 12.0.0
Ill try to test with those fixes, so it seems this can be closed for now

Revision history for this message
Shoham Peller (shoham-peller) wrote :

This happened to me again, after porting those 2 commits on top of version 12.0.0. My git log to show that:

113750b servicegroup: stop zombie service due to exception
736fa6a Handle DB failures in servicegroup DB driver
6df6ad3 (tag: 12.0.0.0rc3, tag: 12.0.0) Omnibus stable/liberty fix
b5b9f54 update dependencies based on rdopkg reqcheck
c920b8f Update to upstream 12.0.0.0rc2
ba6fccc Merge "Fix InstanceV1 backports to use context" into stable/liberty
6062043 Revert "[libvirt] Move cleanup of imported files to imagebackend"

The traceback I see in nova-compute:

2016-04-10 19:54:52,014.014 25926 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 87, in _report_state
2016-04-10 19:54:52,014.014 25926 ERROR nova.servicegroup.drivers.db service.service_ref.report_count += 1
2016-04-10 19:54:52,014.014 25926 ERROR nova.servicegroup.drivers.db AttributeError: 'NoneType' object has no attribute 'report_count'

Attached is the nova-compute with the traceback

Revision history for this message
Shoham Peller (shoham-peller) wrote :

And attaching the nova-conductor log

Changed in nova:
status: Incomplete → New
Revision history for this message
jizhilong (zlji) wrote :

Hi Shoham Peller, which version oslo.messaging you are using.

Revision history for this message
Shoham Peller (shoham-peller) wrote :

@zlji oslo.messaging's version is 2.5.0

Revision history for this message
Pushkar Umaranikar (pushkar-umaranikar) wrote :

Solving Inconsistency: Bug priority already set. Hence, changing status from "new" to "confirmed"

Changed in nova:
status: New → Confirmed
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: Medium → Undecided
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.