OpenStack Compute (nova)

nova service-list shows nova-compute as down and is required to be restarted frequently in order to provision new vms

Bug #1331537 reported by Leonardo Borda on 2014-06-18

This bug affects 11 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Expired	Undecided	Unassigned

Bug Description

Nova compute services in Openstack Havana go down frequently as listed by "nova service-list" and requires to be restarted very frequently, multiple times every day. All the compute nodes have the ntp times in sync.

When a node shows down, it is not able to use those compute nodes for launching new VMs and we quickly run out of compute resources. Hence our workaround is to restart the Compute nodes on those servers hourly.

In the nova-compute node I've found the following error and they did match with the "Updated_at" field from nova service-list.
2014-06-07 00:21:15.690 511340 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py", l ine 92, in _report_state
5804 2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db report_count = service.service_ref['report_count'] + 1
5805 2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db TypeError: 'NoneType' object has no attribute '__getitem__'
5806 2014-06-07 00:21:15.690 511340 TRACE nova.servicegroup.drivers.db

Tags:

Tracy Jones (tjones-i) on 2014-06-18

tags:

added: compute

Revision history for this message

Zap Chang (zapchang) wrote on 2014-07-10:

We've seen this issue recently. By viewing conductor's log, the "service_update" message is received and there is no exceptions related to db in the log. The periodic task for doing service_update is not invoked again until nova-compute is restarted.

Revision history for this message

George Kissandrakis (gkissand) wrote on 2014-07-28:

Setting these
service_down_time=60
report_interval=10
which are the default values to nova.conf [DEFAULT] session, seems to solve the problem
while i had to do 3-4 times nova-compute restart, now nova-compute is registered for 3-4 days without a problem

Revision history for this message

melanie witt (melwitt) wrote on 2014-09-10:

Is anyone still experiencing this bug? The message "model server went away" indicates a connection problem with the database (message queue problem in your environment?).

Zap and George saw issues where there is no "model server went away" error message but instead the periodic task doesn't fire when it's supposed to. George's observation is strange given that setting default values in nova.conf fixes the problem? That doesn't make sense as those values are anyway set in the code that way, without requiring external configuration:

https://github.com/openstack/nova/blob/65135e7/nova/service.py#L44
https://github.com/openstack/nova/blob/65135e7/nova/service.py#L103

Changed in nova:
status:	New → Incomplete

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-18: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/122471

Changed in nova:
assignee:	nobody → Roman Podoliaka (rpodolyaka)
status:	Incomplete → In Progress

Sean Dague (sdague) on 2014-09-18

Changed in nova:
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-20:

There are 1450 hits of this in 7 days in the check/gate queues:

message:"model server went away" AND tags:"screen-n-cpu.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwibW9kZWwgc2VydmVyIHdlbnQgYXdheVwiIEFORCB0YWdzOlwic2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDExMjMwMjYxNjI2LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

There are only 5% of failing jobs when this shows up, so it's not causing us huge issues in the gate apparently. I'd bet if we look at those failures there are other things failing due to messaging timeouts which are waiting on some response, like bug 1371677.

Matt Riedemann (mriedem) on 2014-09-20

tags:

added: conductor

Revision history for this message

Qin Zhao (zhaoqin) wrote on 2014-09-22:

I used to encounter this problem one time. Because no exact way to recreate it, I did not find out why conductor_api.service_update() returned None.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-22:

Which RPC backend is being used?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-22:

Is cells being used?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-22:

This has 0 hits:

message:"model server went away" AND NOT message:"Timeout" AND tags:"screen-n-cpu.txt"

Which means every time we hit 'model server went away' in our Jenkins runs it's due to a messaging timeout.

I wouldn't be surprised if there was a bug in an old kombu or qpid client that was swallowing something on an RPC call and returning None, but we'd need an example of that (log output), and it'd need to be reported against the offending library.

Changed in nova:
status:	In Progress → Incomplete

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-22:

#10

Marking as Incomplete because we need more info here.

Revision history for this message

Leonardo Borda (lborda) wrote on 2014-10-06:

#11

Hi Matt,

RabbitMQ is being used.

Revision history for this message

jichenjc (jichenjc) wrote on 2014-10-24:

#12

@Zhao qin

I worked on other commit and some one suggest to think about using cast instead of call for RPC here because
the cast is fine to this timely update function
https://review.openstack.org/#/c/127876/5

I knew this bug before and really don't know why it returns none ,it doesn't make sense from code analysis

maybe cast make sense as a workaround ? Thanks

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-20: Change abandoned on nova (master)

#13

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/122471
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message

jizhilong (zlji) wrote on 2015-02-28:

#14

We also encountered this issue occasionally recently, and our code base is grizzely.

After digging into some rpc related code, I noticed that there are at least 2 amqp message sent for returning the result of a rpc.call, one for the actual result, and the other for marking the ending.
For someone interested, the code is here: https://github.com/openstack/nova/blob/grizzly-eol/nova/openstack/common/rpc/amqp.py#L419

I wonder if this issue is caused by the lost of the actual result or the misorder of the two messages.

Revision history for this message

jichenjc (jichenjc) wrote on 2015-03-03:

#15

https://bugs.launchpad.net/bugs/1407936
Vladimir Eremin also reported occurrance of this issue

Revision history for this message

jizhilong (zlji) wrote on 2015-04-23:

#16

After some experiments with rabbitmq and nova-conductor rpcapi, I noticed that this issue is caused by disorder of reply messages.

Acoording to the code here: https://github.com/openstack/nova/blob/grizzly-eol/nova/openstack/common/rpc/amqp.py#L419, a service have to send at least 2 rabbitmq message to compose a rpc.call response, and each message is sent by calling msg_reply.The problem here is, the connections used for sending these two messages may be different, and in such circumstance, message ordering won't be guaranteed according to the AMQP specification: 'Section 4.7 of the AMQP 0-9-1 core specification explains the conditions under which ordering is guaranteed: messages published in one channel, passing through one exchange and one queue and one outgoing channel will be received in the same order that they were sent', quoted from https://www.rabbitmq.com/semantics.html.

Revision history for this message

jizhilong (zlji) wrote on 2015-04-23:

#17

Continued from #16, in latest oslo.messaging, reply messages were sent via a same connection.

Revision history for this message

jizhilong (zlji) wrote on 2015-04-28:

#18

fix mistakenly returning None in rpc.call in grizzly Edit (5.8 KiB, text/plain)

Since icehouse, nova turned to use oslo.messaging, which should have fixed this issue. I made a patch to restricting using just one connection for replying messages during rpc.call. This patch should apply to grizzly, and apply to havana with little modification

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-02-10:

#19

Cleanup
=======

This bug report has the status "Incomplete" since more than 30 days
and it looks like that there are no open reviews for it. To keep
the bug list sane, I close this bug with "won't fix". This does not
mean that it is not a valid bug report, it's more to acknowledge that
no progress can be expected here anymore. You are still free to push a
new patch for this bug. If you could reproduce it on the current master
code or on a maintained stable branch, please switch it to "Confirmed".

Changed in nova:
status:	Incomplete → Won't Fix
assignee:	Roman Podoliaka (rpodolyaka) → nobody

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-05:

#20

Happened to me in Liberty:

ERROR nova.servicegroup.drivers.db [-] Failed reporting state, will try next interval (part of a looping call)
ERROR nova.servicegroup.drivers.db Traceback (most recent call last):
ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 86, in _report_state
ERROR nova.servicegroup.drivers.db service.service_ref.report_count += 1
ERROR nova.servicegroup.drivers.db AttributeError: 'NoneType' object has no attribute 'report_count'

Daniel Berrange (berrange) on 2016-04-05

Changed in nova:
status:	Won't Fix → New

Sean Dague (sdague) on 2016-04-05

Changed in nova:
status:	New → Incomplete

Revision history for this message

John Garbutt (johngarbutt) wrote on 2016-04-05:

#21

This sounds like its a new bug, have you looked at 1517926 and 1505471?

We might want to re-close this, and instead look at backporting the above fixes, if they have not already been backported.

Revision history for this message

Sean Dague (sdague) wrote on 2016-04-05:

#22

@speller can you provide the information above.

What does your configuration look like

Which hypervisor? Is this cells? What base.

The very narrow stack trace isn't enough to be actionable.

Revision history for this message

Sean Dague (sdague) wrote on 2016-04-05:

#23

It would also be good to figure out what errors your conductor is having, because that seems to be the main trigger for this.

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-05:

#24

nova-compute.txt Edit (286.1 KiB, text/plain)

@sdague I don't know what questions you're asking about.

hypervisor is libvirt, no cells, and Liberty is the base
Also, attached is the full nova-compute log with the exception

Any other information I can provide?

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-05:

#27

nova-conductor.txt Edit (163.8 KiB, text/plain)

Attached the nova-conductor log as well.
There are these errors in the end, regarding message handling. Could this make this bug happen?

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-05:

#28

@johngarbutt Both those bugs-fixes we're backported to nova 12.0.1., I'm using 12.0.0
Ill try to test with those fixes, so it seems this can be closed for now

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-11:

#29

nova-compute.log Edit (405.9 KiB, text/html)

This happened to me again, after porting those 2 commits on top of version 12.0.0. My git log to show that:

113750b servicegroup: stop zombie service due to exception
736fa6a Handle DB failures in servicegroup DB driver
6df6ad3 (tag: 12.0.0.0rc3, tag: 12.0.0) Omnibus stable/liberty fix
b5b9f54 update dependencies based on rdopkg reqcheck
c920b8f Update to upstream 12.0.0.0rc2
ba6fccc Merge "Fix InstanceV1 backports to use context" into stable/liberty
6062043 Revert "[libvirt] Move cleanup of imported files to imagebackend"

The traceback I see in nova-compute:

2016-04-10 19:54:52,014.014 25926 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 87, in _report_state
2016-04-10 19:54:52,014.014 25926 ERROR nova.servicegroup.drivers.db service.service_ref.report_count += 1
2016-04-10 19:54:52,014.014 25926 ERROR nova.servicegroup.drivers.db AttributeError: 'NoneType' object has no attribute 'report_count'

Attached is the nova-compute with the traceback

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-11:

#30

nova-conductor.log Edit (171.5 KiB, text/plain)

And attaching the nova-conductor log

Shoham Peller (shoham-peller) on 2016-04-11

Changed in nova:
status:	Incomplete → New

Revision history for this message

jizhilong (zlji) wrote on 2016-04-12:

#31

Hi Shoham Peller, which version oslo.messaging you are using.

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-04-12:

#32

@zlji oslo.messaging's version is 2.5.0

Revision history for this message

Pushkar Umaranikar (pushkar-umaranikar) wrote on 2016-05-06:

#33

Solving Inconsistency: Bug priority already set. Hence, changing status from "new" to "confirmed"

Changed in nova:
status:	New → Confirmed

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-07-05: Cleanup EOL bug report

#34

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
Valid example: CONFIRMED FOR: LIBERTY