multi-node test causes nova-compute to lockup

Bug #1462305 reported by John Garbutt on 2015-06-05
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Davanum Srinivas (DIMS)

Bug Description

Its not very clear whats going on here, but here is the symptom.

One of the nova-compute nodes appears to lock up:
http://logs.openstack.org/67/175067/2/check/check-tempest-dsvm-multinode-full/7a95fb0/logs/screen-n-cpu.txt.gz#_2015-05-29_23_27_48_296
It was just completing the termination of an instance:
http://logs.openstack.org/67/175067/2/check/check-tempest-dsvm-multinode-full/7a95fb0/logs/screen-n-cpu.txt.gz#_2015-05-29_23_27_48_153

This is also seen in the scheduler reporting the node as down:
http://logs.openstack.org/67/175067/2/check/check-tempest-dsvm-multinode-full/7a95fb0/logs/screen-n-sch.txt.gz#_2015-05-29_23_31_02_711

On further inspection it seems like the other nova compute node had just started a migration:
http://logs.openstack.org/67/175067/2/check/check-tempest-dsvm-multinode-full/7a95fb0/logs/subnode-2/screen-n-cpu.txt.gz#_2015-05-29_23_27_48_079

We have had issues in the past where olso.locks can lead to deadlocks, its not totally clear if thats happening here. all the periodic tasks happen in the same greenlet, so you can stop them happening if you hold a lock in an RPC call thats being processed, etc. No idea if thats happening here though.

Changed in nova:
status: New → Incomplete
assignee: nobody → Joe Gordon (jogo)
tags: added: testing
Revision history for this message
Joe Gordon (jogo) wrote :

It looks like the delete operation is coming from tempest. But the command never finishes since the lock do_terminate_instance uses is never released

' Lock "e701630a-e0f0-4228-ac9b-475604ac3479" acquired by "do_terminate_instance"'

http://logs.openstack.org/67/175067/2/check/check-tempest-dsvm-multinode-full/7a95fb0/logs/screen-n-cpu.txt.gz#_2015-05-29_23_27_47_445

Revision history for this message
Joe Gordon (jogo) wrote :

Fingerprint: message:"has not been heard from in a while" AND tags:"screen-n-sch.txt" AND build_name:"check-tempest-dsvm-multinode-full"

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

After looking into this further, looks this happens on either node in the multinode job, always ending in the same place (an error in delete causing nova-compute to hang).

Revision history for this message
John Garbutt (johngarbutt) wrote :

Making this high, because ti blocking making multi-node voting

Changed in nova:
importance: Undecided → High
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

I'm setting it to "in progress" because "jogo" is set as assignee. Or does the combination "incomplete" + "assignee" have a special meaning?

Changed in nova:
status: Incomplete → In Progress
Revision history for this message
Joe Gordon (jogo) wrote :

Attempted to run guru meditation report (by sending a SIGUSR1) to the hung nova-compute but it doesn't respond

Revision history for this message
Joe Gordon (jogo) wrote :

next step is to attach gdb and get a stacktrace

Joe Gordon (jogo) on 2015-08-26
Changed in nova:
assignee: Joe Gordon (jogo) → nobody
Changed in nova:
status: In Progress → Confirmed
assignee: nobody → Davanum Srinivas (DIMS) (dims-v)
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers