multinode neutron grenade job times out on Xenial

Bug #1647431 reported by Daniel Alvarez
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Critical
Unassigned

Bug Description

gate-grenade-dsvm-neutron-multinode-ubuntu-xenial job is failing on neutron gate

I have checked some other patches and looks like the job doesn't fail on them so apparently it's not deterministic.

From the logs:

[1]
2016-12-05 09:07:46.832799 | ERROR: the main setup script run by this job failed - exit code: 124

[2]
2016-12-05 09:07:10.778 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:207 : timeout 30 sh -c 'while openstack server show cinder_server1 >/dev/null; do sleep 1; done'
2016-12-05 09:07:40.781 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:1 : exit_trap
2016-12-05 09:07:40.782 | + /opt/stack/new/grenade/functions:exit_trap:103 : local r=124

[1] http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/console.html
[2] http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/grenade.sh.txt.gz

Tags: gate-failure
Revision history for this message
John Schwarz (jschwarz) wrote :
Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Brian Haley (brian-haley) wrote :

This might be due to a recent DHCP change that can sometimes cause the agent to not start correctly on an agent upgrade, which happens in the grenade job.

I re-opened the original bug and have a change up for review:

https://bugs.launchpad.net/neutron/+bug/1627902

https://review.openstack.org/#/c/406428/

Once that fix merges we should recheck to verify.

Revision history for this message
Matt Riedemann (mriedem) wrote :
summary: - grenade job times out on Xenial
+ multinode neutron grenade job times out on Xenial
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like it's mostly multinode jobs when this fails.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This looks pretty bad in the c-vol logs on the new side around the time that we're trying to delete the cinder server:

http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/new/screen-c-vol.txt.gz?level=TRACE

2016-12-05 09:05:33.071 ERROR oslo_service.service [req-24b2e43c-5d5e-4d60-823c-7c7f73060e7c None] Error starting thread.
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service Traceback (most recent call last):
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_service/service.py", line 722, in run_service
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service service.start()
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/opt/stack/new/cinder/cinder/service.py", line 239, in start
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service service_id=Service.service_id)
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/opt/stack/new/cinder/cinder/volume/manager.py", line 461, in init_host
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service self.publish_service_capabilities(ctxt)
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/opt/stack/new/cinder/cinder/volume/manager.py", line 2018, in publish_service_capabilities
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service self._publish_service_capabilities(context)
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/opt/stack/new/cinder/cinder/manager.py", line 181, in _publish_service_capabilities
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service self.last_capabilities)
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/opt/stack/new/cinder/cinder/scheduler/rpcapi.py", line 149, in notify_service_capabilities
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service capabilities=capabilities)
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 144, in cast
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service self._check_version_cap(msg.get('version'))
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 121, in _check_version_cap
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service version_cap=self.version_cap)
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service RPCVersionCapError: Requested message version, 3.1 is incompatible. It needs to be equal in major version and less than or equal in minor version as the specified version cap 3.0.
2016-12-05 09:05:33.071 10423 ERROR oslo_service.service

Revision history for this message
Matt Riedemann (mriedem) wrote :

The cinder failure is a red herring, see bug 1647789.

Revision history for this message
Matt Riedemann (mriedem) wrote :

So I debugged from this failure:

http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/grenade.sh.txt.gz#_2016-12-05_09_07_08_778

The server uuid there is 7e8aad9d-50e1-4b6c-8418-7395a035256f.

I traced that through to n-cpu where nova calls os-terminate_connection in cinder:

http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/new/screen-n-cpu.txt.gz#_2016-12-05_09_07_13_263

Tracing that request to c-api gets me to the point that c-api does an RPC CALL to c-vol:

http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/new/screen-c-api.txt.gz#_2016-12-05_09_07_13_391

At that point it hangs, the RPC call never comes back and we timeout and die. I also don't see the request from c-api req-42b76e49-9bec-4ac5-b457-938bc82b3649 show up in c-vol so it looks like it's just dropped.

Revision history for this message
Matt Riedemann (mriedem) wrote :

OK so it is probably bug 1647789 because the c-vol startup fails to notify the scheduler about service capabilities, which kills the thread so the rpc message never gets to the c-vol service.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.