grenade failures on mitaka side with failure prepping block device due to RPC timeout

Bug #1503290 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Unassigned

Bug Description

http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-n-cpu.txt.gz?level=TRACE#_2015-10-06_05_39_36_682

2015-10-06 05:39:36.682 ERROR nova.compute.manager [req-03e13a69-3da6-4bc6-bcab-3b2a3fe4d236 tempest-ServersTestManualDisk-1771971010 tempest-ServersTestManualDisk-1239478904] [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] Build of instance 816e60bb-0085-4b0b-9a00-dbd24c339959 aborted: Failure prepping block device.
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] Traceback (most recent call last):
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] File "/opt/stack/new/nova/nova/compute/manager.py", line 1906, in _do_build_and_run_instance
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] filter_properties)
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] File "/opt/stack/new/nova/nova/compute/manager.py", line 2034, in _build_and_run_instance
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] 'create.error', fault=e)
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] six.reraise(self.type_, self.value, self.tb)
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] File "/opt/stack/new/nova/nova/compute/manager.py", line 1999, in _build_and_run_instance
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] block_device_mapping) as resources:
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] return self.gen.next()
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] File "/opt/stack/new/nova/nova/compute/manager.py", line 2165, in _build_resources
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] reason=msg)
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959] BuildAbortException: Build of instance 816e60bb-0085-4b0b-9a00-dbd24c339959 aborted: Failure prepping block device.
2015-10-06 05:39:36.682 5634 ERROR nova.compute.manager [instance: 816e60bb-0085-4b0b-9a00-dbd24c339959]

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiX2RvX2J1aWxkX2FuZF9ydW5faW5zdGFuY2VcIiBBTkQgbWVzc2FnZTpcImFib3J0ZWQ6IEZhaWx1cmUgcHJlcHBpbmcgYmxvY2sgZGV2aWNlXCIgQU5EIHRhZ3M6XCJzY3JlZW4tbi1jcHUudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0NDQxMzg1OTE3NjV9

7 hits in 7 days, check and gate, all failures, master branch only.

Tags: volumes
Matt Riedemann (mriedem)
tags: added: volumes
Changed in nova:
status: New → Confirmed
importance: Undecided → High
summary: - grenade failures on mitaka side with failure prepping block device
+ grenade failures on mitaka side with failure prepping block device due
+ to RPC timeout
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm wondering if this is related to the recent devstack change on master which dropped conductor workers from $nproc (8) to 2.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looking at dstat around 05:39:36 when we had the messaging timeout, cinder is doing a lot of work, not conductor, so we're probably timing out waiting for a response from cinder-api for something, like initializing a connection.

http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-dstat.txt.gz

Note that right before this failures we're running test_volume_boot_pattern which is a boot from volume test:

http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-n-cpu.txt.gz?level=TRACE#_2015-10-06_05_39_33_329

So around the time that we timeout prepping block devices for one instance, we're attaching a newly created volume to another instance in a concurrently running test.

Revision history for this message
Matt Riedemann (mriedem) wrote :

There is a lock being held in c-vol that is taking awhile, 16 seconds in one case, I'm not sure if that's a lot or not:

http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-c-vol.txt.gz#_2015-10-06_05_40_15_936

2015-10-06 05:40:15.936 DEBUG oslo_concurrency.lockutils [req-6b402bc1-3c24-47f8-be65-609bfb98c8fd tempest-TestVolumeBootPatternV2-836916434] Lock "35fbbb48-90a3-45e2-9275-d66c9b8e9b2b-delete_snapshot" released by "cinder.volume.manager._run_flow_locked" :: held 16.103s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:265

Revision history for this message
Matt Riedemann (mriedem) wrote :

Cinder is not involved here, we're hitting the rpc timeout when calling instance.save() in nova.compute.manager _build_resources here:

            LOG.debug('Start building block device mappings for instance.',
                      instance=instance)
            instance.vm_state = vm_states.BUILDING
            instance.task_state = task_states.BLOCK_DEVICE_MAPPING
            instance.save()

And conductor is in bad shape:

http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-n-cond.txt.gz?level=TRACE

When we failed, we're falling in between:

http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-n-cond.txt.gz#_2015-10-06_05_39_34_313

And

http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-n-cond.txt.gz#_2015-10-06_05_39_43_085

When nothing is happening.

no longer affects: cinder
Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

The lock part is caused by the fact we're locking source snapshot deletion while we're creating a volume from it. I don't think it is the root of the problem. I'd rather look at it from other side - it's taking us 16 seconds to create a volume from snapshot. Why? Is it the cause of the timeout?

From logs for this particular request (6b402bc1-3c24-47f8-be65-609bfb98c8fd) it looks like this dd is taking the most time: http://logs.openstack.org/10/230510/2/gate/gate-grenade-dsvm/e796b39/logs/new/screen-c-vol.txt.gz#_2015-10-06_05_40_00_759

Revision history for this message
Matt Riedemann (mriedem) wrote :

Maybe this could be related to bug 1446583.

Revision history for this message
Chuck Carmack (chuckcarmack75) wrote :

It might be related to bug 1318721

Revision history for this message
Matt Riedemann (mriedem) wrote :

This isn't showing up anymore and bug 1446583 is fixed so I'm assuming it was related to that.

Changed in nova:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.