lvcreate sometimes hangs and does not return, subsequent lvcreate commands hang from then on

Bug #1912031 reported by melanie witt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Triaged
Medium
Unassigned

Bug Description

Seen in a recent gate job on the stable/ussuri branch [1] where requests to create a volume get stuck in status 'creating' and don't progress further, example test output:

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/api/compute/admin/test_volumes_negative.py", line 100, in test_multiattach_rw_volume_update_failure
    vol1 = self.create_volume(multiattach=True)
  File "/opt/stack/tempest/tempest/api/compute/base.py", line 528, in create_volume
    volume['id'], 'available')
  File "/opt/stack/tempest/tempest/common/waiters.py", line 280, in wait_for_volume_resource_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume 75bbdc7b-8a79-45ed-8883-6389b6837af8 failed to reach available status (current creating) within the required time (196 s).

In this example, we see the create_volume action time out because the volume doesn't reach available status and stays in 'creating'.

Looking at the c-vol logs for volume 75bbdc7b-8a79-45ed-8883-6389b6837af8, we see the following messages for lvcreate (the last messages referencing this volume) [2]:

Jan 14 19:44:41.822811 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 sudo[9486]: stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C lvcreate -T -V 1g -n volume-75bbdc7b-8a79-45ed-8883-6389b6837af8 stack-volumes-lvmdriver-1/stack-volumes-lvmdriver-1-pool
Jan 14 19:44:42.011173 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 cinder-volume[28292]: INFO cinder.volume.flows.manager.create_volume [None req-85d09fb4-356f-4333-9eb5-c79a1ac1a0ea tempest-UpdateMultiattachVolumeNegativeTest-2119626783 None] Volume 75bbdc7b-8a79-45ed-8883-6389b6837af8: being created as raw with specification: {'status': 'creating', 'volume_name': 'volume-75bbdc7b-8a79-45ed-8883-6389b6837af8', 'volume_size': 1}
Jan 14 19:44:42.011722 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 cinder-volume[28292]: DEBUG oslo_concurrency.processutils [None req-85d09fb4-356f-4333-9eb5-c79a1ac1a0ea tempest-UpdateMultiattachVolumeNegativeTest-2119626783 None] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C lvcreate -T -V 1g -n volume-75bbdc7b-8a79-45ed-8883-6389b6837af8 stack-volumes-lvmdriver-1/stack-volumes-lvmdriver-1-pool {{(pid=28393) execute /usr/local/lib/python3.6/dist-packages/oslo_concurrency/processutils.py:379}}

Note that there is no matching "returned <code>" message.

Here is what normal/good output looks like for lvcreate, this example is from earlier in the same job run [3]:

Jan 14 19:39:06.220923 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 cinder-volume[28292]: INFO cinder.volume.flows.manager.create_volume [None req-d37cf946-69e6-4e0a-bceb-900b6392c88e tempest-AttachSCSIVolumeTestJSON-236492775 None] Volume e24aa42f-be18-4f29-9696-1dcf9796c67e: being created as raw with specification: {'status': 'creating', 'volume_name': 'volume-e24aa42f-be18-4f29-9696-1dcf9796c67e', 'volume_size': 1}
Jan 14 19:39:06.221386 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 cinder-volume[28292]: DEBUG oslo_concurrency.processutils [None req-d37cf946-69e6-4e0a-bceb-900b6392c88e tempest-AttachSCSIVolumeTestJSON-236492775 None] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C lvcreate -T -V 1g -n volume-e24aa42f-be18-4f29-9696-1dcf9796c67e stack-volumes-lvmdriver-1/stack-volumes-lvmdriver-1-pool {{(pid=28393) execute /usr/local/lib/python3.6/dist-packages/oslo_concurrency/processutils.py:379}}
Jan 14 19:39:06.235119 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 sudo[5649]: stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C lvcreate -T -V 1g -n volume-e24aa42f-be18-4f29-9696-1dcf9796c67e stack-volumes-lvmdriver-1/stack-volumes-lvmdriver-1-pool
Jan 14 19:39:11.441601 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 cinder-volume[28292]: DEBUG oslo_concurrency.processutils [None req-d37cf946-69e6-4e0a-bceb-900b6392c88e tempest-AttachSCSIVolumeTestJSON-236492775 None] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C lvcreate -T -V 1g -n volume-e24aa42f-be18-4f29-9696-1dcf9796c67e stack-volumes-lvmdriver-1/stack-volumes-lvmdriver-1-pool" returned: 0 in 5.220s {{(pid=28393) execute /usr/local/lib/python3.6/dist-packages/oslo_concurrency/processutils.py:416}}
Jan 14 19:39:11.463393 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 cinder-volume[28292]: INFO cinder.volume.flows.manager.create_volume [None req-d37cf946-69e6-4e0a-bceb-900b6392c88e tempest-AttachSCSIVolumeTestJSON-236492775 None] Volume volume-e24aa42f-be18-4f29-9696-1dcf9796c67e (e24aa42f-be18-4f29-9696-1dcf9796c67e): created successfully

Here ^ we see the command run, session opened, session closed, and the "returned <code>" message.

After an lvcreate hangs the way it did in the first example, no subsequent attempts to create a volume succeed. All result in a hang of the lvcreate command. In this example job run, after the first lvcreate command hangs, 4 more lvcreate commands hang trying to create various other volumes for other tests.

I found this old bug while googling about this lvcreate command hang behavior:

https://bugs.launchpad.net/cinder/+bug/1366125

and that looks to have been solved by a "later" ubuntu version (with the previous version being 14.04 at the time).

This job on the stable/ussuri branch is running ubuntu 18.04:

2021-01-14 18:34:23.176194 | localhost | # Node Information
2021-01-14 18:34:23.176292 | localhost | Inventory Hostname: controller
2021-01-14 18:34:23.176381 | localhost | Hostname: ubuntu-bionic-vexxhost-ca-ymq-1-0022542876
2021-01-14 18:34:23.176470 | localhost | Username: zuul
2021-01-14 18:34:23.176563 | localhost | Distro: Ubuntu 18.04
2021-01-14 18:34:23.176660 | localhost | Provider: vexxhost-ca-ymq-1
2021-01-14 18:34:23.176748 | localhost | Label: ubuntu-bionic
2021-01-14 18:34:23.176924 | localhost | Interface IP: 162.253.55.157

[1] https://zuul.opendev.org/t/openstack/build/20e8210fd0df40669c4b208df7f5de24
[2] https://zuul.opendev.org/t/openstack/build/20e8210fd0df40669c4b208df7f5de24/log/controller/logs/screen-c-vol.txt#1338
[3] https://zuul.opendev.org/t/openstack/build/20e8210fd0df40669c4b208df7f5de24/log/controller/logs/screen-c-vol.txt#950

Tags: gate-failure
Revision history for this message
melanie witt (melwitt) wrote :

Here's the matching good lvcreate command from syslog [1]:

Jan 14 19:39:06 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 sudo[5649]: stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C lvcreate -T -V 1g -n volume-e24aa42f-be18-4f29-9696-1dcf9796c67e stack-volumes-lvmdriver-1/stack-volumes-lvmdriver-1-pool
Jan 14 19:39:06 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 sudo[5649]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 14 19:39:11 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 sudo[5649]: pam_unix(sudo:session): session closed for user root

and the matching first bad lvcreate command [2]:

Jan 14 19:44:41 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 sudo[9486]: stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C lvcreate -T -V 1g -n volume-75bbdc7b-8a79-45ed-8883-6389b6837af8 stack-volumes-lvmdriver-1/stack-volumes-lvmdriver-1-pool
Jan 14 19:44:41 ubuntu-bionic-vexxhost-ca-ymq-1-0022542876 sudo[9486]: pam_unix(sudo:session): session opened for user root by (uid=0)

[1] https://zuul.opendev.org/t/openstack/build/20e8210fd0df40669c4b208df7f5de24/log/controller/logs/syslog.txt#4361-4363
[2] https://zuul.opendev.org/t/openstack/build/20e8210fd0df40669c4b208df7f5de24/log/controller/logs/syslog.txt#4522-4523

Changed in cinder:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
melanie witt (melwitt) wrote :

Are there any notes to paste here regarding the Triaged status? Has the team identified a root cause or ideas on how to fix it?

Revision history for this message
Gorka Eguileor (gorka) wrote :

In my experience this happens when the LVM filtering is not configured to ignore all new volumes, so it constantly scans new volumes attached to the host using os-brick.

Specifically happens when there are leftover volumes, in other words, we have a block device under /dev but the storage array is no longer exporting/mapping that volume to the host, so the device cannot read or do anything.

Removing these leftover devices fixes the blocked LVM call, though the right solution is setting the global_filter in /etc/lvm/lvm.conf

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.