LVM/DM/UDEV out of sync inside cinder-volumes container

Bug #1436999 reported by Bjoern
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Critical
git-harry
Icehouse
Fix Released
Critical
git-harry
Juno
Fix Released
Critical
git-harry
Kilo
Fix Released
Critical
git-harry
Trunk
Fix Released
Critical
git-harry

Bug Description

Apparently we can get into a situation that DM and UDEV are getting out of sync and causing all kinds of issues leading to inaccessibility of cinder-volumes LVM devices:

The /dev/mapper device is supposed to be a symlink but we turned off udev synchronization inside LXC containers to prevent CPU storm in udev situations. I this example, the device link points to the minor number 4

cinder_volumes_container-e8fcb28e:/lib/udev/rules.d# ll /dev/mapper/*1ecbab2b*
brw-rw---- 1 root disk 252, 4 Mar 12 17:03 /dev/mapper/cinder--volumes-volume--1ecbab2b--512a--4f7e--9889--bf663090138c

BUT dm actually accesses this volume as minor number 6 :

cinder_volumes_container-e8fcb28e:/lib/udev/rules.d# dmsetup info /dev/cinder-volumes/volume-1ecbab2b-512a-4f7e-9889-bf663090138c
Name: cinder--volumes-volume--1ecbab2b--512a--4f7e--9889--bf663090138c
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 0
Event number: 0
Major, minor: 252, 6
Number of targets: 1
UUID: LVM-Ch2r8f3SdrgCZ6planIqCr6Zn77jCNAHtL19ICLAa4RdJUMiyfdlIy9NsJe4vKjQ

So after all it looks like we have to turn on UDEV sync (use_udev override) inside the cinder-volumes container.
Currently I do not know when this device ordering is happening, I suspect a creation and deletion of LVM volumes trigger a udev event which usually update the DM/LVM. I'll test this in a lab.

Revision history for this message
Bjoern (bjoern-t) wrote :

I did some testing and was not able to reproduce this issue yet.
"/dev" is not managed by udev inside the container unless you set a parameter but some parameters like obtain_device_list_from_udev or udev_rules did render LVM useless, even if udev manages /dev
At some point I think not having /dev mounted over udev is causing more issues than it actually fixes. Ultimately I want to see udev working inside the container. We just have to check apparmor profiles etc or configurations to get the physical volume added for the cinder-volumes inside the container when using udev.

David Wilde (dave-wilde)
Changed in openstack-ansible:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Evan Callicoat (apsu-2)
Changed in openstack-ansible:
milestone: none → next
Revision history for this message
Bjoern (bjoern-t) wrote :

We just encountered data loss after a cinder volume reboot. This bug needs to be worked at highest priority

Revision history for this message
Christopher H. Laco (claco) wrote :

Sounds like this is an active issue in Juno deployments, so we'll likely want to target a backport.

tags: added: juno-backport-potential
Revision history for this message
Evan Callicoat (diopter) wrote :

I believe that what needs to be done here is enable the udev_sync option in lvm.conf files and test that various operations that interact with udev (running 'reboot' and 'udevadm trigger' inside containers) as well as LVM Cinder operations (create volume, attach to instance, detach from instance, delete volume) don't cause any adverse reactions on the host. I don't believe there's any way changing this option would cause any issues, and I don't believe it would change the behavior of udev-affecting commands whether enabled or disabled, but it's worth testing.

According to what I can find from various bits and pieces of commentary on udev/LXC and Ubuntu specifically, the upstart/systemd LXC Apparmor profiles should prevent udev event propagation to the host, so anything LVM does via udev should, in theory, be safe and isolated.

Revision history for this message
Paul Halmos (paul-halmos) wrote :

This initially surfaced after a kernel panic on the cinder nodes when the client disabled HP monitoring software.. It can be reproduced with a cold reboot of the cinder node.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (juno)

Fix proposed to branch: juno
Review: https://review.openstack.org/188797

Changed in openstack-ansible:
status: Confirmed → In Progress
Revision history for this message
Bjoern (bjoern-t) wrote :

Folks, I can confirm that udev inside the container is working once we actually mount devtmpfs over /dev.
Using lxc.autodev =1 does not work, assuming the local udevd inside the container is running in a complete separate kernel name space.

Revision history for this message
git-harry (git-harry) wrote :
Download full text (5.8 KiB)

Steps to reproduce

utility_container:

# for i in {1..5}; do cinder create --display-name test$i 1; done

# cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| 165e09a7-7e2c-42ad-8cd8-e01ecaa6a8b6 | available | test3 | 1 | None | false | |
| 77bce6ce-70c5-4109-9133-31c7898e511f | available | test1 | 1 | None | false | |
| 8bc99d61-068c-4dcd-bb14-28561b76e333 | available | test4 | 1 | None | false | |
| f4551c7f-6d0b-4b56-afdf-4d67d4e07977 | available | test5 | 1 | None | false | |
| fade96b4-eb74-4ae2-bd64-8bc761cb4eee | available | test2 | 1 | None | false | |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+

# cinder delete test3

# cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| 77bce6ce-70c5-4109-9133-31c7898e511f | available | test1 | 1 | None | false | |
| 8bc99d61-068c-4dcd-bb14-28561b76e333 | available | test4 | 1 | None | false | |
| f4551c7f-6d0b-4b56-afdf-4d67d4e07977 | available | test5 | 1 | None | false | |
| fade96b4-eb74-4ae2-bd64-8bc761cb4eee | available | test2 | 1 | None | false | |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+

# cinder create --display-name test6 1

# cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| 77bce6ce-70c5-4109-9133-31c7898e511f | available | test1 | 1 | None | false | |
| 7e15a9b5-c002-4c3c-8eb5-298e4a91b90c | available | test6 | 1 | None | false | |
| 8bc99d61-068c-4dcd-bb14-28561b76e333 | available | test4 | 1 | None | false | |
| f4551c7f-6d0b-4b56-afdf-4d67d4e07977 | available | test5 | 1 | None | false | |
| fade96b4-eb74-4ae2-bd64-8bc761cb4eee | available | test2 | 1 | None | false | |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+

volumes_container:
...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (juno)

Reviewed: https://review.openstack.org/188797
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=389b0f0dd886388fefae0cece2844c35da28e405
Submitter: Jenkins
Branch: juno

commit 389b0f0dd886388fefae0cece2844c35da28e405
Author: git-harry <email address hidden>
Date: Fri Jun 5 14:41:25 2015 +0100

    Enable udev for lvm in cinder-volume container

    The current configuration of LVM for cinder-volume has udev_sync=0.
    This means that udev is not creating the devices that appear in /dev.
    The device files created reference specific device numbers, and these
    persist between reboots. When the host is rebooted there is no
    guarantee that device numbers allocated to the logical volumes will
    match those defined in the device files. This can be observed by
    comparing the output of 'dmsetup info' and 'ls -l /dev/mapper'.

    LVM's use of udev was disabled in an attempt to protect the host from
    the potential that uevents generated would be processed by all
    containers on the host. In practise this should not be an issue because
    there are not other containers running on a cinder host.

    This commit adjusts the lvm.conf file created so that udev is used. It
    also adds a mount entry to create a devtmpfs on /dev. Finally
    'udevadm trigger' is run to add the devices under /dev/mapper.
    Closes-Bug: #1436999
    Change-Id: I9ab35cf4438a369563f8c08870c1acfd0cc394b0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (icehouse)

Fix proposed to branch: icehouse
Review: https://review.openstack.org/189259

Revision history for this message
Darren Birkett (darren-birkett) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/188394

Revision history for this message
Bjoern (bjoern-t) wrote :

This fix has been committed to 10.1.8 right ?
The milestone still sits on 10.1.7, can we update the status of this bug please

Revision history for this message
Darren Birkett (darren-birkett) wrote :

@bjoern the fix was released in 10.1.7

10.1.8 was a fast follow release to fix a minor versioning issue, but obviously also contains all fixes from 10.1.7 and earlier

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (icehouse)

Reviewed: https://review.openstack.org/189259
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=7e48fa53c534d671f0899b9b220471a615bd83de
Submitter: Jenkins
Branch: icehouse

commit 7e48fa53c534d671f0899b9b220471a615bd83de
Author: git-harry <email address hidden>
Date: Fri Jun 5 14:41:25 2015 +0100

    Enable udev for lvm in cinder-volume container

    The current configuration of LVM for cinder-volume has udev_sync=0.
    This means that udev is not creating the devices that appear in /dev.
    The device files created reference specific device numbers, and these
    persist between reboots. When the host is rebooted there is no
    guarantee that device numbers allocated to the logical volumes will
    match those defined in the device files. This can be observed by
    comparing the output of 'dmsetup info' and 'ls -l /dev/mapper'.

    LVM's use of udev was disabled in an attempt to protect the host from
    the potential that uevents generated would be processed by all
    containers on the host. In practise this should not be an issue because
    there are not other containers running on a cinder host.

    This commit adjusts the lvm.conf file created so that udev is used. It
    also adds a mount entry to create a devtmpfs on /dev. Finally
    'udevadm trigger' is run to add the devices under /dev/mapper.

    (cherry picked from commit 389b0f0dd886388fefae0cece2844c35da28e405)

    Closes-Bug: #1436999
    Change-Id: I9ab35cf4438a369563f8c08870c1acfd0cc394b0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (master)

Reviewed: https://review.openstack.org/188394
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=6ba292a295ab2e7effeefd053802e6afadd0ab9e
Submitter: Jenkins
Branch: master

commit 6ba292a295ab2e7effeefd053802e6afadd0ab9e
Author: git-harry <email address hidden>
Date: Tue Jun 2 15:17:33 2015 +0100

    Enable udev for lvm in cinder-volume container

    The current configuration of LVM for cinder-volume has udev_sync=0.
    This means that udev is not creating the devices that appear in /dev.
    The device files created reference specific device numbers, and these
    persist between reboots. When the host is rebooted there is no
    guarantee that device numbers allocated to the logical volumes will
    match those defined in the device files. This can be observed by
    comparing the output of 'dmsetup info' and 'ls -l /dev/mapper'.

    LVM's use of udev was disabled in an attempt to protect the host from
    the potential that uevents generated would be processed by all
    containers on the host. In practise this should not be an issue because
    there are not other containers running on a cinder host.

    This commit adjusts the lvm.conf file created so that udev is used. It
    also adds a mount entry to create a devtmpfs on /dev. Finally
    'udevadm trigger' is run to add the devices under /dev/mapper.
    Closes-Bug: #1436999
    Change-Id: I9ab35cf4438a369563f8c08870c1acfd0cc394b0

Changed in openstack-ansible:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/190041

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (kilo)

Reviewed: https://review.openstack.org/190041
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=e741ea00abc83f086ec2b1180d3736d0448c56d7
Submitter: Jenkins
Branch: kilo

commit e741ea00abc83f086ec2b1180d3736d0448c56d7
Author: git-harry <email address hidden>
Date: Tue Jun 2 15:17:33 2015 +0100

    Enable udev for lvm in cinder-volume container

    The current configuration of LVM for cinder-volume has udev_sync=0.
    This means that udev is not creating the devices that appear in /dev.
    The device files created reference specific device numbers, and these
    persist between reboots. When the host is rebooted there is no
    guarantee that device numbers allocated to the logical volumes will
    match those defined in the device files. This can be observed by
    comparing the output of 'dmsetup info' and 'ls -l /dev/mapper'.

    LVM's use of udev was disabled in an attempt to protect the host from
    the potential that uevents generated would be processed by all
    containers on the host. In practise this should not be an issue because
    there are not other containers running on a cinder host.

    This commit adjusts the lvm.conf file created so that udev is used. It
    also adds a mount entry to create a devtmpfs on /dev. Finally
    'udevadm trigger' is run to add the devices under /dev/mapper.
    Closes-Bug: #1436999
    Change-Id: I9ab35cf4438a369563f8c08870c1acfd0cc394b0
    (cherry picked from commit 6ba292a295ab2e7effeefd053802e6afadd0ab9e)

Revision history for this message
Paul Halmos (paul-halmos) wrote :
Download full text (4.4 KiB)

I manually rolled the changes into an environment today on a single cinder node. The tenant had shutdown the instances with the volumes attached prior to the maintenance. After rebooting the cinder node, it was found that the state files located in /var/lib/cinder/volumes/ were nonexistent. This manifested itself when attempting to start an instance which had a volume on the effected cinder node:

2015-06-09 18:54:58.804 19459 TRACE oslo.messaging.rpc.dispatcher libvirtError: Failed to open file '/dev/disk/by-path/ip-10.17.150.69:3260-iscsi-iqn.2010-10.org.openstack:volume-ab377917-b1fa-416a-b001-ef6b9ff09715-lun-1': No such device or address

When attempting to discover the targets the following errors were seen:

# iscsiadm -m discovery -t st -p 10.17.150.69:3260
iscsiadm: Connection to Discovery Address 10.17.150.69 closed
iscsiadm: Login I/O error, failed to receive a PDU
iscsiadm: retrying discovery login to 10.17.150.69

2015-06-09 19:23:43.073 19459 TRACE oslo.messaging.rpc.dispatcher Command: sudo nova-rootwrap /etc/nova/rootwrap.conf iscsiadm -m node -T iqn.2010-10.org.openstack:volume-ab377917-b1fa-416a-b001-ef6b9ff09715 -p 10.17.150.69:3260 --rescan
2015-06-09 19:23:43.073 19459 TRACE oslo.messaging.rpc.dispatcher Exit code: 21
2015-06-09 19:23:43.073 19459 TRACE oslo.messaging.rpc.dispatcher Stdout: u''
2015-06-09 19:23:43.073 19459 TRACE oslo.messaging.rpc.dispatcher Stderr: u'iscsiadm: No session found.\n'

On the cinder node:

015-06-09 18:29:18.882 1411 ERROR cinder.volume.manager [req-62b66a03-c879-482c-8be5-942e5b35180d - - - - -] Failed to re-export volume ab377917-b1fa-416a-b001-ef6b9ff09715: setting to error state
2015-06-09 18:29:18.883 1411 ERROR cinder.volume.manager [req-62b66a03-c879-482c-8be5-942e5b35180d - - - - -] Failed to create iscsi target for volume volume-ab377917-b1fa-416a-b001-ef6b9ff09715.
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager Traceback (most recent call last):
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager File "/usr/local/lib/python2.7/dist-packages/cinder/volume/manager.py", line 276, in init_host
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager self.driver.ensure_export(ctxt, volume)
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager return f(*args, **kwargs)
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager File "/usr/local/lib/python2.7/dist-packages/cinder/volume/drivers/lvm.py", line 543, in ensure_export
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager self.configuration)
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager File "/usr/local/lib/python2.7/dist-packages/cinder/volume/iscsi.py", line 116, in ensure_export
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager write_cache=conf.iscsi_write_cache)
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager File "/usr/local/lib/python2.7/dist-packages/cinder/brick/iscsi/iscsi.py", line 249, in create_iscsi_target
2015-06-09 18:29:18.883 1411 TRACE cinder.volume.manager raise exception.ISCSITargetCr...

Read more...

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 11.2.11

This issue was fixed in the openstack/openstack-ansible 11.2.11 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 11.2.12

This issue was fixed in the openstack/openstack-ansible 11.2.12 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 11.2.14

This issue was fixed in the openstack/openstack-ansible 11.2.14 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.