On Fedora, stack.sh randomly fails on "sudo vgs stack-volumes-default"

Bug #1441236 reported by Victor Stinner
40
This bug affects 8 people
Affects Status Importance Assigned to Milestone
devstack
Fix Released
Undecided
Bogdan Katynski

Bug Description

When trying to start DevStack (git d1549d624dea9b00d7aa0f935759c69290e9ea17), stack.sh randomly fails on the creationg of the volume groups. I'm running DevStack in Fedora 21, it's a VM. The host of the VM is also Fedora 21, I'm using virt-manager with kvm. The VM has 4 GB of memory, 2 (v)CPUs.

Logs logs:

2015-04-07 15:49:20.727 | + is_service_enabled c-vol
2015-04-07 15:49:20.730 | + return 0
2015-04-07 15:49:20.730 | + [[ -n lvm:lvmdriver-1 ]]
2015-04-07 15:49:20.731 | + local be be_name be_type
2015-04-07 15:49:20.731 | + for be in '${CINDER_ENABLED_BACKENDS//,/ }'
2015-04-07 15:49:20.731 | + be_type=lvm
2015-04-07 15:49:20.731 | + be_name=lvmdriver-1
2015-04-07 15:49:20.731 | + type init_cinder_backend_lvm
2015-04-07 15:49:20.731 | + [[ lvm == \l\v\m ]]
2015-04-07 15:49:20.731 | + init_default_lvm_volume_group
2015-04-07 15:49:20.731 | + [[ 0 = \0 ]]
2015-04-07 15:49:20.732 | + init_lvm_volume_group stack-volumes-default 10250M
2015-04-07 15:49:20.732 | + local vg=stack-volumes-default
2015-04-07 15:49:20.732 | + local size=10250M
2015-04-07 15:49:20.732 | + is_fedora
2015-04-07 15:49:20.732 | + [[ -z Fedora ]]
2015-04-07 15:49:20.732 | + '[' Fedora = Fedora ']'
2015-04-07 15:49:20.732 | + start_service lvm2-lvmetad
2015-04-07 15:49:20.733 | + is_ubuntu
2015-04-07 15:49:20.733 | + [[ -z rpm ]]
2015-04-07 15:49:20.733 | + '[' rpm = deb ']'
2015-04-07 15:49:20.733 | + sudo /sbin/service lvm2-lvmetad start
2015-04-07 15:49:20.753 | Redirecting to /bin/systemctl start lvm2-lvmetad.service
2015-04-07 15:49:20.763 | + '[' tgtadm = tgtadm ']'
2015-04-07 15:49:20.764 | + start_service tgtd
2015-04-07 15:49:20.764 | + is_ubuntu
2015-04-07 15:49:20.764 | + [[ -z rpm ]]
2015-04-07 15:49:20.764 | + '[' rpm = deb ']'
2015-04-07 15:49:20.764 | + sudo /sbin/service tgtd start
2015-04-07 15:49:20.782 | Redirecting to /bin/systemctl start tgtd.service
2015-04-07 15:49:26.143 | + _create_lvm_volume_group stack-volumes-default 10250M
2015-04-07 15:49:26.144 | + local vg=stack-volumes-default
2015-04-07 15:49:26.144 | + local size=10250M
2015-04-07 15:49:26.144 | + local backing_file=/opt/stack/data/stack-volumes-default-backing-file
2015-04-07 15:49:26.144 | + sudo vgs stack-volumes-default
2015-04-07 15:49:26.162 | Volume group "stack-volumes-default" not found
2015-04-07 15:49:26.163 | Cannot process volume group stack-volumes-default
2015-04-07 15:49:26.165 | + [[ -f /opt/stack/data/stack-volumes-default-backing-file ]]
2015-04-07 15:49:26.165 | ++ sudo losetup -f --show /opt/stack/data/stack-volumes-default-backing-file
2015-04-07 15:49:26.198 | + local vg_dev=/dev/loop0
2015-04-07 15:49:26.198 | + sudo vgs stack-volumes-default
2015-04-07 15:49:26.213 | Volume group "stack-volumes-default" not found
2015-04-07 15:49:26.213 | Cannot process volume group stack-volumes-default
2015-04-07 15:49:26.216 | + sudo vgcreate stack-volumes-default /dev/loop0
2015-04-07 15:49:26.257 | A volume group called stack-volumes-default already exists.
2015-04-07 15:49:26.260 | + exit_trap
2015-04-07 15:49:26.260 | + local r=5
2015-04-07 15:49:26.261 | ++ jobs -p
2015-04-07 15:49:26.262 | + jobs=
2015-04-07 15:49:26.262 | + [[ -n '' ]]
2015-04-07 15:49:26.262 | + kill_spinner
2015-04-07 15:49:26.262 | + '[' '!' -z '' ']'
2015-04-07 15:49:26.262 | + [[ 5 -ne 0 ]]
2015-04-07 15:49:26.262 | + echo 'Error on exit'
2015-04-07 15:49:26.262 | Error on exit
2015-04-07 15:49:26.263 | + [[ -z /opt/stack/data/logs ]]
2015-04-07 15:49:26.263 | + /home/fedora/devstack/tools/worlddump.py -d /opt/stack/data/logs
2015-04-07 15:49:26.340 | + exit 5

It's strange because stack-volumes-default *does* exist:

fedora@myhost$ sudo vgs stack-volumes-default
  VG #PV #LV #SN Attr VSize VFree
  stack-volumes-default 1 0 0 wz--n- 10,01g 10,01g
fedora@myhost$ sudo losetup
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE
/dev/loop0 0 0 0 0 /opt/stack/data/stack-volumes-default-backing-file

Its maybe a bug in the code responsible to detect if the volume group is present. Or a race condition? But how? I'm not running stack.sh in parallel?! (hum, I have an environment variable MAKEFLAGS=-j3 but i don't see how it could be related.)

localrc:

# Configuration for DevStack (used on OpenStack Kilo in 2015)
# Read http://docs.openstack.org/developer/devstack/configuration.html

# Passwords
DATABASE_PASSWORD=password
ADMIN_PASSWORD=password
MYSQL_PASSWORD=password
RABBIT_PASSWORD=password
SERVICE_PASSWORD=password
SERVICE_TOKEN=token

# By default, all git repositories are updated (OFFLINE is False)
# WARNING: local changes are dropped!
#OFFLINE=True

# stack.sh will freshen each repo on each run
RECLONE=True

# Note: it doesn't work to use icehouse, you should checkout the Icehouse
# Note: branch of DevStack too
#GLANCE_BRANCH=stable/icehouse
#HORIZON_BRANCH=stable/icehouse
#KEYSTONE_BRANCH=stable/icehouse
#NEUTRON_BRANCH=stable/icehouse
#NOVA_BRANCH=stable/icehouse
#SWIFT_BRANCH=1.10.0

# Write logs into /opt/stack/data/logs
LOGFILE=$DATA_DIR/logs/stack.log
# Keep 2 days of log max
LOGDAYS=2

# Disable Heat
disable_service heat h-api h-api-cfn h-api-cw h-eng

# Disable swift, tempest
disable_service swift tempest

# Disable horizon
disable_service horizon

# Disable Cinder
disable_service cinder

# Use neutron for manage networks instead of nova ne7twork
Q_PLUGIN=ml2
disable_service n-net
enable_service q-svc q-agt q-dhcp q-l3 q-meta q-lbaas

Note: I disabled SELinux (/etc/selinux/config contains SELINUX=permissive).

Revision history for this message
gordon chung (chungg) wrote :

i saw this once too... and then it disappeared

Revision history for this message
Ed Balduf (ebalduf) wrote :

I'm in this situation right now, and spent some time to track this down. It looks like somehow the volume group stack-volumes-default was created, but not cleaned up. The interesting parts of the log from devstack are

2015-05-30 22:25:48.806 | + truncate -s 16G /opt/stack/data/stack-volumes-lvmdriver-1-backing-file
2015-05-30 22:25:48.808 | ++ sudo losetup -f --show /opt/stack/data/stack-volumes-lvmdriver-1-backing-file
2015-05-30 22:25:48.830 | + local vg_dev=/dev/loop1
2015-05-30 22:25:48.830 | + sudo vgs stack-volumes-lvmdriver-1
2015-05-30 22:25:48.845 | Volume group "stack-volumes-lvmdriver-1" not found
2015-05-30 22:25:48.845 | Cannot process volume group stack-volumes-lvmdriver-1
2015-05-30 22:25:48.846 | + sudo vgcreate stack-volumes-lvmdriver-1 /dev/loop1
2015-05-30 22:25:48.871 | Physical volume '/dev/loop1' is already in volume group 'stack-volumes-default'
2015-05-30 22:25:48.871 | Unable to add physical volume '/dev/loop1' to volume group 'stack-volumes-lvmdriver-1'.

But when you I do a losetup you see this

[devstack-kilo2 data]$ sudo losetup
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE
/dev/loop0 0 0 1 0 /opt/stack/data/swift/drives/images/swift.img
/dev/loop1 0 0 0 0 /opt/stack/data/stack-volumes-lvmdriver-1-backing-file

which indicates that /dev/loop1 is correct. but vgs indicates that stack-volumes-default are still present

[devstack-kilo2 data]$ sudo vgs
  VG #PV #LV #SN Attr VSize VFree
  fedora-server 1 2 0 wz--n- 39.51g 44.00m
  stack-volumes-default 1 0 0 wz--n- 16.00g 16.00g

when I go to remove stack-volumes-default it errors out indicating problems with /dev/loop1

devstack-kilo2 data]$ sudo vgremove stack-volumes-default
  Incorrect metadata area header checksum on /dev/loop1 at offset 4096
  vg_remove_mdas stack-volumes-default failed

I suspect that I changed my local.conf file with devstack running and then did an unstack after and it left stack-volumes-default around and attached to /dev/loop1 but the backing file and loop1 are gone. Not sure if this is a bug or user procedure error.

Revision history for this message
Joe Hakim Rahme (rahmu) wrote :

I got this on my Fedora 24 after rebooting my VM without running `./unstack` first. Just running `./unstack` and re-running `./stack.sh` did it for me.

Revision history for this message
gordon chung (chungg) wrote :

this bad boy came back with a vengeance, i've no idea but the only solution we have is to reboot the system. seems cinder or something is creating a volume and can't clean it.

Revision history for this message
Bogdan Katynski (bkatynski) wrote :

This issue is happening to me as well intermittently.

I'm running ./stack.sh on a VirtualBox VM with 4 vCPUs and 8GB of RAM running Ubuntu 16.04

Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version kilo in description. If this is incorrect, please update the description to include 'https://api.launchpad.net/1.0/devstack version: ...'

Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version kilo in description. If this is incorrect, please update the description to include 'devstack version: ...'

Sean Dague (sdague)
tags: added: openstack-version.kilo
Changed in devstack:
assignee: nobody → Bogdan Katynski (bkatynski)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/478525

Changed in devstack:
status: New → In Progress
Revision history for this message
Bogdan Katynski (bkatynski) wrote :

I'd appreciate a code review from folks that reported and commented on this bug.

Thanks

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/478525
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=8175df1912a5722c2e308dc27623063ac542f496
Submitter: Jenkins
Branch: master

commit 8175df1912a5722c2e308dc27623063ac542f496
Author: Bogdan Katynski <email address hidden>
Date: Wed Jun 28 12:34:52 2017 +0100

    Remove VGs when cleaning LVM up in devstack

    This patch adds removing of the volume group
    before removing the loopback device and
    the backing file when performing LVM cleanup
    in unstack.sh

    Without this commit:

    unstack.sh removes logical volumes, removes the
    loopback devices and deletes the backing file
    but leaves a dangling volume group

      $ ./stack.sh && ./unstack.sh
      $ sudo vgs
      VG #PV #LV #SN Attr VSize VFree
      stack-volumes-default 1 0 0 wz--n- 10.01g 10.01g
      $ sudo losetup -a
      $ sudo vgremove stack-volumes-default
      /dev/loop0: lseek 4096 failed: Invalid argument
      vg_remove_mdas stack-volumes-default failed

    With this commit:

    unstack.sh removes volume groups after removing
    all logical volumes but before removing
    the loopback device and deleting the backing file

    Partial-Bug: 1441236
    Change-Id: Id9c06fa50f6cad28764f5a3396f559cac9999649

Revision history for this message
Bogdan Katynski (bkatynski) wrote :

After the bugfix was merged, I did some manual tests in devstack (Ubuntu 16.04)

Ran
./stack.sh a few times consecutively (no unstack.sh in between)
and also
./stack.sh && ./unstack.sh && ./stack.sh

And didn't see the error.

Can someone else verify too?

Changed in devstack:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.