stable/train volume failed to build and is in ERROR status, Permission denied: '/var/lib/cinder/groups'

Bug #1908750 reported by wes hayutin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Alan Bishop

Bug Description

In tempest we get:

https://logserver.rdoproject.org/10/26510/11/check/periodic-tripleo-ci-centos-8-standalone-train/0ceb28b/logs/undercloud/var/log/tempest/stestr_results.html.gz

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/api/volume/test_volumes_get.py", line 131, in test_volume_create_get_update_delete_from_image
    imageRef=CONF.compute.image_ref, size=disk_size)
  File "/usr/lib/python3.6/site-packages/tempest/api/volume/test_volumes_get.py", line 42, in _volume_create_get_update_delete
    volume['id'], 'available')
  File "/usr/lib/python3.6/site-packages/tempest/common/waiters.py", line 210, in wait_for_volume_resource_status
    resource_name=resource_name, resource_id=resource_id)
tempest.exceptions.VolumeResourceBuildErrorException: volume daee4a64-8697-4860-ba7b-83ccb7b08575 failed to build and is in ERROR status

================ no valid backend found =========
bf8b0eae67474cc9acc63d008737f49c af714ff5314a4c7b987518d4a72ae808 - default default] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid backend was found. No weighed backends available: cinder.exception.NoValidBackend: No valid backend was found. No weighed backends available

============== cinder error =============
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service Traceback (most recent call last):
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/tooz/drivers/file.py", line 278, in _start
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service fileutils.ensure_tree(a_dir)
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/oslo_utils/fileutils.py", line 42, in ensure_tree
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service os.makedirs(path, mode)
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib64/python3.6/os.py", line 220, in makedirs
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service mkdir(name, mode)
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service PermissionError: [Errno 13] Permission denied: '/var/lib/cinder/groups'
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service During handling of the above exception, another exception occurred:
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service Traceback (most recent call last):
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/oslo_service/service.py", line 810, in run_service
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service service.start()
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/cinder/service.py", line 220, in start
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service coordination.COORDINATOR.start()
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/cinder/coordination.py", line 67, in start
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service self.coordinator.start(start_heart=True)
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/tooz/coordination.py", line 690, in start
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service super(CoordinationDriverWithExecutor, self).start(start_heart)
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/tooz/coordination.py", line 426, in start
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service self._start()
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/tooz/drivers/file.py", line 280, in _start
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service raise coordination.ToozConnectionError(e)
2020-12-18 17:54:00.328 ERROR /var/log/containers/cinder/cinder-volume.log: 19 ERROR oslo_service.service tooz.coordination.ToozConnectionError: [Errno 13] Permission denied: '/var/lib/cinder/groups'

wes hayutin (weshayutin)
Changed in tripleo:
importance: Undecided → Critical
Revision history for this message
wes hayutin (weshayutin) wrote :

reproduced the same failures here https://review.rdoproject.org/r/#/c/26510/

Revision history for this message
Luigi Toscano (ltoscano) wrote :

Just to be sure: is that the only stable/train job where that failure is visible? Do all volume tempest tests fail?

It seems that a recheck passed:
https://logserver.rdoproject.org/10/26510/12/check/periodic-tripleo-ci-centos-8-standalone-full-tempest-scenario-train/de15562/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Alan Bishop (alan-bishop) wrote :

I'm really baffled. The failure occurs very early in the cinder-volume service startup, where it initializes its coordination locking scheme. This defaults to tooz's "file" driver, which in turn needs to create the /var/lib/cinder/groups directory. The permission denied failure suggests cinder doesn't own the /var/lib/cinder directory. More disturbing is a failure like this prevents the cinder-volume service from running at all, which would cause all sorts of cascading failures beyond a few tempest tests.

It's hard to tell from the logs what's causing the permission error. Can this be reproduced locally? I'll see what I can do, but traditional I've had no success using CI's quickstart reproducer.

Revision history for this message
Alan Bishop (alan-bishop) wrote :

I reproduced this using a standalone build based on train. The cinder-volume service is running as the 'cinder' user, but /var/lib/cinder is owned by root and so cinder cannot create the /var/lib/cinder/groups directory.

Note: In train's standalone deployment, cinder-volume is not run under pacemaker. in later releases, even standalone deployment runs c-vol under pacemaker. Under pacemaker, c-vol runs as root, so it has perms to create the /var/lib/cinder/groups directory.

I think the issue is the TCIB framework isn't installing a the extend_start.sh script that kolla images provide. See [1], where the c-vol extend_start.sh sets the ownership of /var/lib/cinder.

[1] https://opendev.org/openstack/kolla/src/branch/master/docker/cinder/cinder-volume/extend_start.sh#L4

I looked for something equivelent in [2], where the TCIB extend_start scripts appear to be located, but I don't see anything for cinder-volume.

[2] https://opendev.org/openstack/tripleo-common/src/branch/stable/train/container-images/kolla

This is pretty much the limit of my knowledge of TCIB, but hopefully someone who knows more can make something of this.

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Alan Bishop (alan-bishop) wrote :

I am working on a fix for this.

Changed in tripleo:
assignee: nobody → Alan Bishop (alan-bishop)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Alan, JFYI https://review.opendev.org/q/Ib2ca2ca46ff4efa419b6b9236299e70b39f8639e resembles the subject but for Ironic services

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
yatin (yatinkarel) wrote :
Revision history for this message
yatin (yatinkarel) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

Closing this out

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 14.0.0

This issue was fixed in the openstack/tripleo-common 14.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 12.4.3

This issue was fixed in the openstack/tripleo-common 12.4.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 11.5.0

This issue was fixed in the openstack/tripleo-common 11.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 13.2.0

This issue was fixed in the openstack/tripleo-common 13.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-common/+/873032

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/873032
Committed: https://opendev.org/openstack/tripleo-common/commit/b6cf34a9536ce413b2a46ab455c54e4488f49f2d
Submitter: "Zuul (22348)"
Branch: master

commit b6cf34a9536ce413b2a46ab455c54e4488f49f2d
Author: Alan Bishop <email address hidden>
Date: Tue Feb 7 13:05:47 2023 -0800

    TCIB: Add cinder-backup extend_start.sh script

    Add a kolla_extend_start script to the cinder-backup service that
    ensures /var/lib/cinder is owned by the 'cinder' user. See
    I2d82c1ca86735d2a8d69b3e28e8cea7acd637f0b for details on what was
    done for the cinder-volume service. cinder-backup also needs to run
    the script because there's no guarantee the cinder-volume service
    is running on every cinder-backup node.

    Resolves: rhbz#2167954
    Related-Bug: #1908750
    Change-Id: I7fcca9fbfea87ac4b245856a3aecae9ffd211938

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/tripleo-common/+/873715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/873715
Committed: https://opendev.org/openstack/tripleo-common/commit/8509b2a4b2fc0c5f53c6b46c0c8ac105cee53178
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 8509b2a4b2fc0c5f53c6b46c0c8ac105cee53178
Author: Alan Bishop <email address hidden>
Date: Tue Feb 7 13:05:47 2023 -0800

    TCIB: Add cinder-backup extend_start.sh script

    Add a kolla_extend_start script to the cinder-backup service that
    ensures /var/lib/cinder is owned by the 'cinder' user. See
    I2d82c1ca86735d2a8d69b3e28e8cea7acd637f0b for details on what was
    done for the cinder-volume service. cinder-backup also needs to run
    the script because there's no guarantee the cinder-volume service
    is running on every cinder-backup node.

    Resolves: rhbz#2167954
    Related-Bug: #1908750
    Change-Id: I7fcca9fbfea87ac4b245856a3aecae9ffd211938
    (cherry picked from commit b6cf34a9536ce413b2a46ab455c54e4488f49f2d)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-common/+/874527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/874527
Committed: https://opendev.org/openstack/tripleo-common/commit/d447618dd8ef0713fee3d9b109afdf15dc0d9438
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit d447618dd8ef0713fee3d9b109afdf15dc0d9438
Author: Alan Bishop <email address hidden>
Date: Tue Feb 7 13:05:47 2023 -0800

    TCIB: Add cinder-backup extend_start.sh script

    Add a kolla_extend_start script to the cinder-backup service that
    ensures /var/lib/cinder is owned by the 'cinder' user. See
    I2d82c1ca86735d2a8d69b3e28e8cea7acd637f0b for details on what was
    done for the cinder-volume service. cinder-backup also needs to run
    the script because there's no guarantee the cinder-volume service
    is running on every cinder-backup node.

    Resolves: rhbz#2167954
    Related-Bug: #1908750
    Change-Id: I7fcca9fbfea87ac4b245856a3aecae9ffd211938
    (cherry picked from commit b6cf34a9536ce413b2a46ab455c54e4488f49f2d)
    (cherry picked from commit 8509b2a4b2fc0c5f53c6b46c0c8ac105cee53178)

tags: added: in-stable-wallaby
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.