docker permission denied on upgrade prep containers

Bug #1775209 reported by Rafael Folco on 2018-06-05
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Sofer Athlan-Guyot

Bug Description

Error: 2018-06-05 12:21:08 | Error while fetching server API version: ('Connection aborted.', error(13, 'Permission denied'))

Command: Running skopeo inspect --tls-verify=false docker://192.168.24.1:8787/tripleomaster/centos-binary-rsyslog-base:3a65b17da7b98c83dbfda432af88bb56d3501de9_dba04735

http://logs.openstack.org/87/572187/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/ca1ee14/logs/undercloud/home/zuul/upgrade_overcloud_prep_containers.log.txt.gz#_2018-06-05_12_21_08

2018-06-05 12:21:07 | Starting new HTTPS connection (1): 192.168.24.1
2018-06-05 12:21:07 | imagename: docker.io/tripleomaster/centos-binary-rsyslog-base:3a65b17da7b98c83dbfda432af88bb56d3501de9_dba04735
2018-06-05 12:21:07 | Running skopeo inspect docker://docker.io/tripleomaster/centos-binary-rsyslog-base:3a65b17da7b98c83dbfda432af88bb56d3501de9_dba04735
2018-06-05 12:21:08 | Running skopeo inspect --tls-verify=false docker://192.168.24.1:8787/tripleomaster/centos-binary-rsyslog-base:3a65b17da7b98c83dbfda432af88bb56d3501de9_dba04735
2018-06-05 12:21:08 | Trying paths: ['/home/zuul/.docker/config.json', '/home/zuul/.dockercfg']
2018-06-05 12:21:08 | No config file found
2018-06-05 12:21:08 | Error while fetching server API version: ('Connection aborted.', error(13, 'Permission denied'))
2018-06-05 12:21:08 | Traceback (most recent call last):
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 399, in run_subcommand
2018-06-05 12:21:08 | result = cmd.run(parsed_args)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
2018-06-05 12:21:08 | super(Command, self).run(parsed_args)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
2018-06-05 12:21:08 | return super(Command, self).run(parsed_args)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
2018-06-05 12:21:08 | return_code = self.take_action(parsed_args) or 0
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/container_image.py", line 78, in take_action
2018-06-05 12:21:08 | uploader.upload()
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 114, in upload
2018-06-05 12:21:08 | uploader.run_tasks()
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 455, in run_tasks
2018-06-05 12:21:08 | result = self.upload_image(*first)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 237, in upload_image
2018-06-05 12:21:08 | dockerc = Client(base_url='unix://var/run/docker.sock', version='auto')
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 147, in __init__
2018-06-05 12:21:08 | self._version = self._retrieve_server_version()
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 174, in _retrieve_server_version
2018-06-05 12:21:08 | 'Error while fetching server API version: {0}'.format(e)
2018-06-05 12:21:08 | DockerException: Error while fetching server API version: ('Connection aborted.', error(13, 'Permission denied'))
2018-06-05 12:21:08 | clean_up UploadImage: Error while fetching server API version: ('Connection aborted.', error(13, 'Permission denied'))
2018-06-05 12:21:08 | Traceback (most recent call last):
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 134, in run
2018-06-05 12:21:08 | ret_val = super(OpenStackShell, self).run(argv)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 278, in run
2018-06-05 12:21:08 | result = self.run_subcommand(remainder)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 169, in run_subcommand
2018-06-05 12:21:08 | ret_value = super(OpenStackShell, self).run_subcommand(argv)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 399, in run_subcommand
2018-06-05 12:21:08 | result = cmd.run(parsed_args)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
2018-06-05 12:21:08 | super(Command, self).run(parsed_args)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
2018-06-05 12:21:08 | return super(Command, self).run(parsed_args)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
2018-06-05 12:21:08 | return_code = self.take_action(parsed_args) or 0
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/container_image.py", line 78, in take_action
2018-06-05 12:21:08 | uploader.upload()
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 114, in upload
2018-06-05 12:21:08 | uploader.run_tasks()
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 455, in run_tasks
2018-06-05 12:21:08 | result = self.upload_image(*first)
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 237, in upload_image
2018-06-05 12:21:08 | dockerc = Client(base_url='unix://var/run/docker.sock', version='auto')
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 147, in __init__
2018-06-05 12:21:08 | self._version = self._retrieve_server_version()
2018-06-05 12:21:08 | File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 174, in _retrieve_server_version
2018-06-05 12:21:08 | 'Error while fetching server API version: {0}'.format(e)
2018-06-05 12:21:08 | DockerException: Error while fetching server API version: ('Connection aborted.', error(13, 'Permission denied'))
2018-06-05 12:21:08 |
2018-06-05 12:21:08 | END return value: 1

Job:
tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades FAILURE in 1h 59m 35s (non-voting)
Change: https://review.openstack.org/#/c/572187/

Last successful job: experimental
tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades SUCCESS in 2h 43m 00s (non-voting)
http://logs.openstack.org/25/558425/18/experimental/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/31d0f31/

Observed strange errors like:
http://logs.openstack.org/87/572187/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/ca1ee14/job-output.txt.gz#_2018-06-05_10_36_36_957834

Changed in tripleo:
milestone: rocky-2 → rocky-3
wes hayutin (weshayutin) on 2018-06-05
Changed in tripleo:
status: New → Triaged
tags: added: alert
Steve Baker (steve-stevebaker) wrote :

Its not actually skopeo raising this error, its docker-py creating a client object.

This happens when the deploy user isn't in the docker group, or the shell session hasn't been refreshed since the groupadd.

Changed in tripleo:
assignee: nobody → Steve Baker (steve-stevebaker)
Steve Baker (steve-stevebaker) wrote :

The group add appears to be happening as expected:

2018-06-05 11:11:57 | 2018-06-05 11:11:57,760 INFO: Notice: /Stage[main]/Main/Group[docker]/ensure: created
2018-06-05 11:11:57 | 2018-06-05 11:11:57,798 INFO: Notice: /Stage[main]/Main/User[docker_user]/groups: groups changed 'mock' to ['docker', 'mock']

http://logs.openstack.org/87/572187/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/ca1ee14/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#_2018-06-05_11_11_57

summary: - skopeo permission denied on upgrade prep containers
+ docker permission denied on upgrade prep containers
Marios Andreou (marios-b) wrote :

Looking at the container to container upgrade job ( gate-tripleo-ci-centos-7-container-to-container-upgrades-master-nv) on sbaker debug review @ https://review.openstack.org/#/c/572663/1 the the overcloud-prep-containers actually passes OK https://logs.rdoproject.org/63/572663/1/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-master-nv/Zaa1df255c82e4686b259750e28fb9e47/undercloud/home/jenkins/overcloud_prep_containers.log.txt.gz#_2018-06-06_03_21_38

The failure is for the overcloud deploy - the heat stack is created OK (this is config download) but the host_prep_tasks are failing here https://logs.rdoproject.org/63/572663/1/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-master-nv/Zaa1df255c82e4686b259750e28fb9e47/undercloud/home/jenkins/overcloud_deploy.log.txt.gz#_2018-06-06_04_00_01

I think this is a different bug though (and fwiw I can't spot the problem in the host_prep_tasks or the conditional that actually fails in https://logs.rdoproject.org/63/572663/1/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-master-nv/Zaa1df255c82e4686b259750e28fb9e47/undercloud/var/lib/mistral/abe945bd-153b-41b0-839f-339527656491/Controller/host_prep_tasks.yaml.txt.gz and https://logs.rdoproject.org/63/572663/1/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-master-nv/Zaa1df255c82e4686b259750e28fb9e47/undercloud/var/lib/mistral/abe945bd-153b-41b0-839f-339527656491/deploy_steps_playbook.yaml.txt.gz )

Marios Andreou (marios-b) wrote :

so to update, for sbaker test review it looks like the scenario000-multinode-oooq-container-upgrades job is failing for the same reason as this bug

http://logs.openstack.org/63/572663/1/experimental/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/01a5631/logs/undercloud/home/zuul/

The initial prep images is fine at http://logs.openstack.org/63/572663/1/experimental/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/01a5631/logs/undercloud/home/zuul/overcloud_prep_containers.log.txt.gz but then when it is run again for upgrade you can see it fails at http://logs.openstack.org/63/572663/1/experimental/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/01a5631/logs/undercloud/home/zuul/upgrade_overcloud_prep_containers.log.txt.gz

from sbaker debug you can see the docker group is missing in the second case?

2018-06-06 09:48:14 | + groups
2018-06-06 09:48:14 | zuul mock

vs

2018-06-06 09:08:17 | + groups
2018-06-06 09:08:17 | zuul mock docker

Marios Andreou (marios-b) wrote :
Download full text (3.4 KiB)

tl;dr _something_ is removing the zuul user from the docker groups and that is the root cause of the bug.

Looking at undercloud journal just before the failing overcloud prep containers you can see (this is for the originally reference change in the bug description @ http://logs.openstack.org/87/572187/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/ca1ee14/logs/undercloud):

    Jun 05 12:19:36 centos-7-rax-dfw-0004294805 sudo[17190]: zuul : TTY=pts/1 ; PWD=/home/zuul ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-banjdiooyiqxxhrkxjetvzfpsuqhlbcj; /usr/bin/python /home/zuul/.ansible/tmp/ansible-tmp-1528201174.53-40600594451444/user.py; rm -rf "/home/zuul/.ansible/tmp/ansible-tmp-1528201174.53-40600594451444/" > /dev/null 2>&1
    Jun 05 12:19:36 centos-7-rax-dfw-0004294805 python[17193]: ansible-user Invoked with comment=None ssh_key_bits=0 update_password=always non_unique=False force=False ssh_key_type=rsa ssh_key_passphrase=NOT_LOGGING_PARAMETER createhome=True uid=None home=None append=False skeleton=None ssh_key_comment=ansible-generated on centos-7-rax-dfw-0004294805 group=None system=False state=present local=None shell=None expires=None ssh_key_file=None groups=['mock'] move_home=False password=NOT_LOGGING_PARAMETER name=zuul seuser=None remove=False login_class=None generate_ssh_key=None
    Jun 05 12:19:37 centos-7-rax-dfw-0004294805 usermod[17201]: delete 'zuul' from group 'docker'
    Jun 05 12:19:37 centos-7-rax-dfw-0004294805 usermod[17201]: delete 'zuul' from shadow group 'docker'

(the prep containers, which fails is just after this starting @ 12:20 http://logs.openstack.org/87/572187/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/ca1ee14/logs/undercloud/home/zuul/upgrade_overcloud_prep_containers.log.txt.gz#_2018-06-05_12_21_08 ).

You can see the same thing on sbaker test review (undercloud journal @ http://logs.openstack.org/63/572663/1/experimental/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/01a5631/logs/undercloud/var/log/journal.txt.gz)

    Jun 06 09:46:59 centos-7-rax-iad-0004313964 sudo[18175]: zuul : TTY=pts/1 ; PWD=/home/zuul ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-aqycodoijcsiqhjqpomuilxhhwadxqxi; /usr/bin/python /home/zuul/.ansible/tmp/ansible-tmp-1528278417.64-135284418991881/user.py; rm -rf "/home/zuul/.ansible/tmp/ansible-tmp-1528278417.64-135284418991881/" > /dev/null 2>&1
    Jun 06 09:46:59 centos-7-rax-iad-0004313964 python[18178]: ansible-user Invoked with comment=None ssh_key_bits=0 update_password=always non_unique=False force=False ssh_key_type=rsa ssh_key_passphrase=NOT_LOGGING_PARAMETER createhome=True uid=None home=None append=False skeleton=None ssh_key_comment=ansible-generated on centos-7-rax-iad-0004313964 group=None system=False state=present local=None shell=None expires=None ssh_key_file=None groups=['mock'] move_home=False password=NOT_LOGGING_PARAMETER name=zuul seuser=None remove=False login_class=None generate_ssh_key=None
    Jun 06 09:46:59 centos-7-rax-iad-0004313964 usermod[18186]: delete 'zuul' from group 'docker'
    Jun 06 09:46:59 centos-7-rax-iad-0004313964 usermod[18186]: dele...

Read more...

Looking at the toci logs, we can see that the "Add user to mock group" is running at that exact time[1]:

    TASK [build-test-packages : Add user to mock group] ****************************
    Tuesday 05 June 2018 12:19:34 +0000 (0:00:02.888) 0:00:14.307 **********
    changed: [undercloud]

Pushing a review that may fix the issue here https://review.openstack.org/#/c/573229/

Waiting for ci ...

[1] complete logs: http://logs.openstack.org/87/572187/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/ca1ee14/ara-report/result/68983110-96cf-405b-b4c6-ce880ae12307/

Changed in tripleo:
assignee: Steve Baker (steve-stevebaker) → Sofer Athlan-Guyot (sofer-athlan-guyot)

Reviewed: https://review.openstack.org/573229
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=de06a05f69f86d6a5cf31e7fbb5701038244e6d2
Submitter: Zuul
Branch: master

commit de06a05f69f86d6a5cf31e7fbb5701038244e6d2
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Jun 7 15:17:03 2018 +0200

    Make sure we append mock group to the ansible user.

    This make sure we don't drop important groups like docker from the
    ansible user.

    Change-Id: Ife79266566af23a0d8aa6af92ec8b59875ca8ad6
    Closes-Bug: #1775209

Changed in tripleo:
status: Triaged → Fix Released

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/572663

This issue was fixed in the openstack/tripleo-quickstart-extras 2.1.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers