Via https://bugs.launchpad.net/tripleo/+bug/1746305 and https://review.openstack.org/#/c/539383/ we made the image upload a bit more resilient. I seem to be still hitting it rather often:
Extracting
Extracting
Extracting
Pull complete
imagename: docker.io/tripleomaster/centos-binary-neutron-dhcp-agent:current-tripleo-rdo
Trying paths: ['/home/stack/.docker/config.json', '/home/stack/.dockercfg']
No config file found
None: None
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
result = cmd.run(parsed_args)
File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
return super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
return_code = self.take_action(parsed_args) or 0
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/container_image.py", line 66, in take_action
uploader.upload()
File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 95, in upload
uploader.run_tasks()
File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 380, in run_tasks
for result in p.map(docker_upload, self.upload_tasks):
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
ReadTimeoutError: None: None
clean_up UploadImage: None: None
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 134, in run
ret_val = super(OpenStackShell, self).run(argv)
File "/usr/lib/python2.7/site-packages/cliff/app.py", line 279, in run
result = self.run_subcommand(remainder)
File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 169, in run_subcommand
ret_value = super(OpenStackShell, self).run_subcommand(argv)
File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
result = cmd.run(parsed_args)
File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
return super(Command, self).run(parsed_args)
File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
return_code = self.take_action(parsed_args) or 0
File "/usr/lib/python2.7/site-packages/tripleoclient/v1/container_image.py", line 66, in take_action
uploader.upload()
File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 95, in upload
uploader.run_tasks()
File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 380, in run_tasks
for result in p.map(docker_upload, self.upload_tasks):
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
ReadTimeoutError: None: None
END return value: 1
http://localhost:None "GET /version HTTP/1.1" 200 278
Running skopeo inspect docker://docker.io/tripleomaster/centos-binary-neutron-dhcp-agent:current-tripleo-rdo
Note that I am using a local mirror, so network issues should not be a constant source of issues (hopefully). I.e. in /etc/docker/daemon.json I have:
{"registry-mirrors":["http://mrg-09.mpc.lab.eng.bos.redhat.com:5000"],"debug":true}
I think the relevant docker errors when this happens are these:
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.005057742Z" level=debug msg="Calling GET /version"
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.005257946Z" level=debug msg="{Action=version, Username=stack, LoginUID=1000, PID=11260}"
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.014262853Z" level=error msg="Error trying v2 registry: context canceled"
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.014298212Z" level=error msg="Not continuing with pull after error: context canceled"
This looks like a multiprocessing timeout error, so I guess the child process didn't respond in time because the pull was taking a while.
I'll look into a fix which does the Pool.map in a way that can specify a custom timeout which is longer.
I might also use this bug to switch the pull/push retry loops to use tenacity