openstack overcloud container image upload fails from time to time

Bug #1749663 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Steve Baker

Bug Description

Via https://bugs.launchpad.net/tripleo/+bug/1746305 and https://review.openstack.org/#/c/539383/ we made the image upload a bit more resilient. I seem to be still hitting it rather often:
Extracting
Extracting
Extracting
Pull complete
imagename: docker.io/tripleomaster/centos-binary-neutron-dhcp-agent:current-tripleo-rdo
Trying paths: ['/home/stack/.docker/config.json', '/home/stack/.dockercfg']
No config file found
None: None
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/container_image.py", line 66, in take_action
    uploader.upload()
  File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 95, in upload
    uploader.run_tasks()
  File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 380, in run_tasks
    for result in p.map(docker_upload, self.upload_tasks):
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
ReadTimeoutError: None: None
clean_up UploadImage: None: None
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 134, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 279, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 169, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/container_image.py", line 66, in take_action
    uploader.upload()
  File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 95, in upload
    uploader.run_tasks()
  File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 380, in run_tasks
    for result in p.map(docker_upload, self.upload_tasks):
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
ReadTimeoutError: None: None

END return value: 1
http://localhost:None "GET /version HTTP/1.1" 200 278
Running skopeo inspect docker://docker.io/tripleomaster/centos-binary-neutron-dhcp-agent:current-tripleo-rdo

Note that I am using a local mirror, so network issues should not be a constant source of issues (hopefully). I.e. in /etc/docker/daemon.json I have:
{"registry-mirrors":["http://mrg-09.mpc.lab.eng.bos.redhat.com:5000"],"debug":true}

I think the relevant docker errors when this happens are these:
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.005057742Z" level=debug msg="Calling GET /version"
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.005257946Z" level=debug msg="{Action=version, Username=stack, LoginUID=1000, PID=11260}"
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.014262853Z" level=error msg="Error trying v2 registry: context canceled"
Feb 15 08:59:10 undercloud dockerd-current[6188]: time="2018-02-15T08:59:10.014298212Z" level=error msg="Not continuing with pull after error: context canceled"

Tags: containers
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

This looks like a multiprocessing timeout error, so I guess the child process didn't respond in time because the pull was taking a while.

I'll look into a fix which does the Pool.map in a way that can specify a custom timeout which is longer.

I might also use this bug to switch the pull/push retry loops to use tenacity

Changed in tripleo:
assignee: nobody → Steve Baker (steve-stevebaker)
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Hmm, actually ReadTimeoutError comes from urllib3, so its probably coming from a docker client call. Forget what I said about multiprocessing

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/548914

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/548914
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=13678cb045ca0ab1abc72455b7f71b7cce7406bc
Submitter: Zuul
Branch: master

commit 13678cb045ca0ab1abc72455b7f71b7cce7406bc
Author: Steve Baker <email address hidden>
Date: Thu Mar 1 10:50:40 2018 +0000

    Use tenacity for image upload retries

    This change replaces the custom image pull/push retry loop with a
    tenacity annotation.

    The docker client uses requests, so this fixes an issue where requests
    can raise a vendored urllib3 exception if the registry doesn't respond
    at all. This will now be caught by tenacity and treated as a retry
    condition.

    Change-Id: Ifa08f21ccbc1bf04732462d28449546cf146c53a
    Closes-Bug: #1749663

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.5.0

This issue was fixed in the openstack/tripleo-common 8.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.