Bug #1714412 “Upgrade from Ocata to Pike fails to pull container...” : Bugs : tripleo

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-09-01:

#1

AFAIK, "http: server gave HTTP response to HTTPS client" means the docker is missing the insecure registry option applied, like missing it in the config or having it in the config file, but puppet failed/skipped/unscheduled the docker service restart. The root cause for the latter may be the "bad" initial state of the /etc/sysconfig/docker file so augeas fails to parse it (this can be made visible with 'puppet -tvd' or the like). What is 192.168.24.1 and how it's get configured for tripleo-ci? We need to make a local reproduce and add the aforementioned debug info.

And a registry connectivity test could help to dig some details as well:

curl -s 192.168.24.1:8787/v2/_catalog | python -mjson.tool

would be nice to have it in the debug results.

Revision history for this message

Marios Andreou (marios-b) wrote on 2017-09-01:

#2

o/ hi spent some time looking here too - no root cause but some info & pointers in case useful for anyone else:

* upgrade_tasks seems to complete step5 @ http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/messages.txt.gz#_Aug_31_23_34_01

* I thought the error here from docker-storage-setup and just before the image errors: http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/messages.txt.gz#_Aug_31_23_36_07 might be of significance, but git blame leads me to this https://review.openstack.org/#/c/451916/7/spec/classes/tripleo_profile_base_docker_spec.rb (for "devicemapper") and its from March

* You can see the http request/response for the image logged on the undercloud too, here http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/undercloud/var/log/messages.txt.gz#_Aug_31_23_36_10

So it looks like Docker Container Engine starts up OK and then soon after docker-cleanup runs http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/messages.txt.gz#_Aug_31_23_36_08 which I think is what causes the issues with dockerd-current and the registry issues.

Anyone aware of any recent changes to docker images/registry config that may be behind this?

Revision history for this message

Martin André (mandre) wrote on 2017-09-01:

#3

It's a bit of a mystery to me as well.

When looking at the overcloud node docker setup, it seems we've successfully set 192.168.24.1:8787 as an insecure registry:

http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/extra/docker/docker_allinfo.log.txt.gz

So I don't think the http response to https request error is relevant. According to https://docs.docker.com/registry/insecure/ docker will first try https anyway.

The logs on the undercloud suggest the images are uploaded correctly to the 192.168.24.1:8787 registry:
http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/undercloud/home/jenkins/overcloud_upgrade_console.log.txt.gz#_2017-08-31_23_02_12

But then, docker returns a 404 when queried for the images:
http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/undercloud/var/log/messages.txt.gz#_Aug_31_23_36_10

Revision history for this message

Martin André (mandre) wrote on 2017-09-01:

#4

Looking at the undercloud's /var/log/message, the push to the local registry doesn't seem to work so well after all:

Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.961420450Z" level=debug msg="Pushing repository: 192.168.24.1:8787/centos-binary-rabbitmq:latest"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.962981829Z" level=debug msg="Checking for presence of layer sha256:64745862514c538f29fa0f6e7ce64cd10f557e49f0d3bc44a6171e13aa2a955b (sha256:fff2c763c2f18eb11c18f042eb27abdfec8d2f96aea74804a26d2f5fc25c4a1b) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.963218174Z" level=debug msg="Checking for presence of layer sha256:09fecf9fbf7a5be06610b6408b4b3e3500d611528f2e0471d1472c7702c0eb5e (sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.964151714Z" level=debug msg="Checking for presence of layer sha256:eeea14b858a4f352006973dcebe7ffa921f39b3f8a84fd13570cc8b15d128b53 (sha256:32d6f31a7a57f7cf8186e91c30df7f65ca1298a80ce163e5b7eaf6afdec87cd3) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.964314380Z" level=debug msg="Checking for presence of layer sha256:c14f1fbd13d19b1e443d3691f283217828ea7f9b6817076173c5aa5631667bf0 (sha256:a180cd39a22a920313281e235c4c943758d0d0ae30c428e2855906ff57fd5c3a) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.966393447Z" level=debug msg="Checking for presence of layer sha256:0d14e4801b0cb05dc561cb19cbf68f86971205f7469af432ac11829ed4bbb336 (sha256:253832471aa49670aab81ed492c62cf5027a072962730c86b01b4c5326ea482e) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 registry: time="2017-08-31T23:12:48Z" level=error msg="response completed with error" err.code="blob unknown" err.detail=sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878 err.message="blob unknown to registry" go.version=go1.7.4 http.request.host="192.168.24.1:8787" http.request.id=1648a1ef-3b4d-42bd-9c36-ed543b0c254f http.request.method=HEAD http.request.remoteaddr="192.168.24.1:41018" http.request.uri="/v2/centos-binary-rabbitmq/blobs/sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878" http.request.useragent="docker/1.12.6 go/go1.7.4 kernel/3.10.0-514.26.2.el7.x86_64 os/linux arch/amd64 UpstreamClient(docker-sdk-python/2.4.2)" http.response.contenttype="application/json; charset=utf-8" http.response.duration=10.875742ms http.response.status=404 http.response.written=157 instance.id=945b223b-1b12-493f-8060-b32042fc2aac vars.digest="sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878" vars.name=centos-binary-rabbitmq version="v2.6.0+unknown"

Looking at the undercloud's /var/log/message, the push to the local registry doesn't seem to work so well after all:

Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.961420450Z" level=debug msg="Pushing repository: 192.168.24.1:8787/centos-binary-rabbitmq:latest"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.962981829Z" level=debug msg="Checking for presence of layer sha256:64745862514c538f29fa0f6e7ce64cd10f557e49f0d3bc44a6171e13aa2a955b (sha256:fff2c763c2f18eb11c18f042eb27abdfec8d2f96aea74804a26d2f5fc25c4a1b) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.963218174Z" level=debug msg="Checking for presence of layer sha256:09fecf9fbf7a5be06610b6408b4b3e3500d611528f2e0471d1472c7702c0eb5e (sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.964151714Z" level=debug msg="Checking for presence of layer sha256:eeea14b858a4f352006973dcebe7ffa921f39b3f8a84fd13570cc8b15d128b53 (sha256:32d6f31a7a57f7cf8186e91c30df7f65ca1298a80ce163e5b7eaf6afdec87cd3) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.964314380Z" level=debug msg="Checking for presence of layer sha256:c14f1fbd13d19b1e443d3691f283217828ea7f9b6817076173c5aa5631667bf0 (sha256:a180cd39a22a920313281e235c4c943758d0d0ae30c428e2855906ff57fd5c3a) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.966393447Z" level=debug msg="Checking for presence of layer sha256:0d14e4801b0cb05dc561cb19cbf68f86971205f7469af432ac11829ed4bbb336 (sha256:253832471aa49670aab81ed492c62cf5027a072962730c86b01b4c5326ea482e) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 registry: time="2017-08-31T23:12:48Z" level=error msg="response completed with error" err.code="blob unknown" err.detail=sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878 err.message="blob unknown to registry" go.version=go1.7.4 http.request.host="192.168.24.1:8787" http.request.id=1648a1ef-3b4d-42bd-9c36-ed543b0c254f http.request.method=HEAD http.request.remoteaddr="192.168.24.1:41018" http.request.uri="/v2/centos-binary-rabbitmq/blobs/sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878" http.request.useragent="docker/1.12.6 go/go1.7.4 kernel/3.10.0-514.26.2.el7.x86_64 os/linux arch/amd64 UpstreamClient(docker-sdk-python/2.4.2)" http.response.contenttype="application/json; charset=utf-8" http.response.duration=10.875742ms http.response.status=404 http.response.written=157 instance.id=945b223b-1b12-493f-8060-b32042fc2aac vars.digest="sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878" vars.name=centos-binary-rabbitmq version="v2.6.0+unknown"

Revision history for this message

Martin André (mandre) wrote on 2017-09-01:

#5

Same content in a pastebin so it's easier to read: http://paste.openstack.org/show/620216/

Revision history for this message

Martin André (mandre) wrote on 2017-09-01:

#6

So it appears the "blob unknown" in the docker logs is not a real error because I can see the same thing in a successful job:

http://logs.openstack.org/37/487537/5/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/b00b9df/

I've tried turning the docker debug on in https://review.openstack.org/#/c/500112/, let's see if that provides more information.

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2017-09-01:

#7

There is a lot of this[1]:

"https://192.168.24.1:8787/v1/_ping: http: server gave HTTP response to HTTPS client"

DockerInsecureRegistryAddress is set in containers-default-parameters.yaml[2], but maybe it is not being applied correctly in a stack update.

[1]
http://logs.openstack.org/12/500112/3/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/7735d7e/logs/subnode-2/var/log/messages.txt.gz#_Sep__1_20_49_24
[2] http://logs.openstack.org/12/500112/3/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/7735d7e/logs/undercloud/home/jenkins/containers-default-parameters.yaml.txt.gz

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2017-09-01:

#8

The insecure_registry is in the hierdata as expected[1] but the datestamp service_configs.json is 6 minutes after the docker error <finger on chin emoji>.

I'll look for evidence of when puppet ::tripleo::profile::base::docker is being applied

[1] http://logs.openstack.org/12/500112/3/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/7735d7e/logs/subnode-2/etc/puppet/hieradata/service_configs.json.txt.gz

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2017-09-02:

#9

My current theory is that the puppet/services/docker.yaml step_config include ::tripleo::profile::base::docker is not being applied in the upgrade flow, at least not before AllNodesPostUpgradeSteps.ControllerDeployment_Step1 which is the first thing which attempts to pull an image.

Someone with a better understanding of upgrade flow might know when this should be running

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-09-02:

#10

Trying a revert, https://review.openstack.org/500246
The patch was the first one to fail in the gate (I looked in logstash) - so maybe related.

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2017-09-02:

#11

If the --pull-source deprecation is the cause, this already proposed change would help

https://review.openstack.org/#/c/493391/

gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv passes on this change, I'll recheck to make sure.

If this works I'd rather fail forward and not revert https://review.openstack.org/500246, since we need that to support ceph images properly in the gate

OpenStack Infra (hudson-openstack) on 2017-09-02

Changed in tripleo:
assignee:	nobody → Steve Baker (steve-stevebaker)
status:	Triaged → In Progress

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-09-04:

#12

Steve, it worked fine. I went ahead and approved the patch. Someone will have to explain why it was supposed to be backward compatible, while it actually broke upgrades.

Could we make sure this process is well documented for our users?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-09-04: Fix merged to tripleo-quickstart-extras (master)

#13

Reviewed: https://review.openstack.org/493391
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=e8e41e2d9d8a81b158683603362813c4e0c3564d
Submitter: Jenkins
Branch: master

commit e8e41e2d9d8a81b158683603362813c4e0c3564d
Author: Steve Baker <email address hidden>
Date: Mon Aug 14 13:18:38 2017 +1200

Stop using deprecated --pull-source

    --pull-source has beed deprecated since it interacts poorly with
    non-kolla images such as ceph. This change switches to fully-qualified
    --namespace arguments instead.

Also the --exclude ceph is no longer required due to the service based
image filtering.

    Change-Id: I7f7950ba7c95d42f23246932fb27f1ab457590d1
    Related-Bug: #1708303
    Related-Bug: #1714225
    Closes-Bug: #1714412

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-14: Fix included in openstack/tripleo-quickstart-extras 2.1.1

#14

This issue was fixed in the openstack/tripleo-quickstart-extras 2.1.1 release.

tripleo

Upgrade from Ocata to Pike fails to pull containers during ControllerDeployment_Step1

Bug Description

Other bug subscribers

Remote bug watches