Upgrade from Ocata to Pike fails to pull containers during ControllerDeployment_Step1

Bug #1714412 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Steve Baker

Bug Description

Now, upgrade from Ocata to Pike fails at ControllerDeployment_Step1, when containers are pulled during the upgrade process:

http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/messages.txt.gz#_Aug_31_23_36_11

Snippet:

Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,318 ERROR: 45497 -- ERROR configuring haproxy",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,318 ERROR: 45497 -- ERROR configuring nova_placement",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,318 ERROR: 45497 -- ERROR configuring heat_api",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,318 ERROR: 45497 -- ERROR configuring nova_libvirt",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,318 ERROR: 45497 -- ERROR configuring rabbitmq",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring iscsid",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring heat",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring glance_api",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring keystone",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring nova",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring mysql",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring heat_api_cfn",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring memcached",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring crond",
Aug 31 23:36:11 centos-7-2-node-rax-ord-10727868-862288 os-collect-config: "2017-08-31 23:36:11,319 ERROR: 45497 -- ERROR configuring neutron"

Tags: alert ci upgrade
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

AFAIK, "http: server gave HTTP response to HTTPS client" means the docker is missing the insecure registry option applied, like missing it in the config or having it in the config file, but puppet failed/skipped/unscheduled the docker service restart. The root cause for the latter may be the "bad" initial state of the /etc/sysconfig/docker file so augeas fails to parse it (this can be made visible with 'puppet -tvd' or the like). What is 192.168.24.1 and how it's get configured for tripleo-ci? We need to make a local reproduce and add the aforementioned debug info.

And a registry connectivity test could help to dig some details as well:

curl -s 192.168.24.1:8787/v2/_catalog | python -mjson.tool

would be nice to have it in the debug results.

Revision history for this message
Marios Andreou (marios-b) wrote :

o/ hi spent some time looking here too - no root cause but some info & pointers in case useful for anyone else:

* upgrade_tasks seems to complete step5 @ http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/messages.txt.gz#_Aug_31_23_34_01

* I thought the error here from docker-storage-setup and just before the image errors: http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/messages.txt.gz#_Aug_31_23_36_07 might be of significance, but git blame leads me to this https://review.openstack.org/#/c/451916/7/spec/classes/tripleo_profile_base_docker_spec.rb (for "devicemapper") and its from March

* You can see the http request/response for the image logged on the undercloud too, here http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/undercloud/var/log/messages.txt.gz#_Aug_31_23_36_10

So it looks like Docker Container Engine starts up OK and then soon after docker-cleanup runs http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/messages.txt.gz#_Aug_31_23_36_08 which I think is what causes the issues with dockerd-current and the registry issues.

Anyone aware of any recent changes to docker images/registry config that may be behind this?

Revision history for this message
Martin André (mandre) wrote :

It's a bit of a mystery to me as well.

When looking at the overcloud node docker setup, it seems we've successfully set 192.168.24.1:8787 as an insecure registry:

http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/subnode-2/var/log/extra/docker/docker_allinfo.log.txt.gz

So I don't think the http response to https request error is relevant. According to https://docs.docker.com/registry/insecure/ docker will first try https anyway.

The logs on the undercloud suggest the images are uploaded correctly to the 192.168.24.1:8787 registry:
http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/undercloud/home/jenkins/overcloud_upgrade_console.log.txt.gz#_2017-08-31_23_02_12

But then, docker returns a 404 when queried for the images:
http://logs.openstack.org/57/498057/7/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/d714b99/logs/undercloud/var/log/messages.txt.gz#_Aug_31_23_36_10

Revision history for this message
Martin André (mandre) wrote :

Looking at the undercloud's /var/log/message, the push to the local registry doesn't seem to work so well after all:

Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.961420450Z" level=debug msg="Pushing repository: 192.168.24.1:8787/centos-binary-rabbitmq:latest"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.962981829Z" level=debug msg="Checking for presence of layer sha256:64745862514c538f29fa0f6e7ce64cd10f557e49f0d3bc44a6171e13aa2a955b (sha256:fff2c763c2f18eb11c18f042eb27abdfec8d2f96aea74804a26d2f5fc25c4a1b) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.963218174Z" level=debug msg="Checking for presence of layer sha256:09fecf9fbf7a5be06610b6408b4b3e3500d611528f2e0471d1472c7702c0eb5e (sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.964151714Z" level=debug msg="Checking for presence of layer sha256:eeea14b858a4f352006973dcebe7ffa921f39b3f8a84fd13570cc8b15d128b53 (sha256:32d6f31a7a57f7cf8186e91c30df7f65ca1298a80ce163e5b7eaf6afdec87cd3) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.964314380Z" level=debug msg="Checking for presence of layer sha256:c14f1fbd13d19b1e443d3691f283217828ea7f9b6817076173c5aa5631667bf0 (sha256:a180cd39a22a920313281e235c4c943758d0d0ae30c428e2855906ff57fd5c3a) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 dockerd-current: time="2017-08-31T23:12:48.966393447Z" level=debug msg="Checking for presence of layer sha256:0d14e4801b0cb05dc561cb19cbf68f86971205f7469af432ac11829ed4bbb336 (sha256:253832471aa49670aab81ed492c62cf5027a072962730c86b01b4c5326ea482e) in 192.168.24.1:8787/centos-binary-rabbitmq"
Aug 31 23:12:48 centos-7-2-node-rax-ord-10727868 registry: time="2017-08-31T23:12:48Z" level=error msg="response completed with error" err.code="blob unknown" err.detail=sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878 err.message="blob unknown to registry" go.version=go1.7.4 http.request.host="192.168.24.1:8787" http.request.id=1648a1ef-3b4d-42bd-9c36-ed543b0c254f http.request.method=HEAD http.request.remoteaddr="192.168.24.1:41018" http.request.uri="/v2/centos-binary-rabbitmq/blobs/sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878" http.request.useragent="docker/1.12.6 go/go1.7.4 kernel/3.10.0-514.26.2.el7.x86_64 os/linux arch/amd64 UpstreamClient(docker-sdk-python/2.4.2)" http.response.contenttype="application/json; charset=utf-8" http.response.duration=10.875742ms http.response.status=404 http.response.written=157 instance.id=945b223b-1b12-493f-8060-b32042fc2aac vars.digest="sha256:c93a5859334766b76f51f6164fdbb79fe5de4ac41f6164f1e04e75e5abdc2878" vars.name=centos-binary-rabbitmq version="v2.6.0+unknown"

Revision history for this message
Martin André (mandre) wrote :

Same content in a pastebin so it's easier to read: http://paste.openstack.org/show/620216/

Revision history for this message
Martin André (mandre) wrote :

So it appears the "blob unknown" in the docker logs is not a real error because I can see the same thing in a successful job:

http://logs.openstack.org/37/487537/5/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/b00b9df/

I've tried turning the docker debug on in https://review.openstack.org/#/c/500112/, let's see if that provides more information.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

The insecure_registry is in the hierdata as expected[1] but the datestamp service_configs.json is 6 minutes after the docker error <finger on chin emoji>.

I'll look for evidence of when puppet ::tripleo::profile::base::docker is being applied

[1] http://logs.openstack.org/12/500112/3/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/7735d7e/logs/subnode-2/etc/puppet/hieradata/service_configs.json.txt.gz

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

My current theory is that the puppet/services/docker.yaml step_config include ::tripleo::profile::base::docker is not being applied in the upgrade flow, at least not before AllNodesPostUpgradeSteps.ControllerDeployment_Step1 which is the first thing which attempts to pull an image.

Someone with a better understanding of upgrade flow might know when this should be running

Revision history for this message
Emilien Macchi (emilienm) wrote :

Trying a revert, https://review.openstack.org/500246
The patch was the first one to fail in the gate (I looked in logstash) - so maybe related.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

If the --pull-source deprecation is the cause, this already proposed change would help

https://review.openstack.org/#/c/493391/

gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv passes on this change, I'll recheck to make sure.

If this works I'd rather fail forward and not revert https://review.openstack.org/500246, since we need that to support ceph images properly in the gate

Changed in tripleo:
assignee: nobody → Steve Baker (steve-stevebaker)
status: Triaged → In Progress
Revision history for this message
Emilien Macchi (emilienm) wrote :

Steve, it worked fine. I went ahead and approved the patch. Someone will have to explain why it was supposed to be backward compatible, while it actually broke upgrades.

Could we make sure this process is well documented for our users?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/493391
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=e8e41e2d9d8a81b158683603362813c4e0c3564d
Submitter: Jenkins
Branch: master

commit e8e41e2d9d8a81b158683603362813c4e0c3564d
Author: Steve Baker <email address hidden>
Date: Mon Aug 14 13:18:38 2017 +1200

    Stop using deprecated --pull-source

    --pull-source has beed deprecated since it interacts poorly with
    non-kolla images such as ceph. This change switches to fully-qualified
    --namespace arguments instead.

    Also the --exclude ceph is no longer required due to the service based
    image filtering.

    Change-Id: I7f7950ba7c95d42f23246932fb27f1ab457590d1
    Related-Bug: #1708303
    Related-Bug: #1714225
    Closes-Bug: #1714412

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-quickstart-extras 2.1.1

This issue was fixed in the openstack/tripleo-quickstart-extras 2.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.