Wrong versions of tripleo-common in container images updated in CI

Bug #1786764 reported by Martin Schuppert on 2018-08-13
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Emilien Macchi
Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
milestone: none → rocky-rc1
assignee: nobody → Gabriele Cerami (gcerami)
Bogdan Dobrelya (bogdando) wrote :

The longer we're running w/o promotions, the higher chances are for a timeout. That is inevitable as update packages in containers for CI.

tags: added: ci
Bogdan Dobrelya (bogdando) wrote :

Shall we push for promotions, even having the promotion blocked, then fix the blockers? I think we should, otherwise everything will become blocked like this issue

Jose Luis Franco (jfrancoa) wrote :

After checking the logs for some time I found this in the mistral logs:

2018-08-10 17:12:35.386 7 ERROR mistral.engine.task_handler [req-2a2f73f7-9bab-4e1a-8517-18a2cdbb4b78 8472d43aa8d8497389896dd99b217bbc 333c5e13d8ab41dab559f11853625ce3 - default default] Failed to run task [error=Invalid input [name=tripleo.package_update.update_stack, class=tripleo_common.actions.package_update.UpdateStackAction, missing=['ceph_ansible_playbook']], wf=tripleo.package_update.v1.package_update_plan, task=update]:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mistral/engine/task_handler.py", line 63, in run_task
    task.run()
  File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 390, in run
    self._run_new()
  File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 419, in _run_new
    self._schedule_actions()
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 488, in _schedule_actions
    action.validate_input(input_dict)
  File "/usr/lib/python2.7/site-packages/mistral/engine/actions.py", line 326, in validate_input
    self.action_def.action_class
  File "/usr/lib/python2.7/site-packages/mistral/engine/utils.py", line 66, in validate_input
    raise exc.InputException(msg % tuple(msg_props))
InputException: Invalid input [name=tripleo.package_update.update_stack, class=tripleo_common.actions.package_update.UpdateStackAction, missing=['ceph_ansible_playbook']]
: InputException: Invalid input [name=tripleo.package_update.update_stack, class=tripleo_common.actions.package_update.UpdateStackAction, missing=['ceph_ansible_playbook']]
2018-08-10 17:12:35.395 7 INFO workflow_trace [req-2a2f73f7-9bab-4e1a-8517-18a2cdbb4b78 8472d43aa8d8497389896dd99b217bbc 333c5e13d8ab41dab559f11853625ce3 - default default] Task 'update' (a91e8766-d6d1-408d-950f-8d7892fd1fa7) [RUNNING -> ERROR, msg=Failed to run task [error=Invalid input [name=tripleo.package_update.update_stack, class=tripleo_common.actions.package_update.UpdateStackAction, missing=['ceph_ansible_playbook']], wf=tripleo.package_update.v1.package_update_plan, task=update]:

http://logs.openstack.org/83/590683/1/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/d4d4120/logs/undercloud/var/log/containers/mistral/engine.log.txt.gz#_2018-08-10_17_12_35_386

Jose Luis Franco (jfrancoa) wrote :

The error was probably inserted by some of these patches: https://review.openstack.org/#/q/topic:external-update-upgrade

Jiří Stránský (jistr) wrote :

tripleo_common.actions.package_update.UpdateStackAction, missing=['ceph_ansible_playbook']

^ that must be caused by some stale content in containers. In the latest code there is no ceph_ansible_playbook in anything update/upgrade related, both in tripleo-common and in tripleoclient.

Jiří Stránský (jistr) wrote :

Here are CI results from 20 minutes ago and they're green:

https://review.openstack.org/#/c/591374/

tags: added: alert
Jiří Stránský (jistr) wrote :

The current issue is different, now it complains of missing 'container_registry' instead of 'ceph_ansible_playbook'. It's desync between tripleo-common and tripleoclient versions after removal of deprecations we did recently. The content of repos is fine but appartently containers are not. The fix is simple -- we need containers with fresh RPM content.

Jiří Stránský (jistr) wrote :

It's not about input to the *workflow*, it's about input to the *action*. I think what we have is a desync within mistral containers. Likely we have this patch in one of the mistral containers (e.g. API or engine) but we don't have it in another (e.g. executor):

https://review.openstack.org/#/c/571186/

What i said earlier on this bug still holds true i think -- all we need is fresh content of containers, AFAICT. When i deployed with `update_containers: true`, the problem disappeared in my dev env.

If we promoted recently and it didn't fix the problem, we should to check our promotion process (did we leave out some image by accident maybe?).

Jiří Stránský (jistr) wrote :

Also, let's land this patch please, finishing parameter the removal: https://review.openstack.org/#/c/589487

Landing it is not a prerequisite to fixing the bug though. The state in repositories is already correct, likely it's container content which is broken.

Also notice that the job only fails on t-h-t patches and not on tripleo-common patches (on tripleo-common patches we actually do pull latest tripleo-common -- the one being tested -- into containers, i presume).

Jiří Stránský (jistr) wrote :

I saw a broken job here, the container images are "updated-20180815220725":

http://logs.openstack.org/02/573102/11/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/f89af9f/logs/undercloud/var/log/extra/docker/docker_allinfo.log.txt.gz

The error message talks about expecting this parameter which we recently removed:

https://review.openstack.org/#/c/571186/6/tripleo_common/actions/package_update.py

In other words, the error message talks about code which is not present anywhere since Aug 13. It shouldn't even know about that code. We have some problem in building the updated containers it seems.

Jiří Stránský (jistr) wrote :

The mistral action definitions are in its database. Is it possible that we populate the database with non-updated mistral code (pre- August 13) and then we actually run Mistral processes with an updated container (post- August 15), without re-running `sudo mistral-db-manage populate`? That would result in such error messages i think.

Jiří Stránský (jistr) wrote :

I pulled latest mistral-executor:current-tripleo image and it indeed has stale content.

docker.io/tripleomaster/centos-binary-mistral-executor current-tripleo bfb76e4acb94 47 hours ago 1.29 GB

()[mistral@57e2751fa018 /]$ grep -ri container_registry /usr/lib/python2.7/site-packages/tripleo_common
/usr/lib/python2.7/site-packages/tripleo_common/actions/package_update.py: def __init__(self, timeout, container_registry,
/usr/lib/python2.7/site-packages/tripleo_common/actions/package_update.py: self.container_registry = container_registry
/usr/lib/python2.7/site-packages/tripleo_common/actions/package_update.py: if self.container_registry is not None:
/usr/lib/python2.7/site-packages/tripleo_common/actions/package_update.py: update_env.update(self.container_registry)
/usr/lib/python2.7/site-packages/tripleo_common/actions/package_update.py: if self.container_registry is not None:
/usr/lib/python2.7/site-packages/tripleo_common/actions/package_update.py: parameters.update(self.container_registry['parameter_defaults'])

()[mistral@57e2751fa018 /]$ rpm -qa | grep tripleo-common
openstack-tripleo-common-9.2.1-0.20180811014734.336cd3c.el7.noarch
openstack-tripleo-common-containers-9.2.1-0.20180811014734.336cd3c.el7.noarch
python2-tripleo-common-9.2.1-0.20180811014734.336cd3c.el7.noarch
openstack-tripleo-common-container-base-9.2.1-0.20180811014734.336cd3c.el7.noarch

Jiří Stránský (jistr) wrote :

And then i ran with my locally updated mistral-executor image, "updated-20180814162422", and compared the results:

()[mistral@913f62985e6b /]$ grep -ri container_registry /usr/lib/python2.7/site-packages/tripleo_common

^ nothing :)

Marios Andreou (marios-b) wrote :

folks adding a acomment as i came from https://bugs.launchpad.net/tripleo/+bug/1787226 which is marked duplicate of this. I added a comment about the missing container registry issue as reported above from jfrancoa abishop and others so duplicating here. If the ceph issue is different then lets use the two bugs one for each?

(copy paste from https://bugs.launchpad.net/tripleo/+bug/1787226):

  * for the container registry parameter removal [1,2] it could only happen if you are using a python-tripleoclient with the change, and then tripleo-common w/out it. There is depends on though and they both merged 3 days ago. The multinode-oooq-container-updates job is green there too. However here you can see the error in [4,5] and it looks like

  2018-08-15 10:11:40.190 ERROR /var/log/containers/mistral/engine.log: 7 ERROR mistral.engine.task_handler [req-fd8fbc0e-350e-4297-8a63-6eb0060ff25a 2bea853d47d643b38acfbd8f7c91504b 7f74b99fd5cc4cf0abaefee63f693317 - default default] Failed to run task [error=Invalid input [name=tripleo.package_update.update_stack, class=tripleo_common.actions.package_update.UpdateStackAction, missing=['container_registry']], wf=tripleo.package_update.v1.package_update_plan, task=update]:

  * for the nova cert error examples are at [5] (immediately following that container registry issue) and also [6] and looks like

  2018-08-15 09:32:20.998 ERROR /var/log/containers/mistral/mistral-db-manage.log: 11 ERROR mistral.actions.openstack.action_generator.base [-] Failed to create action: nova.certs_convert_into_with_meta: AttributeError: 'Client' object has no attribute 'certs'

[0] http://zuul.openstack.org/builds.html?job_name=tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates
[1] https://review.openstack.org/#/c/570893/ python-tripleoclient
[2] https://review.openstack.org/#/c/571186/ tripleo-common
[3] https://bugs.launchpad.net/tripleo/+bug/1787227
[4] http://logs.openstack.org/48/588148/3/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/8a5b8b2/logs/undercloud/var/log/extra/errors.txt.gz#_2018-08-16_08_22_08_249
[5] http://logs.openstack.org/70/585370/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/027eb5f/logs/undercloud/var/log/extra/errors.txt.gz#_2018-08-15_09_32_20_998
[6] http://logs.openstack.org/48/588148/3/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/8a5b8b2/logs/undercloud/var/log/extra/errors.txt.gz#_2018-08-16_07_33_58_362

Jiří Stránský (jistr) wrote :

The working container has:

()[mistral@af2c5169e7cf /]$ rpm -qa | grep tripleo-common
openstack-tripleo-common-container-base-9.2.1-0.20180814123159.042f43d.el7.noarch
openstack-tripleo-common-containers-9.2.1-0.20180814123159.042f43d.el7.noarch
openstack-tripleo-common-9.2.1-0.20180814123159.042f43d.el7.noarch
python2-tripleo-common-9.2.1-0.20180814123159.042f43d.el7.noarch

It would probably help if we'd be able to get to the containers which are used in CI and check the RPM versions there. Note that the containers in CI have the "updated" part of name *later* than what i have in my working local env, yet they still hit the problem. There might be something fishy in how we build/update containers for CI...

Jiří Stránský (jistr) wrote :

Ok we may be closing in on the root cause:

https://github.com/openstack/tripleo-quickstart-extras/blob/ff959379665658b454df5507ea09cc265fffdb9e/roles/undercloud-deploy/templates/containers-prepare-parameter.yaml.j2#L15

https://github.com/openstack/ansible-role-tripleo-modify-image#role-variables

"If set, packages from this repo will be updated. Other repos will only be used for dependencies of these updates."

Perhaps thanks to ^^ jobs running on tripleo-common patches get correct/fresh tripleo-common, and update job passes, but jobs running on t-h-t patches don't get fresh tripleo-common, and update job fails. We either need to tag `current-tripleo` images much more often, or we need to use ansible-role-tripleo-modify-image to update more packages than just the one we're gating.

Jiří Stránský (jistr) wrote :

Had a call with Wes about this bug, it's probably two issues:

* Aside from updating RPMs from the gating repo, we should update the packages from delorean-current repo too.

* We should update all container images and not just some. (What triggered this bug was probably that different containers had different version of tripleo-common. Patch interdependencies probably played no part in this. One patch without any depends-on is enough to break things, if it happens to be installed e.g. in mistral-api container but not in mistral-executor container.)

Fix proposed to branch: master
Review: https://review.openstack.org/592577

Changed in tripleo:
assignee: Gabriele Cerami (gcerami) → wes hayutin (weshayutin)
status: Triaged → In Progress

The problem is that any container that is affected by [1] will get updated by dlrn-current too. This will mismatch the versions of [2]. So ensure that [2] are all at the same level across all of the containers we need to make the change in [3].

The concern is that the package update will take too long, and it might. However the includepkgs protects us a bit from that [4].

[1] packages_for_update="$(repoquery --disablerepo='*' --enablerepo={{ gating_repo_name }} --qf %{NAME} -a 2>{{ working_dir }}/repoquery.err.log | sort -u | xargs)"

[2] #includepkgs=instack,instack-undercloud,os-apply-config,os-collect-config,os-net-config,os-refresh-config,python-tripleoclient*,python*-tripleo-common,openstack-tripleo-*,puppet-*,python-paunch

[3] https://review.openstack.org/592577

[4] [root@undercloud yum.repos.d]# cat delorean-current.repo | grep include
includepkgs=instack,instack-undercloud,os-apply-config,os-collect-config,os-net-config,os-refresh-config,python-tripleoclient*,python*-tripleo-common,openstack-tripleo-*,puppet-*,python-paunch
[root@undercloud yum.repos.d]# repoquery -q --repoid=delorean-current | wc -l
221
[root@undercloud yum.repos.d]# vi delorean-current.repo
[root@undercloud yum.repos.d]# repoquery -q --repoid=delorean-current | wc -l
1091

Change abandoned by wes hayutin (<email address hidden>) on branch: master
Review: https://review.openstack.org/592241

Changed in tripleo:
assignee: wes hayutin (weshayutin) → Jiří Stránský (jistr)

Reviewed: https://review.openstack.org/593169
Committed: https://git.openstack.org/cgit/openstack/ansible-role-tripleo-modify-image/commit/?id=7b587fe0f5b0b9ead2adbd9d450109f5fe1c6696
Submitter: Zuul
Branch: master

commit 7b587fe0f5b0b9ead2adbd9d450109f5fe1c6696
Author: Alex Schultz <email address hidden>
Date: Fri Aug 17 14:21:27 2018 -0600

    Only do yum update when needed

    Currently the logic for this results in a full yum update when no
    package updates are found in the provided repository. This can lead to
    job timeouts when nothing was built in CI because it effectively does a
    yum update on every container and applies other system packages rather
    than ones that actually changed.

    There is a larger issue in that we can still get out of sync with the
    host OS.

    Change-Id: Iaf41691ea3cb6e78186741ac5e15614fb73f89ff
    Related-Bug: #1786764

Changed in tripleo:
assignee: Jiří Stránský (jistr) → Sorin Sbarnea (ssbarnea)

Reviewed: https://review.openstack.org/592577
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=5d8f67f74aa2e6d54263b0455ad16c35347a1bb5
Submitter: Zuul
Branch: master

commit 5d8f67f74aa2e6d54263b0455ad16c35347a1bb5
Author: Wes Hayutin <email address hidden>
Date: Thu Aug 16 12:46:33 2018 -0400

    add delorean-current to repolist for updates

    The list of containers that are updated are generated
    from a list of rpms from the gating repo. These
    containers are updated w/ the gating repo and dlrn-current.

    That makes the above set of containers out of sync w/
    the rest of the containers. The list of containers
    that are updated needs to include changes required
    by dlrn-current AND the gating repo.

    The --enablerepo parameter for repoquery seems to support
    comma-delimited lists, we'll take advantage of that so that we don't
    need edit ansible-role-tripleo-modify-image parameter interface.

    Co-Authored-By: Jiri Stransky <email address hidden>
    Closes-Bug: #1786764
    Change-Id: Ie12021ace7e9eb1695aa97ac5d97f3b948be9d86

Changed in tripleo:
status: In Progress → Fix Released

This needs to be fixed for all upstream releases.

include_pkgs can be used on repos that have tripleo jobs in their gate.
do not use wild cards, be explicit

Changed in tripleo:
status: Fix Released → Triaged
Sorin Sbarnea (ssbarnea) wrote :

I am working now on updating the patch to avoid using wildcards for all versions.

Changed in tripleo:
milestone: rocky-rc1 → rocky-rc2

I am not sure what happens but while trying to configure a timeout for the update command in order to avoid failure to collect logs, I found the last line from http://logs.openstack.org/81/596381/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/41a018f/logs/undercloud/home/zuul/overcloud_update_prepare.log.txt.gz

WARNING tripleoclient.plugin [-] Waiting for messages on queue 'tripleo' with no timeout.

This makes me believe that this will never finish. My bug will improve the output and ease reading build logs but will not fix this bug.

Jiří Stránský (jistr) wrote :

The patch did not fix the issue, it's still very the same problem:

InputException: Invalid input [name=tripleo.package_update.update_stack, class=tripleo_common.actions.package_update.UpdateStackAction, missing=['container_registry']]

There's still something broken in the way we run the ansible modify-image role, or in the modify-image role itself.

Jiří Stránský (jistr) wrote :

After promotion the updates job is now passing, but there's no reason to believe that the root cause is now fixed. I changed the bug title.

summary: - tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates times
- out on prepare
+ Wrong versions of tripleo-common in container images updated in CI
Jiří Stránský (jistr) wrote :

Sagi most likely found the root cause, here's what will hopefully be fix things: https://review.openstack.org/#/c/598089/

Sorin Sbarnea (ssbarnea) on 2018-09-02
description: updated
Sorin Sbarnea (ssbarnea) wrote :

I don't think that we can close this because the scenario never run succesfully since Sagi patch was merged two days ago.

http://cistatus.tripleo.org/#tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates

Still, the breakages could be caused by other reasons as I was not able to see the same error.

yatin (yatinkarel) wrote :

<<< Still, the breakages could be caused by other reasons as I was not able to see the same error.

>> Correct, The current failures are after https://review.openstack.org/#/c/573476/, which merged around the time for sagi's patch, updates jobs need to adopt changes as per bp/container-prepare-workflow(https://review.openstack.org/#/q/topic:bp/container-prepare-workflow+(status:open+OR+status:merged))

Alex Schultz (alex-schultz) wrote :

Moving milestone to Stein-1 as this is not required for Rocky RC2.

Changed in tripleo:
milestone: rocky-rc2 → stein-1
Sorin Sbarnea (ssbarnea) on 2018-09-04
Changed in tripleo:
assignee: Sorin Sbarnea (ssbarnea) → nobody

Fix proposed to branch: master
Review: https://review.openstack.org/600273

Changed in tripleo:
assignee: nobody → Steve Baker (steve-stevebaker)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/600277
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=a8b048e1162846d1460c51de706b4c08bf0a9473
Submitter: Zuul
Branch: master

commit a8b048e1162846d1460c51de706b4c08bf0a9473
Author: Steve Baker <email address hidden>
Date: Thu Sep 6 11:20:42 2018 +1200

    Don't set compare_host_packages:True

    Now that the undercloud is containerized, there will be very few host
    packages to compare to, so there is a high risk that required package
    updates will be skipped.

    This is a strategy inherited from container-update.py that was
    intended to avoid unnecessary calls to yum update, however we now have
    a better approach using the repoquery, so host package comparison is
    no longer required, and probably causing some of the instances of bug

    This strategy is removed from the role in change Iab7b9d6377494001d904bb84b058ea293d73110c

    Change-Id: I3bb0ba1f56daf475b7498283a5b7e6dcd1540e7d
    Partial-Bug: #1786764

Bogdan Dobrelya (bogdando) wrote :

This is probably fixed with https://review.openstack.org/#/c/599315/ finally?

Changed in tripleo:
assignee: Steve Baker (steve-stevebaker) → Emilien Macchi (emilienm)
wes hayutin (weshayutin) wrote :

openstack-tripleo-common-container-base.noarch
                                   9.3.1-0.20180918151848.c794510.el7 @gating-repo

http://logs.openstack.org/22/603322/3/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/edb58a0/logs/undercloud/var/log/extra/docker/containers/heat_api/docker_info.log.txt.gz

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/600273
Committed: https://git.openstack.org/cgit/openstack/ansible-role-tripleo-modify-image/commit/?id=c9d085729f62dfcfeeaecccf36c3c0161414afb7
Submitter: Zuul
Branch: master

commit c9d085729f62dfcfeeaecccf36c3c0161414afb7
Author: Steve Baker <email address hidden>
Date: Thu Sep 6 10:35:38 2018 +1200

    Remove compare_host_packages strategy

    Now that the undercloud is containerized, there will be very few host
    packages to compare to, so there is a high risk that required package
    updates will be skipped.

    This is a strategy inherited from container-update.py that was
    intended to avoid unnecessary calls to yum update, however we now have
    a better approach using the repoquery, so host package comparison is
    no longer required, and probably causing some of the instances of bug

    Change-Id: Iab7b9d6377494001d904bb84b058ea293d73110c
    Partial-Bug: #1786764

Change abandoned by Sorin Sbarnea (<email address hidden>) on branch: master
Review: https://review.openstack.org/596381
Reason: True, timestamper_cmd should need escaping to work inside a subshell and this is a read challenge.

I will abandon it as I no longer have time to address it, I will revive it if I see other timeouts happening.

This issue was fixed in the openstack/tripleo-quickstart-extras 2.1.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers