Intermittent failures for container-puppet-ironic_inspector with rsync error

Bug #1868934 reported by Rabi Mishra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Emilien Macchi

Bug Description

Noticed it a few times at the gate. Possibly race with rsync after switching to tripleo_container_manage role(?)

Noticed at:
-----------

https://f2812ca26965a251a9f1-8c8e1caf92c572d2c7c7d35e44c895e7.ssl.cf2.rackcdn.com/714317/4/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates/3fdf6bf/logs/undercloud/home/zuul/undercloud_install.log

undercloud install trace back:
------------------------------
....

TASK [tripleo_container_manage : Wait for containers to be exited] *************
Wednesday 25 March 2020 01:56:32 +0000 (0:00:00.174) 0:25:11.631 *******
FAILED - RETRYING: Wait for containers to be exited (30 retries left).
FAILED - RETRYING: Wait for containers to be exited (29 retries left).
ok: [undercloud]

TASK [tripleo_container_manage : Create a list of containers which didn't exit] ***
Wednesday 25 March 2020 01:56:55 +0000 (0:00:23.466) 0:25:35.097 *******
ok: [undercloud]

TASK [tripleo_container_manage : Create a list of containers with bad Exit Codes] ***
Wednesday 25 March 2020 01:56:56 +0000 (0:00:00.350) 0:25:35.448 *******
ok: [undercloud]

TASK [tripleo_container_manage : Print running containers] *********************
Wednesday 25 March 2020 01:56:56 +0000 (0:00:00.301) 0:25:35.749 *******
skipping: [undercloud]

TASK [tripleo_container_manage : Print failing containers] *********************
Wednesday 25 March 2020 01:56:56 +0000 (0:00:00.238) 0:25:35.988 *******
fatal: [undercloud]: FAILED! => changed=false
  msg: 'Container(s) with bad ExitCode: [''container-puppet-ironic_inspector''], check logs in /var/log/containers/stdouts/'

stdout for container-puppet-ironic_inspector:
------------------------------------------

https://f2812ca26965a251a9f1-8c8e1caf92c572d2c7c7d35e44c895e7.ssl.cf2.rackcdn.com/714317/4/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates/3fdf6bf/logs/undercloud/var/log/containers/stdouts/container-puppet-ironic_inspector.log

2020-03-25T01:56:43.100183386+00:00 stderr F + rc=2
2020-03-25T01:56:43.100183386+00:00 stderr F + '[' false = false ']'
2020-03-25T01:56:43.100183386+00:00 stderr F + set +x
2020-03-25T01:56:43.101723399+00:00 stdout F Evaluating config files to be removed for the ironic_inspector configuration
2020-03-25T01:56:43.148069935+00:00 stdout F Rsyncing config files from /etc /root /opt /var/lib/ironic/tftpboot /var/lib/ironic/httpboot /var/spool/cron into /var/lib/config-data/ironic_inspector
2020-03-25T01:56:43.379851996+00:00 stderr F file has vanished: "/var/lib/ironic/tftpboot/undionly.kpxe20200325-17-69r029"
2020-03-25T01:56:43.509309933+00:00 stderr F rsync warning: some files vanished before they could be transferred (code 24) at main.c(1189) [sender=3.1.3]

May be both container-puppet-ironic and container-puppet-ironic_inspector doing rsync from the same host folder /var/lib/ironic/tftpboot and hence a race, though I've no clue how and why?

Rabi Mishra (rabi)
Changed in tripleo:
status: New → Triaged
importance: Undecided → Medium
milestone: none → ussuri-3
Revision history for this message
Rabi Mishra (rabi) wrote :
Changed in tripleo:
importance: Medium → High
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

concurrent puppet & rsync racing may technically cause a corrupted configs created for the related containers, like ironic* or nova*

tags: added: train-backport-potential
tags: added: queens-backport-potential
tags: added: stein-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/714896

Changed in tripleo:
assignee: nobody → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/714937

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/714940

Revision history for this message
Rabi Mishra (rabi) wrote :

> @Rabi, it seems master shouldn't be affected by that issue

Job failures reported are on master only. Though I don't know if it's the same issue you're talking about or not.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/714942

Revision history for this message
Emilien Macchi (emilienm) wrote :

I think I found the reason.

Both container-puppet-ironic and container-puppet-ironic_inspector run at the same time:

ironic 01:56:21.141861229 - 01:56:51.610467343
ironic_inspector 01:56:24.338754035 - 01:56:43.509309933

And don't have the same config_volume parameter; so the code that prevents the config_volumes to be archived at the same time can't happen:
https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/container_puppet_config.py#L204-L230

The ironic-conductor service creates with ansible /var/lib/ironic/httpboot and seems to configure other things to that directory, so it's possible that the content of that directory changes while the container is being configured. It is problematic if at the same time the ironic inspector container is also being configured; and the config volume removed.

It would be safer to use the same config volumes both these containers.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Actually it's not possible to use the same config volumes for all ironic containers since they don't share the same Config image; so for example we couldn't use the ContainerIronicConfigImage or ContainerIronicApiConfigImage for ironic inspector config container, since the inspector config would be missing in /etc/.

I'm looking at something else.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Bogdan, I commented in your patch but I think this is the way to go (what you started). See my last comment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/stein)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/714940

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/queens)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/714942

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/train)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/714937

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/715144

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/716439

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/716447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/716547

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/715144

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/716658

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.opendev.org/716447
Reason: https://review.opendev.org/#/c/716658/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.opendev.org/716439
Reason: https://review.opendev.org/#/c/716658/

Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/716547

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/714896

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/716658
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b4ff60f0a8dc1e602a0e2f7a20c1d91ef0b2f07e
Submitter: Zuul
Branch: master

commit b4ff60f0a8dc1e602a0e2f7a20c1d91ef0b2f07e
Author: Emilien Macchi <email address hidden>
Date: Wed Apr 1 12:35:05 2020 -0400

    Exclude /var/lib/ironic/* from container-puppet.sh rsync

    Exclude /var/lib/ironic/* from container-puppet.sh rsync, this is a
    leftover from the initial containerization of TripleO; now we have
    host prep tasks, the ironic conductor and inspector bind mount
    /var/lib/ironic and generate the data that they need. But this data should
    not be in the config volume or it can conflict from each other when rsync
    runs at the same time.

    TripleO upgrade tasks and host prep tasks will take care of removing
    the var directory from the config volumes and the containers will just use
    the bind mount, like it should be doing now.
    These tasks will run during a minor update, major upgrade, and fast
    forward upgrade.

    Note: this will have to be backported to stable/train, and cleaned up
    later after Ussuri as the tasks won't be needed anymore.

    Root cause patches:
    - I3a195466a5039e7641e843c11e5436440bfc5a01
    - Ibcff99f03e6751fbf3197adefd5d344178b71fc2

    Related-Bug: #1868934
    Co-Authored-By: Alex Schultz <email address hidden>

    Change-Id: I69a1d1059bdfc0b99cf6a4a3dc78a1eb9a43ad0b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/718548

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/718548
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=3b5fca2961399312dac4f7ac3000ed156dd4815e
Submitter: Zuul
Branch: stable/train

commit 3b5fca2961399312dac4f7ac3000ed156dd4815e
Author: Emilien Macchi <email address hidden>
Date: Wed Apr 1 12:35:05 2020 -0400

    Exclude /var/lib/ironic/* from container-puppet.sh rsync

    Exclude /var/lib/ironic/* from container-puppet.sh rsync, this is a
    leftover from the initial containerization of TripleO; now we have
    host prep tasks, the ironic conductor and inspector bind mount
    /var/lib/ironic and generate the data that they need. But this data should
    not be in the config volume or it can conflict from each other when rsync
    runs at the same time.

    TripleO upgrade tasks and host prep tasks will take care of removing
    the var directory from the config volumes and the containers will just use
    the bind mount, like it should be doing now.
    These tasks will run during a minor update, major upgrade, and fast
    forward upgrade.

    Note: this will have to be backported to stable/train, and cleaned up
    later after Ussuri as the tasks won't be needed anymore.

    Root cause patches:
    - I3a195466a5039e7641e843c11e5436440bfc5a01
    - Ibcff99f03e6751fbf3197adefd5d344178b71fc2

    Related-Bug: #1868934
    Co-Authored-By: Alex Schultz <email address hidden>

    Change-Id: I69a1d1059bdfc0b99cf6a4a3dc78a1eb9a43ad0b
    (cherry picked from commit b4ff60f0a8dc1e602a0e2f7a20c1d91ef0b2f07e)

tags: added: in-stable-train
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.