CI: Images introspection fails in OVB jobs

Bug #1770972 reported by Sagi (Sergey) Shnaidman
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Derek Higgins

Bug Description

OVB jobs fail when running introspection:

https://logs.rdoproject.org/57/568057/1/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zf543f41f687940dda0c8347b7749d7d0/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

https://logs.rdoproject.org/57/568057/1/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Zf543f41f687940dda0c8347b7749d7d0/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

2018-05-13 12:33:50 | 4 node(s) successfully moved to the "manageable" state.
2018-05-13 12:33:50 | Successfully registered node UUID fe2443d0-f3fa-40ea-b05d-2a161472e446
2018-05-13 12:33:50 | Successfully registered node UUID 540e682b-2e3a-4f27-89dc-16cde20e3bed
2018-05-13 12:33:50 | Successfully registered node UUID 8cb61185-5164-4be8-a0cd-ff8cbe720f5d
2018-05-13 12:33:50 | Successfully registered node UUID 253f0815-585e-4e7a-9113-98142429772a
2018-05-13 12:33:50 | + openstack overcloud node introspect --all-manageable
2018-05-13 12:33:54 | Waiting for messages on queue 'tripleo' with no timeout.
2018-05-13 13:34:45 | Exception introspecting nodes: {u'status': u'RUNNING', u'node_uuids': [u'fe2443d0-f3fa-40ea-b05d-2a161472e446', u'540e682b-2e3a-4f27-89dc-16cde20e3bed', u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'253f0815-585e-4e7a-9113-98142429772a'], u'failed_introspection': [u'fe2443d0-f3fa-40ea-b05d-2a161472e446', u'540e682b-2e3a-4f27-89dc-16cde20e3bed', u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'253f0815-585e-4e7a-9113-98142429772a'], u'result': None, u'introspected_nodes': {u'540e682b-2e3a-4f27-89dc-16cde20e3bed': {u'uuid': u'540e682b-2e3a-4f27-89dc-16cde20e3bed', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/540e682b-2e3a-4f27-89dc-16cde20e3bed', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:33:59'}, u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d': {u'uuid': u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:34:00'}, u'fe2443d0-f3fa-40ea-b05d-2a161472e446': {u'uuid': u'fe2443d0-f3fa-40ea-b05d-2a161472e446', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/fe2443d0-f3fa-40ea-b05d-2a161472e446', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:33:58'}, u'253f0815-585e-4e7a-9113-98142429772a': {u'uuid': u'253f0815-585e-4e7a-9113-98142429772a', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/253f0815-585e-4e7a-9113-98142429772a', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:34:01'}}, u'message': u'Retrying 4 nodes that failed introspection. Attempt 2 of 3 ', u'introspection_attempt': 2}
2018-05-13 13:34:45 | Waiting for introspection to finish...
2018-05-13 13:34:45 | Introspection of node 8cb61185-5164-4be8-a0cd-ff8cbe720f5d timed out.
2018-05-13 13:34:45 | Introspection of node 540e682b-2e3a-4f27-89dc-16cde20e3bed timed out.
2018-05-13 13:34:45 | Introspection of node fe2443d0-f3fa-40ea-b05d-2a161472e446 timed out.
2018-05-13 13:34:45 | Introspection of node 253f0815-585e-4e7a-9113-98142429772a timed out.
2018-05-13 13:34:45 | Retrying 4 nodes that failed introspection. Attempt 2 of 3
2018-05-13 13:34:45 | Introspection of node 253f0815-585e-4e7a-9113-98142429772a timed out.
2018-05-13 13:34:45 | Introspection of node 8cb61185-5164-4be8-a0cd-ff8cbe720f5d timed out.
2018-05-13 13:34:45 | Introspection of node fe2443d0-f3fa-40ea-b05d-2a161472e446 timed out.
2018-05-13 13:34:45 | Introspection of node 540e682b-2e3a-4f27-89dc-16cde20e3bed timed out.
2018-05-13 13:34:45 | Retrying 4 nodes that failed introspection. Attempt 3 of 3
2018-05-13 13:34:45 | Introspection of node fe2443d0-f3fa-40ea-b05d-2a161472e446 timed out.
2018-05-13 13:34:45 | Introspection of node 540e682b-2e3a-4f27-89dc-16cde20e3bed timed out.
2018-05-13 13:34:45 | Introspection of node 253f0815-585e-4e7a-9113-98142429772a timed out.
2018-05-13 13:34:45 | Introspection of node 8cb61185-5164-4be8-a0cd-ff8cbe720f5d timed out.
2018-05-13 13:34:45 | Retry limit reached with 4 nodes still failing introspection
2018-05-13 13:34:45 | {u'status': u'RUNNING', u'node_uuids': [u'fe2443d0-f3fa-40ea-b05d-2a161472e446', u'540e682b-2e3a-4f27-89dc-16cde20e3bed', u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'253f0815-585e-4e7a-9113-98142429772a'], u'failed_introspection': [u'fe2443d0-f3fa-40ea-b05d-2a161472e446', u'540e682b-2e3a-4f27-89dc-16cde20e3bed', u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'253f0815-585e-4e7a-9113-98142429772a'], u'result': None, u'introspected_nodes': {u'540e682b-2e3a-4f27-89dc-16cde20e3bed': {u'uuid': u'540e682b-2e3a-4f27-89dc-16cde20e3bed', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/540e682b-2e3a-4f27-89dc-16cde20e3bed', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:33:59'}, u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d': {u'uuid': u'8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/8cb61185-5164-4be8-a0cd-ff8cbe720f5d', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:34:00'}, u'fe2443d0-f3fa-40ea-b05d-2a161472e446': {u'uuid': u'fe2443d0-f3fa-40ea-b05d-2a161472e446', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/fe2443d0-f3fa-40ea-b05d-2a161472e446', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:33:58'}, u'253f0815-585e-4e7a-9113-98142429772a': {u'uuid': u'253f0815-585e-4e7a-9113-98142429772a', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/253f0815-585e-4e7a-9113-98142429772a', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-05-13T12:34:01'}}, u'message': u'Retrying 4 nodes that failed introspection. Attempt 2 of 3 ', u'introspection_attempt': 2}

Changed in tripleo:
assignee: Sagi (Sergey) Shnaidman (sshnaidm) → Dmitry Tantsur (divius)
Revision history for this message
Dmitry Tantsur (divius) wrote :
Revision history for this message
Dmitry Tantsur (divius) wrote :
Revision history for this message
Derek Higgins (derekh) wrote :

Console logs suggest a problem starting networking

Revision history for this message
Derek Higgins (derekh) wrote :
Changed in tripleo:
assignee: Dmitry Tantsur (divius) → Derek Higgins (derekh)
Revision history for this message
Derek Higgins (derekh) wrote :

The IPA image appears to have been currupted (or not package correctly)

I opened it up, set the root password so I could log on to it, inspection with the the new image works, nothing else has changed
I've had 4 or 5 successfull inspections with this image, switched back to the origional and FAILED
Command used to package the ramdisk
  "sudo find . | sudo cpio -H newc -o | gzip -9 > ../ramdisk"

Revision history for this message
Matt Young (halcyondude) wrote :

tripleo-ci triage - monitoring jobs. We are concerned that we don't have a root cause for corruption and/or mitigation for this class of issue moving forward.

tags: removed: alert
Revision history for this message
Matt Young (halcyondude) wrote :
Revision history for this message
Matt Young (halcyondude) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Matt Young (halcyondude) wrote :

kicking off recreate of this issue.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

If this is corrupt image again, we need to find out why it's corrupted and sometimes not.

Revision history for this message
Matt Young (halcyondude) wrote :

reproducer failed, trying again... since this is being seeing in gates adding alert

tags: added: alert
Revision history for this message
Matthias Runge (mrunge) wrote :

FWIW, I'm struggling with this kind of issues for a while now on a separate deployment. This bug here is maybe related: https://bugs.launchpad.net/tripleo/+bug/1749707

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

We update our images in the job: https://logs.rdoproject.org/15/568715/2/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z5df1951657694a9ebaad63e71362a76a/console.txt.gz#_2018-05-16_04_15_05_346
In code it's here: https://github.com/openstack/tripleo-quickstart-extras/blob/69ad943adda9000f79277f0230a5751869de9cb3/roles/modify-image/tasks/manual.yml#L33-L70

But what we have when running update we've got error in yum: https://logs.rdoproject.org/15/568715/2/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z5df1951657694a9ebaad63e71362a76a/undercloud/home/jenkins/repo_setup.sh.1526444104.log.txt.gz

2018-05-16 04:15:04 | + sudo rm -rf '/etc/yum.repos.d/delorean*'
2018-05-16 04:15:04 | + sudo rm -rf '/etc/yum.repos.d/*.rpmsave'
2018-05-16 04:15:04 | + sudo yum clean all
2018-05-16 04:15:04 | error: Failed to initialize NSS library
2018-05-16 04:15:04 | There was a problem importing one of the Python modules
2018-05-16 04:15:04 | required to run yum. The error leading to this problem was:
2018-05-16 04:15:04 |
2018-05-16 04:15:04 | cannot import name ts
2018-05-16 04:15:04 |
2018-05-16 04:15:04 | Please install a package which provides this module, or
2018-05-16 04:15:04 | verify that the module is installed correctly.
2018-05-16 04:15:04 |
2018-05-16 04:15:04 | It's possible that the above module doesn't match the
2018-05-16 04:15:04 | current version of Python, which is:
2018-05-16 04:15:04 | 2.7.5 (default, Apr 11 2018, 07:36:10)
2018-05-16 04:15:04 | [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
2018-05-16 04:15:04 |
2018-05-16 04:15:04 | If you cannot solve this problem yourself, please go to
2018-05-16 04:15:04 | the yum faq at:
2018-05-16 04:15:04 | http://yum.baseurl.org/wiki/Faq
2018-05-16 04:15:04 |
2018-05-16 04:15:04 |

It may be a reason for failures. Anyway it means we can't patch images with built changes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by Sagi Shnaidman (<email address hidden>) on branch: master
Review: https://review.openstack.org/568805

Revision history for this message
Matt Young (halcyondude) wrote :

Sagi has a related patch being tested now:

```
https://review.openstack.org/568838 Mount /dev for chrooted environment

Yum fails to run in chrooted environment because of blocking
access to /dev/urandom. Mount hosts /dev to chrooted.
```

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Chroot issue doesn't help to introspection but it's an issue in itself. Opened a bug about it https://bugs.launchpad.net/tripleo/+bug/1771755

Introspection still fails.

Revision history for this message
Derek Higgins (derekh) wrote :

A problem appears to be getting introduced to the ramdisk image while CI is updating it

promoted ramdisks work fine
promoted ramdisks updated by quickstart fail
promoted ramdisks updated manually work fine
   There most be some difference between the quickstart image update and the manual update I've been doing but I've yet to find it

I've also managed to get onto a updated image while it was failing[1], journal log attached, I suspect something is wrong with /dev and there is problems accessing /dev/urandom and setting selinux contexts

May 17 11:01:56 localhost.localdomain chronyd[243]: Can't open /dev/urandom : Permission denied
May 17 11:01:56 localhost.localdomain sshd-keygen[250]: /sbin/restorecon set context /etc/ssh/ssh_host_rsa_key.pub->system_u:object_r:sshd_key_t:s0 failed:'Operation not supported'

[1] - set a rootpwd= in /httpboot/inspector.ipxe
      ssh to the ipv6 linklocal address (as IPv4 isn't working)
        ssh root@fe80::f816:3eff:fe13:8481%br-ctlplane

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

In the meanwhile, can't we pass OVB jobs as non-voting? It locks reviews using it, and it has been a week now :/.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/568833
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=a835a69e3ba63bc88203584b55ca34d3c116b1c9
Submitter: Zuul
Branch: master

commit a835a69e3ba63bc88203584b55ca34d3c116b1c9
Author: Sagi Shnaidman <email address hidden>
Date: Wed May 16 15:44:58 2018 +0300

    Temporary disable update of IPA image in jobs

    Don't update IPA image, use the provided from promotion.
    Introspection fails when we unpack and pack again IPA images,
    let's use promoted one in jobs until bug 1770972 if fixed.

    Related-Bug: 1770972

    Change-Id: Iab5565d1743d7a6d9aa9fdb5473b3b3e5fea7c62

Revision history for this message
Derek Higgins (derekh) wrote :

After taking two ramdisks one working and another not, stripping out everything I think is irrelevant (inode numbers, timestamps etc..), the only difference I'm left with that I think is relevant is that the "/" directory has the permissions 700 in the non working ramdisk and 775 working ramdisk.

If this turns out to be the problem, I suspect that the umask has changed recently with a newer version of something (centos or ansible maybe).

I've confirmed this locally but CI appears to be having other problems at the moment

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart-extras (master)

Fix proposed to branch: master
Review: https://review.openstack.org/569468

Changed in tripleo:
status: Triaged → In Progress
assignee: Derek Higgins (derekh) → Sagi (Sergey) Shnaidman (sshnaidm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Derek Higgins (<email address hidden>) on branch: master
Review: https://review.openstack.org/517368
Reason: Look here instead https://review.openstack.org/#/c/569468/

Changed in tripleo:
assignee: Sagi (Sergey) Shnaidman (sshnaidm) → Derek Higgins (derekh)
Revision history for this message
wes hayutin (weshayutin) wrote :

Thanks Derek!!!!!!

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/569468
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=d19178ca4f04a81a5a2151df3845518317942b37
Submitter: Zuul
Branch: master

commit d19178ca4f04a81a5a2151df3845518317942b37
Author: Derek Higgins <email address hidden>
Date: Thu May 17 12:55:05 2018 +0100

    Set the permissions for "/" in the ramdisk

    Assert that the "/" directory in the ramdisk has the permissions
    of 755 and not leave it up to umask (or chance).

    Change-Id: I609914f15b8eca6f3cb8e72099c130b88f294224
    Closes-Bug: #1770972

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-quickstart-extras 2.1.1

This issue was fixed in the openstack/tripleo-quickstart-extras 2.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.