Introspection times racily out because of machines messed boot order

Bug #1718898 reported by Raoul Scarazzini on 2017-09-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Medium
Unassigned

Bug Description

We're seeing a lot of introspection failures in rdophase2 for newton [1] and ocata [2], starting from when we re introduced in our jobs the usage of packaged iPXE (instead of the upstream one).
We are not seeing this in pike, so what is strange here is that despite of the ipxe package version, which in every case is ipxe-bootimgs-20170123-1.git4e85b27.el7_4.1.noarch the problem happens (frequently, but still racily) in newton and ocata. So even if the reason seems to be the usage of the packaged ipxe, the version is the same one in both environments. Of course, ironic packages are different:

Ocata (with the introspection timeout error):

openstack-ironic-api-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-common-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-conductor-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-inspector-4.2.3-0.20170711232452.b34fa9c.el7.centos.noarch
puppet-ironic-9.5.1-0.20170912042852.ec33c9a.el7.centos.noarch
python2-ironicclient-1.7.1-0.20161205104628.c0abab8.el7.centos.noarch
python-ironic-inspector-client-1.10.0-0.20170202170017.0eae82e.el7.centos.noarch
python-ironic-lib-2.1.3-0.20170209073115.152cf28.el7.centos.noarch

Pike (all good on introspection side):

openstack-ironic-api-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-common-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-conductor-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-inspector-6.0.1-0.20170824132804.0e72dcb.el7.centos.noarch
puppet-ironic-11.3.1-0.20170907060708.13e23f7.el7.centos.noarch
python2-ironicclient-1.17.0-0.20170906171257.cdff7a0.el7.centos.noarch
python-ironic-inspector-client-2.1.0-0.20170915002324.bdcab9f.el7.centos.noarch
python-ironic-lib-2.10.0-0.20170906171416.1fa0a5f.el7.centos.noarch

The reason why the introspection (did via ipxe_ipmitool driver) times out is that the failed machines have a boot order different from what it was setup.
By hand we always set in the BIOS *FIRST* ipxe network interfaces, but on timed out machines, when we reboot them and look into the BIOS we see the hard disk as the first boot device. So something acts on the boot order.

Now, if we take a look at this failure [3] we see that it failed with introspection timeouts, but then the next deployment DID succeed on introspection [4] without any manual intervention. So we can suppose that the boot order was messed up during the failure and then was fixed during the success.

What we need to understand is WHO changes the boot order, when and why.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-newton-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-48/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[2] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-251/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[3] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-248/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[4] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-249/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz

Changed in tripleo:
status: New → Triaged
importance: Undecided → Medium
milestone: none → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Raoul Scarazzini (rasca) wrote :

I worked on isolating the problem, without any big success.
But I was able to increase the debug level and take the logs from the last SUCCESSFUL deployment [1] and the next FAILURE [2]. These logs includes also tcpdump pcap file coming from the introspection process.

Now, what we know for sure is that:

1) If I continuously deploy an env until the introspection (so once the introspection finishes I just restart from scratch), the problem does not happen;
2) If I continuously deploy a complete env, including deleting the overcloud stack before restarting from scratch, then the problem happens every 3 or 4 deployments;
3) The nodes timing out are not always the same: this is extremely racy;

That said, I'm still not able to reproduce the problem, and neither use a workaround to avoid it.

[1] http://file.rdu.redhat.com/~rscarazz/LP1718898/collect-logs_SUCCESS.tar.bz2
[2] http://file.rdu.redhat.com/~rscarazz/LP1718898/collect-logs_FAILURE.tar.bz2

Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3

Closing it since it wasn't reproduceable.

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers