tripleo

Introspection times racily out because of machines messed boot order

Bug #1718898 reported by Raoul Scarazzini on 2017-09-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Invalid	Medium	Unassigned	tripleo stein-3

Bug Description

We're seeing a lot of introspection failures in rdophase2 for newton [1] and ocata [2], starting from when we re introduced in our jobs the usage of packaged iPXE (instead of the upstream one).
We are not seeing this in pike, so what is strange here is that despite of the ipxe package version, which in every case is ipxe-bootimgs-20170123-1.git4e85b27.el7_4.1.noarch the problem happens (frequently, but still racily) in newton and ocata. So even if the reason seems to be the usage of the packaged ipxe, the version is the same one in both environments. Of course, ironic packages are different:

Ocata (with the introspection timeout error):

openstack-ironic-api-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-common-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-conductor-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-inspector-4.2.3-0.20170711232452.b34fa9c.el7.centos.noarch
puppet-ironic-9.5.1-0.20170912042852.ec33c9a.el7.centos.noarch
python2-ironicclient-1.7.1-0.20161205104628.c0abab8.el7.centos.noarch
python-ironic-inspector-client-1.10.0-0.20170202170017.0eae82e.el7.centos.noarch
python-ironic-lib-2.1.3-0.20170209073115.152cf28.el7.centos.noarch

Pike (all good on introspection side):

openstack-ironic-api-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-common-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-conductor-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-inspector-6.0.1-0.20170824132804.0e72dcb.el7.centos.noarch
puppet-ironic-11.3.1-0.20170907060708.13e23f7.el7.centos.noarch
python2-ironicclient-1.17.0-0.20170906171257.cdff7a0.el7.centos.noarch
python-ironic-inspector-client-2.1.0-0.20170915002324.bdcab9f.el7.centos.noarch
python-ironic-lib-2.10.0-0.20170906171416.1fa0a5f.el7.centos.noarch

The reason why the introspection (did via ipxe_ipmitool driver) times out is that the failed machines have a boot order different from what it was setup.
By hand we always set in the BIOS *FIRST* ipxe network interfaces, but on timed out machines, when we reboot them and look into the BIOS we see the hard disk as the first boot device. So something acts on the boot order.

Now, if we take a look at this failure [3] we see that it failed with introspection timeouts, but then the next deployment DID succeed on introspection [4] without any manual intervention. So we can suppose that the boot order was messed up during the failure and then was fixed during the success.

What we need to understand is WHO changes the boot order, when and why.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-newton-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-48/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[2] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-251/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[3] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-248/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[4] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-249/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz

Tags:

Alex Schultz (alex-schultz) on 2017-09-22

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → queens-1

Alex Schultz (alex-schultz) on 2017-10-04

Changed in tripleo:
milestone:	queens-1 → queens-2

Revision history for this message

Raoul Scarazzini (rasca) wrote on 2017-10-13:

I worked on isolating the problem, without any big success.
But I was able to increase the debug level and take the logs from the last SUCCESSFUL deployment [1] and the next FAILURE [2]. These logs includes also tcpdump pcap file coming from the introspection process.

Now, what we know for sure is that:

1) If I continuously deploy an env until the introspection (so once the introspection finishes I just restart from scratch), the problem does not happen;
2) If I continuously deploy a complete env, including deleting the overcloud stack before restarting from scratch, then the problem happens every 3 or 4 deployments;
3) The nodes timing out are not always the same: this is extremely racy;

That said, I'm still not able to reproduce the problem, and neither use a workaround to avoid it.

[1] http://file.rdu.redhat.com/~rscarazz/LP1718898/collect-logs_SUCCESS.tar.bz2
[2] http://file.rdu.redhat.com/~rscarazz/LP1718898/collect-logs_FAILURE.tar.bz2

Alex Schultz (alex-schultz) on 2017-11-02

Changed in tripleo:
milestone:	queens-2 → queens-3

Emilien Macchi (emilienm) on 2018-01-26

Changed in tripleo:
milestone:	queens-3 → queens-rc1

Alex Schultz (alex-schultz) on 2018-02-20

Changed in tripleo:
milestone:	queens-rc1 → rocky-1

Alex Schultz (alex-schultz) on 2018-04-20

Changed in tripleo:
milestone:	rocky-1 → rocky-2

Emilien Macchi (emilienm) on 2018-06-05

Changed in tripleo:
milestone:	rocky-2 → rocky-3

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-rc1 → stein-1

Juan Antonio Osorio Robles (juan-osorio-robles) on 2018-10-30

Changed in tripleo:
milestone:	stein-1 → stein-2

Emilien Macchi (emilienm) on 2019-01-13

Changed in tripleo:
milestone:	stein-2 → stein-3

Revision history for this message

Juan Antonio Osorio Robles (juan-osorio-robles) wrote on 2019-03-05:

Closing it since it wasn't reproduceable.

Changed in tripleo:
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.