We're seeing a lot of introspection failures in rdophase2 for newton [1] and ocata [2], starting from when we re introduced in our jobs the usage of packaged iPXE (instead of the upstream one).
We are not seeing this in pike, so what is strange here is that despite of the ipxe package version, which in every case is ipxe-bootimgs-20170123-1.git4e85b27.el7_4.1.noarch the problem happens (frequently, but still racily) in newton and ocata. So even if the reason seems to be the usage of the packaged ipxe, the version is the same one in both environments. Of course, ironic packages are different:
Ocata (with the introspection timeout error):
openstack-ironic-api-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-common-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-conductor-6.2.5-0.20170711215022.58a4181.el7.centos.noarch
openstack-ironic-inspector-4.2.3-0.20170711232452.b34fa9c.el7.centos.noarch
puppet-ironic-9.5.1-0.20170912042852.ec33c9a.el7.centos.noarch
python2-ironicclient-1.7.1-0.20161205104628.c0abab8.el7.centos.noarch
python-ironic-inspector-client-1.10.0-0.20170202170017.0eae82e.el7.centos.noarch
python-ironic-lib-2.1.3-0.20170209073115.152cf28.el7.centos.noarch
Pike (all good on introspection side):
openstack-ironic-api-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-common-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-conductor-9.1.1-0.20170908114346.feb64c2.el7.centos.noarch
openstack-ironic-inspector-6.0.1-0.20170824132804.0e72dcb.el7.centos.noarch
puppet-ironic-11.3.1-0.20170907060708.13e23f7.el7.centos.noarch
python2-ironicclient-1.17.0-0.20170906171257.cdff7a0.el7.centos.noarch
python-ironic-inspector-client-2.1.0-0.20170915002324.bdcab9f.el7.centos.noarch
python-ironic-lib-2.10.0-0.20170906171416.1fa0a5f.el7.centos.noarch
The reason why the introspection (did via ipxe_ipmitool driver) times out is that the failed machines have a boot order different from what it was setup.
By hand we always set in the BIOS *FIRST* ipxe network interfaces, but on timed out machines, when we reboot them and look into the BIOS we see the hard disk as the first boot device. So something acts on the boot order.
Now, if we take a look at this failure [3] we see that it failed with introspection timeouts, but then the next deployment DID succeed on introspection [4] without any manual intervention. So we can suppose that the boot order was messed up during the failure and then was fixed during the success.
What we need to understand is WHO changes the boot order, when and why.
[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-newton-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-48/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[2] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-251/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[3] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-248/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
[4] https://thirdparty.logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-249/haa-01.ha.lab.eng.bos.redhat.com/home/stack/overcloud_prep_images.log.txt.gz
I worked on isolating the problem, without any big success.
But I was able to increase the debug level and take the logs from the last SUCCESSFUL deployment [1] and the next FAILURE [2]. These logs includes also tcpdump pcap file coming from the introspection process.
Now, what we know for sure is that:
1) If I continuously deploy an env until the introspection (so once the introspection finishes I just restart from scratch), the problem does not happen;
2) If I continuously deploy a complete env, including deleting the overcloud stack before restarting from scratch, then the problem happens every 3 or 4 deployments;
3) The nodes timing out are not always the same: this is extremely racy;
That said, I'm still not able to reproduce the problem, and neither use a workaround to avoid it.
[1] http:// file.rdu. redhat. com/~rscarazz/ LP1718898/ collect- logs_SUCCESS. tar.bz2 file.rdu. redhat. com/~rscarazz/ LP1718898/ collect- logs_FAILURE. tar.bz2
[2] http://