ping test is periodically failing for the gate-tripleo-ci-centos-7-nonha-multinode-oooq

Bug #1718387 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Emilien Macchi
Revision history for this message
Marios Andreou (marios-b) wrote :

just posted an elastic recheck query for this one so we can see how frequent it is https://review.openstack.org/505574

tags: added: ci
tags: added: alert
Changed in tripleo:
importance: High → Critical
Revision history for this message
Jose Luis Franco (jfrancoa) wrote :
Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :
Revision history for this message
Jose Luis Franco (jfrancoa) wrote :

Checking some more logs, the problem seems to come from nova:

2017-09-28 12:31:16.872 66756 ERROR nova.compute.manager [instance: ba7c96d7-1179-4c52-bf12-8a6b1d94d190] NovaException: Unable to get host UUID: /etc/machine-id is empty
2017-09-28 12:31:16.872 66756 ERROR nova.compute.manager [instance: ba7c96d7-1179-4c52-bf12-8a6b1d94d190]

http://logs.openstack.org/67/474967/28/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/ecb43a8/logs/subnode-2/var/log/nova/nova-compute.log.txt.gz#_2017-09-28_12_31_16_872

There is a related bugzila, although it was affecting the containerized deployment: https://bugzilla.redhat.com/show_bug.cgi?id=1464182 [CLOSED]

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi,

thanks to bandini's comment about heat engine log, here is the error:

http://logs.openstack.org/67/474967/28/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/ecb43a8/logs/subnode-2/var/log/heat/heat-engine.log.txt.gz#_2017-09-28_12_31_28_313

2017-09-28 12:31:28.313 71973 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2017-09-28 12:31:28.313 71973 ERROR heat.engine.resource

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

The final error is that:

http://logs.openstack.org/67/474967/28/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/ecb43a8/logs/subnode-2/var/log/nova/nova-conductor.log.txt.gz#_2017-09-28_12_34_39_463

File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1981, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance 6d97341e-fdaa-4c95-9b7b-bbbc16edcafd was re-scheduled: Unable to get host UUID: /etc/machine-id is empty\n']

subnode-2 doesn't have /etc/machine-id

we need to have

 [ -e /etc/machine-id ] || \
       systemd-machine-id-setup

run somewhere.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Oki, my bad. Previous comment was wrong as the machine-id is properly setup during bootup:

Sep 28 10:52:17 localhost systemd[1]: Initializing machine ID from random generator.[1]

but it's deleted by instack:

Sep 28 08:31:20 centos-7-2-node-rax-iad-11184996-929537 sudo: jenkins : TTY=unknown ; PWD=/home/jenkins ; USER=root ; COMMAND=/bin/instack -e centos7 enable-packages-install install-types selinux-permissive hosts baremetal dhcp-all-interfaces os-collect-config overcloud-full overcloud-controller overcloud-compute overcloud-ceph-storage puppet-modules hiera os-net-config stable-interface-names grub2 element-manifest network-gateway dynamic-login enable-packages-install pip-and-virtualenv-override remove-machine-id -k extra-data pre-install install post-install -b 05-fstab-rootfs-label 00-fix-requiretty 90-rebuild-ramdisk 00-usr-local-bin-secure-path -x delorean-repo -d

Look for the remove-machine-id element from https://github.com/openstack/tripleo-puppet-elements

[1]: http://logs.openstack.org/67/474967/28/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/ecb43a8/logs/subnode-2/var/log/messages.txt.gz#_Sep_28_10_52_17

[2]: http://logs.openstack.org/67/474967/28/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/5b3a972/logs/subnode-2/var/log/secure.txt.gz#_Sep_28_08_31_20

Revision history for this message
Alex Schultz (alex-schultz) wrote :

remove-machine-id is needed for image building but if run on an existing system it would be a problem. It seems that needs to be excluded from CI for this

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Revision history for this message
Michele Baldessari (michele) wrote :

https://review.openstack.org/#/c/508226/ should fix this, once merged

Changed in tripleo:
assignee: Marios Andreou (marios-b) → Sofer Athlan-Guyot (sofer-athlan-guyot)
Revision history for this message
Michele Baldessari (michele) wrote :

https://review.openstack.org/#/c/508226/ has merged. Removing alert.

tags: removed: alert
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
status: Fix Released → Triaged
tags: added: alert
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
assignee: Sofer Athlan-Guyot (sofer-athlan-guyot) → Emilien Macchi (emilienm)
status: Triaged → In Progress
Revision history for this message
Emilien Macchi (emilienm) wrote :

Fixed in https://review.openstack.org/#/c/510312/ in fact. Please review

Changed in tripleo:
status: In Progress → Fix Released
tags: removed: alert
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.