OpenStack Core Infrastructure

Booting from stale snapshots in rax-dfw leads to miscellaneous failures

Bug #1354829 reported by YAMAMOTO Takashi on 2014-08-10

This bug affects 4 people

	Status	Importance	Assigned to	Milestone
OpenStack Core Infrastructure	In Progress	Critical	Jeremy Stanley	OpenStack Core Infrastructure kilo
OpenStack Dashboard (Horizon)	Invalid	Undecided	Unassigned
neutron	Invalid	Undecided	Unassigned
tempest	Invalid	Undecided	Unassigned

Bug Description

http://logs.openstack.org/86/110186/1/check/gate-neutron-python26/9dec53b/console.html

2014-08-10 06:51:57.341 | Started by user anonymous
2014-08-10 06:51:57.343 | Building remotely on bare-centos6-rax-dfw-1380820 in workspace /home/jenkins/workspace/gate-neutron-python26
2014-08-10 06:51:57.458 | [gate-neutron-python26] $ /bin/bash -xe /tmp/hudson1812648831707400818.sh
2014-08-10 06:51:57.540 | + rpm -ql libffi-devel
2014-08-10 06:51:57.543 | /tmp/hudson1812648831707400818.sh: line 2: rpm: command not found
2014-08-10 06:51:57.543 | + sudo yum install -y libffi-devel
2014-08-10 06:51:57.549 | sudo: no tty present and no askpass program specified
2014-08-10 06:51:57.551 | Sorry, try again.
2014-08-10 06:51:57.552 | sudo: no tty present and no askpass program specified
2014-08-10 06:51:57.552 | Sorry, try again.
2014-08-10 06:51:57.553 | sudo: no tty present and no askpass program specified
2014-08-10 06:51:57.553 | Sorry, try again.
2014-08-10 06:51:57.553 | sudo: 3 incorrect password attempts
2014-08-10 06:51:57.571 | Build step 'Execute shell' marked build as failure

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOiBcInN1ZG86IDMgaW5jb3JyZWN0IHBhc3N3b3JkIGF0dGVtcHRzXCIgQU5EIGZpbGVuYW1lOiBcImNvbnNvbGUuaHRtbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA3NjcyNjY4NDc3fQ==

Revision history for this message

Sam Betts (sambetts) wrote on 2014-08-11:

After looking through logstash it would appear that this is not specific to Neutron. Several of the Jenkins servers have suffered from this error when testing different projects, including Neutron, Horizon and Tempest.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-08-12:

Seeing something related here, rpm and yum commands aren't found, this is only on centos6 python 2.6 jobs:

http://logs.openstack.org/21/112421/1/check/gate-nova-python26/5016324/console.html

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwic3VkbzogeXVtOiBjb21tYW5kIG5vdCBmb3VuZFwiIEFORCB0YWdzOlwiY29uc29sZVwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA3ODU3OTg0MDIxfQ==

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-08-12:

Actually they look unrelated, the query in comment 2 only hits on py26 / centos6 jobs, the other hits on trusty nodes and non-py26 jobs.

Matt Riedemann (mriedem) on 2014-08-12

Changed in horizon:
status:	New → Invalid
Changed in neutron:
status:	New → Invalid
Changed in tempest:
status:	New → Invalid
summary:	- "sudo: 3 incorrect password attempts" in gate-neutron-python26 + "sudo: 3 incorrect password attempts" in host setup

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-12: Related fix proposed to elastic-recheck (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/113575

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-08-12: Re: "sudo: 3 incorrect password attempts" in host setup

(11:24:31 AM) fungi: it looks like there was briefly a "sudo yum install -y libffi-devel" in the job builders after revoke-sudo
(11:24:42 AM) fungi: checking config history now to confirm
(11:25:11 AM) clarkb: fungi: mordreds fix for that wasnt correct? that is possible
(11:25:33 AM) fungi: yep, caused by https://review.openstack.org/112972 which was only that way briefly i think

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-08-12:

It looks like the incident for which this was first reported was due to https://review.openstack.org/112972 but hits in other jobs will be for unrelated reasons. I'm worried that the proposed e-r query is over-broad and will highlight a variety of different issues leading to the same symptom, the majority of which will probably be legitimate failures for bad changes.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-08-12:

Actually, on closer inspection that change does position the sudo call in the job prior to the revoke-sudo builder, so this may be an indication that a sudoers file is not getting correctly installed.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-12: Related fix merged to elastic-recheck (master)

Reviewed: https://review.openstack.org/113575
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=0359ac9ca0f69c42306b792e8a8b8e2b2fa51620
Submitter: Jenkins
Branch: master

commit 0359ac9ca0f69c42306b792e8a8b8e2b2fa51620
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 8 16:21:41 2014 -0700

Add query for infra setup bug 1354829

135 hits in 7 days, check and gate, all failures.

    This appears to be an infra issue with the rax-dfw
    nodes so restricting the query to those using
    wildcards.

Change-Id: I0d5eeaa334ca928a82afb69ea42779f17b509b25
Related-Bug: #1354829

Jeremy Stanley (fungi) on 2014-08-13

Changed in openstack-ci:
status:	New → In Progress
importance:	Undecided → Critical
assignee:	nobody → Jeremy Stanley (fungi)
milestone:	none → juno

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-08-24:

I've provided some detailed information to Rackspace about instance UUIDs which hit this and the various other failures marked as duplicates, and they're hunting for an underlying cause. According to their logs we're booting these instances from the snapshots we request, but we're seeing strong evidence to suggest that they're actually booting from older snapshots we've deleted entirely. Also we're only witnessing this behavior in their DFW region, but not the other regions where we have identical setups.

summary:

- "sudo: 3 incorrect password attempts" in host setup
+ Booting from stale snapshots in rax-dfw leads to miscellaneous failures

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-08-24:

#10

James has also amended the nodepool prep scripts and some more common slave scripts to report the name of the snapshot from which a given worker was booted so that it can be cross-referenced from the Jenkins console logs of many jobs. This should help us confirm for certain whether instances are booting from the intended snapshots or older ones which should no longer exist at all.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-08-26:

#11

Unfortunately as time goes on, the manifestation of this issue changes depending on (under the present best theory) which old broken snapshots get used in place of the ones requested. Therefore old queries which highlight it as continuing to occur will stop showing new hits and new symptoms will need to be correlated to the same underlying cause.

Jeremy Stanley (fungi) on 2014-10-27