Booting from stale snapshots in rax-dfw leads to miscellaneous failures

Bug #1354829 reported by YAMAMOTO Takashi on 2014-08-10
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
In Progress
Critical
Jeremy Stanley
OpenStack Dashboard (Horizon)
Undecided
Unassigned
neutron
Undecided
Unassigned
tempest
Undecided
Unassigned

Bug Description

http://logs.openstack.org/86/110186/1/check/gate-neutron-python26/9dec53b/console.html

2014-08-10 06:51:57.341 | Started by user anonymous
2014-08-10 06:51:57.343 | Building remotely on bare-centos6-rax-dfw-1380820 in workspace /home/jenkins/workspace/gate-neutron-python26
2014-08-10 06:51:57.458 | [gate-neutron-python26] $ /bin/bash -xe /tmp/hudson1812648831707400818.sh
2014-08-10 06:51:57.540 | + rpm -ql libffi-devel
2014-08-10 06:51:57.543 | /tmp/hudson1812648831707400818.sh: line 2: rpm: command not found
2014-08-10 06:51:57.543 | + sudo yum install -y libffi-devel
2014-08-10 06:51:57.549 | sudo: no tty present and no askpass program specified
2014-08-10 06:51:57.551 | Sorry, try again.
2014-08-10 06:51:57.552 | sudo: no tty present and no askpass program specified
2014-08-10 06:51:57.552 | Sorry, try again.
2014-08-10 06:51:57.553 | sudo: no tty present and no askpass program specified
2014-08-10 06:51:57.553 | Sorry, try again.
2014-08-10 06:51:57.553 | sudo: 3 incorrect password attempts
2014-08-10 06:51:57.571 | Build step 'Execute shell' marked build as failure

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOiBcInN1ZG86IDMgaW5jb3JyZWN0IHBhc3N3b3JkIGF0dGVtcHRzXCIgQU5EIGZpbGVuYW1lOiBcImNvbnNvbGUuaHRtbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA3NjcyNjY4NDc3fQ==

Sam Betts (sambetts) wrote :

After looking through logstash it would appear that this is not specific to Neutron. Several of the Jenkins servers have suffered from this error when testing different projects, including Neutron, Horizon and Tempest.

Matt Riedemann (mriedem) wrote :

Actually they look unrelated, the query in comment 2 only hits on py26 / centos6 jobs, the other hits on trusty nodes and non-py26 jobs.

Matt Riedemann (mriedem) on 2014-08-12
Changed in horizon:
status: New → Invalid
Changed in neutron:
status: New → Invalid
Changed in tempest:
status: New → Invalid
summary: - "sudo: 3 incorrect password attempts" in gate-neutron-python26
+ "sudo: 3 incorrect password attempts" in host setup

(11:24:31 AM) fungi: it looks like there was briefly a "sudo yum install -y libffi-devel" in the job builders after revoke-sudo
(11:24:42 AM) fungi: checking config history now to confirm
(11:25:11 AM) clarkb: fungi: mordreds fix for that wasnt correct? that is possible
(11:25:33 AM) fungi: yep, caused by https://review.openstack.org/112972 which was only that way briefly i think

Jeremy Stanley (fungi) wrote :

It looks like the incident for which this was first reported was due to https://review.openstack.org/112972 but hits in other jobs will be for unrelated reasons. I'm worried that the proposed e-r query is over-broad and will highlight a variety of different issues leading to the same symptom, the majority of which will probably be legitimate failures for bad changes.

Jeremy Stanley (fungi) wrote :

Actually, on closer inspection that change does position the sudo call in the job prior to the revoke-sudo builder, so this may be an indication that a sudoers file is not getting correctly installed.

Reviewed: https://review.openstack.org/113575
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=0359ac9ca0f69c42306b792e8a8b8e2b2fa51620
Submitter: Jenkins
Branch: master

commit 0359ac9ca0f69c42306b792e8a8b8e2b2fa51620
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 8 16:21:41 2014 -0700

    Add query for infra setup bug 1354829

    135 hits in 7 days, check and gate, all failures.

    This appears to be an infra issue with the rax-dfw
    nodes so restricting the query to those using
    wildcards.

    Change-Id: I0d5eeaa334ca928a82afb69ea42779f17b509b25
    Related-Bug: #1354829

Jeremy Stanley (fungi) on 2014-08-13
Changed in openstack-ci:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Jeremy Stanley (fungi)
milestone: none → juno
Jeremy Stanley (fungi) wrote :

I've provided some detailed information to Rackspace about instance UUIDs which hit this and the various other failures marked as duplicates, and they're hunting for an underlying cause. According to their logs we're booting these instances from the snapshots we request, but we're seeing strong evidence to suggest that they're actually booting from older snapshots we've deleted entirely. Also we're only witnessing this behavior in their DFW region, but not the other regions where we have identical setups.

summary: - "sudo: 3 incorrect password attempts" in host setup
+ Booting from stale snapshots in rax-dfw leads to miscellaneous failures
Jeremy Stanley (fungi) wrote :

James has also amended the nodepool prep scripts and some more common slave scripts to report the name of the snapshot from which a given worker was booted so that it can be cross-referenced from the Jenkins console logs of many jobs. This should help us confirm for certain whether instances are booting from the intended snapshots or older ones which should no longer exist at all.

Jeremy Stanley (fungi) wrote :

Unfortunately as time goes on, the manifestation of this issue changes depending on (under the present best theory) which old broken snapshots get used in place of the ones requested. Therefore old queries which highlight it as continuing to occur will stop showing new hits and new symptoms will need to be correlated to the same underlying cause.

Jeremy Stanley (fungi) on 2014-10-27
Changed in openstack-ci:
milestone: juno → kilo
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers