OpenStack Compute (nova)

gate-tempest-dsvm-postgres-full ssh timeout failures spiking since 7/4 (or 7/7)

Bug #1600005 reported by Tom Swanson on 2016-07-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Matt Riedemann

Bug Description

Description
===========
The gate-tempest-dsvm-postgres-full test is failing on node ovh. (Perhaps others.)

Steps to reproduce
==================
Push up a Nova or Cinder or any patch that runs gate-tempest-dsvm-postgres-full.

Expected Result
===============
Test to pass.

Actual Result
=============
Fail
http://logs.openstack.org/48/338548/5/check/gate-tempest-dsvm-postgres-full/e578f53/

Environment
===========
Infra

Logs and Configs
================
http://logs.openstack.org/48/338548/5/check/gate-tempest-dsvm-postgres-full/e578f53/

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-07-07: Re: gate-tempest-dsvm-postgres-full ssh timeout failures spiking since 7/4

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=%28message:\%22TestVolumeBootPattern:test_volume_boot_pattern\%22%20OR\n%20message:\%22TestVolumeBootPatternV2:test_volume_boot_pattern\%22%29%20AND\nmessage:\%22Initializing%20SSH%20connection%20to\%22%20AND%20message:\%22via%20SSH%20timed%20out.\%22%20AND%20tags:\%22tempest.txt\%22%20AND%20voting:1&from=7d

It's not just OVH. The postgres job runs the nova metadata API so that's probably causing some issues.

summary:	- gate-tempest-dsvm-postgres-full failing on node ovh + gate-tempest-dsvm-postgres-full ssh timeout failures spiking since 7/4
Changed in nova:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-07-07:

http://tinyurl.com/zx6fabn also shows it's just the postgres job really that's spiking with failures.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-07-07:

At this point we have an active server:

http://logs.openstack.org/21/299621/7/gate/gate-tempest-dsvm-postgres-full/c182c88/console.html#_2016-07-07_21_14_39_905789

The fixed IP is 10.1.0.3.

Then the test adds a floating IP:

http://logs.openstack.org/21/299621/7/gate/gate-tempest-dsvm-postgres-full/c182c88/console.html#_2016-07-07_21_14_39_906137

The floating IP is 172.24.5.10.

From the console output of the guest when this fails it looks like it's able to get the instance-id from the metadata service successfully:

2016-07-07 21:14:39.916195 | checking http://169.254.169.254/2009-04-04/instance-id
2016-07-07 21:14:39.916213 | successful after 1/20 tries: up 6.07. iid=i-00000060

I see the fixed IP in the guest:

2016-07-07 21:14:39.916707 | === network info ===
2016-07-07 21:14:39.916721 | if-info: lo,up,127.0.0.1,8,::1
2016-07-07 21:14:39.916739 | if-info: eth0,up,10.1.0.3,20,fe80::f816:3eff:fe4f:4a13
2016-07-07 21:14:39.916754 | ip-route:default via 10.1.0.1 dev eth0
2016-07-07 21:14:39.916780 | ip-route:10.1.0.0/20 dev eth0 src 10.1.0.3

So maybe the floating IP didn't get setup properly.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-07-07:

This is when the floating IP add request is made in the API:

2016-07-07 21:06:49,936 20905 INFO [tempest.lib.common.rest_client] Request (TestSnapshotPattern:test_snapshot_pattern): 202 POST http://127.0.0.1:8774/v2.1/servers/73709451-bd45-4f6f-baf1-aa9c28c25b70/action 0.751s

At this point we've refreshed the nw info cache on the instance and it has the floating IP:

http://logs.openstack.org/21/299621/7/gate/gate-tempest-dsvm-postgres-full/c182c88/logs/screen-n-net.txt.gz#_2016-07-07_21_06_49_926

The lease on the fixed IP comes in 9 seconds after that though:

http://logs.openstack.org/21/299621/7/gate/gate-tempest-dsvm-postgres-full/c182c88/logs/screen-n-net.txt.gz#_2016-07-07_21_06_58_051

And from syslog:

http://logs.openstack.org/21/299621/7/gate/gate-tempest-dsvm-postgres-full/c182c88/logs/syslog.txt.gz#_Jul__7_21_06_55

Jul 7 21:06:55 ubuntu-trusty-osic-cloud1-2340260 dnsmasq-dhcp[21163]: DHCPDISCOVER(br100) fa:16:3e:4f:4a:13
Jul 7 21:06:55 ubuntu-trusty-osic-cloud1-2340260 dnsmasq-dhcp[21163]: DHCPOFFER(br100) 10.1.0.3 fa:16:3e:4f:4a:13
Jul 7 21:06:55 ubuntu-trusty-osic-cloud1-2340260 dnsmasq-dhcp[21163]: DHCPREQUEST(br100) 10.1.0.3 fa:16:3e:4f:4a:13
Jul 7 21:06:55 ubuntu-trusty-osic-cloud1-2340260 dnsmasq-dhcp[21163]: Ignoring domain common.compute-instance-1293652051.novalocal for DHCP host name tempest
Jul 7 21:06:55 ubuntu-trusty-osic-cloud1-2340260 dnsmasq-dhcp[21163]: DHCPACK(br100) 10.1.0.3 fa:16:3e:4f:4a:13 tempest

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-07-08:

http://tinyurl.com/j4tj76v makes it looks like 7/4 something happened across all jobs, but the spike on 7/7 is just the postgres job.

summary:

gate-tempest-dsvm-postgres-full ssh timeout failures spiking since 7/4
+ (or 7/7)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-07-08:

project-config change to make the pg job non-voting while we debug: https://review.openstack.org/#/c/339824/

Revision history for this message

Dan Smith (danms) wrote on 2016-07-11:

I reverted from 3e48d05d2174824c334ba2f6ed0476c4171277e7..master last week and got three consecutive successful runs of the test out of it (which I think is statistically significant). Retrying this morning with d8909c43430bb1fd1cc19282166d686e31196a09..master to narrow down a problematic change. However, nothing in the original set stood out as a potential culprit.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-11: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/340479

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-11: Related fix merged to nova (master)

Reviewed: https://review.openstack.org/340479
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c1a83a3fb8b7a2ad6bd0d0bafb4a38719cf487ae
Submitter: Jenkins
Branch: master

commit c1a83a3fb8b7a2ad6bd0d0bafb4a38719cf487ae
Author: Matt Riedemann <email address hidden>
Date: Mon Jul 11 17:14:16 2016 +0000

Revert "Detach volume after deleting instance with no host"

This reverts commit 5ce74fa06c0e7a70fdc927b2c1f364af83f7de1d.

    We think this is causing a race and the postgres job to fail
    since it's erroneously doing local deletes and not cleaning
    up from the computes. We're not totally sure why it would
    only impact the postgres job though, but the original change
    failed the job and was rechecked, and the time it was merged
    coincides with when the postgres job started spiking with
    failures.

Related-Bug: #1600005
Closes-Bug: #1600031

Change-Id: I0ed184b579b8a8d861e4d2a7c317bf0bc0623d50

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-07-12:

#10

This is fixed by https://review.openstack.org/340479.

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)
status:	Confirmed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.