Attempts to verify guests are running via SSH fails. SSH connection to guest does not work.

Bug #1253896 reported by Clark Boylan
222
This bug affects 40 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Sean Dague
neutron
Fix Released
Critical
Salvatore Orlando
Havana
Won't Fix
Critical
Unassigned
tempest
Invalid
Critical
Unassigned

Bug Description

An example of this can be found at http://logs.openstack.org/74/57774/2/gate/gate-tempest-devstack-vm-full/e592961/console.html. This test failing seems to cause the tearDownClass failure and the process exit code failure.

Judging by the logs below, the VM is coming up, and the test is connecting to the SSH server (dropbear) running in the VM, but the authentication is failing. It appears that authentication is attempted several times before paramiko gives up causing the test to fail. I think this indicates there isn't a network or compute problem, instead is possible the client doesn't have the correct key or the authorized keys aren't configured properly on the server side. But these are just guesses, I haven't been able to get any concrete data that would support these theories.

2013-11-22 05:36:33.980 | 2013-11-22 05:32:17,029 Adding <SecurityGroupRule from_port=-1, group={}, id=14, ip_protocol=icmp, ip_range={u'cidr': u'0.0.0.0/0'}, parent_group_id=105, to_port=-1> to shared resources of TestMinimumBasicScenario
2013-11-22 05:36:33.980 | 2013-11-22 05:32:34,226 starting thread (client mode): 0x52e7e50L
2013-11-22 05:36:33.980 | 2013-11-22 05:32:34,232 Connected (version 2.0, client dropbear_2012.55)
2013-11-22 05:36:33.981 | 2013-11-22 05:32:34,237 kex algos:['diffie-hellman-group1-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] server encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] client mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] server mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
2013-11-22 05:36:33.981 | 2013-11-22 05:32:34,238 Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
2013-11-22 05:36:33.981 | 2013-11-22 05:32:34,238 using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
2013-11-22 05:36:33.981 | 2013-11-22 05:32:34,433 Switch to new keys ...
2013-11-22 05:36:33.982 | 2013-11-22 05:32:34,434 Adding ssh-rsa host key for 172.24.4.227: 189c16acb93fe44ae975e1c653f1856c
2013-11-22 05:36:33.982 | 2013-11-22 05:32:34,434 Trying SSH key 9a9afe52a9485c15495a59b94ebca6b6
2013-11-22 05:36:33.982 | 2013-11-22 05:32:34,437 userauth is OK
2013-11-22 05:36:33.982 | 2013-11-22 05:32:35,104 Authentication (publickey) failed.
2013-11-22 05:36:33.982 | 2013-11-22 05:32:36,693 starting thread (client mode): 0x52f9190L
2013-11-22 05:36:33.982 | 2013-11-22 05:32:36,697 Connected (version 2.0, client dropbear_2012.55)
2013-11-22 05:36:33.983 | 2013-11-22 05:32:36,699 kex algos:['diffie-hellman-group1-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] server encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] client mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] server mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
2013-11-22 05:36:33.983 | 2013-11-22 05:32:36,699 Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
2013-11-22 05:36:33.983 | 2013-11-22 05:32:36,699 using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
2013-11-22 05:36:33.983 | 2013-11-22 05:32:36,903 Switch to new keys ...
2013-11-22 05:36:33.983 | 2013-11-22 05:32:36,904 Trying SSH key 9a9afe52a9485c15495a59b94ebca6b6
2013-11-22 05:36:33.984 | 2013-11-22 05:32:36,906 userauth is OK
2013-11-22 05:36:33.984 | 2013-11-22 05:32:37,438 Authentication (publickey) failed.
2013-11-22 05:36:33.984 | 2013-11-22 05:32:39,035 starting thread (client mode): 0x24c62d0L
2013-11-22 05:36:33.984 | 2013-11-22 05:32:39,043 Connected (version 2.0, client dropbear_2012.55)
2013-11-22 05:36:33.991 | 2013-11-22 05:32:39,045 kex algos:['diffie-hellman-group1-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] server encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] client mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] server mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
2013-11-22 05:36:33.992 | 2013-11-22 05:32:39,045 Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
2013-11-22 05:36:33.992 | 2013-11-22 05:32:39,045 using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
2013-11-22 05:36:33.992 | 2013-11-22 05:32:39,222 Switch to new keys ...
2013-11-22 05:36:33.992 | 2013-11-22 05:32:39,237 Trying SSH key 9a9afe52a9485c15495a59b94ebca6b6
2013-11-22 05:36:33.992 | 2013-11-22 05:32:39,239 userauth is OK
2013-11-22 05:36:33.993 | 2013-11-22 05:32:39,723 Authentication (publickey) failed.
2013-11-22 05:36:33.993 | 2013-11-22 05:32:41,343 starting thread (client mode): 0x52f9f10L
2013-11-22 05:36:33.993 | 2013-11-22 05:32:41,348 Connected (version 2.0, client dropbear_2012.55)
2013-11-22 05:36:33.993 | 2013-11-22 05:32:41,350 kex algos:['diffie-hellman-group1-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] server encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] client mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] server mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
2013-11-22 05:36:33.993 | 2013-11-22 05:32:41,350 Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
2013-11-22 05:36:33.994 | 2013-11-22 05:32:41,350 using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
2013-11-22 05:36:33.994 | 2013-11-22 05:32:41,602 Switch to new keys ...
2013-11-22 05:36:33.994 | 2013-11-22 05:32:41,607 Trying SSH key 9a9afe52a9485c15495a59b94ebca6b6
2013-11-22 05:36:33.994 | 2013-11-22 05:32:41,610 userauth is OK
2013-11-22 05:36:33.994 | 2013-11-22 05:32:42,237 Authentication (publickey) failed.
2013-11-22 05:36:33.994 | 2013-11-22 05:32:43,884 starting thread (client mode): 0x24c6350L
2013-11-22 05:36:33.995 | 2013-11-22 05:32:43,917 Connected (version 2.0, client dropbear_2012.55)
2013-11-22 05:36:33.995 | 2013-11-22 05:32:43,919 kex algos:['diffie-hellman-group1-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] server encrypt:['aes128-ctr', '3des-ctr', 'aes256-ctr', 'aes128-cbc', '3des-cbc', 'aes256-cbc', 'twofish256-cbc', 'twofish-cbc', 'twofish128-cbc'] client mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] server mac:['hmac-sha1-96', 'hmac-sha1', 'hmac-md5'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
2013-11-22 05:36:33.995 | 2013-11-22 05:32:43,919 Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
2013-11-22 05:36:33.995 | 2013-11-22 05:32:43,919 using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
2013-11-22 05:36:33.995 | 2013-11-22 05:32:44,015 Switch to new keys ...
2013-11-22 05:36:33.996 | 2013-11-22 05:32:44,016 Trying SSH key 9a9afe52a9485c15495a59b94ebca6b6
2013-11-22 05:36:33.996 | 2013-11-22 05:32:44,017 userauth is OK
2013-11-22 05:36:33.996 | 2013-11-22 05:32:44,509 Authentication (publickey) failed.
2013-11-22 05:36:33.996 | 2013-11-22 05:32:46,157 starting thread (client mode): 0x5301c50L
2013-11-22 05:36:33.996 | 2013-11-22 05:32:46,157 Exception: Error reading SSH protocol banner
2013-11-22 05:36:33.997 | 2013-11-22 05:32:46,158 Traceback (most recent call last):
2013-11-22 05:36:33.997 | 2013-11-22 05:32:46,159 File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1557, in run
2013-11-22 05:36:33.997 | 2013-11-22 05:32:46,159 self._check_banner()
2013-11-22 05:36:33.997 | 2013-11-22 05:32:46,159 File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1683, in _check_banner
2013-11-22 05:36:33.998 | 2013-11-22 05:32:46,159 raise SSHException('Error reading SSH protocol banner' + str(x))
2013-11-22 05:36:33.998 | 2013-11-22 05:32:46,159 SSHException: Error reading SSH protocol banner
2013-11-22 05:36:33.998 | 2013-11-22 05:32:46,159
2013-11-22 05:36:33.998 | 2013-11-22 05:32:46,214 EOF in transport thread
2013-11-22 05:36:33.999 | 2013-11-22 05:32:46,295 EOF in transport thread
2013-11-22 05:36:33.999 | 2013-11-22 05:32:46,396 EOF in transport thread
2013-11-22 05:36:33.999 | 2013-11-22 05:32:46,496 EOF in transport thread
2013-11-22 05:36:34.000 | 2013-11-22 05:32:46,596 EOF in transport thread
2013-11-22 05:36:34.001 | 2013-11-22 05:32:46,635

Tags: testing
Revision history for this message
David Kranz (david-kranz) wrote :

logstash says this is a persistent flaky bug. We have a number of these, including the console log one, that all have one thing in common: they involve expected activity on the booted vm between when the vm is reported as 'active' and when it is "really booted". I checked in logstash and this never happens in neutron, except for the tests that have been trying to run neutron in parallel. So this is likely an issue of super-slow response from the vm both due to the load on the host and because the vm is nested. I saw that the banner checking code in paramiko allows 15s for the first line to appear but only 2 seconds (hard-wired) for each subsequent line. Also, the code in tempest that does the ssh connection catches socket errors (presumably for when the vm networking is not up yet) and AuthenticationException and then retries. But it does not catch SshException.

I will try catching this exception and see if that fixes the problem. The console log bugs might be similar in that they are demanding that 10 lines show up but perhaps sometimes that just takes a really long time. Please tell me if this makes no sense.

Revision history for this message
Joe Gordon (jogo) wrote :

message:"BadRequest: Invalid volume: Volume status must be available or error, but current status is: in-use (HTTP 400)" AND filename:"console.html"

That query seems to also work for this bug.

Also paramiko is not parallel friendly

Revision history for this message
David Kranz (david-kranz) wrote :

Indeed. It seems there is a bug in this test. The volume is created with a method that will delete the volume during tearDownClass, but the code that attaches the volume does not add a cleanup to put the volume back in a state where the delete can succeed. So when the ssh which is in between the attach and detach calls raises an exception, the volume is left in an undeletable state.

Revision history for this message
Joe Gordon (jogo) wrote :

Ignore the comment about parallel

Revision history for this message
Joe Gordon (jogo) wrote :

Since we are seeing this fairly often in the gate marking as critical.

I think the problem is we don't wait long enough for the server to come up for ssh, I think dropbear may delay starting if there isn't enough entropy

Changed in tempest:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → icehouse-1
Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Sean Dague (sdague) wrote :
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

For the failures on neutron, I suspect the same root cause as remaining occurences of bug 1224001

The OVS agent is unable to wire the port before the timeout expires.
Enabling polling we reduced occurences, but we have since then identified other issues - which are being worked on now.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Adding neutron, marking as critical and assigning me to triage.

Changed in neutron:
importance: Undecided → High
assignee: nobody → Salvatore Orlando (salvatore-orlando)
milestone: none → icehouse-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/58769

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/58827

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/58860

Xiang Hui (xianghui)
Changed in neutron:
status: New → Confirmed
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/58827
Committed: http://github.com/openstack/neutron/commit/da00ed76e6008bd06dada0f0441ae436dd759cdf
Submitter: Jenkins
Branch: master

commit da00ed76e6008bd06dada0f0441ae436dd759cdf
Author: Salvatore Orlando <email address hidden>
Date: Wed Nov 27 14:09:25 2013 +0000

    Revert "ML2 plugin should not delete ports on subnet deletion"

    This reverts commit 0d131ff0e9964cb6a65f64809270f9d597c2d5d1

    There is really no problem with this change. However, it is probably
    triggering a port_update notification to the agent for each port
    with an allocated IP.
    The agent handles that notification in a way which might be improved
    from a scalability perspective.

    I don't actually want this change to removed, I am just checking
    whether neutron without it passess jobs.

    Change-Id: I5494b607127b261043dcddfdc10c93a28ec20af5
    Related-Bug: 1253896
    Related-Bug: 1254236

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-1 → icehouse-2
Revision history for this message
Anita Kuno (anteaya) wrote : Re: test_minimum_basic_scenario fails with SSHException: Error reading SSH protocol banner
Revision history for this message
Joe Gordon (jogo) wrote :
Changed in neutron:
status: In Progress → Invalid
Changed in nova:
importance: Undecided → Critical
Revision history for this message
Joe Gordon (jogo) wrote :

since we aren't seeing this for neutron in gate this must be a nova-network related issue

Revision history for this message
Robert Collins (lifeless) wrote :

Paramiko is parallel safe, eventlet by breaking all the standard tools breaks it is all :) but tempest isn't eventlet based AIUI, so we can ignore that.

Is this test using config drive or metadata server? My first suspicion would be failed cloud-init in the VM leading to no creds - so either a metadata server failure, or config drive not being attached properly or some such. The block device being attached at the same time could suggest an interaction w/config drive.

Revision history for this message
Shawn Hartsock (hartsock) wrote :

Just so summarize what I'm seeing here... the root cause is:

    raise SSHException('Error reading SSH protocol banner' + str(x))

Which typically happens when the SSH server has "bogged down" and is processing requests really slowly.

So, IMHO, the right thing to do is instead of re-trying ... is to back off the attempt and idle until the SSH server becomes less busy.

If I'm right:
* you'll only see this on boxen under heavy load
* you'll not see this on boxen with light load
* you'll be able to relieve the problem by slowing down the boxen acting as clients.

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-2 → none
Thierry Carrez (ttx)
Changed in nova:
milestone: none → icehouse-2
Changed in nova:
status: New → Triaged
Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

Spot checking a case of this I don't see much load around the time of the test failure

http://logs.openstack.org/28/61328/1/gate/gate-tempest-dsvm-neutron/79f110b/testr_results.html.gz
http://logs.openstack.org/28/61328/1/gate/gate-tempest-dsvm-neutron/79f110b/logs/screen-sysstat.txt.gz

There appears to be a significant rise in the frequency of this bug, so I wonder if there was a regression somewhere in neutron in the last day or two. Perhaps it was a change in how we configure the gate as well.

Revision history for this message
Sean Dague (sdague) wrote :

This bug is not invalid for neutron, it's unfortunate that jog0 said that earlier, as neutron is exposing this more often than nova-network.

Changed in neutron:
importance: High → Critical
status: Invalid → Confirmed
Revision history for this message
Joe Gordon (jogo) wrote :

FWIW, a week ago this didn't appear to be a neutron issue. It looks like there was a recent spike in this bug, which leads me to believe that there was a regression in neutron (or something related)

Revision history for this message
Joe Gordon (jogo) wrote :

This bug may be a duplicate, and it has code to reproduce the bug https://bugs.launchpad.net/tempest/+bug/1132879

Revision history for this message
Attila Fazekas (afazekas) wrote :

In our case the vm's get the ssh key from the metedata service, usually worth to check the nova-api logs for meta-data requests.

I wonder would be useful to log the vm's "serial" console output on every ssh connection failure, in order to see more information about meta-data requests and dhcp in the vm's view point.

Sean Dague (sdague)
summary: - test_minimum_basic_scenario fails with SSHException: Error reading SSH
- protocol banner
+ Attempts to verify guests are running via SSH fails. SSH connection to
+ guest does not work.
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I am not able to identify any specific patch which might have made things so worse.
As an interesting data pointer, the large majority of the failures occurs with the non-isolated job.

I am not yet able even to root-cause the failure. I looked at 3 distinct traces and in every case:
- the VM VIF is wired before the timeout
- the DHCP is active and distributes an address before the timeout expires (the DHCP ACK for the VIF's MAC can be seen in syslog)
- the internal router port is always configured on the l3 agent and wired on the ovs agent before the timeout
- the floating IP is configured on the l3 agent before the timeout

Security groups issues would cause 100% of jobs to fail (and also since we're not using the hybrid driver it seems they're not even enforced).

The most likely cause of failure, even if I cannot confirm, should reside in this case in the L3 agent, and in particular for the floating IP. If the machine gets an IP from its DHCP server, L2 connectivity is there, and the correct IP is given. The router port is also correctly configured.
Even if I analysed only 3 failures, I noticed that I did not see any crash in test_network_basic_ops, while in all 3 failures all the three scenario tests creating a floating IP via nova failed. I do not know if this might be a good trail to follow, but the failure pattern is definetely peculiar.

In the logs the address which ends up being unreachable is assigned to other qg-xxx interfaces before the failure. The same address being on multiple qg interfaces on br-ex could be a problem, even if those interfaces are in different namespaces, sincethey are all on the same network.

I also checked devstack, devstack-gate, and tempest for relevant changes, but even there no luck.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/62098

Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → Mark McClain (markmcclain)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/62098
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ec2b8b50182f0629fec5aee5c80b1560b2512235
Submitter: Jenkins
Branch: master

commit ec2b8b50182f0629fec5aee5c80b1560b2512235
Author: Mark McClain <email address hidden>
Date: Fri Dec 13 16:53:09 2013 -0500

    Revert "move rpc_setup to the last step of __init__"

    This reverts commit f691ebe03916a78cbf18017d628a28b17f147700.

    Change-Id: Idc0bb0e4401e2561f2d8e3132f14df14b7537448
    Partial-Bug: #1253896

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → icehouse-2
Revision history for this message
Attila Fazekas (afazekas) wrote :

The test_minimum_basic_scenario does a reboot before trying to ssh, both the 'rebuild' and 'reboot' action causes random failures to the ssh connection via floating ip.

https://bugs.launchpad.net/nova/+bug/1132879
https://bugs.launchpad.net/nova/+bug/1240849

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/58860
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2b375c0f15fd43af23fbd28b85929d63a753548b
Submitter: Jenkins
Branch: master

commit 2b375c0f15fd43af23fbd28b85929d63a753548b
Author: Salvatore Orlando <email address hidden>
Date: Wed Nov 27 13:26:07 2013 -0800

    Do not trigger agent notification if bindings do not change

    The method _process_port_bindings for the ml2 plugin should not
    return True when the host binding does not change, otherwise an
    unnecessary notification will be sent to the agent

    Closes-Bug: #1255680
    Related-Bug: #1253896
    Partially Implements: neutron-tempest-parallel

    Change-Id: I8a40090af347ca430ff6c8e2211fa34bb2bd0f8c

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

We have identified several causes for the SSH timeout, but one particular failure mode has always been puzzling me.
It can be observed here for instance: http://logs.openstack.org/58/63558/2/check/check-tempest-dsvm-neutron-pg/1bd26df

What happens is that everything appears to be perfectly connected.
The VM port is up, the dnsmasq instance is up and entries in the host file properly populated.
Also, router connections and floating IP setup is done properly.

However, no DHCPDISCOVER is seen from the VM (although the DHCPRELEASE is sent when the VM is destroyed, confirming the host file was correctly configured).

It seems that I have know a hint about the root cause. It just seems too easy to be true.
Basically, the problem happens when the network is created immediately after the service and the agent start up. what happens is that DHCP port is being wired on the DEAD vlan instead of being tagged with the appropriate tag.

The agent fails to retrieve the port details from the plugin, because the ML2 plugin failed the binding.
Looking at the neutron server logs, it seems that the binding fails simply because there are no agents. The other failure modes are i) no active agent and ii) error while adding the segment, but the according to the logs none of this two conditions is happening.

Indeed the neutron server log (excerpt: http://paste.openstack.org/show/55918/) reveals the first state report from the OVS agent is received 128 milliseconds after the create_dhcp_port message from the DHCP Agent.

How to solve this problem?
One way would to discuss what do with unbound ports. There might a be a sort of automatic rescheduling with gets triggered once agents become available; this is doable but perhaps it is something that deserves more discussion.

If a port failed to bind, it will stay in a DOWN status, which is a consistent representation of the fact the port does not work. Since that's a DHCP port, this is a problem for the whole network. Tempest should not even attempt to boot the VM in this case. This would make identifying the issue way easier.

This is a problem which might occur at startup, but also when agents fail to report their state under load.
We have already at least two developers actively working on this. I would say, gate-wise, to adopt the following mitigation measures:
1 - Delay tempest test start until the relevant agents are reported as active. If they do not become active, that's a valid reason for failing tempest (or devstack), in my opinion
2 - Increase the timeout interval for reporting agents dead. This will decrease the chance that agents are declared dead when higher load on the gate might cause delays in processing state reports.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

afazekas: I agree it would be good to extract relevant network config info from the console log.
I talked about this recently with sdague.

I am now querying the console-log as it follows:

nova console-log <instance-id> | grep -P "(if-info|ip-route)"

and

nova console-log <instance-id> | grep -i -A2 "sending discover"

This tells the current network configuration, as well as whether DHCP discover was sent and how many times.

Ideally we would like to have timestamps for these events, but perhaps this is asking too much.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

> Indeed the neutron server log (excerpt:
> http://paste.openstack.org/show/55918/) reveals the first state report
> from the OVS agent is received 128 milliseconds after the
> create_dhcp_port message from the DHCP Agent.

I wonder how that could happen. The failed test is not the first in the tempest suite and not the first out of network tests.
How is it possible that state in this piece of logs is the first?

Also, an interesting (and obvious) observation is that most of the failures are happening with -pg jobs.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

The test is definetely not the first, but the network was created when the test run begun as the job is not isolated.
If you look at the logs the create_dhcp_port is actually not even triggered by a network_create_end message but during synchronization, which is also interesting as it indicates a non-negligible lag between neutron-server and neutron-dhcp-agent startup.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Eugene,

I looked back t your comment as I too found weird that a tempest network was created so early.
That network is indeed the 'private' devstack network, which is used throughout the non-isolated jobs.

It seems the network (and the subnet) are created before the q-agt service is started, and this could lead to the situation described above. This failure mode would affect only non-isolated jobs. But if we look at [1], non-isolated jobs currently account for 70% of the total failures for bug 1253896. I do not have a way to say how much of this failures currently occur for this issue, but I guess that it is a consistent percentage of them.

Perhaps this could be easily solved by rearranging how devstack creates network and starts neutron services.

[1] http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiU1NIVGltZW91dDogQ29ubmVjdGlvbiB0byB0aGVcIiBBTkQgbWVzc2FnZTpcInZpYSBTU0ggdGltZWQgb3V0LlwiIEFORCBmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiXG4iLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwibW9kZSI6InNjb3JlIiwiYW5hbHl6ZV9maWVsZCI6ImJ1aWxkX25hbWUifQ==

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I made a devstack change here: https://review.openstack.org/#/c/63641/

I am running rechecks to see whether the neutron job fails for this bug.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

9 hours after the change merged, we have a few occurences of this bug, which I think will stay with us until the end of the days.
Perhaps the problem is that I am the root cause of the bug, I don't know...

Seriously: 10 occurences in 9 hours after the patch merged
However:
* 5 failures are in the same job, an experimental one, started before the patch merged. the job is probably even parallelized, which adds new failure modes for this bug we're addressing with other patches.
* 4 failures are in non-neutron jobs; I have looked at logs, but have no clue about the root cause
* 1 failure is "genuine". The error is the usual, VM not getting DHCP, but the fault this time is different. [1] says the port is added incorrectly, or at least too early, because at the next command it fails to find the VIF on the integration bridge. Note that the VIF was instead found a few seconds before as detected by the ovsdb monitor. It's hard to understand what happened here. It might have a been a kvm issue, and ovs issue, or for some reason nova unplugged the vif (unlikely). But for this failure we should rule out buggy behaviour from neutron.

So things are not bad so far. I will keep looking for each failure, and throw a proposal to start filing different bugs with logstash queries which look for faults and error rather than failures. This way it would be easier to understand how we are making progress and have multiple people looking at this kind of issues.

[1] http://logs.openstack.org/21/60721/2/gate/gate-tempest-dsvm-neutron/353705a/logs/screen-q-agt.txt.gz#_2013-12-24_05_51_45_093

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Over 50% of hits for this bug are related to experimental jobs.
They enable parallel tests. The current trunk code is prone to timeouts under load and this is already being addressed.

To a get a more realistic estimate of the impact of this bug, I think we should be able, at least, to exclude experimental jobs from the logstash query for elastic-recheck.

Revision history for this message
Anita Kuno (anteaya) wrote :
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

In the last 24 hours the failure has been spotted 35 times in non-experimental jobs.
Only 3 experimental failures detected.

Potential regression possible.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I've run the queries [1] and [2] for a 24 hours period. Query 1 is for dec-23 (GMT time zome), Query 2 is for Jan 2 (GMT time zone).
Experimental builds are excluded.

Query 1 showed 145 failures (elastic recheck says 154 because of 9 experimental jobs - ratio 94%) of which
126 in non-isolated jobs
8 in isolated jobs
11 in 'full jobs' - which are non voting and run parallel tests. With the current code we expect this kind of failures in parallel tests

1471 non-isolated jobs were run in this interval with a failure rate of 8.56% which is rather bad
290 isolated jobs where run in this interval with a failure rate of 2.75% which is barely ok.

Query 2 showed 29 failures (elastic recheck says 38 because of 9 experimental jobs - ratio 76%) of which:
0 in non-isolated jobs
4 in non-neutron jobs
5 in 'full jobs' - which are non voting and run parallel tests. With the current code we expect this kind of failures in parallel tests
14 in grenade jobs. It is unclear at the moment whether grenade is running parallel tests for neutron, but it seems this job is non voting as well
6 in isolated jobs. These will lead to a jenkins failure. The last patch we merged (on christmas eve?) was only for non-isolated jobs.

98 isolated jobs were run in this interval with a failure rate of 6.12%, which is still unfortunately far from being acceptable, and shows there has probably been a regression.
632 non isolated jobs were run in this interval with a failure rate of 0%, which might more acceptable, and shows that the last devstack patch merged cleared timeout messages for non-isolated jobs.

ACTION ITEMS:
1 - investigate grenade failure. Are they running parallel tests? if not, the root cause should be tracked and fixed (might either be flakiness in the test or an underlying problem in neutron)
2 - investigate rise in failures with isolated jobs. Failure rate increased from 2.75% to 6.12% in the two intervals considered.

[1] http://logstash.openstack.org/#eyJmaWVsZHMiOlsiYnVpbGRfcXVldWUiXSwic2VhcmNoIjoibWVzc2FnZTpcIlNTSFRpbWVvdXQ6IENvbm5lY3Rpb24gdG8gdGhlXCIgQU5EIG1lc3NhZ2U6XCJ2aWEgU1NIIHRpbWVkIG91dC5cIiBBTkQgZmlsZW5hbWU6XCJjb25zb2xlLmh0bWxcIiBBTkQgTk9UIGJ1aWxkX3F1ZXVlOlwiZXhwZXJpbWVudGFsXCIiLCJ0aW1lZnJhbWUiOiJjdXN0b20iLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJmcm9tIjoiMjAxMy0xMi0yM1QwMDowMDowMCswMDowMCIsInRvIjoiMjAxMy0xMi0yNFQwMDowMDowMCswMDowMCIsInVzZXJfaW50ZXJ2YWwiOiIwIn0sIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIiwic3RhbXAiOjEzODg3MzU5OTUyNTl9
[2] http://logstash.openstack.org/#eyJmaWVsZHMiOlsiYnVpbGRfcXVldWUiXSwic2VhcmNoIjoibWVzc2FnZTpcIlNTSFRpbWVvdXQ6IENvbm5lY3Rpb24gdG8gdGhlXCIgQU5EIG1lc3NhZ2U6XCJ2aWEgU1NIIHRpbWVkIG91dC5cIiBBTkQgZmlsZW5hbWU6XCJjb25zb2xlLmh0bWxcIiBBTkQgTk9UIGJ1aWxkX3F1ZXVlOlwiZXhwZXJpbWVudGFsXCIiLCJ0aW1lZnJhbWUiOiJjdXN0b20iLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJmcm9tIjoiMjAxNC0wMS0wMlQwMDowMDowMCswMDowMCIsInRvIjoiMjAxNC0wMS0wM1QwMDowMDowMCswMDowMCIsInVzZXJfaW50ZXJ2YWwiOiIwIn0sIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIiwic3RhbXAiOjEzODg3MzYxMzg4Mzl9

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I probably have one more hint on Neutron failures.
When neutron isolated jobs fail, there is sometimes an error in adding a VIF. This error is somewhat hidden at least in the logs [1]

As a result, the VIF is not wired and SSH connection fails even if the floating IP is correctly wired.
The fault lies in regex parsing of ovs-vsctl find Interface, which expects the ofport to be always populated - the ofport is instead asynchronously set by vswitchd. instead of returning an integer, ovs-vsctl find Interface returns '[]' for ofport thus causing the parsing regex search to fail.

This has been seen in 3 out of 3 failure logs analysed ([1], [2], and [3])

A possible solution would be to make the regex more robust; however it is paramount to find a reliable way to deal with ports for which ofport has not yet been set.

[1] http://logs.openstack.org/59/53459/4/experimental/check-tempest-dsvm-neutron-isolated/e0aefb5/logs/screen-q-agt.txt.gz#_2014-01-06_22_51_40_376
[2] http://logs.openstack.org/39/64139/6/check/check-tempest-dsvm-neutron-isolated/2f967f8/console.html
[3] http://logs.openstack.org/17/57517/7/check/check-tempest-dsvm-neutron-isolated/11505f4/console.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/65838

Changed in neutron:
assignee: Mark McClain (markmcclain) → Salvatore Orlando (salvatore-orlando)
Changed in nova:
milestone: icehouse-2 → icehouse-3
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-2 → icehouse-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/68899

Changed in tempest:
assignee: nobody → Darragh O'Reilly (darragh-oreilly)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/68899
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=6b636677c2a97a363cc7f1517ad10b4b9fd20a5c
Submitter: Jenkins
Branch: master

commit 6b636677c2a97a363cc7f1517ad10b4b9fd20a5c
Author: Darragh O'Reilly <email address hidden>
Date: Fri Jan 24 12:17:40 2014 +0000

    Increase ping timeout from 60 to 120 seconds

    The Cirros image only sends out three DHCPDISCOVERs with 60 second
    waits between. The ovs_neutron_agent may not have the port wired in
    time for the first one, and with ping_timeout=60 the test will be
    terminated before the second one is sent. This patch increases the
    timeout to 120 seconds to prevent that.

    Change-Id: I03fcec50b91fd89ad64e16fce82f9211c2e086a1
    Partial-Bug: 1253896

Revision history for this message
Valerii Zhelezniakov (vzhelezniakov) wrote :

Hi,

I investigated this issue and found still one place where timeout should be expanded.
It is hardcoded constant 'self.banner_timeout = 15' in /usr/local/lib/python2.7/dist-packages/paramiko/transport.py
I offer to increase this value or make it configurable.

Description:
As I see in logs http://logs.openstack.org/84/58284/2/check/check-tempest-devstack-vm-full/a2ce10a/testr_results.html.gz
The message 'Error reading SSH protocol banner' indicates the main problem.
If we look at in the sources of 'paramiko' we can see that, before starting data exchange, the worker thread check banner.

Banner checking is performed by writing data to packetizer and then reading answer from it.

 /usr/local/lib/python2.7/dist-packages/paramiko/transport.py
1530 def run(self):
                ...
                self.packetizer.write_all(self.local_version + '\r\n')
                self._check_banner()
                ...

where 'packetizer' is connection provider as I understood.
1669 def _check_banner(self):
         ...
         timeout = self.banner_timeout
         ...
         try:
                buf = self.packetizer.readline(timeout)
            except ProxyCommandFailure:
                raise
            except Exception, x:
                raise SSHException('Error reading SSH protocol banner' + str(x))

As we can see, initial value for timeout is 15 seconds, and it is too low for tempest.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/66201
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=c3128c085c2635d82c4909d1be5d016df4978632
Submitter: Jenkins
Branch: master

commit c3128c085c2635d82c4909d1be5d016df4978632
Author: Gary Kotton <email address hidden>
Date: Sun Jan 12 06:59:45 2014 -0800

    Add log info for tempest SSH connection issues

    The patch does the following:
    1. Add in extra information
    2. Logs current attempts
    3. Waits a little longer prior to the next SSH attempt

    Related-bug: #1253896

    Change-Id: I878362899b4cf81143c02c15c55936b33cf65ec8

Revision history for this message
Matt Riedemann (mriedem) wrote :

This might be a stab in the dark, but I'm wondering if this patch maybe has something to do with this bug and possibly bug 1265495:

https://review.openstack.org/#/c/56174/

That merged around the time this bug was reported, but more interestingly when I backported that back to stable/havana in a topic branch:

https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/havana+topic:backport-neutron-auth-fixes,n,z

I'm getting continual fails on the backport:

https://review.openstack.org/#/c/70474/

Against bug 1265495.

The change seems trivial enough, but when I grep the nova source tree (excluding tests) for "neutronv2.get_client" there are 42 matches. There are only 4 cases where admin=True in the call, which is in these nova.network.neutronv2.api methods:

allocate_for_instance // setting port bindings
_refresh_neutron_extensions_cache
_build_network_info_model

I'm not sure how to tell if the other calls to get the client should be using admin credentials or not, and if that somehow causes a race when working with networks depending on the context being used, but it seems fishy.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Matt,

I think the rationale behind change #56174 was to avoid continuous round trip to keystone, but I'm not sure.
Still, you are seeing an interesting relationship wiht bug 1265495.
Also, I am looking at the spike in occurences in bug 1253896, and I'm not sure the two events are unrelated.

I will be able to provide more info later, hopefully.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

79 of the 93 failures observed in the last 24 hours happen in the newly merged load balacing scenario: http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiIEFORCBtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X2xvYWRfYmFsYW5jZXJfYmFzaWMuVGVzdExvYWRCYWxhbmNlckJhc2ljLnRlc3RfbG9hZF9iYWxhbmNlcl9iYXNpY1tjb21wdXRlLGdhdGUsbmV0d29yayxzbW9rZV0gLi4uIEZBSUxcIiIsInRpbWVmcmFtZSI6Ijg2NDAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJvZmZzZXQiOjAsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkxNzcxNDg5Mzc0fQ==

the failure occurs while trying to connect to a worker VM to start a fake web server with netcat.
This time the connection happens over the private IP, so the l3 agent does not get into the picture. The port appears to be correctly wired in all analyzed failures. Therefore we need to look at potentiall sec group/dhcp issues.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

A few observation about neutron failures.

3 distinct failure modes have been observed. they all related to the fact that we've completely turned off injection and are now relying on metadata, which is course not working as well as it should.(*)

Failure mode 1 -> q-meta crashes at startup: http://logs.openstack.org/92/69792/10/check/check-tempest-dsvm-neutron-isolated/82e1716/logs/screen-q-meta.txt.gz
It seems a rather easy fix.
Failure mode 2 -> failure in nova api while retrieving metadata: http://logs.openstack.org/78/70178/5/check/check-tempest-dsvm-neutron-isolated/f65aaa8/logs/screen-n-api.txt.gz#_2014-02-05_16_22_12_713
This needs some investigation and appears to be the most common failure mode
Failure mode 3 -> q-vpn crashes at startup. Not yet sure why the lbaas agent is called into play

I will also add instance console logging to the load balancing scenario to verify whether there is an additional failure mode.

(*) Personally I was expecting this

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Scratch failure mode #3, as it comes from a patch which introduces this unreasonable behaviour. The patch is WIP now, otherwise it would have been a -2 IMO until the relevant blueprint was discussed and approved.

Failure mode #1 also happened only 4 times in the past week, so it is not really a priority.

Failure mode #2 shows up often in logstash because it occurs multiple time in the same run. But in the past week it affected only 5 build changes. http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiRmF1bHRXcmFwcGVyOiAnTm9uZVR5cGUnIG9iamVjdCBoYXMgbm8gYXR0cmlidXRlICdpbnN0YW5jZSdcIiIsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50Iiwib2Zmc2V0IjowLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MTc4MDc3MTY1NCwibW9kZSI6InNjb3JlIiwiYW5hbHl6ZV9maWVsZCI6ImJ1aWxkX2NoYW5nZSJ9

Summarizing, I am not sure anymore the previously extracted information have any value, and I will focus again on neutron issues which might have been uncovered by:
https://review.openstack.org/#/c/53459/
https://review.openstack.org/#/c/58697/

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I spoke with Eugene over IRC and he confirmed that the load balancing scenario is triggering the failures (https://review.openstack.org/#/c/58697/)

This happens because the test makes an assumption of reachability of instances on private network which does not hold for isolated tests. It seems that at the moment we need to skip this test.

Joe Gordon (jogo)
tags: added: testing
Revision history for this message
Valerii Zhelezniakov (vzhelezniakov) wrote :

Hi Salvatore,

Could you share results of UnitTests where this bug is reproduced. I mean logs like this http://logs.openstack.org/84/58284/2/check/check-tempest-devstack-vm-full/a2ce10a/testr_results.html.gz or may be other logs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/65838
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2702baed390d094b0eac07d0ae167ed236868d00
Submitter: Jenkins
Branch: master

commit 2702baed390d094b0eac07d0ae167ed236868d00
Author: Salvatore Orlando <email address hidden>
Date: Mon Jan 13 12:51:03 2014 -0800

    Avoid processing ports which are not yet ready

    This patch changes get_vif_port_set in order to not return
    OVS ports for which the ofport is not yet assigned, thus avoiding
    a regex match failure in get_vif_port_by_id.
    Because of this failure, treat_vif_port is unable to wire
    the port.
    As get_vif_port_by_id is also used elsewhere in the agent, it has
    been enhanced in order to tolerate situations in which ofport might
    have not yet been assigned.

    The ofport field is added to the list of those monitored by the
    SimpleInterfaceMonitor. This will guarantee an event is generated
    when the ofport is assigned to a port. Otherwise there is a risk
    a port would be never processed if it was not yet ready the first
    time is was detected. This change won't trigger any extra processing
    on the agent side.

    Finally, this patch avoids fetching device details from the plugin
    for ports which have disappeared from the OVS bridge. This is a
    little optimization which might be beneficial for short lived ports.

    Change-Id: Icf7f0c7d6fe5239a358567cc9dc9db8ec11c15be
    Partial-Bug: #1253896

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Bug criticality has now decreased a little, since it's not anymore ranked #1.
This bug is however still on the podium sitting at #3.

It is perhaps a good idea to have a detailed look at the failures to understand whether this bug is still actually critical, and what we should do about it.

[1] is a score-by-branch for the past 24 hours.
About 20% of the failures occur on stable/havana. Considering that the frequency of jobs targeting stable/havana is probably less than 20% of overall jobs, this means that we've probably improved the resiliency to this bug for icehouse. However stable/havana failures appear to be all related to the neutron, and the reason has to be searched in the fact that the improvements for neutron have not been backported. Frankly I'm not sure they could be considered backportable at all as there are some large patches.

[2] is a score-by-job for master branch in the past 24 hours.

check-tempest-dvsm-heat-neutron-slow, currently non-voting, accounts for over 50% of the failures. Without this job, there would have been "only" 25 failures on the master branch in the past 24 hours. I have not yet submitted a tempest bug for this job, but the problem appear to be that is trying to ssh an instance over a private network, which I don't think it can work with neutron.

There are 16 failures with nova-network enabled (a little less than 30% of master failures). 7 of them however occur on grenade jobs. I have not yet looked into them. I hope someone from the nova team can help on this matter.

There are also 9 neutron failures. 6 of them all occurred in the same job, and the root cause is that the metadata service did not start. Currently devstack does not return an error when some neutron agents fail to start. Bug 128182 has been filed to this aim.

For the remaning 3 neutron failures, 2 of them are caused by a regex parsing error. This problem is being tracked by bug 1280827, for which there is a patch under review.

The third failure is being investigated.

[1] http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiU1NIVGltZW91dDogQ29ubmVjdGlvbiB0byB0aGVcIiBBTkQgbWVzc2FnZTpcInZpYSBTU0ggdGltZWQgb3V0LlwiIEFORCBmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiIiwidGltZWZyYW1lIjoiODY0MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTI2NTQ5Mjk4NzIsIm1vZGUiOiJzY29yZSIsImFuYWx5emVfZmllbGQiOiJidWlsZF9icmFuY2gifQ==

[2] http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiU1NIVGltZW91dDogQ29ubmVjdGlvbiB0byB0aGVcIiBBTkQgbWVzc2FnZTpcInZpYSBTU0ggdGltZWQgb3V0LlwiIEFORCBmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsInRpbWVmcmFtZSI6Ijg2NDAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJvZmZzZXQiOjAsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkyNjU0OTI5ODcyLCJtb2RlIjoic2NvcmUiLCJhbmFseXplX2ZpZWxkIjoiYnVpbGRfbmFtZSJ9

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

The bug concerning the metada agent is actually bug 1281182

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/74218

Revision history for this message
Alan Pevec (apevec) wrote :

> However stable/havana failures appear to be all related to the neutron, and the reason has to be searched in the fact that the improvements for neutron have not been backported. Frankly I'm not sure they could be considered backportable at all as there are some large patches.

To make it clear, I'll mark Neutron/Havana Won't Fix and add Tempest/Havana task.

So what do we do now with stable/havana gate jobs? Which tests or whole jobs should be disabled for Havana?

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Alan - here is probably one worth backporting to stable/havana for Tempest: https://review.openstack.org/#/c/74504/

Changed in nova:
importance: Critical → High
milestone: icehouse-3 → none
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

More updates on Neutron

Hits in the past 24 hours: 14
Jobs affected: 10 (some jobs hit the failure multiple times, in this case there is a common root, like metadata not working)

The count excludes the always-failing heat job (not a neutron issue), and includes the newly introduced full job.

Compared with 10 days ago:
Hits in 24 hours: 14 down from 29 (~52% reduction)
Excluding the full job it would 6 failures only (~80% reduction)

Changed in neutron:
milestone: icehouse-3 → icehouse-rc1
Revision history for this message
Mark McClain (markmcclain) wrote :

This is no longer showing up in test in the gate where Neutron is enabled.

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Mark McLoughlin (markmc) wrote :

AIUI, the large-ops job failures which elastic-recheck classified as this bug in the last 24 hours should be fixed by https://review.openstack.org/84319

It looks like https://review.openstack.org/80643 caused the problem

Revision history for this message
Mark McLoughlin (markmc) wrote :

Yeah, all the largeops failures had:

  Running tempest smoke tests

and now they're back to:

  Running large ops tests

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-rc1 → 2014.1
Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

And for Neutron only, re-opening for neutron

Revision history for this message
Kyle Mestery (mestery) wrote :

f you look at this patch here [1], you can see for patch #2 both check-tempest-dvsm-full fails and also check-grenade-dvsm-neutron. In the former case, nova networking is being used. In the latter, neutron is being used. The failure is the same: ssh to the guest is failing. Seems like this may imply an issue with the tests or something which crosses both nova-network and neutron.

[1] https://review.openstack.org/#/c/91887/

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/92018

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Kyle Mestery (mestery) wrote :

Ack @jogo, Neutron hits are much too high on this one.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I think patch #92018 nails the issue.
However, I don't want this bug to be reopened.
Otherwise we'll get into the habit of reopening it everytime the network does not work.

If you agree, I would even remove it from e-r query and file a new bug report for the malfunctioning which patch 92018 is fixing.

Revision history for this message
Kyle Mestery (mestery) wrote :

I agree with salv-orlando's comment in #71. Lets file a new bug, update the e-r query, and retarget bug #92018 at the new bug.

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/92018
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=45381fe1c742c75773d97f1c0bd1f3cb1e7a6468
Submitter: Jenkins
Branch: master

commit 45381fe1c742c75773d97f1c0bd1f3cb1e7a6468
Author: Attila Fazekas <email address hidden>
Date: Sun May 4 19:54:37 2014 +0200

    L3 RPC loop could delete a router on concurrent update

    routers_updated does not acquire any lock just updates
    a set for future rpc loop processing.

    The self.updated_routers can be changed by concurrent update
    notification. If this change happens at the time around the
    self.plugin_rpc.get_routers call, the additional routers
    - by mistake - is considered as admin_state_up=false routers, which
     are safe to delete.

    Creating a local copy of the updated_routers and preserve
    the fresh updated_routers entries for the next _rpc_loop
    operations.

    Change-Id: Icc7377f9c29e248c3b34562465e859b15ecc2ec3
    Closes-Bug: #1315467
    Partial-Bug: #1253896

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/92630

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/icehouse)

Reviewed: https://review.openstack.org/92630
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=74a9365edd9fb2206c52d8ee9fa781b772d1738d
Submitter: Jenkins
Branch: stable/icehouse

commit 74a9365edd9fb2206c52d8ee9fa781b772d1738d
Author: Attila Fazekas <email address hidden>
Date: Sun May 4 19:54:37 2014 +0200

    L3 RPC loop could delete a router on concurrent update

    routers_updated does not acquire any lock just updates
    a set for future rpc loop processing.

    The self.updated_routers can be changed by concurrent update
    notification. If this change happens at the time around the
    self.plugin_rpc.get_routers call, the additional routers
    - by mistake - is considered as admin_state_up=false routers, which
     are safe to delete.

    Creating a local copy of the updated_routers and preserve
    the fresh updated_routers entries for the next _rpc_loop
    operations.

    Change-Id: Icc7377f9c29e248c3b34562465e859b15ecc2ec3
    Closes-Bug: #1315467
    Partial-Bug: #1253896
    (cherry picked from commit 45381fe1c742c75773d97f1c0bd1f3cb1e7a6468)

tags: added: in-stable-icehouse
Alan Pevec (apevec)
tags: removed: in-stable-icehouse
Revision history for this message
Takahiro Shida (shida) wrote :

I was try to reproduce it, and I succeeded.
And I can also avoid this problem with modify to the ssh_timeout ( 196 -> 300 )
So this is not a bug on code?

Revision history for this message
Henry Gessau (gessau) wrote :

A five minute timeout seems excessive, but we may need that for reducing the gate failures until we can figure out why the servers take so long to respond.

Here is another occurrence that happened when there were only a couple of other jobs in the gate:
http://logs.openstack.org/17/143017/5/check/check-tempest-dsvm-neutron-full/003dcbe/console.html

no longer affects: tempest/havana
Sean Dague (sdague)
Changed in nova:
status: Triaged → Confirmed
Revision history for this message
Attila Fazekas (afazekas) wrote :

Setting the tempest status to incomplete, since we do not know about an actionable item on tempest side.

Changed in tempest:
status: Confirmed → Incomplete
Tom Fifield (fifieldt)
Changed in tempest:
milestone: havana-stable → none
Revision history for this message
Sean Dague (sdague) wrote :

https://review.openstack.org/#/c/273042/ - seems to have largely fixed this from the Nova side. Final major issue was a reuse of non fully deallocated fixed ips.

Changed in nova:
assignee: nobody → Sean Dague (sdague)
status: Confirmed → Fix Released
Changed in tempest:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.