DVR functional tests failing intermittently

Bug #1521815 reported by Assaf Muller
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Expired
Low
Unassigned

Bug Description

Some console logs:

http://logs.openstack.org/18/248418/3/check/gate-neutron-dsvm-functional/8a6dfcf/console.html
http://logs.openstack.org/00/189500/23/check/gate-neutron-dsvm-functional/d949ce0/console.html
http://logs.openstack.org/72/252072/1/check/gate-neutron-dsvm-functional/aafcd9a/console.html
http://logs.openstack.org/32/192032/26/check/gate-neutron-dsvm-functional/b267f83/console.html
http://logs.openstack.org/02/251502/3/check/gate-neutron-dsvm-functional/b074a96/console.html

Tests seen failing so far (May not be comprehensive):
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_add_fips_on_restarted_agent
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_ha_with_snat_with_fips
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_without_ha_without_snat_with_fips

The commonality is:
1) DVR tests (With and without HA, with and without SNAT)
2) The tests are taking thousands of seconds to fail, causing the job to time out, this is even though we're supposed to have a per-test timeout of 180 seconds defined in tox.ini. This means (I suspect) that we're not getting the functional tests logs.

Revision history for this message
Assaf Muller (amuller) wrote :

I'm getting FIP namespace leftovers, I don't know if this is related.

Doug Wiegley (dougwig)
Changed in neutron:
status: New → Confirmed
assignee: nobody → Ryan Moats (rmoats)
milestone: none → mitaka-2
tags: added: needs-attention
Revision history for this message
Ryan Moats (rmoats) wrote :

This is looking like a regression:

(1) the first link shows that neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_rem_fips_on_restarted_agent passed, but took 5365.125131 seconds to do so.
(2) logstach query for failure of neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips
for the functional job shows 19 failues over the past 7 days. At 12/1 16:51:45 there was a failure that took 12.8 seconds while at 12/1 18:27:59 UTC there was a failure that took 3700 seconds. Logstash queries for other failures cited confirm this

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/252589

Ryan Moats (rmoats)
Changed in neutron:
assignee: Ryan Moats (rmoats) → Swaminathan Vasudevan (swaminathan-vasudevan)
Revision history for this message
Ryan Moats (rmoats) wrote :

assigned to Swami as a first step (although obviously others are looking at it)

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I am not able to see that long timeout anymore in my setup.

Just running the DVR tests alone does not cause any issues and it seems reasonable, so far from my observation.

Test id Runtime (s)
--------------------------------------------------------------------------------------------------------------------------- -----------
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_ha_router_failover 38.055
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_ha_with_snat_with_fips 25.563
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_fips_for_multiple_ext_networks 22.574
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_rem_fips_on_restarted_agent 18.662
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_snat_namespace_with_interface_remove 14.792
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_add_fips_on_restarted_agent 14.539
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips 13.854
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_fip_late_binding 10.959
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_without_ha_without_snat_with_fips 10.009
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_fip_agent_mismatch 8.463
___________________________________ summary ____________________________________
  dsvm-functional: commands succeeded
  congratulations :)

Revision history for this message
Assaf Muller (amuller) wrote :

I wasn't able to reproduce this locally either (Ran all functional tests with 8 cores).

Revision history for this message
Doug Wiegley (dougwig) wrote :

Here is a timeout from this morning. Related? https://review.openstack.org/#/c/252686/

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Thanks Doug, the issue here is we are not able to get any logs out of the failures at this point from the gate and reproducing the same failure downstream is very difficult.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

See this console output from a test run that passed jenkins, but the failures in here are from dsvm-functional-py34, but it seems that might be related to the bug that we are chasing in a non py34 module, it exactly states that some tests are timeing out after 60 seconds. #link http://logs.openstack.org/77/252677/2/check/gate-neutron-dsvm-functional-py34/d47bb78/console.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/252589
Reason: OK, it does in the general case, just not for those specific DVR functional tests...

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This started showing up in stable/liberty too, and looking at what merged in Liberty I can't think of anything that went into stable that could have caused the spike, so I am tempted to rule out the Neutron regression. [1] did merge both in master and stable in a short span and unless something crazy happens, I can't see why it would be related.

[1] https://review.openstack.org/#/c/248024/

Revision history for this message
Assaf Muller (amuller) wrote :
Download full text (5.1 KiB)

I wanted to see if a dependency changed around the 1st of December.

pip freeze from recent job failure:
http://logs.openstack.org/81/230481/13/check/gate-neutron-dsvm-functional/d1a4d77/logs/pip-freeze.txt.gz

pip freeze from a success before the 1st of December (24th of November):
http://logs.openstack.org/32/249232/1/check/gate-neutron-dsvm-functional/4d8262e/logs/pip-freeze.txt.gz

The diff is:
old:
py==1.4.30
new:
py==1.4.31

dpkg -l from recent job failure:
http://logs.openstack.org/81/230481/13/check/gate-neutron-dsvm-functional/d1a4d77/logs/dpkg-l.txt.gz

And from old successful run:
http://logs.openstack.org/32/249232/1/check/gate-neutron-dsvm-functional/4d8262e/logs/dpkg-l.txt.gz

[amuller@dhcp-1-3 ~]$ diff dpkg-l.txt.gz dpkg-l.txt.gz.1
76,77c76,77
< ii dpkg 1.17.5ubuntu5.5 amd64 Debian package management system
< ii dpkg-dev 1.17.5ubuntu5.5 all Debian package development tools
---
> ii dpkg 1.17.5ubuntu5.4 amd64 Debian package management system
> ii dpkg-dev 1.17.5ubuntu5.4 all Debian package development tools
309c309
< ii libdpkg-perl 1.17.5ubuntu5.5 all Dpkg perl modules
---
> ii libdpkg-perl 1.17.5ubuntu5.4 all Dpkg perl modules
365,366c365,366
< ii libgnutls-openssl27:amd64 2.12.23-12ubuntu2.3 amd64 GNU TLS library - OpenSSL wrapper
< ii libgnutls26:amd64 2.12.23-12ubuntu2.3 amd64 GNU TLS library - runtime library
---
> ii libgnutls-openssl27:amd64 2.12.23-12ubuntu2.2 amd64 GNU TLS library - OpenSSL wrapper
> ii libgnutls26:amd64 2.12.23-12ubuntu2.2 amd64 GNU TLS library - runtime library
698c698
< ii linux-generic 3.13.0.70.76 amd64 Complete Generic Linux kernel and headers
---
> ii linux-generic 3.13.0.68.74 amd64 Complete Generic Linux kernel and headers
701,704c701,704
< ii linux-headers-3.13.0-70 3.13.0-70.113 all Header files related to Linux kernel version 3.13.0
< ii linux-headers-3.13.0-70-generic 3.13.0-70.113 amd64 Linux kernel headers for version 3.13.0 on 64 bit x86 SMP
< ii linux-headers-generic 3.13.0.70.76 amd64 Generic Linux kernel headers
< ii linux-headers-virtual 3.13.0.70.76 amd64 Transitional package.
---
> ii linux-headers-3.13.0-68 3.13.0-68.111 all Header files related to Linux kernel version 3.13.0
> ii linux-headers-3.13.0-68-generic 3.13.0-68.111 amd64 Linux kernel headers for version 3.13.0 on 64 bit x86 SMP
> ii linux-headers-generic 3.13.0.68.74 amd64 Generic Linux...

Read more...

Revision history for this message
Ryan Moats (rmoats) wrote :

Checking logstash for all of those tests shows that the occurrences of long tailed failures in those test ended on 12/3 - not sure at this point why...

Ryan Moats (rmoats)
Changed in neutron:
importance: High → Low
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This disappeared in thin air. 0 occurrences in the last 7 days.

Changed in neutron:
status: Confirmed → Incomplete
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → nobody
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Yes we noticed it and that's the reason we have lowered the severity on this bug.
It is just in a watchlist for now.

Changed in neutron:
milestone: mitaka-2 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.