ServerPersonalityTestJSON:test_rebuild_server_with_personality fails intermittent (linuxbridge)

Bug #1531862 reported by Andreas Scheuring
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Kevin Benton

Bug Description

24 Hits in the last 30 days - all of them after the 31st of December! [1]

2016-01-07 11:08:30.061 | Captured traceback:
2016-01-07 11:08:30.061 | ~~~~~~~~~~~~~~~~~~~
2016-01-07 11:08:30.061 | Traceback (most recent call last):
2016-01-07 11:08:30.062 | File "tempest/api/compute/servers/test_server_personality.py", line 81, in test_rebuild_server_with_personality
2016-01-07 11:08:30.062 | waiters.wait_for_server_status(self.client, server_id, 'ACTIVE')
2016-01-07 11:08:30.062 | File "tempest/common/waiters.py", line 95, in wait_for_server_status
2016-01-07 11:08:30.062 | raise exceptions.TimeoutException(message)
2016-01-07 11:08:30.062 | tempest.exceptions.TimeoutException: Request timed out
2016-01-07 11:08:30.062 | Details: (ServerPersonalityTestJSON:test_rebuild_server_with_personality) Server f22f189f-6fa2-4d6d-9a60-89ca045c6dbf failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: REBUILD. Current task state: rebuild_spawning.

Example Log: http://logs.openstack.org/18/246318/13/check/gate-tempest-dsvm-neutron-linuxbridge/589bb9b/

[1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22ServerPersonalityTestJSON%3Atest_rebuild_server_with_personality)%20Server%5C%22

Revision history for this message
Andreas Scheuring (andreas-scheuring) wrote :
Download full text (4.1 KiB)

I'm not 100% sure what causes this issue. But I think it's like this.

- Instance gets started, rebuilt triggert
- Start destroy
- Agent loop detects that related tap device exists
- Instance destroyed, tap device, too
- Instances start triggered, nova starts listening on network-vif-plugged event
2016-01-11 04:00:46.887 DEBUG nova.compute.manager [req-d99131b4-6ad7-4d60-8d1f-5c44075c719c tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Preparing to wait for external event network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:514

- Instance created and tap device created
- Agent loop detects that device is still there and assumes it active - it did not realize, that it dissappeared and appeard again
 ---> so the device up message is missing --> the port will never become Active!
- Instance stays in paused state until it gets deleted
2016-01-11 04:03:37.974 DEBUG nova.compute.manager [req-a7841493-f6c3-485d-a9e2-720d5386dcf2 tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Events pending at deletion: network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 _delete_instance /opt/stack/new/nova/nova/compute/manager.py:2391

long time after the instance got deleted, the network-vif-timeout occurs

So probably the test case did trigger the delete...

I can not verify this assumption from the logs to 100%, as I can not figure out the exact point of time when the tap device got created. But at least the agent did never detect it as down. It's like this:

agent-loop scan: [6] 2016-01-11 04:00:41.727 (device in current + added due to resync)
libvirt: [2] 2016-01-11 04:00:45.875+0000: 16554: VM destroyed
nova log: [5] 2016-01-11 04:00:45.928 - vm destroyd

agent-loop scan: [7] 2016-01-11 04:00:47.560 Loop iteration exceeded interval (2 vs. 5.83269119263)! daemon_loop /opt/stack/
agent-loop scan: [7] 2016-01-11 04:00:47.560 (device in current)
libvirt: [1] 2016-01-11 04:00:47.731+0000: 16552: instance stareted (or to be started?)
nova log [4] 2016-01-11 04:00:47.957 instance started
agent-loop scan: [8] 2016-01-11 04:00:51.560 (device in current)
nova log [9] deletion triggered although still waiting for event

nova log [3] vif timeout
--Device should have been detected as deleted by the agent somewhere in between!

[1] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/libvirt/libvirtd.txt.gz#_2016-01-11_04_00_47_731
[2] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/libvirt/libvirtd.txt.gz#_2016-01-11_04_00_45_875
[3] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-n-cpu.txt.gz#_2016-01-11_04_05_47_959
[4] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-n-cpu.txt.gz#_2016-01-11_04_0...

Read more...

Revision history for this message
Andreas Scheuring (andreas-scheuring) wrote :

I think the problem in this particular case is the big resync that happened due to https://bugs.launchpad.net/neutron/+bug/1532171 which occured in a parallel executed test case...

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
tags: added: gate-failure
Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This seems to affect linuxbridge only.

tags: added: linuxbridge
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Andreas: I think your analysis is correct. It looks like the test builds and rebuilds the server too fast for the linuxbridge agent to detect that the tap interface is indeed gone and added. Now I am no linuxbridge agent expert, but if the tap interface looks exactly alike after a rebuild phase, then this is a chronic issue that was bound to happen.

We'd need to figure out a way to establish a state change on the tap interface between two runs of the daemon_loop. All things remained equal the only solution right now would be to shorten the daemon_loop period.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm going to mark this as incomplete for nova until we know if there is an issue that we need to address in nova for this.

summary: ServerPersonalityTestJSON:test_rebuild_server_with_personality fails
- intermittent
+ intermittent (linuxbridge)
tags: added: neutron
Changed in nova:
status: New → Confirmed
status: Confirmed → Incomplete
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Matt: we might be able to handle this from the Neutron side only. As soon as I learn more, I'll remove Nova from affected projects.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/291042

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: Confirmed → In Progress
no longer affects: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/291042
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7afbd3a6b896d6d1285343f31517e9a656cfa606
Submitter: Jenkins
Branch: master

commit 7afbd3a6b896d6d1285343f31517e9a656cfa606
Author: Kevin Benton <email address hidden>
Date: Wed Mar 9 23:47:35 2016 -0800

    Check tap bridge timestamps to detect local changes

    During a quick instance rebuild on Nova, it may remove a tap
    interface and then plug it in within the agent polling window.
    In this scenario the agent will not realize a device has changed
    and will therefore not ask the server for info an subsequently
    update its status. This will prevent the notification from being
    sent back to Nova that the vif plugging has finished so the
    VM will never resume state.

    This adds a new timestamp collection method to the common agent
    manager interface for devices that is used by the common agent
    loop to determine if a device has locally changed.

    The linux bridge implementation of it checks the timestamps on the
    tap interface's 'bridge' directory, which will change whenever
    the tap is added to bridge.

    Closes-Bug: #1531862
    Change-Id: If172470e907848556b6a8aff13520f94245919bb

Changed in neutron:
status: In Progress → Fix Released
Changed in neutron:
milestone: none → mitaka-rc1
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0rc1

This issue was fixed in the openstack/neutron 8.0.0.0rc1 release candidate.

Revision history for this message
Dustin Lundquist (dlundquist) wrote :

This occurred on a stable/kilo change too:

2016-03-23 22:32:37.809 | tempest.api.compute.servers.test_server_personality.ServerPersonalityTestJSON.test_rebuild_server_with_personality[id-128966d8-71fc-443c-8cab-08e24114ecc9]
2016-03-23 22:32:37.810 | -----------------------------------------------------------------------------------------------------------------------------------------------------------
2016-03-23 22:32:37.810 |
2016-03-23 22:32:37.810 | Captured traceback:
2016-03-23 22:32:37.810 | ~~~~~~~~~~~~~~~~~~~
2016-03-23 22:32:37.810 | Traceback (most recent call last):
2016-03-23 22:32:37.810 | File "tempest/api/compute/servers/test_server_personality.py", line 84, in test_rebuild_server_with_personality
2016-03-23 22:32:37.810 | waiters.wait_for_server_status(self.client, server_id, 'ACTIVE')
2016-03-23 22:32:37.811 | File "tempest/common/waiters.py", line 95, in wait_for_server_status
2016-03-23 22:32:37.811 | raise exceptions.TimeoutException(message)
2016-03-23 22:32:37.811 | tempest.exceptions.TimeoutException: Request timed out
2016-03-23 22:32:37.811 | Details: (ServerPersonalityTestJSON:test_rebuild_server_with_personality) Server d0060300-59b2-432f-89e9-3f56d45f0e50 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: REBUILD. Current task state: rebuild_spawning.

http://logs.openstack.org/59/296659/2/check/gate-tempest-dsvm-neutron-linuxbridge/e51c86d/console.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.