Sometimes grenade job fails with NetworkNotFound because a network delete request took too long

Bug #1680912 reported by Ihar Hrachyshka on 2017-04-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Critical
Ihar Hrachyshka

Bug Description

http://logs.openstack.org/38/453838/2/gate/gate-grenade-dsvm-neutron-ubuntu-xenial/84cf5aa/logs/grenade.sh.txt.gz#_2017-04-07_03_45_19_220

The DELETE request in question is:
2017-04-07 03:45:19.220 | 2017-04-07 03:41:31,143 18539 WARNING [urllib3.connectionpool] Retrying (Retry(total=9, connect=None, read=None, redirect=5)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='149.202.181.79', port=9696): Read timed out. (read timeout=60)",)': /v2.0/networks/46b0776a-3917-440d-9b90-4ab02a735188
2017-04-07 03:45:19.220 | 2017-04-07 03:41:34,053 18539 INFO [tempest.lib.common.rest_client] Request (NetworksIpV6Test:_run_cleanups): 404 DELETE http://149.202.181.79:9696/v2.0/networks/46b0776a-3917-440d-9b90-4ab02a735188 62.919s
2017-04-07 03:45:19.220 | 2017-04-07 03:41:34,053 18539 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
2017-04-07 03:45:19.220 | Body: None
2017-04-07 03:45:19.221 | Response - Headers: {u'content-length': '138', u'content-type': 'application/json', u'x-openstack-request-id': 'req-d1dfda92-9b9b-421f-b45f-7ef4746e43c6', u'date': 'Fri, 07 Apr 2017 03:41:34 GMT', u'connection': 'close', 'content-location': 'http://149.202.181.79:9696/v2.0/networks/46b0776a-3917-440d-9b90-4ab02a735188', 'status': '404'}
2017-04-07 03:45:19.221 | Body: {"NeutronError": {"message": "Network 46b0776a-3917-440d-9b90-4ab02a735188 could not be found.", "type": "NetworkNotFound", "detail": ""}}

What we see is first attempt to delete the network failed after 60 seconds, so we retry DELETE, at which point we see that the network is no longer there.

In neutron-server log, we see that the first DELETE attempt was received with req_id req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d at 03:40:31.344 but completed at 03:41:33.951 only (probably because second DELETE attempt triggered something that unblocked the first request).

In logs handling the first DELETE request, we see some looping de-allocating ports:

2017-04-07 03:40:34.227 8785 DEBUG neutron.plugins.ml2.plugin [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Ports to auto-deallocate: set([u'b5ff06a6-4084-4961-b239-9f29aa42c68a']) delete_subnet /opt/stack/old/neutron/neutron/plugins/ml2/plugin.py:1093
2017-04-07 03:40:34.231 8785 DEBUG neutron.callbacks.manager [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Notify callbacks [] for subnet, before_delete _notify_loop /opt/stack/old/neutron/neutron/callbacks/manager.py:142
2017-04-07 03:40:34.338 8785 DEBUG neutron.plugins.ml2.plugin [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Ports to auto-deallocate: set([u'b5ff06a6-4084-4961-b239-9f29aa42c68a']) delete_subnet /opt/stack/old/neutron/neutron/plugins/ml2/plugin.py:1093
2017-04-07 03:40:34.340 8785 DEBUG neutron.callbacks.manager [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Notify callbacks [] for subnet, before_delete _notify_loop /opt/stack/old/neutron/neutron/callbacks/manager.py:142

It goes on like that on and on up until:

2017-04-07 03:41:32.644 8785 DEBUG neutron.plugins.ml2.plugin [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Deleting subnet record delete_subnet /opt/stack/old/neutron/neutron/plugins/ml2/plugin.py:1145
2017-04-07 03:41:32.644 8785 DEBUG neutron.ipam.driver [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Loading ipam driver: internal get_instance /opt/stack/old/neutron/neutron/ipam/driver.py:51
2017-04-07 03:41:32.698 8785 DEBUG neutron.plugins.ml2.plugin [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Committing transaction delete_subnet /opt/stack/old/neutron/neutron/plugins/ml2/plugin.py:1152

Right before the thread is unblocked and makes progress, we see there is no longer a port to de-allocate:

2017-04-07 03:41:32.421 8785 DEBUG neutron.plugins.ml2.plugin [req-933bd8b3-cd8f-4d2f-8f0c-505f85347b9d tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Ports to auto-deallocate: set([]) delete_subnet /opt/stack/old/neutron/neutron/plugins/ml2/plugin.py:1093

I think this was unblocked by another cleanup (DELETE) request for the port that happened just before:

2017-04-07 03:41:31.994 8785 DEBUG neutron.plugins.ml2.plugin [req-d1dfda92-9b9b-421f-b45f-7ef4746e43c6 tempest-NetworksIpV6Test-1590417059 tempest-NetworksIpV6Test-1590417059] Calling delete_port for b5ff06a6-4084-4961-b239-9f29aa42c68a owned by network:dhcp delete_port /opt/stack/old/neutron/neutron/plugins/ml2/plugin.py:1637

I suspect this is related to the following patch where we first caught the situation but landed the patch nevertheless: https://review.openstack.org/#/q/I924fa7e36ea9e45bf0ef3480972341a851bda86c,n,z

We may want to revert those. We may also want to release a new Newton release because the patch got into 9.3.0.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Read%20timed%20out.%20(read%20timeout%3D60)%5C%22

^ suggests there were 3 failures of gate-grenade-dsvm-neutron-ubuntu-xenial in last week overall, though there are more timeouts in other jobs.

Changed in neutron:
importance: Undecided → Critical
tags: added: gate-failure
tags: added: db
Changed in neutron:
status: New → Confirmed
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)

Reviewed: https://review.openstack.org/454805
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0a2ca17d871c88a03feb3bbe845c3951af2d95b6
Submitter: Jenkins
Branch: stable/newton

commit 0a2ca17d871c88a03feb3bbe845c3951af2d95b6
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Apr 7 17:14:18 2017 +0000

    Revert "Fix DetachedInstanceError on subnet delete"

    This reverts commit 20b9b8934331556e217e63b9a0dab6e71c2ef7c6.

    The patch made us loop indefinitely on network delete request.

    Change-Id: I18517bafcee495d7985ce29508435ff6fe610970
    Closes-Bug: #1680912

tags: added: in-stable-newton

Reviewed: https://review.openstack.org/454803
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a919f2b3541e2338fd8765dc22cda6b9781f61c8
Submitter: Jenkins
Branch: stable/ocata

commit a919f2b3541e2338fd8765dc22cda6b9781f61c8
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Apr 7 17:13:35 2017 +0000

    Revert "Fix DetachedInstanceError on subnet delete"

    This reverts commit b9242c348cbf44ef1bd061826067c2ec96b8d2f0.

    The patch made us loop indefinitely on network delete request.

    Change-Id: I67eed7c0cb9ca738cea1e827676b4f8f350e5aef
    Closes-Bug: #1680912

tags: added: in-stable-ocata

This issue was fixed in the openstack/neutron 10.0.1 release.

This issue was fixed in the openstack/neutron 9.3.1 release.

Changed in neutron:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers