test_create_port_in_allowed_allocation_pools test fails on ironic grenade

Bug #1707160 reported by Vladyslav Drok
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Kevin Benton
oslo.messaging
Fix Released
Undecided
Ken Giusti

Bug Description

Here is an example of a job at http://logs.openstack.org/58/487458/6/check/gate-grenade-dsvm-ironic-ubuntu-xenial/d8f187e/console.html#_2017-07-28_09_33_52_031224

2017-07-28 09:33:52.027473 | Captured pythonlogging:
2017-07-28 09:33:52.027484 | ~~~~~~~~~~~~~~~~~~~~~~~
2017-07-28 09:33:52.027539 | 2017-07-28 09:15:48,746 9778 INFO [tempest.lib.common.rest_client] Request (PortsTestJSON:test_create_port_in_allowed_allocation_pools): 201 POST http://149.202.183.40:9696/v2.0/networks 0.342s
2017-07-28 09:33:52.027604 | 2017-07-28 09:15:48,746 9778 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-07-28 09:33:52.027633 | Body: {"network": {"name": "tempest-PortsTestJSON-test-network-1596805013"}}
2017-07-28 09:33:52.027728 | Response - Headers: {u'date': 'Fri, 28 Jul 2017 09:15:48 GMT', u'x-openstack-request-id': 'req-0502025a-db49-4f1f-b30d-c38b8098b79e', u'content-type': 'application/json', u'content-length': '582', 'content-location': 'http://149.202.183.40:9696/v2.0/networks', 'status': '201', u'connection': 'close'}
2017-07-28 09:33:52.027880 | Body: {"network":{"status":"ACTIVE","router:external":false,"availability_zone_hints":[],"availability_zones":[],"description":"","subnets":[],"shared":false,"tenant_id":"5c851bb85bef4b008714ef04d1fe3671","created_at":"2017-07-28T09:15:48Z","tags":[],"ipv6_address_scope":null,"mtu":1450,"updated_at":"2017-07-28T09:15:48Z","admin_state_up":true,"revision_number":2,"ipv4_address_scope":null,"is_default":false,"port_security_enabled":true,"project_id":"5c851bb85bef4b008714ef04d1fe3671","id":"b8a3fb1c-86a4-4518-8c3a-dd12db585659","name":"tempest-PortsTestJSON-test-network-1596805013"}}
2017-07-28 09:33:52.027936 | 2017-07-28 09:15:49,430 9778 INFO [tempest.lib.common.rest_client] Request (PortsTestJSON:test_create_port_in_allowed_allocation_pools): 201 POST http://149.202.183.40:9696/v2.0/subnets 0.682s
2017-07-28 09:33:52.027998 | 2017-07-28 09:15:49,431 9778 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-07-28 09:33:52.028054 | Body: {"subnet": {"ip_version": 4, "allocation_pools": [{"end": "10.1.0.14", "start": "10.1.0.2"}], "network_id": "b8a3fb1c-86a4-4518-8c3a-dd12db585659", "gateway_ip": "10.1.0.1", "cidr": "10.1.0.0/28"}}
2017-07-28 09:33:52.028135 | Response - Headers: {u'date': 'Fri, 28 Jul 2017 09:15:49 GMT', u'x-openstack-request-id': 'req-1a50b739-8683-4aaa-ba4a-6e9daf73f1c8', u'content-type': 'application/json', u'content-length': '594', 'content-location': 'http://149.202.183.40:9696/v2.0/subnets', 'status': '201', u'connection': 'close'}
2017-07-28 09:33:52.030085 | Body: {"subnet":{"service_types":[],"description":"","enable_dhcp":true,"tags":[],"network_id":"b8a3fb1c-86a4-4518-8c3a-dd12db585659","tenant_id":"5c851bb85bef4b008714ef04d1fe3671","created_at":"2017-07-28T09:15:49Z","dns_nameservers":[],"updated_at":"2017-07-28T09:15:49Z","gateway_ip":"10.1.0.1","ipv6_ra_mode":null,"allocation_pools":[{"start":"10.1.0.2","end":"10.1.0.14"}],"host_routes":[],"revision_number":0,"ip_version":4,"ipv6_address_mode":null,"cidr":"10.1.0.0/28","project_id":"5c851bb85bef4b008714ef04d1fe3671","id":"be974b50-e56b-44a8-86a9-6bcc345f9d55","subnetpool_id":null,"name":""}}
2017-07-28 09:33:52.030176 | 2017-07-28 09:15:50,616 9778 INFO [tempest.lib.common.rest_client] Request (PortsTestJSON:test_create_port_in_allowed_allocation_pools): 201 POST http://149.202.183.40:9696/v2.0/ports 1.185s
2017-07-28 09:33:52.030232 | 2017-07-28 09:15:50,617 9778 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-07-28 09:33:52.030259 | Body: {"port": {"network_id": "b8a3fb1c-86a4-4518-8c3a-dd12db585659"}}
2017-07-28 09:33:52.030369 | Response - Headers: {u'date': 'Fri, 28 Jul 2017 09:15:50 GMT', u'x-openstack-request-id': 'req-6b57ff81-c874-4e97-8183-bd57c7e8de81', u'content-type': 'application/json', u'content-length': '691', 'content-location': 'http://149.202.183.40:9696/v2.0/ports', 'status': '201', u'connection': 'close'}
2017-07-28 09:33:52.030596 | Body: {"port":{"status":"DOWN","created_at":"2017-07-28T09:15:49Z","description":"","allowed_address_pairs":[],"tags":[],"network_id":"b8a3fb1c-86a4-4518-8c3a-dd12db585659","tenant_id":"5c851bb85bef4b008714ef04d1fe3671","extra_dhcp_opts":[],"admin_state_up":true,"updated_at":"2017-07-28T09:15:50Z","name":"","device_owner":"","revision_number":3,"mac_address":"fa:16:3e:a4:39:53","binding:vnic_type":"normal","port_security_enabled":true,"project_id":"5c851bb85bef4b008714ef04d1fe3671","fixed_ips":[{"subnet_id":"be974b50-e56b-44a8-86a9-6bcc345f9d55","ip_address":"10.1.0.4"}],"id":"a873e67a-a171-4860-9f47-be50dc061b10","security_groups":["107c3ae4-4c9f-4851-935f-5602c35270dc"],"device_id":""}}
2017-07-28 09:33:52.030660 | 2017-07-28 09:15:51,354 9778 INFO [tempest.lib.common.rest_client] Request (PortsTestJSON:_run_cleanups): 204 DELETE http://149.202.183.40:9696/v2.0/ports/a873e67a-a171-4860-9f47-be50dc061b10 0.735s
2017-07-28 09:33:52.030736 | 2017-07-28 09:15:51,354 9778 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-07-28 09:33:52.030747 | Body: None
2017-07-28 09:33:52.030824 | Response - Headers: {u'date': 'Fri, 28 Jul 2017 09:15:51 GMT', u'x-openstack-request-id': 'req-018a5b4a-cb5e-4dc1-b807-e3223b1fc83a', u'content-length': '0', 'content-location': 'http://149.202.183.40:9696/v2.0/ports/a873e67a-a171-4860-9f47-be50dc061b10', 'status': '204', u'connection': 'close'}
2017-07-28 09:33:52.030835 | Body:
2017-07-28 09:33:52.030918 | 2017-07-28 09:16:51,390 9778 WARNING [urllib3.connectionpool] Retrying (Retry(total=9, connect=None, read=None, redirect=5, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='149.202.183.40', port=9696): Read timed out. (read timeout=60)",)': /v2.0/subnets/be974b50-e56b-44a8-86a9-6bcc345f9d55
2017-07-28 09:33:52.030974 | 2017-07-28 09:16:51,508 9778 INFO [tempest.lib.common.rest_client] Request (PortsTestJSON:_run_cleanups): 404 DELETE http://149.202.183.40:9696/v2.0/subnets/be974b50-e56b-44a8-86a9-6bcc345f9d55 60.153s
2017-07-28 09:33:52.031023 | 2017-07-28 09:16:51,508 9778 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-07-28 09:33:52.031047 | Body: None
2017-07-28 09:33:52.031161 | Response - Headers: {u'date': 'Fri, 28 Jul 2017 09:16:51 GMT', u'x-openstack-request-id': 'req-86d01b29-c1b1-4396-a0a8-c360c78dcbd5', u'content-type': 'application/json', u'content-length': '136', 'content-location': 'http://149.202.183.40:9696/v2.0/subnets/be974b50-e56b-44a8-86a9-6bcc345f9d55', 'status': '404', u'connection': 'close'}
2017-07-28 09:33:52.031224 | Body: {"NeutronError": {"message": "Subnet be974b50-e56b-44a8-86a9-6bcc345f9d55 could not be found.", "type": "SubnetNotFound", "detail": ""}}

Seems like there is some kind of race here, this failure should not be related to ironic.

Revision history for this message
Vladyslav Drok (vdrok) wrote :
Revision history for this message
Vladyslav Drok (vdrok) wrote :
tags: added: gate-failure tempest
Revision history for this message
Jakub Libosvar (libosvar) wrote :

That's a pretty precise troubleshooting,

neutron-client calls delete subnet but neutron is slow for some reason (maybe rabbit issue where neutron-server notifies dhcp agent to delete port and namespace) and the operation takes 97 seconds to delete the subnet. Client times out and tries to delete it again, but the first call eventually is completed. The next call fails cause subnet was deleted in previous call.

Changed in neutron:
status: New → Confirmed
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

We landed a fix for https://bugs.launchpad.net/neutron/+bug/1705351 and think it's related and should help here. Could you please check if you still have the problem?

Revision history for this message
Vladyslav Drok (vdrok) wrote :

Hi Ihar, do you mean the ignoring exception patch? If so, it does help in not raising exception further up, but here we just see some super-slowness from neutron or rabbit, which might be related. This log is what we see after that fix merged. Sometimes we see this not found exception for subnet, sometimes for router, sometimes for router interface, on random api tests as it seems to me.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

FYI the neutron fix in question that should have fixed this was: https://review.openstack.org/#/c/485365/

Changed in neutron:
status: Confirmed → Incomplete
status: Incomplete → Confirmed
Revision history for this message
Vladyslav Drok (vdrok) wrote :

And now I also see this stuff on grenade -- http://logs.openstack.org/42/358142/35/check/gate-grenade-dsvm-ironic-ubuntu-xenial/2e6b101/logs/new/screen-q-svc.txt.gz#_2017-08-01_17_07_08_874

I'm not even sure if it's a separate thing or the root of all of these is the same.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I think the security groups NotFound error is the result of https://review.openstack.org/#/c/410422/ landed. A possible fix is: https://review.openstack.org/486687

Changed in neutron:
importance: Undecided → Critical
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/489697

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Testing two options: revert for security group push notifications patch; and a patch to catch all NotFound for all RPC endpoints; using those two Depends-On patches for ironic:

https://review.openstack.org/489699
https://review.openstack.org/489701

Revision history for this message
John L. Villalovos (happycamp) wrote :

FYI: Even though https://review.openstack.org/#/c/486687/ (Catch exceptions for all rpc casts) merged it has not resolved the Ironic gate errors we are seeing.

The test of the revert has passed twice, I am doing another recheck now.
https://review.openstack.org/#/c/489699/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/489697

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The latest recheck on the revert patch suggests that it doesn't help either.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

In the last failure, it's not neutron API but nova API that times out:

http://logs.openstack.org/99/489699/1/check/gate-grenade-dsvm-ironic-ubuntu-xenial/2e8b6fa/logs/testr_results.html.gz

The relevant log messages for the Nova API request start at:

http://logs.openstack.org/99/489699/1/check/gate-grenade-dsvm-ironic-ubuntu-xenial/2e8b6fa/logs/new/screen-n-api.txt.gz#_2017-08-02_17_40_21_213

2017-08-02 17:40:21.149 30615 DEBUG nova.osapi_compute.wsgi.server [req-84752f96-9ca4-43dd-86ed-64620eb7d8ca - -] (30615) accepted ('10.0.1.169', 52578) server /usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py:883
2017-08-02 17:40:21.213 30615 DEBUG nova.api.openstack.wsgi [req-04a0710f-c696-415e-a2a7-1381bf0174b1 tempest-TestServerBasicOps-512068857 tempest-TestServerBasicOps-512068857] Action: 'create', calling method: <function version_select at 0x7f90a183bc08>, body: {"security_group": {"name": "tempest-TestServerBasicOps-722492525", "description": "tempest-TestServerBasicOps-722492525 description"}} _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:609
2017-08-02 17:40:21.619 30615 INFO nova.osapi_compute.wsgi.server [req-04a0710f-c696-415e-a2a7-1381bf0174b1 tempest-TestServerBasicOps-512068857 tempest-TestServerBasicOps-512068857] 10.0.1.169 "POST /v2.1/os-security-groups HTTP/1.1" status: 200 len: 637 time: 0.4696331
2017-08-02 17:40:21.623 30615 DEBUG nova.osapi_compute.wsgi.server [req-84752f96-9ca4-43dd-86ed-64620eb7d8ca - -] (30615) accepted ('10.0.1.169', 52582) server /usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py:883
2017-08-02 17:40:21.651 30615 DEBUG nova.api.openstack.wsgi [req-983ceec8-1c1e-4d6e-b560-9f4aa358bf6a tempest-TestServerBasicOps-512068857 tempest-TestServerBasicOps-512068857] Action: 'create', calling method: <function version_select at 0x7f90a183b140>, body: {"security_group_rule": {"from_port": 22, "to_port": 22, "ip_protocol": "tcp", "cidr": "0.0.0.0/0", "parent_group_id": "1da383d4-6d09-451b-8e87-1d708803ae02"}} _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:609

There seems to be no completion for the request.

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I marked the bug as fixed for neutron; if you still experience problems with Neutron though, feel free to reopen (a ping in IRC would also help to get my attention).

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

OK, I still see this failing on a neutron api request here: http://logs.openstack.org/01/489701/1/check/gate-grenade-dsvm-ironic-ubuntu-xenial/50785e5/logs/testr_results.html.gz

In neutron-server log, we see first NotFound ERROR message from the missing fanout exchange (that's not an issue per se, and I sent a patch to downgrade the message to DEBUG [1]); then the request handler attempts to fanout another event to agents; then request is locked for ~115 seconds to end with:

2017-08-01 22:50:59.696 1787 ERROR oslo.messaging._drivers.impl_rabbit [req-765ebd21-ee44-4db2-85f8-f6b7e52cdf19 tempest-RoutersTest-175057974 tempest-RoutersTest-175057974] [fdb10a5b-b3c8-436c-ba74-193c721329bb] AMQP server on 104.130.216.130:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2017-08-01 22:51:00.715 1787 INFO oslo.messaging._drivers.impl_rabbit [req-765ebd21-ee44-4db2-85f8-f6b7e52cdf19 tempest-RoutersTest-175057974 tempest-RoutersTest-175057974] [fdb10a5b-b3c8-436c-ba74-193c721329bb] Reconnected to AMQP server on 104.130.216.130:5672 via [amqp] client with port 51878.

When we look at rabbitmq logs around the time of failure we see:

=ERROR REPORT==== 1-Aug-2017::22:49:05 ===
Channel error on connection <0.3221.0> (104.130.216.130:43578 -> 104.130.216.130:5672, vhost: '/', user: 'stackrabbit'), channel 1:
{amqp_error,not_found,
            "no exchange 'q-agent-notifier-port-delete_fanout' in vhost '/'",
            'basic.publish'}

=WARNING REPORT==== 1-Aug-2017::22:50:59 ===
closing AMQP connection <0.3221.0> (104.130.216.130:43578 -> 104.130.216.130:5672):
connection_closed_abruptly

I suspect that after NotFound is raised, oslo.messaging channels get into a state that doesn't allow to send any more messages, and only connection reset restores connectivity.

[1] https://review.openstack.org/#/c/490139/

Changed in neutron:
status: Fix Released → Confirmed
milestone: none → pike-3
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The issue is extensively being discussed in #openstack-oslo; logs: http://eavesdrop.openstack.org/irclogs/%23openstack-oslo/%23openstack-oslo.2017-08-02.log.html#t2017-08-02T19:18:05

Current thinking is that we hit a race between

1) rabbitmq cleaning up the topic after rabbit_transient_queues_ttl [30 mins] since the last usage by a neutron agent (new agent doesn't use it anymore) and
2) oslo.messaging sending a fanout to a topic.

In regular setup, a fanout to a non-existent topic results in its creation; but if it is pulled from under fanout attempt in the middle of it, then we may hit the issue.

A possible solution would be then to catch the error, do any necessary cleanup for connection channel, and repeat the attempt, now first creating the topic.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Looks like the bug is not related to Nova but rather oslo.messaging to me when looking at the logs and the above discussion.

Any reason why we don't push that bug to Oslo component ?

Changed in nova:
importance: Undecided → Critical
Revision history for this message
Sean Dague (sdague) wrote :

It is not clear this is really a Nova bug, putting into incomplete. It's definitely not critical in Nova unless it's confirmed that it's something we need to handle.

Changed in nova:
status: New → Incomplete
importance: Critical → Medium
Changed in neutron:
milestone: pike-3 → pike-rc1
Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → huanzhangzhao (yujiamayi)
Changed in neutron:
assignee: huanzhangzhao (yujiamayi) → nobody
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

For Neutron, a workaround was committed as: https://review.openstack.org/#/c/490305/

Changed in neutron:
status: Confirmed → Fix Released
assignee: nobody → Kevin Benton (kevinbenton)
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Adding oslo to the list of affected projects. The fix is here: https://review.openstack.org/#/c/486706/

Changed in oslo.messaging:
assignee: nobody → Ken Giusti (kgiusti)
Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/486706
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=f059bba6ae9642675bc1c9004a4a4adc5d1b19a6
Submitter: Jenkins
Branch: master

commit f059bba6ae9642675bc1c9004a4a4adc5d1b19a6
Author: Kenneth Giusti <email address hidden>
Date: Mon Jul 24 13:36:17 2017 -0400

    Prevent rabbit from raising unexpected exceptions

    Publishing a message using the kombu connection autoretry method may
    allow exceptions from the py-amqp library to be raised up to the
    application. This does not conform to the documented oslo.messaging
    API.

    Enhance the try except block to capture any exception and translate it
    into a MessageDeliveryFailure.

    There are a few cases where exceptions will be raised during autoretry
    publishing: recoverable connection or channel errors, and
    non-recoverable connection or channel errors.

    autoretry will only retry if the error is recoverable. Non recoverable
    errors are re-raised immediately regardless of the retry count.

    In the case of a recoverable error it seems unlikely that retrying
    either the connection or the channel yet again is going to get us
    anywhere, so in this case we simply clean up the channel state, log an
    error and fail the operation.

    In the case of non-recoverable error we are out of luck (think
    authentication failure) - further retrying will not help. Best we can
    do is clean up state and log the heck out of it.

    Change-Id: I2f65d2ee19a8c3e9a323b30404abbf0cbb45a216
    Closes-Bug: #1705351
    Closes-Bug: #1707160

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.31.0

This issue was fixed in the openstack/oslo.messaging 5.31.0 release.

Lee Yarwood (lyarwood)
no longer affects: nova
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.