Check for FIP status in scenario tests cause instability in neutron gate jobs

Bug #1381617 reported by Armando Migliaccio
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
tempest
Fix Released
Undecided
Carl Baldwin
Tags: gate-failure
Changed in neutron:
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
Rajeev Grover (rajeev-grover) wrote :

Analyzing the logs, the status being reported by the agent seems to be correct but the test is checking the status too soon. Therefore, test failure occurs. Since the test status is being correctly reported by the agent, the Importance of this defect should be moved down.

Here are relevant entries from the logs:

Test case checking the status, Note the time:

2014-10-15 09:07:21,845 7581 INFO [tempest.common.rest_client] Request (TestNetworkBasicOps:test_network_basic_ops): 200 GET http://127.0.0.1:9696/v2.0/floatingips/5c026cb3-0b11-4f83-8a6b-488dc853334c 0.010s

Agent reporting the status, again note the time:

2014-10-15 09:07:23.382 2281 DEBUG neutron.common.rpc [-] neutron.agent.l3_agent.L3PluginApi method call called with arguments (<neutron.context.ContextBase object at 0x7f47726ad690>, {'args': {'router_id': u'c65d9fd9-7da9-485d-9c45-b159f0f861ea', 'fip_statuses': {u'5c026cb3-0b11-4f83-8a6b-488dc853334c': 'DOWN'}}, 'namespace': None, 'method': 'update_floatingip_statuses'}) {'version': '1.1'} wrapper /opt/stack/new/neutron/neutron/common/log.py:33

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Do you have an opinion as to why this is affecting DVR more than the non-DVR case?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
summary: - status check for FloatingIP in network scenario showed instability in
- DVR
+ status check for FloatingIP in network scenario showed caused
+ instability in neutron gate jobs
summary: - status check for FloatingIP in network scenario showed caused
- instability in neutron gate jobs
+ Check for FIP status in scenario tests cause instability in neutron gate
+ jobs
description: updated
tags: removed: l3-dvr-backlog
tags: added: gate-failure
Revision history for this message
Rajeev Grover (rajeev-grover) wrote :

Armando to your question: "Do you have an opinion as to why this is affecting DVR more than the non-DVR case?"
No concrete measurements but fliatingip cration and deletion has more steps in the case of DVR therefore could take slightly longer than in non-DVR case.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Agreed, but how much longer? The change affects the regular gate as well, but DVR just went up the wall! Perhaps by having an idea of the timing requirements between the two we can determine whether there is anything that can be done to speed up the state transition in the DVR case.

Revision history for this message
Marios Andreou (marios-b) wrote :
Download full text (3.2 KiB)

Hi

indeed the failed tempest-dsvm-neutron-full at [1] confirms that this happens in non-dvr cases. Concentrating on this case for a moment - initial thought was (as Rajeev notes) 'the test is checking the status too soon' - however, looking at the logs [2] the update and subsequent GET are 4 seconds apart... I don't know how long is 'enough' but we're definitely not talking miliseconds here. For context, the code that causes this looks like [4] - so the following shows the update (disassociate op) and then the GET for verification from _check_public_network_connectivity:

2014-10-16 14:53:54,177 26442 DEBUG [tempest.common.rest_client] Request (TestNetworkBasicOps:test_network_basic_ops): 200 PUT http://127.0.0.1:9696/v2.0/floatingips/d355cc43-3ae4-4d25-ba57-b66195f2dc68 0.493s
...
2014-10-16 14:53:58,251 26442 DEBUG [tempest.common.rest_client] Request (TestNetworkBasicOps:test_network_basic_ops): 200 GET http://127.0.0.1:9696/v2.0/floatingips/d355cc43-3ae4-4d25-ba57-b66195f2dc68 0.028s

However, after following the trace I find there is a hell of a lot that goes on in _update_floatingip (_update_floatingip && _update_fip_assoc && _check_and_get_fip_assoc --> get_assoc_data && _internal_fip_assoc_data && _get_port --> _get_router_for_floatingip). I will revisit on with fresh eyes next week, would be surprised if we can't streamline this a bit - perhaps only perform a subset of the checks when we are talking about a disassociation op, if possible. I assign to myself for now but feel free to ping me if you really want to/already are working on it.

A random sampling of the 'latest' neutron reviews shows me that the "check-tempest-dsvm-neutron-dvr" fails reliably with exactly this problem - trace looks like [3].

thanks, marios

[1] https://review.openstack.org/#/c/128772

[2] http://logs.openstack.org/72/128772/2/check/check-tempest-dsvm-neutron-full/274f6d1/logs/testr_results.html.gz

[3] Traceback (most recent call last):
  File "tempest/test.py", line 113, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_network_basic_ops.py", line 378, in test_network_basic_ops
    msg="after disassociate "
  File "tempest/scenario/test_network_basic_ops.py", line 203, in _check_public_network_connectivity
    self.check_floating_ip_status(floating_ip, floatingip_status)
  File "tempest/scenario/manager.py", line 639, in check_floating_ip_status
    st=status))
  File "/usr/local/lib/python2.7/dist-packages/testtools/testcase.py", line 348, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/usr/local/lib/python2.7/dist-packages/testtools/testcase.py", line 433, in assertThat
    raise mismatch_error
MismatchError: 'DOWN' != u'ACTIVE': FloatingIP: <"FloatingIP" addr="172.24.4.73" id="d355cc43-3ae4-4d25-ba57-b66195f2dc68"> is at status: ACTIVE. failed to reach status: DOWN

[4] self._check_network_external_connectivity()
        self._disassociate_floating_ips()
        self._check_public_network_connectivity(should_connect=False,
                                                msg="after disassociate...

Read more...

Changed in neutron:
assignee: nobody → Marios Andreou (marios-b)
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

If you are looking for the exact time that it takes between a Delete and the status update.

It takes around 1minute and couple of seconds.

I am looking at the trace in my log.

Here it is requesting a delete.

2014-10-15 13:33:23.986 INFO neutron.wsgi [req-66de7aa0-d4f5-442f-b6c6-2bd4895894b0 admin 89100eaae2764aa88c0b12cb65240eb6] 192.168.15.147 - - [15/Oct/2014 13:33:23] "DELETE /v2.0/floatingips/714c30ac-41b5-4c74-9b7d-bea0672dfff2.json HTTP/1.1" 204 149 0.138742

Here it is sending an rpc message back to the server to update the status.

2014-10-15 13:34:25.513 DEBUG neutron.api.rpc.handlers.l3_rpc [req-88c465d8-02d6-46c7-b78a-966ec622fb2a None None] New status for floating IP 714c30ac-41b5-4c74-9b7d-bea0672dfff2: DOWN from (pid=60349) update_floatingip_statuses /opt/stack/neutron/neutron/api/rpc/handlers/l3_rpc.py:

Revision history for this message
Marios Andreou (marios-b) wrote :

Spent some more time poking today... I think the following should address Salvatore's question:

* timing for the update is not a problem (as I originally thought), but only in the NON-dvr case... I don't think this is hapenning enough in the non dvr case to be a concern (apart from the specific example above I haven't seen any others - is this other's experience too?)

* in 'regular' l3_db.py, the _update_fip_assoc [1] is not doing anything major - for the given update case here, all the conditionals at [2] prove false so the fip is updated with None for the port.

* for the DVR case, afaics the key difference (besides admin context), is in the overridden _update_fip_assoc [3]. In the dvr case, the conditional [4] holds true for the PUT /floating_ip operation here and this kicks off garbage collection to run for the agent gateway port [5]. This happens before the fip db update and thus also before eventually being able to send the rpc notification (which will lead to the status being updated to DOWN, right?). The check_fips_availability_on_host is quite an expensive operation (get all routers, get all floating_ip for those routers, get port for each floating ip).

If others agree that this could in fact be the problem, a proposed solution is for l3_dvr_db to override def update_floatingip [6] and move the agent check from [3] to there, but after the update and notification have been sent.

does it make sense?

[1] https://github.com/openstack/neutron/blob/master/neutron/db/l3_db.py#L740
[2] https://github.com/openstack/neutron/blob/master/neutron/db/l3_db.py#L713
[3] https://github.com/openstack/neutron/blob/master/neutron/db/l3_dvr_db.py#L145
[4] if (not ('port_id' in fip and fip['port_id'])) and (floatingip_db['fixed_port_id'] is not None):
[5] https://github.com/openstack/neutron/blob/master/neutron/db/l3_dvr_db.py#L409
[6] https://github.com/openstack/neutron/blob/master/neutron/db/l3_db.py#L1156

Revision history for this message
Marios Andreou (marios-b) wrote :

excuse me, it was Armando's question

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Hi Marios, thanks for looking into this. Your analysis seems accurate, we'd need to see how that plays out, if you can put a patch together we can take it from there.

Cheers,
Armando

Revision history for this message
Marios Andreou (marios-b) wrote :

sure, am just waiting for tests to pass on my dev box (tox py27 and pep8), but basically looks like http://paste.fedoraproject.org/144144/13982336/ (will submit to gerrit once tests are green)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/130221

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/130414

Changed in neutron:
assignee: Marios Andreou (marios-b) → Armando Migliaccio (armando-migliaccio)
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Marios Andreou (marios-b)
Changed in neutron:
assignee: Marios Andreou (marios-b) → Armando Migliaccio (armando-migliaccio)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

With attempted fix:

https://review.openstack.org/130414

There is a possibility that the FIP is marked down, but the agent has not processed it yet (or even worse it will never do). This is clearly a potential source of inconsistency.

Ultimately wouldn't a client check whether the FIP is first and foremost reachable? What does ACTIVE really mean? As a matter of fact today there's a window where the FIP is no longer reachable but its status it's still ACTIVE.

Another potential fix would be to change the L3 agent logic so that the agent turns the status of the FIP's to DOWN first before tearing down rules and namespaces. However this is not an easy fix.

The problem here stems from the fact that the tempest test does not wait for the state transition at all:

https://review.openstack.org/#/c/102700/30/tempest/scenario/manager.py

As soon as the the FIP is no longer reachable, the test 'assumes' the status is DOWN. So another potential fix is to make method:

  check_floating_ip_status

wait that the state transition occurs.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

A revised approach in:

https://review.openstack.org/130414/4

Changes the logic only agent-side to deal with the problem observed.

Changed in neutron:
milestone: none → kilo-1
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Added Tempest to the list of projects affected because of this reason:

The problem found here stems from the fact that the tempest test does not wait for the state transition at all:

https://review.openstack.org/#/c/102700/30/tempest/scenario/manager.py

As soon as the the FIP is no longer reachable (a), the test 'assumes' the status is DOWN (b). However there is a lag between (a) and (b).

The method check_floating_ip_status could be relaxed to wait until the transition occurs.

Changed in tempest:
assignee: nobody → Carl Baldwin (carl-baldwin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/130414
Reason: Perhaps a more promising course of action here is to relax the Tempest test to allow for some slack in waiting for the status to go from ACTIVE to DOWN. This sounds sensible to me, but I don't understand why the test was not designed this way in the first place.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Perhaps a more promising course of action here is to relax the Tempest test to allow for some slack in waiting for the status to go from ACTIVE to DOWN. This sounds sensible to me, but I don't understand why the test was not designed this way in the first place.

Changed in tempest:
status: New → Invalid
status: Invalid → Confirmed
status: Confirmed → Opinion
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/130655

Changed in tempest:
status: Opinion → In Progress
Revision history for this message
Yair Fried (yfried) wrote :

As the one who added the status check, I assumed that once connectivity status is verified, the DB should have had enough time to update.
I wonder what is considered acceptable delay between connectivity and status change in API - 5sec? 5min?
The initial approach was to have the test verify the status with a waiter prior to connectivity check. this was rejected since it would have potentially masked races.
I don't like adding a waiter at after the connectivity check for the same reason

Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Marios Andreou (marios-b)
Revision history for this message
Marios Andreou (marios-b) wrote :

so am note sure we actually can/should relax the test - depends how bound we are by our API ;) and how complete my analysis at https://review.openstack.org/#/c/130414/

For PUT /v2.0/floatingips/?{floatingip_id}? we return 200. So it is OK for the client to have the expectation that the request was completed and successful. They should be able to verify that with a subsequent GET on that resource (which is exactly what the test does). If we ask the test to change, then really we also need to change our API to return 202 here.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Marios: My suggestion is not go to down the API response code rabbit hole :)

@Yair: in the existing implementation there is a lag between connectivity teardown and status update. There's no race per se, that's how it works..so having the test fail on first attempt its prone to intermittent failures...the 'race' is in the test itself, not the tested system!

Changed in neutron:
importance: Critical → High
Revision history for this message
Rajeev Grover (rajeev-grover) wrote :

Analyzing the logs from different days, the time difference between Status checked by Test and Status reported by the L-3 Agent is < 5 sec. To add a wait to the tests it appears a test wait of > 6-8 sec would be a good number to start experimenting with.

Status Checked by Test Status reported by Agent
2014-10-24 16:24:18,428 6944 2014-10-24 16:24:19.975
2014-10-23 09:10:15,677 2474 2014-10-23 09:10:19.028
2014-10-22 18:56:56,389 26774 2014-10-22 18:56:58.177
2014-10-22 15:46:20,526 27000 2014-10-22 15:46:25.172

Revision history for this message
Marios Andreou (marios-b) wrote :

@Armando, yes fair enough (I'll note the interesting discussion at [1] and Carl's discussion of the addition of status reporting).

As noted in the relevant places I would though still advocate for at least some note on the API page

thanks! marios

[1] https://review.openstack.org/#/c/130414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/130655
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=a754e2d12cdc2c1a7f860f20f2d915dd576c5155
Submitter: Jenkins
Branch: master

commit a754e2d12cdc2c1a7f860f20f2d915dd576c5155
Author: Carl Baldwin <email address hidden>
Date: Thu Oct 23 22:47:41 2014 +0000

    Allow some lag for floating IP to reach a status

    The current test checks the floating IP status as soon as possible
    without allowing the status to converge. This doesn't allow for the
    fact that implementations don't report the new status until after
    connectivity has been torn down. There may be some expected lag that
    is not allowed by the current test.

    Change-Id: Icfe4bb2c40fde540bc2a6371630aa3f12dd945ff
    Closes-Bug: #1381617

Changed in tempest:
status: In Progress → Fix Released
Changed in neutron:
status: In Progress → Confirmed
Kyle Mestery (mestery)
Changed in neutron:
milestone: kilo-1 → kilo-2
Kyle Mestery (mestery)
Changed in neutron:
milestone: kilo-2 → kilo-3
Kyle Mestery (mestery)
Changed in neutron:
milestone: kilo-3 → kilo-rc1
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The tempest fix healed the breakage that was captured by this report. Having said that the stability of the DVR seems pretty good these days:

http://graphite.openstack.org/render/?from=-10days&height=500&until=now&width=1200&bgcolor=ffffff&fgcolor=000000&yMax=100&yMin=0&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.check.job.check-tempest-dsvm-neutron-dvr.FAILURE,sum(stats.zuul.pipeline.check.job.check-tempest-dsvm-neutron-dvr.{SUCCESS,FAILURE})),%2736hours%27),%20%27check-tempest-dsvm-neutron-dvr%27),%27orange%27)&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.check.job.check-tempest-dsvm-neutron-full.FAILURE,sum(stats.zuul.pipeline.check.job.check-tempest-dsvm-neutron-full.{SUCCESS,FAILURE})),%2736hours%27),%20%27check-tempest-dsvm-neutron-full%27),%27blue%27)

no longer affects: neutron
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.