FlushError on IPAllocation

Bug #1627424 reported by Armando Migliaccio
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Miguel Lavalle
description: updated
tags: added: gate-failure
Changed in neutron:
status: New → Confirmed
importance: Undecided → High
tags: added: db l3-ipam-dhcp
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

This is looking a lot like another manifestation of this bug [1]. It is a port update that is called from a subnet delete and there are a lot of retries in the logs around the time of the failure. Fixes for the other bug [1] started pouring in the day after these occurrences. I'm inclined to look through logstash to see if there are more occurrences of this exception since then.

[1] https://launchpad.net/bugs/1622616

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Fix [1] postdates the traces. I would be curious to see if these have gone away?

[1] https://review.openstack.org/#/c/373536/

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I see a number of failures with this logstash query [1]. If you include successful runs, there are about 10x more, probably because they are retried and still successful about that often.

Most of the occurrences are in the Rally jobs. Should definitely keep an eye out for this. I'll dive in to the code now and see if I can figure out what is going on.

One thing that is odd is that logstash doesn't show any occurrences of this exception between sometime on the 21st of September and later on the 26th. I don't know why that would be. It could be for a reason not related to this problem at all. There are many reasons this gap could be there. I just thought I'd note it.

[1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22FlushError%3A%20New%20instance%20%3CIPAllocation%20at%5C%22%20AND%20build_status%3AFAILURE

Revision history for this message
Ryan Tidwell (ryan-tidwell) wrote :

Just following up with this, still seeing this appearing. I found 48 hits over the last week with the above logstash query.

Revision history for this message
Miguel Lavalle (minsel) wrote :

SQLAlchemy documentation has a reference to the 'FlushError ... conflicts with persistent instance.... ' exception we see in the tracebacks here: http://docs.sqlalchemy.org/en/latest/orm/session_state_management.html. I'll continue digging

Miguel Lavalle (minsel)
Changed in neutron:
assignee: nobody → Miguel Lavalle (minsel)
Revision history for this message
Brian Haley (brian-haley) wrote :

I took a look at this, don't have an answer though, just some notes.

We are actually making it as far as udpate_port_with_ips() in ipam_pluggable_backend.py from what I can tell. That code is going to add items to the session by calling _delete_ip_allocation() and _store_ip_allocation() - from the log http://logs.openstack.org/86/399986/4/check/gate-rally-dsvm-networking-odl-carbon-snapshot/c6fff55/logs/screen-q-svc.txt which could go away:

Delete allocated IP 1.1.0.206 (0aa0f60a-f4b2-4f7a-bfcb-744a26a5a236/26f0040c-1bf9-4e0c-b813-129ccace18cd) _delete_ip_allocation /opt/stack/new/neutron/neutron/db/db_base_plugin_common.py:100
Allocated IP 1.1.0.246 (0aa0f60a-f4b2-4f7a-bfcb-744a26a5a236/2b209aca-3fff-4b4c-9d32-b4fcdc510898/d363aa64-a1c2-4b43-99c2-eb244c78ac40) _store_ip_allocation /opt/stack/new/neutron/neutron/db/db_base_plugin_common.py:116

Then less than a second later:

delete failed: Exception deleting fixed_ip from port d363aa64-a1c2-4b43-99c2-eb244c78ac40

After that is done a call to _update_db_port() is made which will actually update the port. That might be what's blowing-up, and the exception unwinds things until we get back to update_port().

But the interesting thing is that update_port() will only raise a RetryRequest() if the exception is IpAddressAllocationNotFound(), which only exists in one place - deallocate() in the ipam driver. That is most likely getting called from _ipam_deallocate_ips().

Which gets us back to the FlushError:

FlushError: New instance <IPAllocation at 0x7f5e2ec1b550> with identity key (<class 'neutron.db.models_v2.IPAllocation'>, ('1.1.0.246', u'2b209aca-3fff-4b4c-9d32-b4fcdc510898', u'0aa0f60a-f4b2-4f7a-bfcb-744a26a5a236')) conflicts with persistent instance <IPAllocation at 0x7f5e2e0a3d50>

So does that mean we're trying to delete and create an object for that IP (1.1.0.246) in the same session? There's so many layers here I haven't been able to figure that out.

Revision history for this message
Ryan Tidwell (ryan-tidwell) wrote :

This really has me scratching my head. I see this bubble up during what looks like an attempt to delete a subnet. As I trace the logs, it looks like the subnet delete that triggers this is an IPv4 subnet, so I'm wondering why we even get to L1171 and attempt to invoke update_port(). It looks to me like this should only occur if there are SLAAC allocations on the port. At L1153 in plugins/ml2/plugin.py I would expect to_deallocate to be empty since I don't see any SLAAC allocations to handle. If there are IPv4 allocations, shouldn't that stop the delete_subnet() operation in its tracks? What am I missing?

Revision history for this message
Miguel Lavalle (minsel) wrote :

This is the analysis of one instance of this failure following two concurrent subnet delete requests. The analyzed failure reflects the most common case found in Kibana for this bug. The first request (req-991c797f-6905-4f85-a54b-b604d39b3cce) is traced here: http://paste.openstack.org/show/597114/. The second request (req-b60423b8-2fd6-4f25-975d-e4824837360f) is traced here: http://paste.openstack.org/show/597276/. The complete Neutron server log is here: http://logs.openstack.org/14/425714/6/check/gate-rally-dsvm-neutron-rally-ubuntu-xenial/e757329/logs/screen-q-svc.txt#_2017-01-30_15_16_09_904 (this points to the failure of the first request, req-991c797f-6905-4f85-a54b-b604d39b3cce).

1) We have a port with id 290d5c99-3140-4f63-aa7b-3180b251c2ca. We will refer to it as P.

2) P has ip address 1.1.0.98 on subnet 27b645fe-8d45-4573-86f5-2ff888e92f3a, referred from now on as IP1 and S1 respectively. S1 is being deleted by request req-991c797f-6905-4f85-a54b-b604d39b3cce.

3) P also has ip address 10.2.5.2 on subnet 278f2e3e-b5e8-4a9c-a5bc-acd21ae78b5a, referred from now on as IP2 and S2 respectively. S2 is being deleted by request req-b60423b8-2fd6-4f25-975d-e4824837360f

4) The deletion of S1 triggers a call to update_port (ml2) to remove IP1 from P's fixed ips.

5) In parallel to the previous point, the deletion of S2 triggers a call to update_port (ml2) to remove IP2 from P's fixed ips.

6) The update_port to remove IP2 succeeds.

7) When the update_port to remove IP1 attempts to commit the change, it gets an error: StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched.

8) The update_port is executed again by our retry mechanism: Performing DB retry for function neutron.plugins.ml2.plugin.update_port wrapper /usr/local/lib/python2.7/dist-packages/oslo_db/api.py:153

9) The problem here is that update_port to remove IP1 was called (by delete_subnet) with a fixed_ips attribute that includes IP2: https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/plugin.py#L1151

10) This leads to the retried execution of update_port to not only delete IP1 but also to add back IP2 and also to the FlushError exception and the 500 returned to the API.

We need to retry the complete delete_subnet operation, so the retried update_port receives an updated fixed_ips attribute that doesn't include IP2.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/429421

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Miguel Lavalle (<email address hidden>) on branch: master
Review: https://review.openstack.org/429421

Revision history for this message
Miguel Lavalle (minsel) wrote :

I have traced this error to happen on the create or update of DHCP ports when, concurrently, the port's network or subnets are being deleted. Besides the example mentioned in note #8, I traced this DHCP port creation (http://paste.openstack.org/raw/598918/) while the network was being deleted concurrently (http://paste.openstack.org/raw/599252/). The port creation fails throwing a FlushError exception, while the network deletion succeeds.

Patchsets https://review.openstack.org/#/c/423584 and https://review.openstack.org/#/c/428774 significantly change the way networks and subnets are deleted, including logic that deletes or updates the network owned ports, which currently means DHCP ports. These changes significantly alter how long some of the corresponding Sqlalchemy objects are kept in memory. As a consequence, this bug might go away.

We need to keep an eye on these two changes and after they merge, verify whether this bug gets fixed

Revision history for this message
Miguel Lavalle (minsel) wrote :

Current plan is to fix bug https://bugs.launchpad.net/neutron/+bug/1676235 and see the effect it has on this one

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I don't see any failing hits for the bug. I remove gate-failure now.

I see some hits that haven't resulted in a failure, so leaving the bug open for the assignee to handle it.

tags: removed: gate-failure
Changed in neutron:
importance: High → Medium
Revision history for this message
Miguel Lavalle (minsel) wrote :

Using Kibana query:

message:"FlushError: New instance <IPAllocation at" AND build_status:FAILURE

I cannot find any hits for this bug over the past 7 days.

Modifying the query to just:

message:"FlushError: New instance <IPAllocation at"

I can find 7 hits. 5 of them belong to project Neutron but with branches stable/ocata (3 hits) or stable/newton (2 hits), which don't have this fix: https://review.openstack.org/#/c/452501/.

The remaining 2 hits are in project Tempest, which doesn't use Neutron master. I went through the log messages in screen-q-svc and I can see messages issued by the ML2 plugin that are not in the code anymore.

As a consequence, closing this bug

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Miguel Lavalle (minsel) wrote :
Changed in neutron:
importance: Medium → High
Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.