Network: concurrent issue for create network operation

Bug #1800417 reported by Le, Huifeng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Undecided
Unassigned

Bug Description

High level description:
When running rally test-cases in parallel against the network creation API it is possible to encounter errors while attempting to update a segmentation allocation record in the DB. This results from querying the database to find a free entry and then updating the tuple in independent operations without any sort of mutual exclusion over multiple users. Since the neutron API is implemented with multiple child processes it is possible that a collision will occur when two processes attempt to access the same DB tuple.

Version: latest devstack

Tags: api db
Revision history for this message
zhaobo (zhaobo6) wrote :

So, you still want to raise https://bugs.launchpad.net/neutron/+bug/1800599 ? Just read your analysis, seems this is the same with the https://bugs.launchpad.net/neutron/+bug/1800599 . So maybe this is a common/know issue.Right? Could you please test more cases about other resources and other operations? Or this issue just be hit during network creation in parallel test?

Changed in neutron:
status: New → Incomplete
tags: added: api db
Revision history for this message
Le, Huifeng (hle2) wrote :

Thanks much for the review. This is a different issue with #1800599 (which is caused by multiple long lived clients which use up neutron processing threads).

This issue is special for create_network as this operation will query then update segment allocation table which will cause confliction during concurrency access scenarios. as workaround, we add a global InterProcessLock to make the create_network operation in sequence but we would like to see whether community has better solution/suggestion for such cases. Thanks!

Revision history for this message
zhaobo (zhaobo6) wrote :

Hi, sorry for a bit late.

So we are NOT talking about the func "add_network_segment", but the "reserve_provider_segment" func in each type driver, right? If multiple API requests with the same fields to request the same segmentation for a network, I think that's your case.

Also you said this will cause error, could you please show us more logs if you test it with rally? Thanks. Just for me, it's hard to image how they are conflict. ;-)

Revision history for this message
Le, Huifeng (hle2) wrote :

Thanks for the comments! Yes, it is rally case and hard to reproduce.
From code level:
in reserve_provider_segment -> allocate_partially_specified_segment, if multiple process/threads call

count = (session.query(self.model).filter_by(allocated=False, **raw_segment). update({"allocated": True}))

concurrently, it may casused the same segment be allocated to differet networks, the code:

alloc = random.choice(allocs)

can help to reduce the possibility, but it still have the possibility for this issue.

We are found the issue with a customized type driver which do not select the segment randomly, Let me try to see whether I can get some logs for you.

And do you know if community had some solution for such issue? Thanks!

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Maybe using retry_db_errors decorator (https://github.com/openstack/neutron-lib/blob/0ff55899bad4082a965014ef753a3370bdeb2b43/neutron_lib/db/api.py#L157) for this method would help.
Can You maybe provide some additional neutron-server logs with errors which You got there? I don't think we spot such issue in rally job in our CI now.
Maybe You also have some rally scenario (or something else) which can help to reproduce this issue?

Revision history for this message
Le, Huifeng (hle2) wrote :

Slawek,

Thanks much for the comments and retry_db_errors should be worked.
Actually, this issue is same with https://github.com/openstack/neutron/commit/1d9fd2aec00cb85034e5a23cc1beac33c74e0110, and the issue had been most addressed by the fix. but this issue still has low possibility to be trigged during large system deployment. And sorry, we do not have additional error logs, do you think it make sense to add @retry_db_errors as additional protection for this issue? if yes, where do you think to add the decorator? (or maybe add it for _create_network_db call?)
@db_api.retry_db_errors
def _create_network_db(self, context, network):

Revision history for this message
Le, Huifeng (hle2) wrote :

Slawek,

I retested the cases with some mock code to force introducing the confliction, it is found that the retry mechanism is actually implemented in upstream code (e.g. by @db_api.retry_if_session_inactive() for def create_network(self, context, network)) and it can also handle the confliction correctly (e.g. retry mechanism can allocate a new segment id the operation can be success finally). So suppose this issue had been resolved in upstream. thanks!

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers