somehow getting duplicate openvswitch agents for the same host

Bug #1254246 reported by Clint Byrum on 2013-11-23
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Roman Podoliaka
tripleo
High
Unassigned

Bug Description

While investigating spurious failures in our TripleO continous deployment, I had this problem:

+--------------------------------------+--------------------+-------------------------------------+-------+----------------+
| id | agent_type | host | alive | admin_state_up |
+--------------------------------------+--------------------+-------------------------------------+-------+----------------+
| 3a9c6aca-e91f-49c9-850a-67db219fdf58 | L3 agent | overcloud-notcompute-wjo2jbvvd2sm | :-) | True |
| 3fb9f6cf-b545-4a34-a490-dda834973d1e | Open vSwitch agent | overcloud-novacompute0-ubrjpv4jz64a | xxx | True |
| 855349b2-b0fc-4270-bb96-385b61aa5a6c | DHCP agent | overcloud-notcompute-wjo2jbvvd2sm | :-) | True |
| 8b8a4128-9716-42ee-b886-f053db166ce3 | Metadata agent | overcloud-notcompute-wjo2jbvvd2sm | :-) | True |
| c8297e0d-8575-47f0-ae65-499c1e0319b3 | Open vSwitch agent | overcloud-notcompute-wjo2jbvvd2sm | :-) | True |
| f746fc1d-9083-46f4-a922-739c5d332d7c | Open vSwitch agent | overcloud-novacompute0-ubrjpv4jz64a | xxx | True |
+--------------------------------------+--------------------+-------------------------------------+-------+----------------+

Note that overcloud-novacompute0-ubrjpv4jz64a has _two_ Open vSwitch agents.

This caused many 'vif_type=binding_failed' errors when booting nova instances.

Deleting f746fc1d-9083-46f4-a922-739c5d332d7c resulted in the problem going away.

Seems like there might be a race if the agent restarts quickly, thus not seeing its own agent record and sending a second RPC to create one. I think, I am not entirely sure how this works, that is just a hypothesis.

Clint Byrum (clint-fewbar) wrote :

Adding TripleO as we have to work-around this to keep moving forward with automated deployment.

Changed in tripleo:
status: New → In Progress
importance: Undecided → High
importance: High → Critical
assignee: nobody → Clint Byrum (clint-fewbar)
Robert Collins (lifeless) wrote :

We're SQL backed though, surely thats atomic and immediately visible?

Changed in neutron:
assignee: nobody → Roman Podoliaka (rpodolyaka)

Fix proposed to branch: master
Review: https://review.openstack.org/58814

Changed in neutron:
status: New → In Progress
Maru Newby (maru) on 2013-12-08
tags: added: havana-backport-potential

Reviewed: https://review.openstack.org/58814
Committed: http://github.com/openstack/neutron/commit/5529071bf1393d0d448bc495cc906a68bc30a820
Submitter: Jenkins
Branch: master

commit 5529071bf1393d0d448bc495cc906a68bc30a820
Author: Roman Podoliaka <email address hidden>
Date: Wed Nov 27 18:57:56 2013 +0200

    Fix a race condition in agents status update code

    Code handling agents status updates coming via RPC checks,
    if a corresponding entry for the given (agent_type, host)
    pair already exists in DB and updates it. And if it doesn't
    exist, a new entry is created.

    Without a unique constraint this can cause a race condition
    resulting in adding of two agent entries having the same value
    of (agent_type, host) pair.

    Note, that it's already not allowed to have multiple agents of
    the same type having the same host value, but currently it's
    enforced only at code level, not at DB schema level, which
    effectively makes race conditions possible.

    Closes-Bug: #1254246

    Change-Id: I1ebaa111154b3d6b34074705b579097ab730594c

Changed in neutron:
status: In Progress → Fix Committed
Sean M. Collins (scollins) wrote :

+1 for this breaking devstack - I'm really surprised this did not get caught by the gate?

http://paste.openstack.org/show/60821/

Henry Gessau (gessau) wrote :

The reason it is not seen in the gate is because the gate runs with lbaas service enabled, which masks the issue. Try adding 'enable_service q-lbaas' in localrc as a workaround for devstack.

Ladislav Smola (lsmola) wrote :

Seems like enable_service q-lbaas doesn't help

So far, only after checking out this patch:
https://review.openstack.org/#/c/61663/
devstack works

Changed in tripleo:
importance: Critical → High
Thierry Carrez (ttx) on 2014-01-22
Changed in neutron:
milestone: none → icehouse-2
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/61663
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8b67ee3d94a4866fd4ecd27e8fc37d2f17595aa9
Submitter: Jenkins
Branch: master

commit 8b67ee3d94a4866fd4ecd27e8fc37d2f17595aa9
Author: Roman Podoliaka <email address hidden>
Date: Thu Dec 12 08:20:15 2013 +0200

    Fix the migration adding a UC to agents table

    The migration script mistakenly assumes that all core
    plugins use agents extension, which is not true (e.g.
    plumgrid and bigswitch don't).

    Apply this migration script only for plugins that are
    stated in the original migration script adding agents
    table (511471cc46b_agent_ext_model_supp.py).

    Related-Bug: #1254246

    Change-Id: I7915ef8d183782eb5d46ac47f45014aa9e9640fb

Thierry Carrez (ttx) on 2014-04-17
Changed in neutron:
milestone: icehouse-2 → 2014.1

hm... do we plan to commit this fix for havanna release?

Sean M. Collins (scollins) wrote :

The quick and dirty fix for me was deleting the rows that were duplicates.

Clint Byrum (clint-fewbar) wrote :

Does this still actually affect TripleO which only deploys trunk?

Changed in tripleo:
status: In Progress → Incomplete
assignee: Clint Byrum (clint-fewbar) → nobody
Roman Podoliaka (rpodolyaka) wrote :

No, this is fixed in trunk.

Changed in tripleo:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers