nova-compute host is added to scheduling pool before Neutron can bind network ports on said host

Bug #1260440 reported by Clint Byrum
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned
neutron
Invalid
Undecided
Unassigned
tripleo
Fix Released
Critical
Clint Byrum

Bug Description

This is a race condition.

Given a cloud with 0 compute nodes available, on a compute node:
* Start up neutron-openvswitch-agent
* Start up nova-compute
* nova boot an instance

Scenario 1:
* neutron-openvswitch-agent registers with Neutron before nova tries to boot instance
* port is bound to agent
* instance boots with correct networking

Scenario 2:
* nova schedules instance to host before neutron-openvswitch-agent is registered with Neutron
* nova instance fails with vif_type=binding_failed
* instance is in ERROR state

I would expect that Nova would not try to schedule instances on compute hosts that are not ready.

Please also see this mailing list thread for more info:

http://lists.openstack.org/pipermail/openstack-dev/2013-December/022084.html

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

This breaks deployment of new clouds in TripleO sometimes, and will likely break scaling too. Hence the Critical status.

Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Russell Bryant (russellb) wrote :

Agree that this is important, but the Nova change is blocked on being able to ask Neutron if it's ready, as discussed on the list

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Ian Wells (ijw-ubuntu) wrote :

Rather than do anything clever and REST oriented, maybe instead just a directory nova creates with scripts in that it uses to test auxiliary tools' status? Neutron can put a test script or scripts in there and the script can vary per driver, calling out to a central controller or testing a local process. It's not going to work if you call it before every VM run, I guess, but would be OK if you used it for initial readiness.

Changed in tripleo:
assignee: nobody → Clint Byrum (clint-fewbar)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-incubator (master)

Reviewed: https://review.openstack.org/61608
Committed: https://git.openstack.org/cgit/openstack/tripleo-incubator/commit/?id=661884b5c7a47d01171c680c83b601d3c9a15d9f
Submitter: Jenkins
Branch: master

commit 661884b5c7a47d01171c680c83b601d3c9a15d9f
Author: Clint Byrum <email address hidden>
Date: Wed Dec 11 15:33:01 2013 -0800

    Wait for Neutron L2 Agent on Compute Node

    The L2 Agent sometimes does not register until later on in the
    deployment for some reason. This is just a work-around until that bug
    can be properly understood.

    Change-Id: Idbbc977aa2e13f2026de05ae7e6571bc9dd0a498
    Closes-Bug: #1260440

Changed in tripleo:
status: In Progress → Fix Released
Changed in neutron:
status: New → Confirmed
status: Confirmed → Triaged
tags: added: neutron-agent
Revision history for this message
Cedric Brandily (cbrandily) wrote :

This bug is > 365 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status: Triaged → Incomplete
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: High → Undecided
status: Confirmed → Expired
Changed in neutron:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.