[RFE] Ironic needs to synchronize external events with Neutron

Bug #1304673 reported by Adam Gandelman
32
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ironic
Confirmed
Wishlist
Vasyl Saienko

Bug Description

Related to https://review.openstack.org/#/c/84361/ and https://bugs.launchpad.net/ironic/+bug/1300589

Updates to Neutron resources via its API are processed asynchronously on its backend. This exposes potential races with Ironic. Example: an API request from Ironic to update a port's DHCP settings will return successfully long before the associated dnsmasq config has been updated and the server restarted. There is a small potential for a race condition where Ironic will boot a machine before its DHCP has been properly configured, especially if the machine boots very quickly (ie a local VM)

Though none are used by Ironic (yet?), other Neutron operations are dependent on some other state in Neutron. For instance, a firewall will stay PENDING until an associated router and router interface have been created and are ACTIVE.

We need a way to synchronize these events. During Icehouse, Nova solved almost identical issues regarding orchestration between Nova and Neutron via an admin API endpoint Neutron can use to post back notifications. Ironic's Neutron usage is relatively limited ATM, but providing a framework similar to Nova's for this type of orchestration would solve the current issues and allow future drivers to take advantage of other Neutron features.

Tags: needs-spec rfe
aeva black (tenbrae)
Changed in ironic:
status: New → Triaged
importance: Undecided → Medium
Changed in ironic:
assignee: nobody → Shraddha Pandhe (shraddha-pandhe)
Revision history for this message
John Stafford (john-stafford) wrote :

Hi Shraddha,

Are you still working this bug?

Vasyl Saienko (vsaienko)
Changed in ironic:
assignee: Shraddha Pandhe (shraddha-pandhe) → Vasyl Saienko (vsaienko)
Revision history for this message
Vasyl Saienko (vsaienko) wrote :

Neutron port remains in down state for Ironic instances, we should fix it first: https://bugs.launchpad.net/neutron/+bug/1599836

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/339489

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/345963

Revision history for this message
Ruby Loo (rloo) wrote :
Changed in ironic:
importance: Medium → Wishlist
summary: - Ironic needs to synchronize external events with Neutron
+ [RFE] Ironic needs to synchronize external events with Neutron
tags: added: rfe
tags: added: needs-spec
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Vasyl Saienko (<email address hidden>) on branch: master
Review: https://review.openstack.org/339489

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Vasyl Saienko (<email address hidden>) on branch: master
Review: https://review.openstack.org/345963

Changed in ironic:
status: In Progress → Confirmed
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

So this is still an outstanding item, and stalled on the question of "how to know when to resume".

So to start off, where are we:

1) We can get events! https://github.com/openstack/ironic/blob/master/ironic/api/controllers/v1/event.py#L107
2) People even already configure it! But when you look at the link above you feel sad.
3) We *still* have a sleep in place deep inside of the ironic's networking code.
https://github.com/openstack/ironic/blob/268b28f52782d20cd3f7bf27ead36438695b786a/ironic/dhcp/neutron.py#L160

I thought there was another sleep someplace, but we'll have to look for it.

Anyway, the issue is neutron sometimes takes time to complete binding, and we should wait some period of time for a callback from neutron.

4) So, ideally, what we could do is setup a database table to append the events to, with a periodic to delete any events older than say 15 minutes. We have some prior art here with the node history table.

5) We can then swap the sleep code around to look for an event in the new events table, and then unblock the flow once the event has been observed.

This would require an RPC change, to allow the inbound event to be submitted over RPC, as conductors are the database writers.

And the other conductor thread would just poll the events to determine what is required.

Event would remain in the table until the periodic purges it.

The sleep(s) for configuration would then be updated to be the upper bounds "how long to wait" for network configuration to complete.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.