Ironic gate breakage: deployed VM's do not get DHCP

Bug #1507558 reported by Dmitry Tantsur
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Invalid
Critical
Unassigned
neutron
Critical
Ihar Hrachyshka

Bug Description

See e.g. https://review.openstack.org/#/c/234186/. It started around midnight UTC, Mon Oct 19.

Revision history for this message
Dmitry Tantsur (divius) wrote :

I suspect Neutron is involved

Changed in ironic:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I believe the patch that could break it is: https://review.openstack.org/#/c/231031/5/neutron/db/l3_dvr_db.py It seems we assumed DVR mixin is triggered for DVR routers only, but I see dvr mentioned in the logs, and I see that the mixin is enabled for generic L3 router plugin.

Changed in neutron:
importance: Undecided → Critical
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Almost nothing went into Neutron around that time that should impact DHCP.

Can you get some more evidence that a change in Neutron seems to be the issue?

Changed in neutron:
status: New → Incomplete
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Another recent patch that got us here is: https://review.openstack.org/#/c/215136/65/neutron/db/l3_dvr_db.py but it probably is not the culprit because it merely hides RouterNotFound exception, replacing it with silent debug log. There is a race condition mentioned in the comment, but it refers to a bug that corresponds to the patch intent, and does not give clue which race we experience.

Overall, it seems weird that we get RouterNotFound from the db.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Some notes on the test failure:

- the test attaches a FIP to an instance and then attempt ssh login using the FIP. Agent log does not contain notification traces at the time when FIP is created, and in server log we see:

http://logs.openstack.org/86/234186/3/check/gate-tempest-dsvm-ironic-pxe_ssh/317fd7e/logs/screen-q-svc.txt.gz#_2015-10-19_10_58_10_564

"2015-10-19 10:58:10.564 DEBUG neutron.db.l3_dvr_db [req-e5296590-770b-4ba1-a9d9-156b36f2117c tempest-BaremetalBasicOps-4546358 tempest-BaremetalBasicOps-193221136] Router 5f20788f-ddff-4a7f-bb07-7e564da0b3a9 not found. Just ingore this router. _notify_floating_ip_change /opt/stack/new/neutron/neutron/db/l3_dvr_db.py:706"

Which means that we don't notify the agent, hence FIP is not configured on its side.

Changed in neutron:
status: Incomplete → Confirmed
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Actually, seems like both patches break notifications (first one for create, second one for update), but only the latter triggered ironic gate breakage because the test in question updates the existing FIP with the address, and does not create it with the address from the very start.

I believe the real reason why we broke is that our L3/DVR/HA implementation is so reliant on random mixins melded in random order, in a twisted way that made the author think that a mixin called smth_DVR_smth is used for DVR routers only, while in reality it seems to be used unconditionally.

I wouldn't blame the author for such a mistake...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/236955

Changed in neutron:
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/236955
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a2f7e0343a147a30a637af4e1cb9a866f557e87d
Submitter: Jenkins
Branch: master

commit a2f7e0343a147a30a637af4e1cb9a866f557e87d
Author: Ihar Hrachyshka <email address hidden>
Date: Mon Oct 19 12:44:58 2015 +0000

    Revert "DVR: Notify specific agent when update floatingip"

    This reverts commit 52e91f48f2327b47f126893f9cb12f153380a9a6.

    The patch broke notifications about FIP updates and triggered 100%
    gate failures for Ironic gate.

    I believe that I0cbe8c51c3714e6cbdc48ca37135b783f8014905 is also
    breaking notifications, but for FIP create, which probably was not
    utilized in any gate before and hence not caught in time.

    The change the reverted patch introduced made update_floatingip to
    fetch router based on FIP router_id field on every call, which was
    not the case before the patch. For some reason unknown at the
    moment, we get NotFound from database on this fetch.

    The patch does not answer the question why we get NotFound from
    database on fetching a FIP router_id, but that's another issue that
    should be investigated while Ironic gate is happy.

    Change-Id: I4affac49d7c63f47c5654b94b28f4cb7471e87b0
    Closes-Bug: #1507558
    Related-Bug: #1507602

Changed in neutron:
status: In Progress → Fix Committed
Dmitry Tantsur (divius)
Changed in ironic:
status: Confirmed → Invalid
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers