KeyError on "subnet_dhcp_ip = subnet_to_interface_ip[subnet.id]"

Bug #1624079 reported by Richard Theis
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Kevin Benton

Bug Description

The networking-ovn gate-tempest-dsvm-networking-ovn job is seeing KeyErrors on "subnet_dhcp_ip = subnet_to_interface_ip[subnet.id]". It is unclear yet if this is contributing to the recent job failures.

LogStash Query:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22subnet_dhcp_ip%20%3D%20subnet_to_interface_ip%5Bsubnet.id%5D%5C%22

Revision history for this message
Hirofumi Ichihara (ichihara-hirofumi) wrote :

I can see many hits in Logstash. But I'm not sure whether neutron bug or ovn bug.

tags: added: l3-ipam-dhcp
tags: added: gate-failure
Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Richard Theis (rtheis) wrote :

I'm assuming this is a neutron bug since OVN isn't managing DHCP. Yesterday, Kevin Benton mentioned on IRC that he noticed an incorrect assumption being made. You may want to check with him on the details.

Revision history for this message
Richard Theis (rtheis) wrote :

I wonder if this may be related to the networking-ovn DHCP agent configuration: see https://review.openstack.org/#/c/374186/

Revision history for this message
Richard Theis (rtheis) wrote :

After a couple test runs with https://review.openstack.org/#/c/374186/, it appears that the problems may be related to the DHCP configuration options force_metadata=True and enable_isolated_metadata=True.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Just took a quick look at the traceback [1] and I thought I'd leave a few breadcrumbs behind. The index error happens here [2]. Looks like the interface for the subnet doesn't exist when it is expected to. Could it be a race between creating the interface and this code? That's just off the top of my head.

[1] http://paste.openstack.org/show/582442/
[2] https://github.com/openstack/neutron/blob/4ff772ff07c6f4275cdcdbf3ebaf1642ff721420/neutron/agent/linux/dhcp.py#L888

Revision history for this message
Richard Theis (rtheis) wrote :

In additional to the key error, the following errors are also seen in the DHCP logs.

Cannot remove namespace file:
http://logs.openstack.org/74/353074/7/check/gate-tempest-dsvm-networking-ovn/846e796/logs/screen-q-dhcp.txt.gz#_2016-09-22_11_22_37_032

NoFilterMatched:
http://logs.openstack.org/74/353074/7/check/gate-tempest-dsvm-networking-ovn/846e796/logs/screen-q-dhcp.txt.gz#_2016-09-22_11_34_59_137

With the networking-ovn workaround (https://review.openstack.org/#/c/374186/), the key errors and NoFilterMatched errors no longer occur. The workaround successfully completed the gate tests 7 times while the gate-tempest-dsvm-networking-ovn failure rate was around 50%.

Revision history for this message
Richard Theis (rtheis) wrote :

A couple more items to note:
1) "Timeout waiting for vif plugging" failures in nova are often seen with the errors because DHCP fails to complete port provisioning before the timeout.
2) Problems similar to https://bugs.launchpad.net/neutron/+bug/1622616 are often seen with the errors.

tags: added: newton-rc-potential
Revision history for this message
Richard Theis (rtheis) wrote :

I was able to recreate the failures noted in comments #6 and #7 with neutron as well. See https://review.openstack.org/#/c/374888/ and http://logs.openstack.org/88/374888/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/cf76519/

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The subnet_to_interface_ip represents addresses on a tap device that also belong to one of subnets of the network. Looking for 'ip addr add' commands executed for a tap device in question in logs, I don't see any except for metadata ip. The list of ips to set inside init_l3 is build in DeviceManager.setup(). Apparently, setup_dhcp_port returned a port with some fixed ips missing. Can it suggest plugin somehow missing returning fixed ips for a port? Or not allocating any for ipv6 stateful network?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/375562

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/375562
Reason: Wrong approach: the cache is supposed to reflect backend state not database.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/375645

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/366291
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d4f92ede9ed52cd0b9059d993e3cf1f42d5ff57e
Submitter: Jenkins
Branch: master

commit d4f92ede9ed52cd0b9059d993e3cf1f42d5ff57e
Author: Brian Haley <email address hidden>
Date: Tue Sep 6 14:44:41 2016 -0400

    Fix dhcp_release6 error when not supported

    If the system does not have a version of dnsmasq that supports
    dhcp_release6, warn about but otherwise handle it gracefully, and also
    don't even try to execute it the next time.

    Closes-Bug: #1622002
    Related-Bug: #1624079
    Change-Id: I6bb9547f9d9a9fcfb2357521f3f5bd1dc1dd5136

Revision history for this message
Kevin Benton (kevinbenton) wrote :

According to logstash[1], these failures stopped at about the same time (~ Sept 21st, 2:00AM PDT) this pair of patches went in: [2], [3]. I think either one would have fixed the issue of reloading based on inconsistent data between the active DHCP subnets and the DHCP agent's interface due to stale update messages.

1. http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22subnet_dhcp_ip%20%3D%20subnet_to_interface_ip%5Bsubnet.id%5D%5C%22
2. https://github.com/openstack/neutron/commit/a7e528bc8fe2dc957187fdd7eb135dbed8a762cf
3. https://github.com/openstack/neutron/commit/2e14c087a9ec6773ed0072cc9c0d9a42785343a0

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
Revision history for this message
Kevin Benton (kevinbenton) wrote :
Revision history for this message
Kevin Benton (kevinbenton) wrote :

No luck. Logstash was just really far behind. Looking into it

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/375645
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f74a6c047f02e077e4fd943e2eb256df82e05509
Submitter: Jenkins
Branch: stable/newton

commit f74a6c047f02e077e4fd943e2eb256df82e05509
Author: Brian Haley <email address hidden>
Date: Tue Sep 6 14:44:41 2016 -0400

    Fix dhcp_release6 error when not supported

    If the system does not have a version of dnsmasq that supports
    dhcp_release6, warn about but otherwise handle it gracefully, and also
    don't even try to execute it the next time.

    Closes-Bug: #1622002
    Related-Bug: #1624079

    (cherry picked from commit d4f92ede9ed52cd0b9059d993e3cf1f42d5ff57e)
    Change-Id: I6bb9547f9d9a9fcfb2357521f3f5bd1dc1dd5136

tags: added: in-stable-newton
Changed in neutron:
milestone: none → newton-rc2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/375791
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d1fb4238306f854ab610204a50c5cdc20ea5c770
Submitter: Jenkins
Branch: master

commit d1fb4238306f854ab610204a50c5cdc20ea5c770
Author: Kevin Benton <email address hidden>
Date: Thu Sep 22 23:38:33 2016 -0700

    Fixes for missing IPs on subnets in DHCP agent

    There is a race condition server-side where a port request containing
    a subnet_id is processed at the same time the subnet is being deleted,
    the port operation may be successful without having a fixed IP on the
    requested subnet. This patch makes the DHCP agent resillient to this
    bug by checking the port response and raising a SubnetMismatchForPort
    to trigger a resync if it doesn't have all of the requested subnet IDs.

    Additionally, it avoids skipping assignment of IPv6 addresses to the
    interface if they are stateless. The original logic to skip assignment
    was only meant to be for SLAAC addresses.

    Both of these issues were resulting in the KeyError observed in the
    bug report.

    Related-Bug: #1627480
    Closes-Bug: #1624079
    Change-Id: I85ef1f4d60efd0309d6a0706e29fdbcc16f0b59d

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/377089

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/377089
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=571af6bc37315ff023981904bd2592aa8ec0ec14
Submitter: Jenkins
Branch: stable/newton

commit 571af6bc37315ff023981904bd2592aa8ec0ec14
Author: Kevin Benton <email address hidden>
Date: Thu Sep 22 23:38:33 2016 -0700

    Fixes for missing IPs on subnets in DHCP agent

    There is a race condition server-side where a port request containing
    a subnet_id is processed at the same time the subnet is being deleted,
    the port operation may be successful without having a fixed IP on the
    requested subnet. This patch makes the DHCP agent resillient to this
    bug by checking the port response and raising a SubnetMismatchForPort
    to trigger a resync if it doesn't have all of the requested subnet IDs.

    Additionally, it avoids skipping assignment of IPv6 addresses to the
    interface if they are stateless. The original logic to skip assignment
    was only meant to be for SLAAC addresses.

    Both of these issues were resulting in the KeyError observed in the
    bug report.

    Related-Bug: #1627480
    Closes-Bug: #1624079
    Change-Id: I85ef1f4d60efd0309d6a0706e29fdbcc16f0b59d
    (cherry picked from commit d1fb4238306f854ab610204a50c5cdc20ea5c770)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/377140

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/377142

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Did didn't work as it should have. It's safer to back out and take this slow.

Changed in neutron:
status: Fix Released → Confirmed
milestone: newton-rc2 → ocata-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/377626

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/377626
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c0d0986e8b7412f7c37db799177dc529ce9045d2
Submitter: Jenkins
Branch: master

commit c0d0986e8b7412f7c37db799177dc529ce9045d2
Author: Kevin Benton <email address hidden>
Date: Sun Sep 25 15:49:19 2016 -0700

    Correct fix for IPv6 auto address interfaces

    This is a partial revert of d1fb4238306f854ab610204a50c5cdc20ea5c770.
    That patch attempted to allow the DHCP agent to manage IPv6 interfaces
    to avoid a KeyError in the subnet_interface_map. However, this led
    to races with router advertisements and static configuration
    (see bug 1627902).

    The correct fix was actually a very simple conditional that was broken
    because this map shouldn't have been referenced for IPv6 subnets in
    the first place. The force metadata option was shortcutting the whole
    evaluation so the agent was trying to add a metadata route to a v6
    subnet.

    This patch undoes the v6 changes from
    d1fb4238306f854ab610204a50c5cdc20ea5c770, corrects the conditional,
    and adds a simple unit test that ensures the branch doesn't throw
    and error.

    Closes-Bug: #1624079
    Change-Id: Ide494b6333a4f1e279ab58aa27c0aa719e79545d

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/377793

Changed in neutron:
milestone: ocata-1 → newton-rc2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/377793
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=98c4c0876d4df31cf1eb34a2bcc40660e938a2ef
Submitter: Jenkins
Branch: stable/newton

commit 98c4c0876d4df31cf1eb34a2bcc40660e938a2ef
Author: Kevin Benton <email address hidden>
Date: Sun Sep 25 15:49:19 2016 -0700

    Correct fix for IPv6 auto address interfaces

    This is a partial revert of d1fb4238306f854ab610204a50c5cdc20ea5c770.
    That patch attempted to allow the DHCP agent to manage IPv6 interfaces
    to avoid a KeyError in the subnet_interface_map. However, this led
    to races with router advertisements and static configuration
    (see bug 1627902).

    The correct fix was actually a very simple conditional that was broken
    because this map shouldn't have been referenced for IPv6 subnets in
    the first place. The force metadata option was shortcutting the whole
    evaluation so the agent was trying to add a metadata route to a v6
    subnet.

    This patch undoes the v6 changes from
    d1fb4238306f854ab610204a50c5cdc20ea5c770, corrects the conditional,
    and adds a simple unit test that ensures the branch doesn't throw
    and error.

    Closes-Bug: #1624079
    Change-Id: Ide494b6333a4f1e279ab58aa27c0aa719e79545d
    (cherry picked from commit c0d0986e8b7412f7c37db799177dc529ce9045d2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0rc2

This issue was fixed in the openstack/neutron 9.0.0.0rc2 release candidate.

tags: removed: newton-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b1

This issue was fixed in the openstack/neutron 10.0.0.0b1 development milestone.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.