Deleting csnat port fails due to no fixed ips

Bug #1609540 reported by Carl Baldwin on 2016-08-03
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Miguel Lavalle

Bug Description

This code [1] ends up emitting an "IndexError: list index out of range" exception that ends up with a trace like this [2]. Essentially, there are no fixed ips on the port. Not sure yet how it got in to this state. This failure is linked to various tempest failures in gate-tempest-dsvm-neutron-dvr and gate-tempest-dsvm-neutron-dvr-multinode-full. Here are tests which have failed due to this. This logstash [3] finds them.

tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_dhcpv6_stateless
tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_slaac

[1] https://github.com/openstack/neutron/blob/5e8b8274fe94ca9eafbfe951134326df3a60373d/neutron/db/l3_dvr_db.py#L906
[2] http://paste.openstack.org/show/547840/
[3] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22if%20p%5B'fixed_ips'%5D%5B0%5D%5B'subnet_id'%5D%20%3D%3D%20subnet_id%5C%22

Carl Baldwin (carl-baldwin) wrote :

Kevin mentioned that he suspects that the code that handles multiple ipv6 addresses on a single csnat port may be the culprit. It is difficult to follow and complicates things. It was added in this patch [1].

Each ipv4 subnet gets its own port, so why not the ipv6 ones?

[1] https://review.openstack.org/#/c/225319

Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
tags: added: gate-failure ipv6 l3-dvr-backlog l3-ipam-dhcp
Changed in neutron:
assignee: nobody → Carl Baldwin (carl-baldwin)
Carl Baldwin (carl-baldwin) wrote :

The dualnet aspect of the test causes tempest to create a different network for the ipv6 subnets [1]. The multi_prefix aspect creates two ipv6 subnets instead of one. There are other tests which test both of these aspects individually [2] and with neither aspect but I haven't seen those tests fail in this way.

[1] https://github.com/openstack/tempest/blob/f8012dde023351dd7df9c726a3f616c3ac752160/tempest/scenario/test_network_v6.py#L79
[2] https://github.com/openstack/tempest/blob/f8012dde023351dd7df9c726a3f616c3ac752160/tempest/scenario/test_network_v6.py#L215-L249

Fix proposed to branch: master
Review: https://review.openstack.org/350783

Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Kevin Benton (kevinbenton)
status: Confirmed → In Progress
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Carl Baldwin (carl-baldwin)
Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Kevin Benton (kevinbenton)
Carl Baldwin (carl-baldwin) wrote :

It occurred to me that regular router ports also use a single port for multiple ipv6 subnets. The csnat ports were likely designed to mirror this behavior. I also have a vague recollection of asking why this was the case. I think we were in the Adobe office in Utah for the Neutron mid-cycle when I asked. I don't recall the response. With that in mind, I dig this up [1]

[1] https://review.openstack.org/#/c/156360/25//COMMIT_MSG

Carl Baldwin (carl-baldwin) wrote :

When the proposed fix merges, we should file a follow up bug to keep an eye on the issue using the new log message.

Carl Baldwin (carl-baldwin) wrote :

I looked a little bit in the logs for a failure [1]. The router interface that got the 500 was the ipv4 one. Yet, this seems to only happen when coupled with another network (dualnet) with two ipv6 subnets. I still haven't figure it out and need to step away.

[1] http://logs.openstack.org/87/350387/2/check/gate-tempest-dsvm-neutron-dvr/a7cf20a/console.html

Reviewed: https://review.openstack.org/350783
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bcef61703061c578826a3ea26fbf846b6670fb0e
Submitter: Jenkins
Branch: master

commit bcef61703061c578826a3ea26fbf846b6670fb0e
Author: Kevin Benton <email address hidden>
Date: Tue Aug 2 18:46:11 2016 -0700

    Fix indexerror in delete_csnat_port

    The code assumed every port returned from the csnat port query
    would have a fixed_ip that it could compare the subnet it is
    looking for to. This should be a valid assumption however there
    is a path leading to a condition where it has no IPs. This makes
    the cleanup code handle this case and dump a warning until we can
    figure out what causes the interface to lose the IP.

    Partial-Bug: #1609540
    Change-Id: Ida024a231bb3fc09dad0e80498f57a5761ca3420

Carl Baldwin (carl-baldwin) wrote :

I'm going to watch this to see if "CSNAT port has no IPs" appears anywhere in logstash. So far, no occurrences of it but we should start to see them.

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Carl Baldwin (carl-baldwin)
Miguel Lavalle (minsel) wrote :

We need to keep watching logstash for occurrences of "CSNAT port has no IPs". I'll check with infra to make sure we get debug level messages from logstash

If this is no longer offending in the gate can we demote its severity?

Carl Baldwin (carl-baldwin) wrote :

This isn't causing gate failures anymore. But, we still need to treat it like a bug because we haven't found the root cause yet.

Changed in neutron:
importance: Critical → High
tags: removed: gate-failure
Carl Baldwin (carl-baldwin) wrote :

I wonder if my logstash queries are not searching debug log messages.

Miguel Lavalle (minsel) wrote :

I checked with the infra team today. Logstash / kibana doesn't capture debug level messages. That is why we never see the "CSNAT port has no IPs" message

Carl Baldwin (carl-baldwin) wrote :

I was afraid of that, especially since it was my suggestion to make it debug level in the first place. So, now to get any kind of data about this problem, we'd have to bump up the log level of that message.

Reviewed: https://review.openstack.org/370964
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=268e10ef4d3106e2bc7b4e679bcba67b4c9f0e9e
Submitter: Jenkins
Branch: master

commit 268e10ef4d3106e2bc7b4e679bcba67b4c9f0e9e
Author: Carl Baldwin <email address hidden>
Date: Thu Sep 15 09:47:40 2016 -0600

    Raise level of message to info

    We need to be able to search for occurences of this issue in the gate.
    But, logstash doesn't index debug messages. So, they are impossible to
    search for. Hence, raising the level to info which, I'm told, are
    indexed.

    Change-Id: Ie9116c362f3e3d5d424cb957dae23bdd222520fb
    Related-Bug: #1609540

Oleg Bondarev (obondarev) wrote :

Comment 13 from https://bugs.launchpad.net/neutron/+bug/1540983 says: "The test has been skipped in the gate since Aug. 3rd with: https://review.openstack.org/#/c/275457/ so I don't think the lack of elastic search hits is necessarily indicative of a fix." - may also explain why we can't see those logs.

Miguel Lavalle (minsel) wrote :

Following up https://review.openstack.org/#/c/374106. Checking for "IndexError: list index out of range"

Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Miguel Lavalle (minsel)
Miguel Lavalle (minsel) wrote :

Not seen over the past 4 weeks by triggering tests with this patchset: https://review.openstack.org/#/c/374106

Changed in neutron:
status: In Progress → Invalid
tags: added: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers