pyroute2 0.5.4 breaks nested deployments

Bug #1824846 reported by Michal Dulko
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kuryr-kubernetes
Fix Released
High
Unassigned

Bug Description

Seems like some change in pyroute2 0.5.4 is breaking the nested deployments. We get such errors:

- RuntimeError (http://paste.openstack.org/show/749299/)
- NetlinkError: (17, 'File exists') (http://paste.openstack.org/show/749306/)

Workaround is to downgrade pyroute2 to 0.5.3.

Changed in kuryr-kubernetes:
status: New → Incomplete
status: Incomplete → Confirmed
status: Confirmed → Triaged
importance: Undecided → High
Revision history for this message
Nayan Deshmukh (ndesh26) wrote :

I was unable to reproduce the RuntimeError. However I was able to reproduce the NetlinkError.

I tried to do a git bisect to find the problematic commit which was causing this. The following commit was responsible for the error:

commit 4420185675b9ca9f71f7110653ee77b957ebbbcb
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Jan 17 16:37:31 2019 +0000

    Stop NetNS server gracefully

    This patch changes the way of stopping the NetNS server, by sending
    a SIGTERM signal and waiting for the server loop to finish. When the
    signal is received, the loop control flag is inverted. Once the loop
    is finished, the server process ends.

    This patch also modifies the Transport class receiver function loops.
    When the Transport object is closed, the receiver loops are stopped but
    not the file descriptors. Those file descriptors, created in the NetNS
    parent class, are closed at the end of the NetNS.close function, once
    the child process (Server) is finished and the Transport receiver loops
    are stopped. At this point, the file descriptors are not in use and can
    be closed.

    closes #578

From what I understood, with this patch when the namespace is deleted not all the files related to the iface that belongs to the namespace is deleted. Hence when we try to recreate a new interface with the same name on a new CNI ADD request we encounter this error.

One potential fix is removed the existing iface with the same name if it exists as we do in case of Bridge driver.

I will submit a patch with this fix.

Revision history for this message
Nayan Deshmukh (ndesh26) wrote :

I was finally able to find the cause of the error. The above mentioned patch introduced a regression in the pyroute2 library. The error was due to leakage of FDs which lead to NetlinkError: (17, 'File exists').

I have submitted a patch to fix the error (https://github.com/svinota/pyroute2/pull/624) with this patch the error should not happen. I was still unable to reproduce the RuntimeError with pyroute2 version 0.5.6.

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

Okay, seems like it's fixed now. We had a lot of tests with 0.5.6 as well and never hit the issue again.

Changed in kuryr-kubernetes:
status: Triaged → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.