TestNetworkBasicOps:test_subnet_details intermittently fails with "cat: can't open '/var/run/udhcpc..pid': No such file or directory"

Bug #1813198 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
Fix Released
High
Matt Riedemann

Bug Description

Seen here:

http://logs.openstack.org/78/570078/17/check/tempest-slow/161ea32/job-output.txt.gz#_2019-01-24_18_26_22_886987

2019-01-24 18:26:22.886987 | controller | {0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details [125.520364s] ... FAILED
2019-01-24 18:26:22.887067 | controller |
2019-01-24 18:26:22.887166 | controller | Captured traceback:
2019-01-24 18:26:22.887251 | controller | ~~~~~~~~~~~~~~~~~~~
2019-01-24 18:26:22.887370 | controller | Traceback (most recent call last):
2019-01-24 18:26:22.887545 | controller | File "tempest/common/utils/__init__.py", line 89, in wrapper
2019-01-24 18:26:22.887663 | controller | return f(*func_args, **func_kwargs)
2019-01-24 18:26:22.887868 | controller | File "tempest/scenario/test_network_basic_ops.py", line 629, in test_subnet_details
2019-01-24 18:26:22.887952 | controller | renew_delay),
2019-01-24 18:26:22.888141 | controller | File "tempest/lib/common/utils/test_utils.py", line 107, in call_until_true
2019-01-24 18:26:22.888237 | controller | if func(*args, **kwargs):
2019-01-24 18:26:22.888443 | controller | File "tempest/scenario/test_network_basic_ops.py", line 621, in check_new_dns_server
2019-01-24 18:26:22.888583 | controller | dhcp_client=CONF.scenario.dhcp_client)
2019-01-24 18:26:22.888776 | controller | File "tempest/common/utils/linux/remote_client.py", line 140, in renew_lease
2019-01-24 18:26:22.888957 | controller | return getattr(self, '_renew_lease_' + dhcp_client)(fixed_ip=fixed_ip)
2019-01-24 18:26:22.889161 | controller | File "tempest/common/utils/linux/remote_client.py", line 116, in _renew_lease_udhcpc
2019-01-24 18:26:22.889279 | controller | format(path=file_path, nic=nic_name))
2019-01-24 18:26:22.889474 | controller | File "tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper
2019-01-24 18:26:22.889595 | controller | return function(self, *args, **kwargs)
2019-01-24 18:26:22.889793 | controller | File "tempest/lib/common/utils/linux/remote_client.py", line 108, in exec_command
2019-01-24 18:26:22.890231 | controller | return self.ssh_client.exec_command(cmd)
2019-01-24 18:26:22.890402 | controller | File "tempest/lib/common/ssh.py", line 202, in exec_command
2019-01-24 18:26:22.890520 | controller | stderr=err_data, stdout=out_data)
2019-01-24 18:26:22.890848 | controller | tempest.lib.exceptions.SSHExecCommandFailed: Command 'set -eu -o pipefail; PATH=$PATH:/sbin; cat /var/run/udhcpc..pid', exit status: 1, stderr:
2019-01-24 18:26:22.891027 | controller | cat: can't open '/var/run/udhcpc..pid': No such file or directory
2019-01-24 18:26:22.891068 | controller |
2019-01-24 18:26:22.891142 | controller | stdout:

Looks like the problem would be in the file name "udchpc..pid" -- too many extension separator dots. Maybe something in this code:

http://git.openstack.org/cgit/openstack/tempest/tree/tempest/common/utils/linux/remote_client.py#n111

    def _renew_lease_udhcpc(self, fixed_ip=None):
        """Renews DHCP lease via udhcpc client. """
        file_path = '/var/run/udhcpc.'
        nic_name = self.get_nic_name_by_ip(fixed_ip)
        pid = self.exec_command('cat {path}{nic}.pid'.
                                format(path=file_path, nic=nic_name))
        pid = pid.strip()
        cmd = 'sudo /bin/kill -{sig} {pid}'.format(pid=pid, sig='USR1')
        self.exec_command(cmd)

The nic_name must be coming back empty and that's how we get /var/run/udhcpc..pid.

Tags: tempest
Revision history for this message
Matt Riedemann (mriedem) wrote :

That code also looks buggy since _renew_lease_udhcpc takes fixed_ip=None meaning it's optional, but get_nic_name_by_ip assumes fixed_ip is not None.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I also don't see anything within tempest itself calling that _renew_lease_udhcpc method...

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh it's called from here:

http://git.openstack.org/cgit/openstack/tempest/tree/tempest/common/utils/linux/remote_client.py#n140

Looks like right before the failure we run that command:

2019-01-24 18:26:06,439 5033 DEBUG [tempest.lib.common.utils.linux.remote_client] Remote command: set -eu -o pipefail; PATH=$PATH:/sbin; ip -o addr | awk '/10.1.0.3/ {print $2}'

But it doesn't log the result.

I'm also seeing this in the console output:

"Failed to update DNS nameservers"

Looks like this started around Jan 20:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22cat%3A%20can't%20open%20'%2Fvar%2Frun%2Fudhcpc..pid'%3A%20No%20such%20file%20or%20directory%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/521825/ merged on Jan 20 which changed the default cirros version to 0.4.0. Maybe something in that version is breaking something with how tempest works.

LIU Yulong (dragon889)
Changed in neutron:
status: New → Confirmed
Changed in neutron:
importance: Undecided → High
tags: added: tempest
Revision history for this message
Lajos Katona (lajos-katona) wrote :
Revision history for this message
Lajos Katona (lajos-katona) wrote :

I tried to reproduce manually doing the same kill /var/run/udhcp.ethx.pid inside cirros 0.4, without success, so hard to say what causing the disappearance of the pid file.

Revision history for this message
Matt Riedemann (mriedem) wrote :

It might not be that the pid file is missing, but the parsing code in tempest could be broken. We probably need a patch to tempest to log the output of the command that is used to parse the nic name for the pid file.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/633225

Revision history for this message
Matt Riedemann (mriedem) wrote :

Logstash now shows this is also failing on xenial nodes, so probably less to do with bionic and more to do with that new cirros 0.4.0 version we're using in devstack?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/633502

Changed in tempest:
assignee: nobody → Bence Romsics (bence-romsics)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/633566

Changed in tempest:
assignee: Bence Romsics (bence-romsics) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/633706

Matt Riedemann (mriedem)
Changed in tempest:
assignee: Matt Riedemann (mriedem) → Bence Romsics (bence-romsics)
importance: Undecided → High
no longer affects: neutron
Changed in tempest:
assignee: Bence Romsics (bence-romsics) → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

The failures seem to have magically dropped off after Jan 29:

http://status.openstack.org/elastic-recheck/#1813198

Looking at git commit history for devstack/tempest/requirements/neutron/nova it's not clear to me what has changed to make this work now.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I don't know if https://github.com/openstack/neutron/commit/99f4495c940011293e3cabbb590770dc1e7b6900 would be related to inadvertently fixing this, slaweq doesn't think so.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ignore comment 13 - all logstash graphs dropped off around Jan 19 so it's a problem with logstash/elastic-recheck and the bug is not fixed yet.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/633566
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=f853ef7a250ebfb03bd3d761d66fea0851446101
Submitter: Zuul
Branch: master

commit f853ef7a250ebfb03bd3d761d66fea0851446101
Author: Matt Riedemann <email address hidden>
Date: Mon Jan 28 12:31:18 2019 -0500

    Skip test_subnet_details slow test until bug 1813198 is fixed

    The bug is killing throughput in our CI system [1] so until it's
    fixed I'm proposing that we skip it.

    http://status.openstack.org/elastic-recheck/#1813198
    http://status.openstack.org/elastic-recheck/gate.html#1813198

    Change-Id: I1a6f35dcb15110297cd491b902d7a8279861997e
    Related-Bug: #1813198

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/633225
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=acd6ecdbd9d33814d4b27359ad290cad5069129d
Submitter: Zuul
Branch: master

commit acd6ecdbd9d33814d4b27359ad290cad5069129d
Author: Matt Riedemann <email address hidden>
Date: Fri Jan 25 10:52:08 2019 -0500

    Log the result of command in get_nic_name_by_ip

    We need to know the result of the command run
    in get_nic_name_by_ip to see what is getting parsed
    since we are returning the empty string in some
    cases which results in an error because
    /var/run/udhcpc..pid does not exist in the guest.

    Change-Id: Ie15ac360a7fd384583607d5bf1d8ce7a3d2c3576
    Related-Bug: #1813198

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/633502
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=199e02290290d795b44cd174c22b3209b71e6d8d
Submitter: Zuul
Branch: master

commit 199e02290290d795b44cd174c22b3209b71e6d8d
Author: Bence Romsics <email address hidden>
Date: Mon Jan 28 14:18:00 2019 +0100

    Raise renew_delay in TestNetworkBasicOps:test_subnet_details

    Bug #1813198 can be reproduced both in cirros 0.3.5 and 0.4.0 by:

    while true
    do
        sudo /bin/kill -USR1 $( cat /var/run/udhcpc.$( ip -o addr | awk '/100.109.0.37/ {print $2}' ).pid )
        #sleep .1
    done

    Where 100.109.0.37 is the original IP of eth0 in cirros and the 'kill'
    line basically reproduces how test_subnet_details() renews the dhcp
    lease using busybox's udhcpc.

    However if you uncomment the very short sleep in that loop, the error
    is gone. It seems to me this because the following line in udhcpc:

    https://git.busybox.net/busybox/tree/networking/udhcp/dhcpc.c?h=1_23_stable#n1093

    Note: CirrOS 0.4.0 uses BusyBox v1.23.2.

    That is SIGUSR1 triggers a renew request in udhcpc. Internally udhcpc
    goes into RENEW_REQUESTED state. If it receives a 2nd SIGUSR1 signal
    while still in that state then it deconfigures the interface briefly
    before acquiring a new (many times still the same) address. For us it
    means we should not retry renewing the lease too fast, because if we do
    the address may get deconfigured leading to bug #1813198.

    Note: the call of the deconfig script can be easily confirmed by
    modifying it (for example to log something). In cirros it is located
    under '/sbin/cirros-dhcpc'. Look for the case handling the 'deconfig'
    parameter.

    This patch increases the delay in retries.

    Change-Id: I9b18ef2d835e02be30b7c557e5c10585722111a0
    Closes-Bug: #1813198

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tempest 20.0.0

This issue was fixed in the openstack/tempest 20.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/633706
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=21f53012f76d11e3df327adcf87e67edf9045d09
Submitter: Zuul
Branch: master

commit 21f53012f76d11e3df327adcf87e67edf9045d09
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jan 29 12:52:01 2019 +0100

    Add unstable_test decorator

    This decorator can be used to temporarily mark some tests as unstable.
    This may help sometimes to debug such test which is failing often
    but not always in the gate because it will still be run but will
    not cause all job failure when it fails.

    This may be also used to easily track in logstash how often
    such test is failing by looking for describption of unstability
    reason set in decorator.

    Change-Id: I79ce70f479506ec2b3216747d533ff2e450685aa
    Related-Bug: #1813198

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.