tempest ssh timeout due to udhcpc fails in the cirros guest

Bug #2006467 reported by Balazs Gibizer
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
High
Unassigned
neutron
New
Undecided
Unassigned

Bug Description

Tests trying to ssh into the guest fails intermittently with timeout as udhcpc fails in the guest:

2023-02-01 20:46:32.286979 | controller | Starting network: udhcpc: started, v1.29.3

2023-02-01 20:46:32.286987 | controller | udhcp

2023-02-01 20:46:32.286996 | controller | c: sending discover

2023-02-01 20:46:32.287004 | controller | udhcpc: sending discover

2023-02-01 20:46:32.287013 | controller | udhcpc: sending discover

2023-02-01 20:46:32.287022 | controller | Usage: /sbin/cirros-dhcpc <up|down>

2023-02-01 20:46:32.287030 | controller | udhcpc: no lease, failing

2023-02-01 20:46:32.287039 | controller | FAIL

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/api/compute/admin/test_volumes_negative.py", line 128, in test_multiattach_rw_volume_update_failure
    server1 = self.create_test_server(
  File "/opt/stack/tempest/tempest/api/compute/base.py", line 272, in create_test_server
    body, servers = compute.create_test_server(
  File "/opt/stack/tempest/tempest/common/compute.py", line 334, in create_test_server
    with excutils.save_and_reraise_exception():
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/opt/stack/tempest/tempest/common/compute.py", line 329, in create_test_server
    wait_for_ssh_or_ping(
  File "/opt/stack/tempest/tempest/common/compute.py", line 148, in wait_for_ssh_or_ping
    waiters.wait_for_ssh(
  File "/opt/stack/tempest/tempest/common/waiters.py", line 632, in wait_for_ssh
    raise lib_exc.TimeoutException()
tempest.lib.exceptions.TimeoutException: Request timed out
Details: None

Example failure https://zuul.opendev.org/t/openstack/build/f1c6b7e54b28415c952de0be833731a9/logs

Signature
$ logsearch log --job-group nova-devstack --result FAILURE 'udhcpc: no lease, failing' --days 7
[snip]
Builds with matching logs 6/138:
+----------------------------------+---------------------+----------------+----------+-----------------------------------+----------------+--------------------------------------------+
| uuid | finished | project | pipeline | review | branch | job |
+----------------------------------+---------------------+----------------+----------+-----------------------------------+----------------+--------------------------------------------+
| 9bd5d568bfa84c119470df9fbff2de0b | 2023-02-03T12:36:54 | openstack/nova | check | https://review.opendev.org/857339 | master | nova-next |
| 3fae6edffe68483fa2627bc40002f524 | 2023-02-02T13:52:04 | openstack/nova | check | https://review.opendev.org/860285 | master | nova-next |
| 70eeeb8eb3184d8d9ee802ee53cb979b | 2023-02-02T13:33:57 | openstack/nova | check | https://review.opendev.org/860287 | master | nova-next |
| 492821b715974ae389c5d7f9127bb5c3 | 2023-02-02T05:14:11 | openstack/nova | check | https://review.opendev.org/871798 | stable/wallaby | tempest-integrated-compute-centos-8-stream |
| f1c6b7e54b28415c952de0be833731a9 | 2023-02-01T21:34:36 | openstack/nova | gate | https://review.opendev.org/872220 | master | nova-next |
| cca45d74a56f4204a299ee4bbbaad59d | 2023-02-01T06:17:04 | openstack/nova | check | https://review.opendev.org/871557 | stable/wallaby | tempest-integrated-compute-centos-8-stream |
+----------------------------------+---------------------+----------------+----------+-----------------------------------+----------------+--------------------------------------------+

Tags: gate-failure
tags: added: gate-failure
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Okay, I did a bit of digging today for some other CI failure I saw on another change and eventually, I found this was related.

So, lemme explain the issue here. First, I was looking at https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6f9/868236/5/gate/nova-next/6f9f3d0/ and I was wondering why the SSH connection wasn't working.

When I looked at the nova logs, I found that the instance was spawned at 18:18:56 :
Feb 14 18:18:56.514945 np0033093378 nova-compute[83239]: INFO nova.compute.manager [None req-053318ab-09ad-4a3a-8ddb-633cc0002c3e tempest-AttachVolumeNegativeTest-1605485622 tempest-AttachVolumeNegativeTest-1605485622-project] [instance: 6a265379-ebfd-4aea-a081-8b271f32c0ea] Took 8.58 seconds to build instance.

Then, Tempest tried to ssh the instance at 18:18:59 :
2023-02-14 18:22:39.102680 | controller | 2023-02-14 18:18:59,630 92653 INFO [tempest.lib.common.ssh] Creating ssh connection to '172.24.5.161:22' as 'cirros' with public key authentication

And eventually, 2mins32sec after that (18:22:31), it stopped :
2023-02-14 18:22:39.103394 | controller | 2023-02-14 18:22:31,398 92653 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.161 after 16 attempts. Proxy client: no proxy client

Then, I tried to look at the guest console, and I saw that udhcpc tried 3 times :
2023-02-14 18:22:39.129636 | controller | [ 12.638156] sr 0:0:0:0: Attached scsi generic sg0 type 5
[...]
2023-02-14 18:22:39.130384 | controller | Starting network: udhcpc: started, v1.29.3
2023-02-14 18:22:39.130415 | controller | udhcpc: sending discover
2023-02-14 18:22:39.130439 | controller | udhcpc: sending discover
2023-02-14 18:22:39.130461 | controller | udhcpc: sending discover

So, I was wondering how long the DHCP discovery was done and eventually, I found that cirros dhcp client actually hangs for 1 min before requesting again.

So, now I'm wondering why it takes so much time to get a DHCP address and why the 2nd DHCP call doesn't get the IP address.

Adding Neutron team to this bug report because maybe we have something about our DHCP controller.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Sylvain:

I this bug is related to [1][2]. Please check the patch merged in Neutron (and the dependant one in tempest [3]). Maybe you need to update your job "nova-next" config (depending on the cirros image used) to update the DHCP client:
        test-config:
          $TEMPEST_CONFIG:
            scenario:
              dhcp_client: dhcpcd

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/871272
[2]https://bugs.launchpad.net/neutron/+bug/2003063
[3]https://review.opendev.org/c/openstack/tempest/+/871271

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

no comment without mistake: "I *think* this bug is related"

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Hi Rodolpho,

In nova-next, we continue having cirros-0.5.2 but I think we can upgrade to 0.6.0.
Do you think that switching from udhcpc client to dhcpcd would solve the DHCP lease assignement issue ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/873934

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (6.3 KiB)

I am not aware about any issues with udhcpc, the change(to switch to dhcpcd) in neutron/tempest were required for move to cirros-0.6.1(but i don't think that move was to fix any udhcpc issue in cirros-0.5.2).

It can be updated in nova too but once we have confirm it's an issue with cirros/udhcpc.

As per [1] it's seen 7 times in last 14 day in master, stable/zed for jobs:-
nova-next 28.6%
neutron-ovs-tempest-dvr 28.6%
devstack-platform-ubuntu-jammy-ovs 28.6%
neutron-ovs-tempest-fips 14.3%

For the same example from comment #1,
network_id=9680b161-ad56-444b-aed2-053bed433701, port_id=77cc62f3-8d3f-4a4f-bbef-ac68fdcd9c9e, in dhcp agent log:-

Feb 14 18:18:50.991856 np0033093378 neutron-dhcp-agent[76814]: INFO neutron.agent.dhcp.agent [-] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2023-02-14T18:18:50Z, description=, device_id=6a265379-ebfd-4aea-a081-8b271f32c0ea, device_owner=, extra_dhcp_opts=[], fixed_ips=[<neutron.agent.linux.dhcp.DictModel object at 0x7f5718c1f970>], id=77cc62f3-8d3f-4a4f-bbef-ac68fdcd9c9e, mac_address=fa:16:3e:57:88:0b, name=, network=admin_state_up=True, availability_zone_hints=[], availability_zones=['nova'], created_at=2023-02-14T18:18:26Z, description=, id=9680b161-ad56-444b-aed2-053bed433701, ipv4_address_scope=None, ipv6_address_scope=None, mtu=1380, name=tempest-AttachVolumeNegativeTest-1639666997-network, port_security_enabled=True, project_id=30309aa994164cd1a33985bb09feee0b, provider:network_type=vxlan, provider:physical_network=None, provider:segmentation_id=428, qos_policy_id=None, revision_number=2, router:external=False, shared=False, standard_attr_id=1540, status=ACTIVE, subnets=['2a932882-9ae4-4727-8396-40db023a31d1'], tags=[], tenant_id=30309aa994164cd1a33985bb09feee0b, updated_at=2023-02-14T18:18:26Z, vlan_transparent=None, network_id=9680b161-ad56-444b-aed2-053bed433701, port_security_enabled=True, project_id=30309aa994164cd1a33985bb09feee0b, qos_network_policy_id=None, qos_policy_id=None, resource_request=None, revision_number=1, security_groups=['0571006c-8dcf-45d6-8856-87ee19949ca3'], standard_attr_id=1555, status=DOWN, tags=[], tenant_id=30309aa994164cd1a33985bb09feee0b, updated_at=2023-02-14T18:18:50Z on network _ns_name=qdhcp-9680b161-ad56-444b-aed2-053bed433701, admin_state_up=True, id=9680b161-ad56-444b-aed2-053bed433701, mtu=1380, non_local_subnets=[], ports=[<neutron.agent.linux.dhcp.DictModel object at 0x7f5718be4310>, <neutron.agent.linux.dhcp.DictModel object at 0x7f5718be6650>], project_id=30309aa994164cd1a33985bb09feee0b, subnets=[<neutron.agent.linux.dhcp.DictModel object at 0x7f5718d42080>], tenant_id=30309aa994164cd1a33985bb09feee0b
Feb 14 18:18:50.993190 np0033093378 neutron-dhcp-agent[76814]: DEBUG neutron.agent.dhcp.agent [-] Calling driver for network: 9680b161-ad56-444b-aed2-053bed433701/seg=None action: reload_allocations {{(pid=76814) _call_driver /opt/stack/neutron/neutron/agent/dhcp/agent.py:242}}
Feb 14 18:18:50.996591 np0033093378 neutron-dhcp-agent[76814]: DEBUG neutron.agent.linux.dhcp [-] Building host file: /opt/s...

Read more...

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

That's right, the correct sequence for DHCP is [1]. But the DHCPREQUEST should be sent by the dhcp client inside the VM. If that is not happening, then I would suggest to check (if any) the logs of the client.

Do you know who can help us with this dhcp client? How is possible to enable the logging?

However I've seen that too using cirros 0.5.2 [2] so maybe we should consider something else. I'll keep checking the logs.

[1]https://paste.opendev.org/show/bCgFSv49GnS2wBPg2B3I/
[2]https://356b2093dea69c2f8b45-a9b411804046f27e6dcdc66ce2b9d993.ssl.cf1.rackcdn.com/871210/8/check/neutron-ovs-tempest-dvr/2dbee89/controller/logs/screen-q-dhcp.txt

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.