tempest ssh timeout due to udhcpc fails in the cirros guest
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
High
|
Unassigned | ||
neutron |
New
|
Undecided
|
Unassigned |
Bug Description
Tests trying to ssh into the guest fails intermittently with timeout as udhcpc fails in the guest:
2023-02-01 20:46:32.286979 | controller | Starting network: udhcpc: started, v1.29.3
2023-02-01 20:46:32.286987 | controller | udhcp
2023-02-01 20:46:32.286996 | controller | c: sending discover
2023-02-01 20:46:32.287004 | controller | udhcpc: sending discover
2023-02-01 20:46:32.287013 | controller | udhcpc: sending discover
2023-02-01 20:46:32.287022 | controller | Usage: /sbin/cirros-dhcpc <up|down>
2023-02-01 20:46:32.287030 | controller | udhcpc: no lease, failing
2023-02-01 20:46:32.287039 | controller | FAIL
Traceback (most recent call last):
File "/opt/stack/
return f(*func_args, **func_kwargs)
File "/opt/stack/
server1 = self.create_
File "/opt/stack/
body, servers = compute.
File "/opt/stack/
with excutils.
File "/opt/stack/
self.
File "/opt/stack/
raise self.value
File "/opt/stack/
wait_
File "/opt/stack/
waiters.
File "/opt/stack/
raise lib_exc.
tempest.
Details: None
Example failure https:/
Signature
$ logsearch log --job-group nova-devstack --result FAILURE 'udhcpc: no lease, failing' --days 7
[snip]
Builds with matching logs 6/138:
+------
| uuid | finished | project | pipeline | review | branch | job |
+------
| 9bd5d568bfa84c1
| 3fae6edffe68483
| 70eeeb8eb3184d8
| 492821b715974ae
| f1c6b7e54b28415
| cca45d74a56f420
+------
tags: | added: gate-failure |
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → High |
Okay, I did a bit of digging today for some other CI failure I saw on another change and eventually, I found this was related.
So, lemme explain the issue here. First, I was looking at https:/ /storage. bhs.cloud. ovh.net/ v1/AUTH_ dcaab5e32b234d5 6b626f72581e364 4c/zuul_ opendev_ logs_6f9/ 868236/ 5/gate/ nova-next/ 6f9f3d0/ and I was wondering why the SSH connection wasn't working.
When I looked at the nova logs, I found that the instance was spawned at 18:18:56 : 83239]: INFO nova.compute. manager [None req-053318ab- 09ad-4a3a- 8ddb-633cc0002c 3e tempest- AttachVolumeNeg ativeTest- 1605485622 tempest- AttachVolumeNeg ativeTest- 1605485622- project] [instance: 6a265379- ebfd-4aea- a081-8b271f32c0 ea] Took 8.58 seconds to build instance.
Feb 14 18:18:56.514945 np0033093378 nova-compute[
Then, Tempest tried to ssh the instance at 18:18:59 : lib.common. ssh] Creating ssh connection to '172.24.5.161:22' as 'cirros' with public key authentication
2023-02-14 18:22:39.102680 | controller | 2023-02-14 18:18:59,630 92653 INFO [tempest.
And eventually, 2mins32sec after that (18:22:31), it stopped : lib.common. ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.161 after 16 attempts. Proxy client: no proxy client
2023-02-14 18:22:39.103394 | controller | 2023-02-14 18:22:31,398 92653 ERROR [tempest.
Then, I tried to look at the guest console, and I saw that udhcpc tried 3 times :
2023-02-14 18:22:39.129636 | controller | [ 12.638156] sr 0:0:0:0: Attached scsi generic sg0 type 5
[...]
2023-02-14 18:22:39.130384 | controller | Starting network: udhcpc: started, v1.29.3
2023-02-14 18:22:39.130415 | controller | udhcpc: sending discover
2023-02-14 18:22:39.130439 | controller | udhcpc: sending discover
2023-02-14 18:22:39.130461 | controller | udhcpc: sending discover
So, I was wondering how long the DHCP discovery was done and eventually, I found that cirros dhcp client actually hangs for 1 min before requesting again.
So, now I'm wondering why it takes so much time to get a DHCP address and why the 2nd DHCP call doesn't get the IP address.
Adding Neutron team to this bug report because maybe we have something about our DHCP controller.