tripleo-ci-centos-9-content-provider failing with invalid option error

Bug #1966165 reported by Rabi Mishra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Medium
Unassigned

Bug Description

Noticed at:

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c58/831813/6/check/tripleo-ci-centos-9-content-provider/c58bc4d/job-output.txt

2022-03-23 10:21:36.324090 | primary | ++++(/etc/os-release:1): source(): NAME=Ubuntu
2022-03-23 10:21:36.324128 | primary | ++++(/etc/os-release:2): source(): VERSION='20.04.4 LTS (Focal Fossa)'
2022-03-23 10:21:36.324161 | primary | ++++(/etc/os-release:3): source(): ID=ubuntu
2022-03-23 10:21:36.324208 | primary | ++++(/etc/os-release:4): source(): ID_LIKE=debian
2022-03-23 10:21:36.324252 | primary | ++++(/etc/os-release:5): source(): PRETTY_NAME='Ubuntu 20.04.4 LTS'
2022-03-23 10:21:36.324275 | primary | ++++(/etc/os-release:6): source(): VERSION_ID=20.04
2022-03-23 10:21:36.324304 | primary | ++++(/etc/os-release:7): source(): HOME_URL=https://www.ubuntu.com/
2022-03-23 10:21:36.324333 | primary | ++++(/etc/os-release:8): source(): SUPPORT_URL=https://help.ubuntu.com/
2022-03-23 10:21:36.324362 | primary | ++++(/etc/os-release:9): source(): BUG_REPORT_URL=https://bugs.launchpad.net/ubuntu/
2022-03-23 10:21:36.324414 | primary | ++++(/etc/os-release:10): source(): PRIVACY_POLICY_URL=https://www.ubuntu.com/legal/terms-and-policies/privacy-policy
2022-03-23 10:21:36.324447 | primary | ++++(/etc/os-release:11): source(): VERSION_CODENAME=focal
2022-03-23 10:21:36.324477 | primary | ++++(/etc/os-release:12): source(): UBUNTU_CODENAME=focal
2022-03-23 10:21:36.324544 | primary | +++(/home/zuul/src/opendev.org/openstack/tripleo-ci/scripts/oooq_common_functions.sh:192): python_cmd(): distribution_major_version=20
2022-03-23 10:21:36.324579 | primary | +++(/home/zuul/src/opendev.org/openstack/tripleo-ci/scripts/oooq_common_functions.sh:193): python_cmd(): case $NAME in
2022-03-23 10:21:36.324609 | primary | +++(/home/zuul/src/opendev.org/openstack/tripleo-ci/scripts/oooq_common_functions.sh:212): python_cmd(): distribution=Ubuntu
2022-03-23 10:21:36.324639 | primary | +++(/home/zuul/src/opendev.org/openstack/tripleo-ci/scripts/oooq_common_functions.sh:236): python_cmd(): echo python3
2022-03-23 10:21:36.325299 | primary | ++(/home/zuul/src/opendev.org/openstack/tripleo-ci/scripts/oooq_common_functions.sh:241): package_manager(): '[' python3 == python3 ']'
2022-03-23 10:21:36.325363 | primary | ++(/home/zuul/src/opendev.org/openstack/tripleo-ci/scripts/oooq_common_functions.sh:242): package_manager(): echo ' -y --exclude=python2* '
2022-03-23 10:21:36.326136 | primary | +(/home/zuul/src/opendev.org/openstack/tripleo-ci/toci_gate_test.sh:47): sudo -y '--exclude=python2*' install python3-setuptools python3-requests python3-urllib3 python3-PyYAML
2022-03-23 10:21:36.331254 | primary | sudo: invalid option -- 'y'
2022-03-23 10:21:36.331306 | primary | usage: sudo -h | -K | -k | -V
2022-03-23 10:21:36.331349 | primary | usage: sudo -v [-AknS] [-g group] [-h host] [-p prompt] [-u user]
2022-03-23 10:21:36.331374 | primary | usage: sudo -l [-AknS] [-g group] [-h host] [-p prompt] [-U user] [-u user]
2022-03-23 10:21:36.331395 | primary | [command]
2022-03-23 10:21:36.331432 | primary | usage: sudo [-AbEHknPS] [-r role] [-t type] [-C num] [-g group] [-h host] [-p
2022-03-23 10:21:36.331465 | primary | prompt] [-T timeout] [-u user] [VAR=value] [-i|-s] [<command>]
2022-03-23 10:21:36.331487 | primary | usage: sudo -e [-AknS] [-r role] [-t type] [-C num] [-g group] [-h host] [-p
2022-03-23 10:21:36.331519 | primary | prompt] [-T timeout] [-u user] file ...
2022-03-23 10:21:36.457709 | primary | ERROR
2022-03-23 10:21:36.458155 | primary | {
2022-03-23 10:21:36.458311 | primary | "delta": "0:00:01.039973",
2022-03-23 10:21:36.458449 | primary | "end": "2022-03-23 10:21:36.332954",
2022-03-23 10:21:36.458579 | primary | "msg": "non-zero return code",
2022-03-23 10:21:36.458718 | primary | "rc": 1,
2022-03-23 10:21:36.458850 | primary | "start": "2022-03-23 10:21:35.292981"
2022-03-23 10:21:36.458981 | primary | }
2022-03-23 10:21:36.502840 |

Looks like it's ubuntu focal node and [1] returns nothing. I can see that we're using single-centos-9-node[2]. Not sure why it's using focal.

[1] https://github.com/openstack/tripleo-ci/blob/master/roles/run-test/templates/oooq_common_functions.sh.j2#L240
[2] https://github.com/openstack/tripleo-ci/blob/master/zuul.d/content-provider.yaml#L31

Revision history for this message
Rabi Mishra (rabi) wrote :

Probably the node has been tagged icorrectlty in nodepool[1]? I see green provider jobs after that failure. https://zuul.opendev.org/t/openstack/build/b465b23988e241599e2c5f2850125ac0

[1]
2022-03-23 10:17:01.203650 |
2022-03-23 10:17:01.204083 | LOOP [emit-job-header : Print node information]
2022-03-23 10:17:01.472298 | localhost | ok:
2022-03-23 10:17:01.472666 | localhost | # Node Information
2022-03-23 10:17:01.472750 | localhost | Inventory Hostname: primary
2022-03-23 10:17:01.472861 | localhost | Hostname: ubuntu-focal-iweb-mtl01-0028848617
2022-03-23 10:17:01.472931 | localhost | Username: zuul
2022-03-23 10:17:01.473000 | localhost | Distro: Ubuntu 20.04
2022-03-23 10:17:01.473067 | localhost | Provider: iweb-mtl01
2022-03-23 10:17:01.473159 | localhost | Label: centos-9-stream
2022-03-23 10:17:01.473228 | localhost | Interface IP: 198.72.124.130
2022-03-23 10:17:01.549169 |

Changed in tripleo:
importance: Critical → Medium
Revision history for this message
Arx Cruz (arxcruz) wrote :
Download full text (4.2 KiB)

This is an upstream issue, this is my conversation with Fungi at #openstack-infra yesterday:

[14:34:25] <fungi> this is definitely strange
[14:34:27] <fungi> 2022-03-23 11:53:01,706 INFO nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Creating server with hostname centos-9-stream-iweb-mtl01-0028979246 in iweb-mtl01 from image centos-9-stream
[14:35:23] <fungi> 2022-03-23 11:53:03,134 DEBUG nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Waiting for server 7d657092-6bc2-42ef-8e54-8b14b7586634
[14:36:06] <fungi> that server instance uuid matches what's in the inventory: https://zuul.opendev.org/t/openstack/build/bf287b96d99442c58222fde07992e5d7/log/zuul-info/inventory.yaml#34
[14:36:07] soniya29 (~soniya29@103.58.152.110) joined the channel
[14:37:45] <fungi> it was creating 0028979246 but the job ended up running on older 0028848617
[14:38:00] <fungi> i'll see if i can find when/where 0028848617 was created
[14:38:34] <fungi> but my gut says that's a rogue vm which never got cleared out in iweb's environment and got into an arp fight with the correct node for the job
[14:39:31] <arxcruz|ruck> rlandy ^
[14:39:34] <frickler> fungi: that node is running for 7d:
[14:39:36] <frickler> root@ubuntu-focal-iweb-mtl01-0028848617:~# w 13:39:11 up 7 days, 11:09, 1 user, load average: 0.00, 0.00, 0.00
[14:40:10] <ysandeep> fungi, we already hit that issue thrice in last 3 run: https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0
[14:40:13] <fungi> yeah, week-old ubuntu node at the ip address nodepool expected the new node to have
[14:40:28] <ysandeep> but they all have same Interface IP:
[14:40:32] <fungi> ysandeep: always in inap-mtl01?
[14:40:36] <fungi> yeah
[14:40:45] <fungi> that pretty much confirms it
[14:41:11] <fungi> nova probably failed to completely delete the node and then lost track of it existing
[14:41:34] <fungi> and the ip address keeps getting handed to new nodes at random
[14:42:14] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) left IRC (Quit: WeeChat 3.3)
[14:42:51] <ysandeep> that's interesting, I think I have seen that somewhere in an old OpenStack env, I wonder which version of Openstack we run on?
[14:42:53] <fungi> supposedly inap is discarding their openstack services completely a week from tomorrow, so i wouldn't be surprised if they've stopped any manual cleanup/grooming they might previously have been performing on that environment
[14:44:11] <fungi> but since we basically have no operator contacts there any longer, there's probably not much we can do about this other than turn them off a week early and take the (significant) capacity hit
[14:44:29] <frickler> we could shutdown the rogue node manually until then
[14:45:05] <fungi> oh, yeah that's not a bad idea
[14:45:08] <frickler> not sure though what neutron would do with the port then
[14:45:26] <fungi> do a `sudo poewroff` so it hopefully won't get rebooted if the host restarts
[14:45:39] <fungi> er, `sudo poweroff`
[14:45:50] <frickler> but at le...

Read more...

Revision history for this message
Rabi Mishra (rabi) wrote :

Could be a rogue/leftover vm, but the fact that it's tagged as centos-9-stream at the first place is weird and confusing.

Revision history for this message
Arx Cruz (arxcruz) wrote :

From this morning:

[09:16:48] <arxcruz|ruck> fungi got another one last night
[09:16:48] <arxcruz|ruck> https://zuul.opendev.org/t/openstack/build/9b84177bfddb4f85bd60e54e5f802225
[09:16:54] <arxcruz|ruck> tkajinam ^
[09:19:37] andrewbonney (<email address hidden>) joined the channel
[09:23:33] <frickler> arxcruz|ruck: fungi: confirmed, another long running node, up for 12d even. I've powered it off now, too, maybe we should reconsider completely turning off the cloud early
[09:28:18] <arxcruz|ruck> frickler thanks :)

Revision history for this message
Rabi Mishra (rabi) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.