Comment 2 for bug 1966165

Revision history for this message
Arx Cruz (arxcruz) wrote :

This is an upstream issue, this is my conversation with Fungi at #openstack-infra yesterday:

[14:34:25] <fungi> this is definitely strange
[14:34:27] <fungi> 2022-03-23 11:53:01,706 INFO nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Creating server with hostname centos-9-stream-iweb-mtl01-0028979246 in iweb-mtl01 from image centos-9-stream
[14:35:23] <fungi> 2022-03-23 11:53:03,134 DEBUG nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Waiting for server 7d657092-6bc2-42ef-8e54-8b14b7586634
[14:36:06] <fungi> that server instance uuid matches what's in the inventory: https://zuul.opendev.org/t/openstack/build/bf287b96d99442c58222fde07992e5d7/log/zuul-info/inventory.yaml#34
[14:36:07] soniya29 (~soniya29@103.58.152.110) joined the channel
[14:37:45] <fungi> it was creating 0028979246 but the job ended up running on older 0028848617
[14:38:00] <fungi> i'll see if i can find when/where 0028848617 was created
[14:38:34] <fungi> but my gut says that's a rogue vm which never got cleared out in iweb's environment and got into an arp fight with the correct node for the job
[14:39:31] <arxcruz|ruck> rlandy ^
[14:39:34] <frickler> fungi: that node is running for 7d:
[14:39:36] <frickler> root@ubuntu-focal-iweb-mtl01-0028848617:~# w 13:39:11 up 7 days, 11:09, 1 user, load average: 0.00, 0.00, 0.00
[14:40:10] <ysandeep> fungi, we already hit that issue thrice in last 3 run: https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0
[14:40:13] <fungi> yeah, week-old ubuntu node at the ip address nodepool expected the new node to have
[14:40:28] <ysandeep> but they all have same Interface IP:
[14:40:32] <fungi> ysandeep: always in inap-mtl01?
[14:40:36] <fungi> yeah
[14:40:45] <fungi> that pretty much confirms it
[14:41:11] <fungi> nova probably failed to completely delete the node and then lost track of it existing
[14:41:34] <fungi> and the ip address keeps getting handed to new nodes at random
[14:42:14] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) left IRC (Quit: WeeChat 3.3)
[14:42:51] <ysandeep> that's interesting, I think I have seen that somewhere in an old OpenStack env, I wonder which version of Openstack we run on?
[14:42:53] <fungi> supposedly inap is discarding their openstack services completely a week from tomorrow, so i wouldn't be surprised if they've stopped any manual cleanup/grooming they might previously have been performing on that environment
[14:44:11] <fungi> but since we basically have no operator contacts there any longer, there's probably not much we can do about this other than turn them off a week early and take the (significant) capacity hit
[14:44:29] <frickler> we could shutdown the rogue node manually until then
[14:45:05] <fungi> oh, yeah that's not a bad idea
[14:45:08] <frickler> not sure though what neutron would do with the port then
[14:45:26] <fungi> do a `sudo poewroff` so it hopefully won't get rebooted if the host restarts
[14:45:39] <fungi> er, `sudo poweroff`
[14:45:50] <frickler> but at least it should give a network failure instead of the current issue
[14:46:03] <fungi> i'll do that now. it can't be any worse than having jobs run on a reused node which may be for an entirely different distro/version
[14:46:32] <frickler> did it already
[14:46:43] <frickler> now I get onto a different node immediately
[14:47:44] <fungi> yeah, in that case i probably powered off the wrong (newer) node, but whatever job was using that was also almost certainly struggling or about to be
[14:48:17] <fungi> yep, looking back, shell prompt says it was node 0028982952
[14:48:26] <fungi> i should have compared that before issuing the poweroff
[14:54:30] <frickler> nodepool should clean that node up soon, so the impact shouldn't be too bad. and I was surprised myself when I still could connect to that address after the poweroff
[14:57:32] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) joined the channel
[14:58:08] dasm|off is now known as dasm
[15:22:19] <fungi> yeah, as soon as the executor for whatever job was running there ceased to be able to connect, it would have ended the build and probably retried it