This is an upstream issue, this is my conversation with Fungi at #openstack-infra yesterday: [14:34:25] this is definitely strange [14:34:27] 2022-03-23 11:53:01,706 INFO nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Creating server with hostname centos-9-stream-iweb-mtl01-0028979246 in iweb-mtl01 from image centos-9-stream [14:35:23] 2022-03-23 11:53:03,134 DEBUG nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Waiting for server 7d657092-6bc2-42ef-8e54-8b14b7586634 [14:36:06] that server instance uuid matches what's in the inventory: https://zuul.opendev.org/t/openstack/build/bf287b96d99442c58222fde07992e5d7/log/zuul-info/inventory.yaml#34 [14:36:07] soniya29 (~soniya29@103.58.152.110) joined the channel [14:37:45] it was creating 0028979246 but the job ended up running on older 0028848617 [14:38:00] i'll see if i can find when/where 0028848617 was created [14:38:34] but my gut says that's a rogue vm which never got cleared out in iweb's environment and got into an arp fight with the correct node for the job [14:39:31] rlandy ^ [14:39:34] fungi: that node is running for 7d: [14:39:36] root@ubuntu-focal-iweb-mtl01-0028848617:~# w 13:39:11 up 7 days, 11:09, 1 user, load average: 0.00, 0.00, 0.00 [14:40:10] fungi, we already hit that issue thrice in last 3 run: https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0 [14:40:13] yeah, week-old ubuntu node at the ip address nodepool expected the new node to have [14:40:28] but they all have same Interface IP: [14:40:32] ysandeep: always in inap-mtl01? [14:40:36] yeah [14:40:45] that pretty much confirms it [14:41:11] nova probably failed to completely delete the node and then lost track of it existing [14:41:34] and the ip address keeps getting handed to new nodes at random [14:42:14] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) left IRC (Quit: WeeChat 3.3) [14:42:51] that's interesting, I think I have seen that somewhere in an old OpenStack env, I wonder which version of Openstack we run on? [14:42:53] supposedly inap is discarding their openstack services completely a week from tomorrow, so i wouldn't be surprised if they've stopped any manual cleanup/grooming they might previously have been performing on that environment [14:44:11] but since we basically have no operator contacts there any longer, there's probably not much we can do about this other than turn them off a week early and take the (significant) capacity hit [14:44:29] we could shutdown the rogue node manually until then [14:45:05] oh, yeah that's not a bad idea [14:45:08] not sure though what neutron would do with the port then [14:45:26] do a `sudo poewroff` so it hopefully won't get rebooted if the host restarts [14:45:39] er, `sudo poweroff` [14:45:50] but at least it should give a network failure instead of the current issue [14:46:03] i'll do that now. it can't be any worse than having jobs run on a reused node which may be for an entirely different distro/version [14:46:32] did it already [14:46:43] now I get onto a different node immediately [14:47:44] yeah, in that case i probably powered off the wrong (newer) node, but whatever job was using that was also almost certainly struggling or about to be [14:48:17] yep, looking back, shell prompt says it was node 0028982952 [14:48:26] i should have compared that before issuing the poweroff [14:54:30] nodepool should clean that node up soon, so the impact shouldn't be too bad. and I was surprised myself when I still could connect to that address after the poweroff [14:57:32] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) joined the channel [14:58:08] dasm|off is now known as dasm [15:22:19] yeah, as soon as the executor for whatever job was running there ceased to be able to connect, it would have ended the build and probably retried it