This is an upstream issue, this is my conversation with Fungi at #openstack-infra yesterday:
[14:34:25] <fungi> this is definitely strange
[14:34:27] <fungi> 2022-03-23 11:53:01,706 INFO nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Creating server with hostname centos-9-stream-iweb-mtl01-0028979246 in iweb-mtl01 from image centos-9-stream
[14:35:23] <fungi> 2022-03-23 11:53:03,134 DEBUG nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Waiting for server 7d657092-6bc2-42ef-8e54-8b14b7586634
[14:36:06] <fungi> that server instance uuid matches what's in the inventory: https://zuul.opendev.org/t/openstack/build/bf287b96d99442c58222fde07992e5d7/log/zuul-info/inventory.yaml#34
[14:36:07] soniya29 (~soniya29@103.58.152.110) joined the channel
[14:37:45] <fungi> it was creating 0028979246 but the job ended up running on older 0028848617
[14:38:00] <fungi> i'll see if i can find when/where 0028848617 was created
[14:38:34] <fungi> but my gut says that's a rogue vm which never got cleared out in iweb's environment and got into an arp fight with the correct node for the job
[14:39:31] <arxcruz|ruck> rlandy ^
[14:39:34] <frickler> fungi: that node is running for 7d:
[14:39:36] <frickler> root@ubuntu-focal-iweb-mtl01-0028848617:~# w 13:39:11 up 7 days, 11:09, 1 user, load average: 0.00, 0.00, 0.00
[14:40:10] <ysandeep> fungi, we already hit that issue thrice in last 3 run: https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0
[14:40:13] <fungi> yeah, week-old ubuntu node at the ip address nodepool expected the new node to have
[14:40:28] <ysandeep> but they all have same Interface IP:
[14:40:32] <fungi> ysandeep: always in inap-mtl01?
[14:40:36] <fungi> yeah
[14:40:45] <fungi> that pretty much confirms it
[14:41:11] <fungi> nova probably failed to completely delete the node and then lost track of it existing
[14:41:34] <fungi> and the ip address keeps getting handed to new nodes at random
[14:42:14] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) left IRC (Quit: WeeChat 3.3)
[14:42:51] <ysandeep> that's interesting, I think I have seen that somewhere in an old OpenStack env, I wonder which version of Openstack we run on?
[14:42:53] <fungi> supposedly inap is discarding their openstack services completely a week from tomorrow, so i wouldn't be surprised if they've stopped any manual cleanup/grooming they might previously have been performing on that environment
[14:44:11] <fungi> but since we basically have no operator contacts there any longer, there's probably not much we can do about this other than turn them off a week early and take the (significant) capacity hit
[14:44:29] <frickler> we could shutdown the rogue node manually until then
[14:45:05] <fungi> oh, yeah that's not a bad idea
[14:45:08] <frickler> not sure though what neutron would do with the port then
[14:45:26] <fungi> do a `sudo poewroff` so it hopefully won't get rebooted if the host restarts
[14:45:39] <fungi> er, `sudo poweroff`
[14:45:50] <frickler> but at least it should give a network failure instead of the current issue
[14:46:03] <fungi> i'll do that now. it can't be any worse than having jobs run on a reused node which may be for an entirely different distro/version
[14:46:32] <frickler> did it already
[14:46:43] <frickler> now I get onto a different node immediately
[14:47:44] <fungi> yeah, in that case i probably powered off the wrong (newer) node, but whatever job was using that was also almost certainly struggling or about to be
[14:48:17] <fungi> yep, looking back, shell prompt says it was node 0028982952
[14:48:26] <fungi> i should have compared that before issuing the poweroff
[14:54:30] <frickler> nodepool should clean that node up soon, so the impact shouldn't be too bad. and I was surprised myself when I still could connect to that address after the poweroff
[14:57:32] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) joined the channel
[14:58:08] dasm|off is now known as dasm
[15:22:19] <fungi> yeah, as soon as the executor for whatever job was running there ceased to be able to connect, it would have ended the build and probably retried it
This is an upstream issue, this is my conversation with Fungi at #openstack-infra yesterday:
[14:34:25] <fungi> this is definitely strange NodeLauncher: [e: e58f618f98a7435 1b8bc6867b3e803 ce] [node_request: 200-0017642005] [node: 0028979246] Creating server with hostname centos- 9-stream- iweb-mtl01- 0028979246 in iweb-mtl01 from image centos-9-stream NodeLauncher: [e: e58f618f98a7435 1b8bc6867b3e803 ce] [node_request: 200-0017642005] [node: 0028979246] Waiting for server 7d657092- 6bc2-42ef- 8e54-8b14b75866 34 /zuul.opendev. org/t/openstack /build/ bf287b96d99442c 58222fde07992e5 d7/log/ zuul-info/ inventory. yaml#34 103.58. 152.110) joined the channel focal-iweb- mtl01-002884861 7:~# w 13:39:11 up 7 days, 11:09, 1 user, load average: 0.00, 0.00, 0.00 /zuul.opendev. org/t/openstack /builds? job_name= tripleo- ci-centos- 9-content- provider& skip=0 86-44-155- 110-dynamic. agg2.cty. lmk-pgs. eircom. net) left IRC (Quit: WeeChat 3.3) 86-44-155- 110-dynamic. agg2.cty. lmk-pgs. eircom. net) joined the channel
[14:34:27] <fungi> 2022-03-23 11:53:01,706 INFO nodepool.
[14:35:23] <fungi> 2022-03-23 11:53:03,134 DEBUG nodepool.
[14:36:06] <fungi> that server instance uuid matches what's in the inventory: https:/
[14:36:07] soniya29 (~soniya29@
[14:37:45] <fungi> it was creating 0028979246 but the job ended up running on older 0028848617
[14:38:00] <fungi> i'll see if i can find when/where 0028848617 was created
[14:38:34] <fungi> but my gut says that's a rogue vm which never got cleared out in iweb's environment and got into an arp fight with the correct node for the job
[14:39:31] <arxcruz|ruck> rlandy ^
[14:39:34] <frickler> fungi: that node is running for 7d:
[14:39:36] <frickler> root@ubuntu-
[14:40:10] <ysandeep> fungi, we already hit that issue thrice in last 3 run: https:/
[14:40:13] <fungi> yeah, week-old ubuntu node at the ip address nodepool expected the new node to have
[14:40:28] <ysandeep> but they all have same Interface IP:
[14:40:32] <fungi> ysandeep: always in inap-mtl01?
[14:40:36] <fungi> yeah
[14:40:45] <fungi> that pretty much confirms it
[14:41:11] <fungi> nova probably failed to completely delete the node and then lost track of it existing
[14:41:34] <fungi> and the ip address keeps getting handed to new nodes at random
[14:42:14] sean-k-mooney (~sean@
[14:42:51] <ysandeep> that's interesting, I think I have seen that somewhere in an old OpenStack env, I wonder which version of Openstack we run on?
[14:42:53] <fungi> supposedly inap is discarding their openstack services completely a week from tomorrow, so i wouldn't be surprised if they've stopped any manual cleanup/grooming they might previously have been performing on that environment
[14:44:11] <fungi> but since we basically have no operator contacts there any longer, there's probably not much we can do about this other than turn them off a week early and take the (significant) capacity hit
[14:44:29] <frickler> we could shutdown the rogue node manually until then
[14:45:05] <fungi> oh, yeah that's not a bad idea
[14:45:08] <frickler> not sure though what neutron would do with the port then
[14:45:26] <fungi> do a `sudo poewroff` so it hopefully won't get rebooted if the host restarts
[14:45:39] <fungi> er, `sudo poweroff`
[14:45:50] <frickler> but at least it should give a network failure instead of the current issue
[14:46:03] <fungi> i'll do that now. it can't be any worse than having jobs run on a reused node which may be for an entirely different distro/version
[14:46:32] <frickler> did it already
[14:46:43] <frickler> now I get onto a different node immediately
[14:47:44] <fungi> yeah, in that case i probably powered off the wrong (newer) node, but whatever job was using that was also almost certainly struggling or about to be
[14:48:17] <fungi> yep, looking back, shell prompt says it was node 0028982952
[14:48:26] <fungi> i should have compared that before issuing the poweroff
[14:54:30] <frickler> nodepool should clean that node up soon, so the impact shouldn't be too bad. and I was surprised myself when I still could connect to that address after the poweroff
[14:57:32] sean-k-mooney (~sean@
[14:58:08] dasm|off is now known as dasm
[15:22:19] <fungi> yeah, as soon as the executor for whatever job was running there ceased to be able to connect, it would have ended the build and probably retried it