tripleo

Bug #1966165
Comment #2

Comment 2 for bug 1966165

Revision history for this message

Arx Cruz (arxcruz) wrote on 2022-03-24:

This is an upstream issue, this is my conversation with Fungi at #openstack-infra yesterday:

[14:34:25] <fungi> this is definitely strange
[14:34:27] <fungi> 2022-03-23 11:53:01,706 INFO nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Creating server with hostname centos-9-stream-iweb-mtl01-0028979246 in iweb-mtl01 from image centos-9-stream
[14:35:23] <fungi> 2022-03-23 11:53:03,134 DEBUG nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Waiting for server 7d657092-6bc2-42ef-8e54-8b14b7586634
[14:36:06] <fungi> that server instance uuid matches what's in the inventory: https://zuul.opendev.org/t/openstack/build/bf287b96d99442c58222fde07992e5d7/log/zuul-info/inventory.yaml#34
[14:36:07] soniya29 (~soniya29@103.58.152.110) joined the channel
[14:37:45] <fungi> it was creating 0028979246 but the job ended up running on older 0028848617
[14:38:00] <fungi> i'll see if i can find when/where 0028848617 was created
[14:38:34] <fungi> but my gut says that's a rogue vm which never got cleared out in iweb's environment and got into an arp fight with the correct node for the job
[14:39:31] <arxcruz|ruck> rlandy ^
[14:39:34] <frickler> fungi: that node is running for 7d:
[14:39:36] <frickler> root@ubuntu-focal-iweb-mtl01-0028848617:~# w 13:39:11 up 7 days, 11:09, 1 user, load average: 0.00, 0.00, 0.00
[14:40:10] <ysandeep> fungi, we already hit that issue thrice in last 3 run: https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0
[14:40:13] <fungi> yeah, week-old ubuntu node at the ip address nodepool expected the new node to have
[14:40:28] <ysandeep> but they all have same Interface IP:
[14:40:32] <fungi> ysandeep: always in inap-mtl01?
[14:40:36] <fungi> yeah
[14:40:45] <fungi> that pretty much confirms it
[14:41:11] <fungi> nova probably failed to completely delete the node and then lost track of it existing
[14:41:34] <fungi> and the ip address keeps getting handed to new nodes at random
[14:42:14] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) left IRC (Quit: WeeChat 3.3)
[14:42:51] <ysandeep> that's interesting, I think I have seen that somewhere in an old OpenStack env, I wonder which version of Openstack we run on?
[14:42:53] <fungi> supposedly inap is discarding their openstack services completely a week from tomorrow, so i wouldn't be surprised if they've stopped any manual cleanup/grooming they might previously have been performing on that environment
[14:44:11] <fungi> but since we basically have no operator contacts there any longer, there's probably not much we can do about this other than turn them off a week early and take the (significant) capacity hit
[14:44:29] <frickler> we could shutdown the rogue node manually until then
[14:45:05] <fungi> oh, yeah that's not a bad idea
[14:45:08] <frickler> not sure though what neutron would do with the port then
[14:45:26] <fungi> do a `sudo poewroff` so it hopefully won't get rebooted if the host restarts
[14:45:39] <fungi> er, `sudo poweroff`
[14:45:50] <frickler> but at least it should give a network failure instead of the current issue
[14:46:03] <fungi> i'll do that now. it can't be any worse than having jobs run on a reused node which may be for an entirely different distro/version
[14:46:32] <frickler> did it already
[14:46:43] <frickler> now I get onto a different node immediately
[14:47:44] <fungi> yeah, in that case i probably powered off the wrong (newer) node, but whatever job was using that was also almost certainly struggling or about to be
[14:48:17] <fungi> yep, looking back, shell prompt says it was node 0028982952
[14:48:26] <fungi> i should have compared that before issuing the poweroff
[14:54:30] <frickler> nodepool should clean that node up soon, so the impact shouldn't be too bad. and I was surprised myself when I still could connect to that address after the poweroff
[14:57:32] sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) joined the channel
[14:58:08] dasm|off is now known as dasm
[15:22:19] <fungi> yeah, as soon as the executor for whatever job was running there ceased to be able to connect, it would have ended the build and probably retried it

This is an upstream issue, this is my conversation with Fungi at #openstack-infra yesterday:

[14:34:25]  <fungi> this is definitely strange
[14:34:27]  <fungi> 2022-03-23 11:53:01,706 INFO nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Creating server with hostname centos-9-stream-iweb-mtl01-0028979246 in iweb-mtl01 from image centos-9-stream
[14:35:23]  <fungi> 2022-03-23 11:53:03,134 DEBUG nodepool.NodeLauncher: [e: e58f618f98a74351b8bc6867b3e803ce] [node_request: 200-0017642005] [node: 0028979246] Waiting for server 7d657092-6bc2-42ef-8e54-8b14b7586634
[14:36:06]  <fungi> that server instance uuid matches what's in the inventory: https://zuul.opendev.org/t/openstack/build/bf287b96d99442c58222fde07992e5d7/log/zuul-info/inventory.yaml#34
[14:36:07]  soniya29 (~soniya29@103.58.152.110) joined the channel
[14:37:45]  <fungi> it was creating 0028979246 but the job ended up running on older 0028848617
[14:38:00]  <fungi> i'll see if i can find when/where 0028848617 was created
[14:38:34]  <fungi> but my gut says that's a rogue vm which never got cleared out in iweb's environment and got into an arp fight with the correct node for the job
[14:39:31]  <arxcruz|ruck> rlandy ^
[14:39:34]  <frickler> fungi: that node is running for 7d:
[14:39:36]  <frickler> root@ubuntu-focal-iweb-mtl01-0028848617:~# w 13:39:11 up 7 days, 11:09,  1 user,  load average: 0.00, 0.00, 0.00
[14:40:10]  <ysandeep> fungi, we already hit that issue thrice in last 3 run: https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0
[14:40:13]  <fungi> yeah, week-old ubuntu node at the ip address nodepool expected the new node to have
[14:40:28]  <ysandeep> but they all have same Interface IP: 
[14:40:32]  <fungi> ysandeep: always in inap-mtl01?
[14:40:36]  <fungi> yeah
[14:40:45]  <fungi> that pretty much confirms it
[14:41:11]  <fungi> nova probably failed to completely delete the node and then lost track of it existing
[14:41:34]  <fungi> and the ip address keeps getting handed to new nodes at random
[14:42:14]  sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) left IRC (Quit: WeeChat 3.3)
[14:42:51]  <ysandeep> that's interesting, I think I have seen that somewhere in an old OpenStack env, I wonder which version of Openstack we run on?
[14:42:53]  <fungi> supposedly inap is discarding their openstack services completely a week from tomorrow, so i wouldn't be surprised if they've stopped any manual cleanup/grooming they might previously have been performing on that environment
[14:44:11]  <fungi> but since we basically have no operator contacts there any longer, there's probably not much we can do about this other than turn them off a week early and take the (significant) capacity hit
[14:44:29]  <frickler> we could shutdown the rogue node manually until then
[14:45:05]  <fungi> oh, yeah that's not a bad idea
[14:45:08]  <frickler> not sure though what neutron would do with the port then
[14:45:26]  <fungi> do a `sudo poewroff` so it hopefully won't get rebooted if the host restarts
[14:45:39]  <fungi> er, `sudo poweroff`
[14:45:50]  <frickler> but at least it should give a network failure instead of the current issue
[14:46:03]  <fungi> i'll do that now. it can't be any worse than having jobs run on a reused node which may be for an entirely different distro/version
[14:46:32]  <frickler> did it already
[14:46:43]  <frickler> now I get onto a different node immediately
[14:47:44]  <fungi> yeah, in that case i probably powered off the wrong (newer) node, but whatever job was using that was also almost certainly struggling or about to be
[14:48:17]  <fungi> yep, looking back, shell prompt says it was node 0028982952
[14:48:26]  <fungi> i should have compared that before issuing the poweroff
[14:54:30]  <frickler> nodepool should clean that node up soon, so the impact shouldn't be too bad. and I was surprised myself when I still could connect to that address after the poweroff
[14:57:32]  sean-k-mooney (~sean@86-44-155-110-dynamic.agg2.cty.lmk-pgs.eircom.net) joined the channel
[14:58:08]  dasm|off is now known as dasm
[15:22:19]  <fungi> yeah, as soon as the executor for whatever job was running there ceased to be able to connect, it would have ended the build and probably retried it