juju created duplicate servers

Bug #1988439 reported by Liam Young
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

Juju version: 2.9.33-ubuntu-amd64

During a recent deployment using the OpenStack provider I found that the ip address reported by juju status did not match the ip address of the unit when querying its network interfaces. On closer inspection that was because juju had created two servers with the same name. The ip address being reported by juju statue related to one server and the server I connected to using juju ssh connected to the other:

$ juju status vault/0 --format oneline

- vault/0: 172.20.0.175 (agent:idle, workload:blocked) 8200/tcp

So juju status shows vault has an ip address of 172.20.0.175

sshing to the unit shows a different ip:

$ juju ssh vault/0 "ip -o -4 a | grep 172"
2: ens3 inet 172.20.0.183/24 brd 172.20.0.255 scope global dynamic ens3\ valid_lft 38519sec preferred_lft 38519sec
Connection to 172.20.0.183 closed.

172.20.0.183 != 172.20.0.175

Querying openstack to see which servers have those IPs shows:

$ openstack server list | grep -E '172.20.0.175|172.20.0.183'
| c604e6f4-90ab-44a0-93f9-fe32939862db | juju-53f887-zaza-afdd1d2a7e48-1 | ACTIVE | gnuoy_oscify=172.20.0.175 | auto-sync/ubuntu-bionic-daily-amd64-server-20220829-disk1.img | m1.small |
| 7082589c-52c7-4a60-a68a-de4cde9fa2b9 | juju-53f887-zaza-afdd1d2a7e48-1 | ACTIVE | gnuoy_oscify=172.20.0.183 | auto-sync/ubuntu-bionic-daily-amd64-server-20220829-disk1.img | m1.small |

Two servers both called juju-53f887-zaza-afdd1d2a7e48-1

Revision history for this message
Liam Young (gnuoy) wrote :
Revision history for this message
Felipe Reyes (freyes) wrote :

I just ran "juju add-machine" and for a few seconds (the cloud is not particularly loaded at the moment) juju status showed this:

Machine State Address Inst id Series AZ Message
...
28 pending pending focal instance "19a71d00-c4d1-4310-9d79-3e8842f54557" has status BUILD, wait 10 seconds before retry, attempt 1

So even when nova replied a instance id, it looks like it wasn't immediately recorded in juju's database.

Revision history for this message
Felipe Reyes (freyes) wrote :

If I'm reading this code correctly, juju will block while the instance in BUILD status, waiting for it to reach to ACTIVE[0], if that doesn't happen in 5 minutes[1], it times out and deletes the instance to launch a new one, and if the deletion fails it will just log the error and move on[2]

[0] https://github.com/juju/juju/blob/2.9/provider/openstack/provider.go#L1188
[1] https://github.com/juju/juju/blob/2.9/provider/openstack/provider.go#L1238
[2] https://github.com/juju/juju/blob/2.9/provider/openstack/provider.go#L1244

Revision history for this message
John A Meinel (jameinel) wrote :

So interestingly, machine-0.log that you uploaded does make references to one of those like:
2022-09-01 12:23:25 DEBUG goose logger.go:44 TRACE: MakeServiceURL: http://10.245.161.158:8774/v2.1/servers/c604e6f4-90ab-44a0-93f9-fe32939862db

2022-09-01 12:23:45 INFO juju.provider.openstack provider.go:1316 started instance "c604e6f4-90ab-44a0-93f9-fe32939862db"

However, I don't see any references to 7082589c

How is the model name: zaza-afdd1d2a7e48 generated? It obviously looks like a uniquely generated model name and presumably you wouldn't ever conflict on those.
The juju-53f887 is the tail of our internal model uuid: `{constraints 83dd362e-3658-42fd-8d47-ea177b53f887:e}`
(the model uuid is "83dd362e-3658-42fd-8d47-ea177b53f887" and the tail of that is the same 53f887).

I have the feeling one bit to debug is why don't we get to a point where we see: started instance "7082589c..."

That definitely feels like we did a provisioning request against openstack, which *did* allocate 7082589c, but we never got that response back.

One thing that does concern me are lines like:
2022-09-01 10:41:36 DEBUG juju.apiserver request_notifier.go:188 <- [4115D] machine-0 {"request-id":1457,"type":"Undertaker","version":1,"request":"ProcessDyingModel","params":"'params redacted'"}
2022-09-01 10:41:36 DEBUG juju.apiserver request_notifier.go:220 -> [4115D] machine-0 2.416608ms {"request-id":1457,"error":"model not empty, found 9 machines, 9 applications","error-code":"model not empty","response":"'body redacted'"} Undertaker[""].ProcessDyingModel

Revision history for this message
John A Meinel (jameinel) wrote :

I think your analysis of provisioning is correct, but I didn't see any mention at all of the instance id in the logs that you shared. Nor do I find any line about "Failed to delete instance" anywhere.

Changed in juju:
importance: Undecided → High
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.