Comment 14 for bug 2008509

Revision history for this message
Trent Lloyd (lathiat) wrote :

Kelvin,

I have again re-created it. I'd like to re-iterate that this is 100% reproducible every time I do it with these charms, and that this also happened in a production environment causing quite an impact.

This issue (among others) will cause problems for anyone upgrading an Ubuntu Openstack from Focal to Jammy which is likely to become more and more common soon. This is a high impact issue, please can you take some extra time to comprehensively try my reproducer? It's fairly simple and only takes about 30 minutes at most, most of that time is waiting.

I understand in the first case you didn't have a matching VIP (that was an oversight on my part), but I have otherwise given very detailed reproduction instructions which you haven't attempted again.

To assist with that, I have modified my original reproducer script to automatically extract and set an appropriate VIP address from "juju subnets" to ensure it will deploy cleanly on anyones AWS environment. You will need to use us-east-1a or otherwise modify that to any other regions specific AZ in all of the commands and the bundle file to use a different AZ. We need to use a specific AZ because the 'vip' config option has to match the VPC subnet of the AZ we deploy into.

I have attached the following:
- lp2008509-aws-reproduction-terminal.txt - complete terminal output of reproducing the issue with the below script
- juju-crashdump-414db7ba-6214-41f7-aed5-85a5edf8f03a.tar.xz - juju-crashdump of the environment after the failed upgrade-machine
- juju-backup-20230704-090223.tar.gz - juju controller backup after the failed upgrade-machine

We need two outcomes
- First we need to develop a workaround for once someone gets into this situation, how to get out of it without removing the broken unit. It's not trivial to remove and replace units in many OpenStack deployments - that's very disruptive - and people seem almost certain to attempt and hit this issue even once it's fixed when they haven't upgraded juju
- We need a fix including backport to 2.9 (this reproduces on 2.9.43, 3.1.2 and 3.2.0 all the same)

# Revised reproducer script
# Requires 'jq' and 'python3' installed

juju bootstrap aws aws --bootstrap-constraints "instance-type=t2.micro arch=amd64 zones=us-east-1a"

juju add-model lp2008509

juju set-model-constraints instance-type=t2.micro arch=amd64 zones="us-east-1a"

VPC_SUBNET=$(juju subnets --format=json | jq -r '.subnets | to_entries[] | select(.value.zones[] == "us-east-1a" and (.value."provider-id" | contains("INFAN") | not)) | .key')

VPC_VIP=$(python3 -c 'import random, ipaddress, sys; print(str(random.choice(list(ipaddress.ip_network(sys.argv[1]).hosts()))))' $VPC_SUBNET)

echo VPC Subnet: ${VPC_SUBNET}, VPC VIP: ${VPC_VIP}

# Note that this file has had the 'vip' field removed, compared to the one originally uploaded. Be sure to remove it from yours.
juju deploy ./keystone-focal-yoga.yaml

juju config keystone vip=${VPC_VIP}

juju wait-for application keystone --query='name=="keystone" && (status=="active" || status=="idle")'

juju status

juju upgrade-machine 0 prepare ubuntu@22.04