Misconfigured local_ip in linuxbridge_agent.ini after Pike -> Queens upgrade
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack-Ansible |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
Upgrading from Pike to Queens. First mistake was running `cleanup-
James was very kind to help us troubleshoot at what would be late in his evening. He spotted a bad configuration on the infra nodes, In `/etc/neutron/
This matches up with the symptom that instances could talk to _each other_ on Neutron private subnets (so compute nodes were configured correctly), but instances could not communicate with their default gateway or anything North of it (i.e. couldn't talk to anything on the infra hosts).
After verifying that Ansible in fact did have the correct tunnel interface IP addresses in the dynamically discovered facts (`ansible_br_vxlan` block in each infra host's `/etc/openstack
---
I started the original upgrade following the guide exactly, running everything up to and including setup-openstack.yml completing without error. Then I did something slightly weird: I switched from the 17.1.6 tag of OSA to the stable/queens tag. The diff set is fairly small, but I wanted to get odyssey4me's commit (https:/
Here is exactly what I did:
- Check out `stable/queens`
- `openstack-ansible "${UPGRADE_
- `openstack-ansible repo-install.yml`
- `${SCRIPTS_
- Redeploy Neutron, `openstack-ansible os-neutron-
- Could not connect to tombstone36 (a compute host), called UITS to ask for a restart
- Redeploy Neutron to tombstone36: `openstack-ansible os-neutron-
- `cleanup-
- Re-check-out 17.1.6 to clean up cinder, heat, and nova (then I switched back to stable/queens again)
- Later during troubleshooting I re-ran `openstack-ansible os-nova-
---
I don't know if the misconfiguration was due to cloud operator error, or a bug in the automation, or what. At some point, Ansible wrongly set the br-mgmt interface's IP as the `local_ip` in the Linuxbridge agent config, but we didn't find evidence of this in the `ansible_facts` files. Unfortunately it's likely infeasible for me to reproduce this problem, but I'm happy to look more at the config and answer questions. At least this records our experience and the fix, in case someone else encounters a similar issue.
The same thing happened for our "other" production cloud during its Pike -> Queens upgrade last Tuesday. `local_ip` in linuxbridge_ agent.ini got misconfigured with each infra host's br-mgmt IP rather than br-vxlan IP.
I verified that Ansible did have the correct tunnel interface IP addresses in the dynamically discovered facts, but unlike last time, re-deploying Neutron to the infra hosts did _not_ fix the issue. I fixed by manually setting correct IP address in linuxbridge_ agent.ini, deleting VXLAN bridges, and restarting Neutron services.
Also, this time it was a much "cleaner" upgrade, going straight to the `stable/queens` branch (no switching branches mid-upgrade).
¯\_(ツ)_/¯