Misconfigured local_ip in linuxbridge_agent.ini after Pike -> Queens upgrade

Bug #1814160 reported by Chris Martin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Won't Fix
Undecided
Unassigned

Bug Description

Upgrading from Pike to Queens. First mistake was running `cleanup-neutron.yml` at the end to blow away the Neutron agents containers, without first shutting them off and verifying that everything works with the agents running on bare metal. This caused a 6-hour cloud networking outage which jamesdenton helped me troubleshoot and fix on IRC here:

http://eavesdrop.openstack.org/irclogs/%23openstack-ansible/%23openstack-ansible.2019-01-30.log.html#t2019-01-30T00:07:45

James was very kind to help us troubleshoot at what would be late in his evening. He spotted a bad configuration on the infra nodes, In `/etc/neutron/plugins/ml2/linuxbridge_agent.ini`, the `local_ip` field was set to the infra node's interface on the _management_ network, rather than its interface on the VXLAN _tunnel_ network.

This matches up with the symptom that instances could talk to _each other_ on Neutron private subnets (so compute nodes were configured correctly), but instances could not communicate with their default gateway or anything North of it (i.e. couldn't talk to anything on the infra hosts).

After verifying that Ansible in fact did have the correct tunnel interface IP addresses in the dynamically discovered facts (`ansible_br_vxlan` block in each infra host's `/etc/openstack_deploy/ansible_facts` file), we fixed the issue just by redeploying Neutron to the infra hosts, and re-creating the VXLAN interfaces.

---

I started the original upgrade following the guide exactly, running everything up to and including setup-openstack.yml completing without error. Then I did something slightly weird: I switched from the 17.1.6 tag of OSA to the stable/queens tag. The diff set is fairly small, but I wanted to get odyssey4me's commit (https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=4bd1c9b3dc936ea61b9ea34d22c058d174aa925a) that does the automated Neutron agent migration from containers to bare metal, so that the subsequent Rocky upgrade would go more smoothly.

Here is exactly what I did:

- Check out `stable/queens`
- `openstack-ansible "${UPGRADE_PLAYBOOKS}/neutron-tmp-inventory.yml"``
- `openstack-ansible repo-install.yml`
- `${SCRIPTS_PATH}/bootstrap-ansible.sh` to pull in updated roles
- Redeploy Neutron, `openstack-ansible os-neutron-install.yml`
  - Could not connect to tombstone36 (a compute host), called UITS to ask for a restart
  - Redeploy Neutron to tombstone36: `openstack-ansible os-neutron-install.yml --limit tombstone36`
- `cleanup-neutron.yml`
- Re-check-out 17.1.6 to clean up cinder, heat, and nova (then I switched back to stable/queens again)
- Later during troubleshooting I re-ran `openstack-ansible os-nova-install.yml`, apparently to no effect

---

I don't know if the misconfiguration was due to cloud operator error, or a bug in the automation, or what. At some point, Ansible wrongly set the br-mgmt interface's IP as the `local_ip` in the Linuxbridge agent config, but we didn't find evidence of this in the `ansible_facts` files. Unfortunately it's likely infeasible for me to reproduce this problem, but I'm happy to look more at the config and answer questions. At least this records our experience and the fix, in case someone else encounters a similar issue.

Revision history for this message
Chris Martin (6-chris-z) wrote :

The same thing happened for our "other" production cloud during its Pike -> Queens upgrade last Tuesday. `local_ip` in linuxbridge_agent.ini got misconfigured with each infra host's br-mgmt IP rather than br-vxlan IP.

I verified that Ansible did have the correct tunnel interface IP addresses in the dynamically discovered facts, but unlike last time, re-deploying Neutron to the infra hosts did _not_ fix the issue. I fixed by manually setting correct IP address in linuxbridge_agent.ini, deleting VXLAN bridges, and restarting Neutron services.

Also, this time it was a much "cleaner" upgrade, going straight to the `stable/queens` branch (no switching branches mid-upgrade).

¯\_(ツ)_/¯

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Sorry, Pike and Queens has been EOLed by now.

But upgrade process has been improved dramatically recently and CI tests cover the process.

I think that some similar issue was fixed really recently though, but it was caused by running neutron playbook with tags:
https://review.opendev.org/q/Ib1e3a47acc34ff6d8e6de9555aea59ee8aa244e7

Changed in openstack-ansible:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.