OVN deployment issues with high CPU load and disconnects

Bug #1884734 reported by Justinas Balciunas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Ussuri
Fix Released
Medium
Unassigned
Victoria
Fix Released
Medium
Unassigned
kolla-ansible
Fix Released
Undecided
Michal Nasiadka
Ussuri
Triaged
Undecided
Unassigned
Victoria
Fix Released
Undecided
Michal Nasiadka
openvswitch (Ubuntu)
Fix Released
Medium
Unassigned
Focal
Fix Released
Medium
Unassigned
Groovy
Fix Released
Medium
Unassigned

Bug Description

Using: kolla and kolla-ansible 10.0.0rc2, Ussuri

Problem: networking environment in OpenStack is unstable and unusable.

OS, OVN versions and logs: http://paste.openstack.org/show/795087/
Neutron server logs: http://paste.openstack.org/show/795088/

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Adding Ubuntu Cloud Archive - bug is related to OpenvSwitch version in UCA Train not containing this commit: https://github.com/openvswitch/ovs/commit/db5a066c17bdeaa7ecac08870331ae583f5ddfcc#diff-85b0dffbc2791acbfc7bcae4499df672

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Sorry, not Train - but Ussuri.

Revision history for this message
Justinas Balciunas (justinas-balciunas) wrote :

Log files from all three controllers for OVN components and neutron server: https://filebin.net/phvao10y38wcozre

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Update: it seems there is a bug in kolla-ansible causing this - we are setting OVS system-id based on {{ inventory_hostname_short }} and if Ansible inventory contains ip addresses - it will cause multiple OVN chassis to have the same system-id - and they will ,,override'' each other - which means OVN will get multiple create/delete events per second - which results in 100% CPU usage of Neutron and OVSDB.

no longer affects: cloud-archive
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/737921

Changed in kolla-ansible:
assignee: nobody → Michal Nasiadka (mnasiadka)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/737921
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=cecdb6a175eb8fe1bfff1d9321d06b7ba59990b1
Submitter: Zuul
Branch: master

commit cecdb6a175eb8fe1bfff1d9321d06b7ba59990b1
Author: Michal Nasiadka <email address hidden>
Date: Thu Jun 25 10:24:22 2020 +0200

    openvswitch: Use ansible_hostname for system-id

    Currently openvswitch sets system-id based on inventory_hostname, but when
    Ansible inventory contains ip addresses - then it will only take first ip
    octet - resulting in multiple OVN chassis being named i.e. "10".
    Then Neutron and OVN have problems functioning, because a chassis named "10"
    will be created and deleted multiple times per second - this ends up in
    ovsdb and neutron-server processes using up to 100% CPU.

    Adding openvswitch role to ovn CI job triggers.

    Change-Id: Id22eb3e74867230da02543abd93234a5fb12b31d
    Closes-Bug: #1884734

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/738164
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=c5ddd19eacb6d6ea1fead663c98d965ee1575af8
Submitter: Zuul
Branch: stable/ussuri

commit c5ddd19eacb6d6ea1fead663c98d965ee1575af8
Author: Michal Nasiadka <email address hidden>
Date: Thu Jun 25 10:24:22 2020 +0200

    openvswitch: Use ansible_hostname for system-id

    Currently openvswitch sets system-id based on inventory_hostname, but when
    Ansible inventory contains ip addresses - then it will only take first ip
    octet - resulting in multiple OVN chassis being named i.e. "10".
    Then Neutron and OVN have problems functioning, because a chassis named "10"
    will be created and deleted multiple times per second - this ends up in
    ovsdb and neutron-server processes using up to 100% CPU.

    Adding openvswitch role to ovn CI job triggers.

    Change-Id: Id22eb3e74867230da02543abd93234a5fb12b31d
    Closes-Bug: #1884734
    (cherry picked from commit cecdb6a175eb8fe1bfff1d9321d06b7ba59990b1)

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Triaging focal/ussuri for ubuntu as medium as it appears this has been worked around. Sounds like we can pick up the referenced commit in the next stable point release.

Changed in openvswitch (Ubuntu Focal):
status: New → Triaged
importance: Undecided → Medium
Changed in openvswitch (Ubuntu Groovy):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Also doing the same for groovy/victoria.

Revision history for this message
James Page (james-page) wrote :

We've done quite a few OVS point releases since this bug was raised and the versions in >= focal now include the referenced commit and subsequent further fixes and improvements in this area.

Marking fix released across all targets for Ubuntu and the UCA.

Changed in openvswitch (Ubuntu):
status: Triaged → Fix Released
Changed in openvswitch (Ubuntu Groovy):
status: Triaged → Fix Released
Changed in openvswitch (Ubuntu Focal):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.