test_reassign_port_between_servers failing in networking-ovn-tempest-dsvm-ovs-release CI job

Bug #1835029 reported by Slawek Kaplonski on 2019-07-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-ovn
Undecided
Unassigned

Bug Description

Test tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_reassign_port_between_servers is failing quite often, especially in networking-ovn-tempest-dsvm-ovs-release job in Neutron's check queue.

Example of failure: http://logs.openstack.org/78/668378/1/check/networking-ovn-tempest-dsvm-ovs-release/2af6cc8/testr_results.html.gz

Logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22line%20314%2C%20in%20test_reassign_port_between_servers%5C%22%20AND%20build_name%3A%5C%22networking-ovn-tempest-dsvm-ovs-release%5C%22

It is also failing quite often in other networking-ovn jobs, see: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22line%20314%2C%20in%20test_reassign_port_between_servers%5C%22

Daniel Alvarez (dalvarezs) wrote :
Download full text (5.0 KiB)

Looks like a core ovn issue. Still investigating but ovn-northd gets stuck retrying the same transaction (to set the FIP on the port) over and over and over:

2019-07-12T15:01:36.979Z|2796980|jsonrpc|DBG|unix:/usr/local/var/run/openvswitch/ovnsb_db.sock: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","44c638c5-7b92-41e3-9a25-a1f4bf4c6ae1"]]],"row":{"nat_addresses":"fa:16:3e:de:ea:d5 172.24.5.169 is_chassis_resident(\"cr-lrp-01adeeb9-99f5-43ec-8da4-a7bb343c82f1\")"},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["fa:16:3e:de:ea:d5 172.24.5.125 is_chassis_resident(\"cr-lrp-01adeeb9-99f5-43ec-8da4-a7bb343c82f1\")"]]]],"where":[["_uuid","==",["uuid","44c638c5-7b92-41e3-9a25-a1f4bf4c6ae1"]]],"op":"mutate","table":"Port_Binding"},{"where":[["_uuid","==",["uuid","104adb06-d1db-4617-95c6-3af720271e30"]]],"row":{"options":["map",[["ipv6_ra_address_mode","slaac"],["ipv6_ra_max_interval","600"],["ipv6_ra_min_interval","200"],["ipv6_ra_mtu","1392"],["ipv6_ra_prefixes","fd8b:814:608f::/64"],["ipv6_ra_send_periodic","true"],["ipv6_ra_src_addr","fe80::f816:3eff:fe8f:98c5"],["ipv6_ra_src_eth","fa:16:3e:8f:98:c5"],["peer","fa25893c-d654-47a2-bbee-f3acc89464c3"]]]},"op":"update","table":"Port_Binding"}], id=1456621
2019-07-12T15:01:36.979Z|2796981|poll_loop|DBG|wakeup due to [POLLIN] on fd 11 (<->/usr/local/var/run/openvswitch/ovnsb_db.sock) at lib/stream-fd.c:157 (83% CPU usage)
2019-07-12T15:01:36.979Z|2796982|jsonrpc|DBG|unix:/usr/local/var/run/openvswitch/ovnsb_db.sock: received reply, result=[{},{"count":1},{"count":1},{"count":1}], id=1456621
2019-07-12T15:01:36.979Z|2796983|poll_loop|DBG|wakeup due to 0-ms timeout at lib/ovsdb-idl.c:5397 (83% CPU usage)
2019-07-12T15:01:36.981Z|2796984|jsonrpc|DBG|unix:/usr/local/var/run/openvswitch/ovnsb_db.sock: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","44c638c5-7b92-41e3-9a25-a1f4bf4c6ae1"]]],"row":{"nat_addresses":"fa:16:3e:de:ea:d5 172.24.5.169 is_chassis_resident(\"cr-lrp-01adeeb9-99f5-43ec-8da4-a7bb343c82f1\")"},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["fa:16:3e:de:ea:d5 172.24.5.125 is_chassis_resident(\"cr-lrp-01adeeb9-99f5-43ec-8da4-a7bb343c82f1\")"]]]],"where":[["_uuid","==",["uuid","44c638c5-7b92-41e3-9a25-a1f4bf4c6ae1"]]],"op":"mutate","table":"Port_Binding"},{"where":[["_uuid","==",["uuid","104adb06-d1db-4617-95c6-3af720271e30"]]],"row":{"options":["map",[["ipv6_ra_address_mode","slaac"],["ipv6_ra_max_interval","600"],["ipv6_ra_min_interval","200"],["ipv6_ra_mtu","1392"],["ipv6_ra_prefixes","fd8b:814:608f::/64"],["ipv6_ra_send_periodic","true"],["ipv6_ra_src_addr","fe80::f816:3eff:fe8f:98c5"],["ipv6_ra_src_eth","fa:16:3e:8f:98:c5"],["peer","fa25893c-d654-47a2-bbee-f3acc89464c3"]]]},"op":"update","table":"Port_Binding"}], id=1456622
2019-07-12T15:01:36.981Z|2796985|poll_loop|DBG|wakeup due to [POLLIN] on fd 11 (<->/usr/local/var/run/openvswitch/ovnsb_db.sock) at lib/stream-fd.c:157 (83% CPU usage)
2019-07-12T15:01:36.981Z|2796986|jsonrpc|DBG|unix:/usr/local/var/ru...

Read more...

Jakub Libosvar (libosvar) wrote :

We figured out the issue is reproducible with a certain content of ovn NB db. We have the DB backup and when northd is started, the issue is reproduced. It doesn't reproduce if we comment out this call https://github.com/openvswitch/ovs/blob/master/ovn/northd/ovn-northd.c#L2581 that adds a value to nat_addresses row.

Jakub Libosvar (libosvar) wrote :

It turned out the issue was in the networking-ovn mech driver where it used a wrong OVN NB DB call and tried to call "set" on a nat resource that was removed in the same transaction. The missing entry in OVN NB DB was later fixed by the maintenance task but given the timeouts in the test and maintenance task run period, it was not always reproducible.

Reviewed: https://review.opendev.org/670868
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=d662f444280f5dc0a304a0b4de4767c91304c747
Submitter: Zuul
Branch: master

commit d662f444280f5dc0a304a0b4de4767c91304c747
Author: Daniel Alvarez <email address hidden>
Date: Mon Jul 15 17:29:03 2019 +0200

    Always add NAT rule to a LR when updating a FIP

    Before this patch, the NAT rule was attempted to be updated when a
    FIP was reassigned to another port. However, this was a noop
    since the NAT rule didn't exist anymore causing the reassigning to
    be ineffective and to fail silently.

    This patch is always adding the NAT rule to the Logical Router
    no matter if the Neutron FIP is being added or updated.

    The bug that this patch addresses was being hit in the gate with
    around a 30% ratio because the maintenance task was fixing it while
    tempest was still rying to SSH into it.

    Change-Id: Icebf4a82f64989112c3ca810b4358de490108c2d
    Closes-Bug: #1835029
    Closes-Bug: #1833820
    Co-Authored-By: Jakub Libosvar <email address hidden>
    Signed-off-by: Daniel Alvarez <email address hidden>

Changed in networking-ovn:
status: New → Fix Released

Reviewed: https://review.opendev.org/671969
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=13689522d6b97b4d2246e5de6af0447b39ec50ed
Submitter: Zuul
Branch: stable/stein

commit 13689522d6b97b4d2246e5de6af0447b39ec50ed
Author: Daniel Alvarez <email address hidden>
Date: Mon Jul 15 17:29:03 2019 +0200

    Always add NAT rule to a LR when updating a FIP

    Before this patch, the NAT rule was attempted to be updated when a
    FIP was reassigned to another port. However, this was a noop
    since the NAT rule didn't exist anymore causing the reassigning to
    be ineffective and to fail silently.

    This patch is always adding the NAT rule to the Logical Router
    no matter if the Neutron FIP is being added or updated.

    The bug that this patch addresses was being hit in the gate with
    around a 30% ratio because the maintenance task was fixing it while
    tempest was still rying to SSH into it.

    Conflicts:
     networking_ovn/common/ovn_client.py

    Change-Id: Icebf4a82f64989112c3ca810b4358de490108c2d
    Closes-Bug: #1835029
    Closes-Bug: #1833820
    Co-Authored-By: Jakub Libosvar <email address hidden>
    Signed-off-by: Daniel Alvarez <email address hidden>
    (cherry picked from commit d662f444280f5dc0a304a0b4de4767c91304c747)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/671970
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=9490a84f21eef0c112c625227748a61738334d7a
Submitter: Zuul
Branch: stable/rocky

commit 9490a84f21eef0c112c625227748a61738334d7a
Author: Daniel Alvarez <email address hidden>
Date: Mon Jul 15 17:29:03 2019 +0200

    Always add NAT rule to a LR when updating a FIP

    Before this patch, the NAT rule was attempted to be updated when a
    FIP was reassigned to another port. However, this was a noop
    since the NAT rule didn't exist anymore causing the reassigning to
    be ineffective and to fail silently.

    This patch is always adding the NAT rule to the Logical Router
    no matter if the Neutron FIP is being added or updated.

    The bug that this patch addresses was being hit in the gate with
    around a 30% ratio because the maintenance task was fixing it while
    tempest was still rying to SSH into it.

    Conflicts:
     networking_ovn/common/ovn_client.py

    Change-Id: Icebf4a82f64989112c3ca810b4358de490108c2d
    Closes-Bug: #1835029
    Closes-Bug: #1833820
    Co-Authored-By: Jakub Libosvar <email address hidden>
    Signed-off-by: Daniel Alvarez <email address hidden>
    (cherry picked from commit d662f444280f5dc0a304a0b4de4767c91304c747)
    (cherry picked from commit 13689522d6b97b4d2246e5de6af0447b39ec50ed)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/671971
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=fcc79fe9f7539b5665847f9edd8aa8cb37b6dc0f
Submitter: Zuul
Branch: stable/queens

commit fcc79fe9f7539b5665847f9edd8aa8cb37b6dc0f
Author: Daniel Alvarez <email address hidden>
Date: Mon Jul 15 17:29:03 2019 +0200

    Always add NAT rule to a LR when updating a FIP

    Before this patch, the NAT rule was attempted to be updated when a
    FIP was reassigned to another port. However, this was a noop
    since the NAT rule didn't exist anymore causing the reassigning to
    be ineffective and to fail silently.

    This patch is always adding the NAT rule to the Logical Router
    no matter if the Neutron FIP is being added or updated.

    The bug that this patch addresses was being hit in the gate with
    around a 30% ratio because the maintenance task was fixing it while
    tempest was still rying to SSH into it.

    Conflicts:
     networking_ovn/common/ovn_client.py

    Change-Id: Icebf4a82f64989112c3ca810b4358de490108c2d
    Closes-Bug: #1835029
    Closes-Bug: #1833820
    Co-Authored-By: Jakub Libosvar <email address hidden>
    Signed-off-by: Daniel Alvarez <email address hidden>
    (cherry picked from commit d662f444280f5dc0a304a0b4de4767c91304c747)
    (cherry picked from commit 13689522d6b97b4d2246e5de6af0447b39ec50ed)
    (cherry picked from commit 9490a84f21eef0c112c625227748a61738334d7a)

tags: added: in-stable-queens
tags: added: networking-ovn-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers