Concurrent report_state from multiple agents: segment_host_mapping fails - StaleDataError

Bug #1743579 reported by Harald Jensås on 2018-01-16
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Medium
Harald Jensås

Bug Description

When multiple host agents rapidly report_state for the first time we get StaleDataError and _update_segment_host_mapping_for_agent does not complete for all hosts.

Attached is a file with logs as well as reproducer script and instruction on how to set up devstack environment similar to the one I am using.

To Reproduce:
-------------

Run script with the delay, time.sleep(10), commented.
 Results:
  * 2x StaleDataError
  * Only 1 attempt to add host to placement/host-aggregate.

MariaDB [neutron]> MariaDB [neutron]> SELECT * FROM segmenthostmappings;
+--------------------------------------+---------------------------------+
| segment_id | host |
+--------------------------------------+---------------------------------+
| a974ae4c-1389-4e41-9ab9-820165c26acd | host2 |
| a974ae4c-1389-4e41-9ab9-820165c26acd | routed-devstack.lab.example.com |
| bc626d3d-5503-4875-9db8-e1bcfad35979 | host2 |
| bc626d3d-5503-4875-9db8-e1bcfad35979 | routed-devstack.lab.example.com |
| ec7717dd-8533-464f-a3c8-4ecc7dc08d10 | host2 |
| ec7717dd-8533-464f-a3c8-4ecc7dc08d10 | routed-devstack.lab.example.com |
+--------------------------------------+---------------------------------+

Conclutions:
  * 2x StaleDataError
  * 1x successfull _update_segment_host_mapping after_create.

*** We should see 3x attempts to add to placement/host-aggregate, one for each host agent. ****

Running the reproducer script with the delay uncommented (No issue):
--------------------------------------------------------------------

Run script with the delay, time.sleep(10), enabled.
Results:
  * No StaleDataError
  * 3 attempts to add the host to placemenb/host-aggregate.

MariaDB [neutron]> SELECT * FROM segmenthostmappings;
+--------------------------------------+---------------------------------+
| segment_id | host |
+--------------------------------------+---------------------------------+
| 11b9258f-8712-43b7-8f39-3eab627a8c7f | host0 |
| 11b9258f-8712-43b7-8f39-3eab627a8c7f | host1 |
| 11b9258f-8712-43b7-8f39-3eab627a8c7f | host2 |
| 11b9258f-8712-43b7-8f39-3eab627a8c7f | routed-devstack.lab.example.com |
| 89f96bee-424c-4ee2-8639-2ca8e07a70e6 | host0 |
| 89f96bee-424c-4ee2-8639-2ca8e07a70e6 | host1 |
| 89f96bee-424c-4ee2-8639-2ca8e07a70e6 | host2 |
| 89f96bee-424c-4ee2-8639-2ca8e07a70e6 | routed-devstack.lab.example.com |
| a7a7d2f4-c809-4ebb-916f-930c97fbec47 | host0 |
| a7a7d2f4-c809-4ebb-916f-930c97fbec47 | host1 |
| a7a7d2f4-c809-4ebb-916f-930c97fbec47 | host2 |
| a7a7d2f4-c809-4ebb-916f-930c97fbec47 | routed-devstack.lab.example.com |
+--------------------------------------+---------------------------------+

Conclution:
  * 3x successfull _update_segment_host_mapping after_create.

** NOTE: **
The RESP BODY: {"itemNotFound": {"message": "Compute host host1 could not be found.", "code": 404}} errors in the logs is expected, the fake host is not in Nova, so this is expeced.

Harald Jensås (harald-jensas) wrote :

I proposed a fix here: https://review.openstack.org/#/c/534449/

/me wonder why it was not automatically posted here ...

Changed in neutron:
assignee: nobody → Harald Jensås (harald-jensas)
status: New → In Progress
Changed in neutron:
assignee: Harald Jensås (harald-jensas) → nobody
status: In Progress → New
Changed in neutron:
assignee: nobody → Harald Jensås (harald-jensas)
status: New → In Progress

Reviewed: https://review.openstack.org/534449
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f84781f246004651e0636f8b6507ee1e48bac6b0
Submitter: Zuul
Branch: master

commit f84781f246004651e0636f8b6507ee1e48bac6b0
Author: Harald Jensas <email address hidden>
Date: Tue Jan 16 21:15:22 2018 +0100

    Add retry decorator update_segment_host_mapping()

    When multiple agents register at the same time there is
    a possible race condition causing segment host mappings
    updates to fail. StaleDataError raised by SQLAlchemy ORM.

    Adding retry_if_session_inactive() decorator to the method
    fixes the issue.

    Also serialize the method with lockutils. It takes 25+
    seconds to update segment host mappings for 10 agents with
    the retry decorator alone. With the method serialized the
    same operation completes in less than 1 second. The retry
    decorator is still required for active/active scenarios.

    Closes-Bug: #1743579
    Change-Id: I616457f094d000a4016c610b454be8269d9b4948

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 12.0.0.0b3 development milestone.

Akihiro Motoki (amotoki) wrote :

(not related to the main topic)

> /me wonder why it was not automatically posted here ...

It is because the bug number in patch set 1 referred to a different bug.
It seems "Fix proposed" is post automatically when a new change is proposed.

Changed in neutron:
importance: Undecided → Medium

Reviewed: https://review.openstack.org/536940
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=59520d6006ce3d6eccda25d81a5155d5328fbd96
Submitter: Zuul
Branch: stable/pike

commit 59520d6006ce3d6eccda25d81a5155d5328fbd96
Author: Harald Jensas <email address hidden>
Date: Tue Jan 16 21:15:22 2018 +0100

    Add retry decorator update_segment_host_mapping()

    When multiple agents register at the same time there is
    a possible race condition causing segment host mappings
    updates to fail. StaleDataError raised by SQLAlchemy ORM.

    Adding retry_if_session_inactive() decorator to the method
    fixes the issue.

    Also serialize the method with lockutils. It takes 25+
    seconds to update segment host mappings for 10 agents with
    the retry decorator alone. With the method serialized the
    same operation completes in less than 1 second. The retry
    decorator is still required for active/active scenarios.

    Closes-Bug: #1743579
    Change-Id: I616457f094d000a4016c610b454be8269d9b4948
    (cherry picked from commit f84781f246004651e0636f8b6507ee1e48bac6b0)

tags: added: in-stable-pike

This issue was fixed in the openstack/neutron 11.0.3 release.

Reviewed: https://review.opendev.org/640447
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=490edd4cc8015fbfad7ca5866a60316728d3c9fd
Submitter: Zuul
Branch: stable/ocata

commit 490edd4cc8015fbfad7ca5866a60316728d3c9fd
Author: Harald Jensas <email address hidden>
Date: Tue Jan 16 21:15:22 2018 +0100

    Add retry decorator update_segment_host_mapping()

    When multiple agents register at the same time there is
    a possible race condition causing segment host mappings
    updates to fail. StaleDataError raised by SQLAlchemy ORM.

    Adding retry_if_session_inactive() decorator to the method
    fixes the issue.

    Also serialize the method with lockutils. It takes 25+
    seconds to update segment host mappings for 10 agents with
    the retry decorator alone. With the method serialized the
    same operation completes in less than 1 second. The retry
    decorator is still required for active/active scenarios.

    Closes-Bug: #1743579
    Change-Id: I616457f094d000a4016c610b454be8269d9b4948
    (cherry picked from commit f84781f246004651e0636f8b6507ee1e48bac6b0)
    (cherry picked from commit 59520d6006ce3d6eccda25d81a5155d5328fbd96)

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers