BSN consistency hash not multi-server safe

Bug #1374261 reported by Kevin Benton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Kevin Benton
Juno
Fix Released
Medium
Kevin Benton

Bug Description

Multiple neutron servers may read from the consistency hash table in the big switch plugin simultaneously, which will cause the one with a later request to receive an inconsistency error.

This is an issue with RPC induced backend requests (port update) or active-active deployments.

tags: added: folsom-backport-potential
tags: added: icehouse-backport-potential
removed: folsom-backport-potential
Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/124265

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/124336

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/icehouse)

Change abandoned by Kevin Benton (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/124336
Reason: revisit once fix is in master

Changed in neutron:
importance: Undecided → Medium
milestone: none → kilo-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)
Download full text (4.1 KiB)

Reviewed: https://review.openstack.org/124265
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cdaa502f899d4c90c5e40ccd745f1d92bfd1127b
Submitter: Jenkins
Branch: master

commit cdaa502f899d4c90c5e40ccd745f1d92bfd1127b
Author: Kevin Benton <email address hidden>
Date: Thu Sep 25 21:42:39 2014 -0700

    BSN: Optimistic locking strategy for consistency

    Summary:
      Adds an optimistic locking strategy for the Big Switch
      server manager so multiple Neutron servers wanting to
      communicate with the backend do not receive the consistency
      hash for use simultaneously.

      The bsn-rest-call semaphore is removed because serialization
      is now provided by the new locking scheme.

      A new DB engine is added because the consistency hashes
      need a life-cycle with rollbacks and other DB operations
      than cannot impact or be impacted by database operations
      happening on the regular Neutron objects.

      Unit tests are included for each of the new branches
      introduced.

    Problem Statement:
      Requests to the Big Switch controllers must contain the
      consistency hash value received from the previous update.
      Otherwise, an inconsistency error will be triggered which
      will force a synchronization. Essentially, a new backend
      call must be prevented from reading from the consistency
      hash table in the DB until the previous call has updated
      the table with the hash from the server response.

      This can be addressed by a semaphore around the rest_call
      function for the single server use case and by a table lock
      on the consistency table for multiple Neutron servers.
      However, both solutions are inadequate because a single
      Neutron server does not scale and a table lock is not
      supported by common SQL HA deployments (e.g. Galera).

      This issue was previously addressed by deploying servers
      in an active-standby configuration. However, that only
      prevented the problem for HTTP API calls. All Neutron
      servers would respond to RPC messages, some of which would
      result in a port update and possible backend call which
      would trigger a conflict if it happened at the same time
      as a backend call from another server. These unnecessary
      syncs are unsustainable as the topology increases beyond
      ~3k VMs.

      Any solution needs to be back-portable to Icehouse so new
      database tables, new requirements, etc. are all out of the
      question.

    Solution:
      This patch stores the lock for the consistency hash as a part
      of the DB record. The guaruntees the database offers around
      atomic insertion and constrained atomic updates offer the
      primitives necessary to ensure that only one process/thread
      can lock the record at once.

      The read_for_update method is modified to not return the hash
      in the database until an identifier is inserted into the
      current record or added as a new record. By using an UPDATE
      query with a WHERE clause restricting to the current state,
      only one of many concurrent caller...

Read more...

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/136275

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)
Download full text (4.3 KiB)

Reviewed: https://review.openstack.org/136275
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=615e2d6399ca443a27ca3476d326faa2ae7a3fa2
Submitter: Jenkins
Branch: stable/juno

commit 615e2d6399ca443a27ca3476d326faa2ae7a3fa2
Author: Kevin Benton <email address hidden>
Date: Thu Sep 25 21:42:39 2014 -0700

    BSN: Optimistic locking strategy for consistency

    Summary:
      Adds an optimistic locking strategy for the Big Switch
      server manager so multiple Neutron servers wanting to
      communicate with the backend do not receive the consistency
      hash for use simultaneously.

      The bsn-rest-call semaphore is removed because serialization
      is now provided by the new locking scheme.

      A new DB engine is added because the consistency hashes
      need a life-cycle with rollbacks and other DB operations
      than cannot impact or be impacted by database operations
      happening on the regular Neutron objects.

      Unit tests are included for each of the new branches
      introduced.

    Problem Statement:
      Requests to the Big Switch controllers must contain the
      consistency hash value received from the previous update.
      Otherwise, an inconsistency error will be triggered which
      will force a synchronization. Essentially, a new backend
      call must be prevented from reading from the consistency
      hash table in the DB until the previous call has updated
      the table with the hash from the server response.

      This can be addressed by a semaphore around the rest_call
      function for the single server use case and by a table lock
      on the consistency table for multiple Neutron servers.
      However, both solutions are inadequate because a single
      Neutron server does not scale and a table lock is not
      supported by common SQL HA deployments (e.g. Galera).

      This issue was previously addressed by deploying servers
      in an active-standby configuration. However, that only
      prevented the problem for HTTP API calls. All Neutron
      servers would respond to RPC messages, some of which would
      result in a port update and possible backend call which
      would trigger a conflict if it happened at the same time
      as a backend call from another server. These unnecessary
      syncs are unsustainable as the topology increases beyond
      ~3k VMs.

      Any solution needs to be back-portable to Icehouse so new
      database tables, new requirements, etc. are all out of the
      question.

    Solution:
      This patch stores the lock for the consistency hash as a part
      of the DB record. The guaruntees the database offers around
      atomic insertion and constrained atomic updates offer the
      primitives necessary to ensure that only one process/thread
      can lock the record at once.

      The read_for_update method is modified to not return the hash
      in the database until an identifier is inserted into the
      current record or added as a new record. By using an UPDATE
      query with a WHERE clause restricting to the current state,
      only one of many concurrent c...

Read more...

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.