Searilization process in Nailgun locks the database

Bug #1643024 reported by Evgeniy L
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Georgy Kibardin
Nominated for Ocata by Oleksiy Molchanov
Mitaka
Fix Released
High
Georgy Kibardin
Newton
Fix Released
High
Georgy Kibardin

Bug Description

Detailed bug description:
 On 300-node environment Nailgun locks the database so no other handler is available.
Steps to reproduce:
 Run 300-node cluster, start deployment, force nailgun-agent to start, run `fuel node` handler.
Expected results:
 `fuel node` slowly but shows the result.
Actual result:
 `fuel node` just stucks, until searilization process if finished.
Reproducibility:
 100%
Workaround:
 No workaround.
Impact:
 Huge impact for big-deployments.
Description of the environment:
 9.1, 8 cores, 16G RAM.

One of the bug was found during the same debugging session
https://bugs.launchpad.net/fuel/+bug/1643008

Changed in fuel:
milestone: none → 9.2
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
status: New → Confirmed
milestone: 9.2 → 11.0
tags: added: area-python
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Evgeniy, could you please give me an access to your lab so that I could play with it?

Revision history for this message
Evgeniy L (rustyrobot) wrote :

Georgy, we don't have the lab anymore, take regular ISO, create 300 nodes, start searilization process, and try to get a list of tasks or nodes.

Important: do it with iso, and run Nailgun as it's run in production, i.e. without fake mode.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/408406

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

So, the reason is that we have very long transaction (for the whole graph serialization) where statements like:

UPDATE nodes SET primary_tags=ARRAY['controller'] WHERE nodes.id = 1

block simultaneous updates of node timestamps by nailgun agent:

UPDATE nodes SET timestamp='2016-12-09T05:54:08.388083'::timestamp WHERE nodes.id = 1

At some point this results in all web server workers being blocked waiting for DB to complete the transaction.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

It seems that tags are populated at first read, i.e. they are updated in case they aren't set yet. This explain why the lockup can be reproduced only once for a set of nodes.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Here is what causes the problem:

https://github.com/openstack/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_serializers.py#L901

Thinking about how to move this somewhere.

Revision history for this message
Evgeniy L (rustyrobot) wrote :

Georgy, are you sure that the root cause is tags assignment? There were no tags in Fuel 9.1

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Evgeniy, this is definitely the root cause - postgresql shows what is locked by what. Before tags there were absolutely identical update for roles which locks everything up in your case.

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/411692

Changed in fuel:
status: In Progress → Invalid
Changed in fuel:
status: Invalid → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/411692
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=acf86ab9e23a2ca59425055397650e892af5f7ad
Submitter: Jenkins
Branch: master

commit acf86ab9e23a2ca59425055397650e892af5f7ad
Author: Georgy Kibardin <email address hidden>
Date: Fri Dec 16 11:18:56 2016 +0300

    Do not lock nodes records for a long time

    As a part of cluster serialization primary tags on nodes are updated.
    This results in updated nodes record to be locked until the transaction
    where serialization happens ends. When there is a lot of nodes this
    transaction can be very long and, as a result, a lot of node updates by
    nailgun agent can be blocked until it ends. At some points this results
    in all workers being busy blocking any REST call.

    Change-Id: Ie5341290097bddce299bf5726e2607d150c23768
    Closes-Bug: #1643024

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/415877

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/415882

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/mitaka)

Change abandoned by Georgy Kibardin (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/408406

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/mitaka)

Reviewed: https://review.openstack.org/415882
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=8bb1769ccdaf2804d79fc2d99cc52ac304e7c296
Submitter: Jenkins
Branch: stable/mitaka

commit 8bb1769ccdaf2804d79fc2d99cc52ac304e7c296
Author: Georgy Kibardin <email address hidden>
Date: Fri Dec 16 11:18:56 2016 +0300

    Do not lock nodes records for a long time

    As a part of cluster serialization primary tags on nodes are updated.
    This results in updated nodes record to be locked until the transaction
    where serialization happens ends. When there is a lot of nodes this
    transaction can be very long and, as a result, a lot of node updates by
    nailgun agent can be blocked until it ends. At some points this results
    in all workers being busy blocking any REST call.

    Change-Id: Ie5341290097bddce299bf5726e2607d150c23768
    Closes-Bug: #1643024
    (cherry picked from commit acf86ab9e23a2ca59425055397650e892af5f7ad)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/newton)

Reviewed: https://review.openstack.org/415877
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=35798c0c0545097cd491fbdd6f1a69c6846ae51d
Submitter: Jenkins
Branch: stable/newton

commit 35798c0c0545097cd491fbdd6f1a69c6846ae51d
Author: Georgy Kibardin <email address hidden>
Date: Fri Dec 16 11:18:56 2016 +0300

    Do not lock nodes records for a long time

    As a part of cluster serialization primary tags on nodes are updated.
    This results in updated nodes record to be locked until the transaction
    where serialization happens ends. When there is a lot of nodes this
    transaction can be very long and, as a result, a lot of node updates by
    nailgun agent can be blocked until it ends. At some points this results
    in all workers being busy blocking any REST call.

    Change-Id: Ie5341290097bddce299bf5726e2607d150c23768
    Closes-Bug: #1643024
    (cherry picked from commit acf86ab9e23a2ca59425055397650e892af5f7ad)

tags: added: area-scale
tags: added: scale
Revision history for this message
Michael Semenov (msemenov) wrote :

9.2 is not certified with more than 200 nodes. So, set as Fix Released in 9.2.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-web 11.0.0.0rc1

This issue was fixed in the openstack/fuel-web 11.0.0.0rc1 release candidate.

Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

Bug should be verified on scale lab.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.