[nailgun][astute] Switch PostgresDB transaction isolation level to 'SERIALIZABLE'

Bug #1746052 reported by Miroslav Anashkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
Miroslav Anashkin

Bug Description

Nailgun version fuel-nailgun.noarch 9.0.0-1.mos8983 (it is update to MOS 9.2 MU-3)
Fuel version fuel.noarch 9.0.0-1.mos6430 (also update to MOS 9.2 MU-3)

Description:

Currently Nailgun uses default PostgresDB transaction isolation level 'Repeatable Read'.
Since all the Nailgun tasks are running in parallel default transaction isolation level leads to serialization anomaly and even DB deadlocks. Such specific conditions is usually a combination of the fast modern CPU, virtualised environment (master node and OpenStack nodes) and high load, presumably to the disk which is used for PostgresDB data.
It makes the following scenario possible:

1. Astute finishes task serialization and casts the message to Nailgun to store the big list of serialized tasks. This list includes thousands of tasks and is inserted as single big transaction.

2. Astute is unable to commit directly the task list store transaction to Nailgun - it may only cast the messages to Nailgun.

3. Astute immediately starts the task execution - it has the list of tasks in memory.
Due to the fast modern CPU speed or slow disk speed under the Postgres DB - some tasks may be finished before the previous transaction with full list of tasks is committed to the DB. Astute casts one more message to Nailgun - to update the finished task status. This update requires the previous transaction to be finished at this moment. If the transaction with the list of tasks is not finished - Nailgun threads with the concurrent writes to the DB may get into deadlock or Nailgun may miss some object. Nailgun does not stop at such errors, it only casts error message to the app.log and continue accepting the requests in another threads. Such requests may never be finished due to DB error or deadlock.

Due to the size, the logs attached as separate file.

Revision history for this message
Miroslav Anashkin (manashkin) wrote :
tags: added: customer-found
Changed in fuel:
assignee: nobody → MOS Maintenance (mos-maintenance)
milestone: none → 9.x-updates
Changed in fuel:
assignee: MOS Maintenance (mos-maintenance) → Oleksiy Molchanov (omolchanov)
status: New → In Progress
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla1 for 9.0-updates

tags: added: sla1
Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → MOS Maintenance (mos-maintenance)
status: In Progress → Confirmed
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Miroslav, is there a way to correctly reproduce the issue? It seems the issue might be caused by heavy overloaded environment.

Changed in fuel:
assignee: MOS Maintenance (mos-maintenance) → Miroslav Anashkin (manashkin)
status: Confirmed → Incomplete
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Yes, the issue is result of overload.
Virtualized master node with storage shared with the other VMs/services has greater chance to encounter such issue.
High load creates a kind of disbalance between the master node block storage speed and other nodes. As a result - operation (save to DB), which in normal conditions goes faster then operations requested from the slave nodes, under load goes slower. So, the slave nodes may finish the tasks, send the results back to Nailgun and Nailgun requests to save the new operations results to save from DB, while already running transaction is still in progress. This is result of high parallelism level in Nailgun.

To reproduce it I use 4-core laptop with 16GB of RAM and 2 GB per VM, disk is SSD.
Fuel 9.2 (MU) deployment fully goes under VirtualBox.
In production this issue happens the same way, over the virtualized Fuel master node, with CPU pass-through.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Miroslav, since this is a long standing issue which is barely fixable do you feel ok to close it as Won't Fix?

Revision history for this message
Miroslav Anashkin (manashkin) wrote :

I did verification.
Looks like we cannot just set the transaction isolation level SERIALIZABLE for all transactions.
Nailgun DB perfomance drops ~1000 times.
We need it only for some critical transactions in critical places.
However, it means we need to change Astute, Nailgun, task properties tables in the Nailgun DB schema and may be tasks in Fuel-library to reflect which tasks require serializable transactions.

So, looks like for now it is OK to close this issue with Won't Fix and hope that we make such feature in the future.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Thanks, Miroslav, for the investigation,

Changed in fuel:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.