[long testing] Cloud is not functional: "Lock wait timeout exceeded; try restarting transaction" error.

Bug #1490500 reported by Timur Nurlygayanov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Critical
Sergii Golovatiuk

Bug Description

Note: this issue reproduced on the hardware scale lab with 5 controller nodes and ~18 compute servers.

Steps To Reproduce:
1. Deploy OpenStack cloud with 5 controllers and 18 compute nodes
2. Run different tests during 4 days (Murano, Sahara integration tests, boot VMs with CPU / HDD / Network load)
3. Wait 2 days without any tests / load on the cloud
4. Login to Horizon, try to create Network. Horizon will work really slow, and after several minutes it will fail with 'Something went wrong!' error
5. Login to controller nodes and check the resources usage. We can see that all RAM (30Gb) and SWAP (16 Gb) are full (please see attached screenshots.

Expected Result:
Cloud is functional, all services work without issues, memory is free.

Observed Result:
We can see that all RAM (30Gb) and SWAP (16 Gb) are full (please see attached screenshots). Cloud is not functional:

root@node-7:~# . openrc
root@node-7:~# nova list
ERROR (Unauthorized): Unauthorized (HTTP 401) (Request-ID: req-9d7dea89-d168-4bbd-b6f6-c8f2a21f81a3)

MySQL and RabbitMQ take a lot of memory:
______________________________________
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14463 mysql 20 0 11.994g 4.000g 1.325g S 0.3 12.7 122:30.16 mysqld
18106 rabbitmq 20 0 5312204 2.234g 2732 S 34.1 7.1 2813:41 beam.smp
22344 root 20 0 1014832 791396 7656 S 0.7 2.4 53:11.24 ceph-mon
17847 nova 20 0 388008 196548 3748 S 0.0 0.6 118:42.91 nova-conductor
 7209 cinder 20 0 4181612 181548 8340 S 0.0 0.6 122:01.94 cinder-volume
17844 nova 20 0 353480 161940 3748 S 0.0 0.5 118:45.47 nova-conductor
17835 nova 20 0 350800 159208 3748 S 0.0 0.5 118:46.23 nova-conductor
17841 nova 20 0 348972 157460 3752 S 0.0 0.5 118:43.65 nova-conductor
17849 nova 20 0 347052 155620 3748 S 0.0 0.5 118:45.61 nova-conductor
 7328 glance 20 0 2795720 152264 8748 S 0.0 0.5 3:47.63 glance-api
32470 neutron 20 0 388668 152008 2752 S 0.0 0.5 20:33.16 neutron-server
17838 nova 20 0 340748 149624 3756 S 0.0 0.5 118:41.81 nova-conductor
32469 neutron 20 0 385000 148808 2752 S 0.0 0.5 20:47.76 neutron-server

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Screenshot of 'atop' output

Changed in mos:
milestone: none → 7.0
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

root@node-7:~# keystone user-list
/usr/lib/python2.7/dist-packages/keystoneclient/shell.py:65: DeprecationWarning: The keystone CLI is deprecated in favor of python-openstackclient. For a Python library, continue using python-keystoneclient.
  'python-keystoneclient.', DeprecationWarning)
An unexpected error prevented the server from fulfilling your request: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'DELETE FROM revocation_event WHERE revocation_event.revoked_at < %s' (datetime.datetime(2015, 8, 31, 9, 7, 28, 737931),) (Disable debug mode to suppress these details.) (HTTP 500) (Request-ID: req-6de83571-e7b0-4a84-8b64-2f588c385e67)

summary: [long testing] SWAP is full after 4 days of OpenStack cloud life - cloud
- is not functional
+ is not functional: "Lock wait timeout exceeded; try restarting
+ transaction" error
Changed in mos:
assignee: nobody → Boris Bobrov (bbobrov)
Revision history for this message
Boris Bobrov (bbobrov) wrote : Re: [long testing] SWAP is full after 4 days of OpenStack cloud life - cloud is not functional: "Lock wait timeout exceeded; try restarting transaction" error

The exception really happens in keystone, but I can't say what's the root cause. There are 0 entries in `revocation_event` table and it seems that the process just could not grab the lock. I don't know how to check what grabbed the lock.

Changed in mos:
assignee: Boris Bobrov (bbobrov) → Timur Nurlygayanov (tnurlygayanov)
assignee: Timur Nurlygayanov (tnurlygayanov) → Boris Bobrov (bbobrov)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Comments from http://stackoverflow.com/a/15202926:

there is a problem somewhere :
- in your query (misplaced char, cartesian product, ...)
- very numerous records to edit
- complex joins or tests (MD5, substrings, LIKE %...%, etc.)
- data structure problem
- foreign key model (chain/loop locking)
- misindexed data

As @syedrakib said, it works but this is no long-living solution for production.

Revision history for this message
Boris Bobrov (bbobrov) wrote :

Sergii, maybe you have any idea about what happened to the lock after the swap became 100% full?

Changed in mos:
assignee: Boris Bobrov (bbobrov) → Sergii Golovatiuk (sgolovatiuk)
Changed in mos:
status: Confirmed → In Progress
summary: - [long testing] SWAP is full after 4 days of OpenStack cloud life - cloud
- is not functional: "Lock wait timeout exceeded; try restarting
- transaction" error
+ [long testing] Cloud is not functional: "Lock wait timeout exceeded; try
+ restarting transaction" error.
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

RCA indicates that partition for mysql became full

root@node-4:/var/lib/mysql# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/mysql-root 9.8G 9.3G 4.0K 100% /var/lib/mysql

Further analysis indicated that most of space was filled up by mysql-bin

root@node-4:/var/lib/mysql# du -k mysql-bin.* | awk '{total=total+$1} END {print total/(1024 * 1024)"GB"}'
5.15004GB

Changed in mos:
status: In Progress → Invalid
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :
Changed in mos:
status: Invalid → Fix Committed
status: Fix Committed → In Progress
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :
Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Fix verified on MOS 7.0 ISO #265, now we have 20Gb for MySQL partition, so, we will have more time before cluster will die.
Also it is possible to configure disks partitions manually.

Changed in mos:
status: Fix Committed → Fix Released
tags: added: long-haul-testing
tags: removed: long-haul-testing
tags: added: low-hanging-fruit
tags: removed: low-hanging-fruit
tags: added: long-haul-testing
Dmitry Pyzhov (dpyzhov)
tags: removed: long-haul-testing
tags: added: long-haul-testing
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.