Mirantis OpenStack

[long testing] Cloud is not functional: "Lock wait timeout exceeded; try restarting transaction" error.

Bug #1490500 reported by Timur Nurlygayanov on 2015-08-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	Fix Released	Critical	Sergii Golovatiuk	Mirantis OpenStack 7.0

Bug Description

Note: this issue reproduced on the hardware scale lab with 5 controller nodes and ~18 compute servers.

Steps To Reproduce:
1. Deploy OpenStack cloud with 5 controllers and 18 compute nodes
2. Run different tests during 4 days (Murano, Sahara integration tests, boot VMs with CPU / HDD / Network load)
3. Wait 2 days without any tests / load on the cloud
4. Login to Horizon, try to create Network. Horizon will work really slow, and after several minutes it will fail with 'Something went wrong!' error
5. Login to controller nodes and check the resources usage. We can see that all RAM (30Gb) and SWAP (16 Gb) are full (please see attached screenshots.

Expected Result:
Cloud is functional, all services work without issues, memory is free.

Observed Result:
We can see that all RAM (30Gb) and SWAP (16 Gb) are full (please see attached screenshots). Cloud is not functional:

root@node-7:~# . openrc
root@node-7:~# nova list
ERROR (Unauthorized): Unauthorized (HTTP 401) (Request-ID: req-9d7dea89-d168-4bbd-b6f6-c8f2a21f81a3)

MySQL and RabbitMQ take a lot of memory:
______________________________________
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14463 mysql 20 0 11.994g 4.000g 1.325g S 0.3 12.7 122:30.16 mysqld
18106 rabbitmq 20 0 5312204 2.234g 2732 S 34.1 7.1 2813:41 beam.smp
22344 root 20 0 1014832 791396 7656 S 0.7 2.4 53:11.24 ceph-mon
17847 nova 20 0 388008 196548 3748 S 0.0 0.6 118:42.91 nova-conductor
7209 cinder 20 0 4181612 181548 8340 S 0.0 0.6 122:01.94 cinder-volume
17844 nova 20 0 353480 161940 3748 S 0.0 0.5 118:45.47 nova-conductor
17835 nova 20 0 350800 159208 3748 S 0.0 0.5 118:46.23 nova-conductor
17841 nova 20 0 348972 157460 3752 S 0.0 0.5 118:43.65 nova-conductor
17849 nova 20 0 347052 155620 3748 S 0.0 0.5 118:45.61 nova-conductor
7328 glance 20 0 2795720 152264 8748 S 0.0 0.5 3:47.63 glance-api
32470 neutron 20 0 388668 152008 2752 S 0.0 0.5 20:33.16 neutron-server
17838 nova 20 0 340748 149624 3756 S 0.0 0.5 118:41.81 nova-conductor
32469 neutron 20 0 385000 148808 2752 S 0.0 0.5 20:47.76 neutron-server

Tags:

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-08-31:

Screenshot from 2015-08-31 09:29:45.png Edit (323.2 KiB, image/png)

Screenshot of 'atop' output

Changed in mos:
milestone:	none → 7.0
status:	New → Confirmed
importance:	Undecided → Critical

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-08-31:

root@node-7:~# keystone user-list
/usr/lib/python2.7/dist-packages/keystoneclient/shell.py:65: DeprecationWarning: The keystone CLI is deprecated in favor of python-openstackclient. For a Python library, continue using python-keystoneclient.
'python-keystoneclient.', DeprecationWarning)
An unexpected error prevented the server from fulfilling your request: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'DELETE FROM revocation_event WHERE revocation_event.revoked_at < %s' (datetime.datetime(2015, 8, 31, 9, 7, 28, 737931),) (Disable debug mode to suppress these details.) (HTTP 500) (Request-ID: req-6de83571-e7b0-4a84-8b64-2f588c385e67)

Timur Nurlygayanov (tnurlygayanov) on 2015-08-31

summary:	[long testing] SWAP is full after 4 days of OpenStack cloud life - cloud - is not functional + is not functional: "Lock wait timeout exceeded; try restarting + transaction" error
Changed in mos:
assignee:	nobody → Boris Bobrov (bbobrov)

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-08-31: Re: [long testing] SWAP is full after 4 days of OpenStack cloud life - cloud is not functional: "Lock wait timeout exceeded; try restarting transaction" error

The exception really happens in keystone, but I can't say what's the root cause. There are 0 entries in `revocation_event` table and it seems that the process just could not grab the lock. I don't know how to check what grabbed the lock.

Changed in mos:
assignee:	Boris Bobrov (bbobrov) → Timur Nurlygayanov (tnurlygayanov)
assignee:	Timur Nurlygayanov (tnurlygayanov) → Boris Bobrov (bbobrov)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-08-31:

Comments from http://stackoverflow.com/a/15202926:

there is a problem somewhere :
- in your query (misplaced char, cartesian product, ...)
- very numerous records to edit
- complex joins or tests (MD5, substrings, LIKE %...%, etc.)
- data structure problem
- foreign key model (chain/loop locking)
- misindexed data

As @syedrakib said, it works but this is no long-living solution for production.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-08-31:

Sergii, maybe you have any idea about what happened to the lock after the swap became 100% full?

Changed in mos:
assignee:	Boris Bobrov (bbobrov) → Sergii Golovatiuk (sgolovatiuk)

Sergii Golovatiuk (sgolovatiuk) on 2015-08-31

Changed in mos:
status:	Confirmed → In Progress
summary:	- [long testing] SWAP is full after 4 days of OpenStack cloud life - cloud - is not functional: "Lock wait timeout exceeded; try restarting - transaction" error + [long testing] Cloud is not functional: "Lock wait timeout exceeded; try + restarting transaction" error.

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-08-31:

RCA indicates that partition for mysql became full

root@node-4:/var/lib/mysql# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/mysql-root 9.8G 9.3G 4.0K 100% /var/lib/mysql

Further analysis indicated that most of space was filled up by mysql-bin

root@node-4:/var/lib/mysql# du -k mysql-bin.* | awk '{total=total+$1} END {print total/(1024 * 1024)"GB"}'
5.15004GB

Sergii Golovatiuk (sgolovatiuk) on 2015-08-31

Changed in mos:
status:	In Progress → Invalid

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-08-31:

https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/mysql/templates/my.cnf.erb#L52 is already in config. I would raise the minimum from 10GB to 15GB.

Sergii Golovatiuk (sgolovatiuk) on 2015-08-31

Changed in mos:
status:	Invalid → Fix Committed
status:	Fix Committed → In Progress

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-09-01:

Fix is on review:
https://review.openstack.org/#/c/218921

Sergii Golovatiuk (sgolovatiuk) on 2015-09-02

Changed in mos:
status:	In Progress → Fix Committed

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-09-03:

Fix verified on MOS 7.0 ISO #265, now we have 20Gb for MySQL partition, so, we will have more time before cluster will die.
Also it is possible to configure disks partitions manually.