kolla-ansible

http-request timeout can cause services to become out of sync

Bug #1917648 reported by Doug Szumski on 2021-03-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	kolla-ansible	Fix Released	Medium	Doug Szumski

Bug Description

In services which use the Apache HTTP server to service HTTP requests,
there exists a TimeOut directive [1] which defaults to 60 seconds. A
similar timeout also exists in HAProxy, and is set to 60 seconds. APIs
which come under heavy load, such as Cinder, can sometimes exceed the
shortest of these periods which results in a HTTP 504 Gateway timeout,
or similar. However, the request can still be serviced without error.
For example, if Nova calls the Cinder API to detach a volume, and
this operation takes longer than the shortest of the two timeouts, Nova
will emit a stack trace with a 504 Gateway timeout. At some time later,
the request to detach the volume will succeed. The Nova and Cinder DBs
then become out-of-sync with each other, and in the worst case DB
surgery is required.

Although strictly this category of bugs should be fixed in OpenStack
services it is not realistic to expect this happen in the short term.
Therefore it makes sense to try and reduce the likelihood of
triggering such bugs in Kolla Ansible.

An example of a related bug is here:

https://bugs.launchpad.net/nova/+bug/1888665

See original description