Instance fails to start, race condition prevents guestagent stopping postgresql

Bug #1837832 reported by Mark Kirkwood
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
New
Undecided
Unassigned

Bug Description

This is master from a week or so ago (last commit 24 Jun). I'm running it via devstack. The host VM is Ubuntu bionic and the guest image is xenial with Postgres 9.6. The scenario is:

- create a database instance
- goes to FAILED status (Service not active, status: failed to spawn)

It hits the guestagent timeout waiting for shutdown (from guestagent log):

2019-07-25 03:06:08.872 ERROR trove.guestagent.datastore.service [-] Timeout while waiting for database status to change.Expected state shutdown, current state is running
2019-07-25 03:06:08.873 INFO trove.guestagent.datastore.service [-] Service status did not change to shutdown within the given timeout: 600s
2019-07-25 03:06:08.873 DEBUG trove.guestagent.datastore.service [-] Attempting to cleanup stalled services. from (pid=1389) _wait_for_database_service_status /home/ubuntu/trove/trove/guestagent/datastore/service.py:325
2019-07-25 03:06:08.874 DEBUG trove.guestagent.datastore.service [-] No cleanup action specified for this datastore. from (pid=1389) cleanup_stalled_db_services /home/ubuntu/trove/trove/guestagent/datastore/service.py:381
2019-07-25 03:06:08.875 ERROR trove.guestagent.datastore.manager [-] An error occurred preparing datastore: Database failed to stop.: RuntimeError: Database failed to stop.
2019-07-25 03:06:08.955 ERROR oslo_messaging.rpc.server [-] Exception during message handling: RuntimeError: Database failed to stop.
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server Traceback (most recent call last):
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server result = func(ctxt, **new_args)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 160, in wrapper
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server result = f(*args, **kwargs)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/home/ubuntu/trove/trove/guestagent/datastore/manager.py", line 271, in prepare
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server cluster_config, snapshot, modules)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/home/ubuntu/trove/trove/guestagent/datastore/manager.py", line 286, in _prepare
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server cluster_config, snapshot)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 160, in wrapper
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server result = f(*args, **kwargs)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/home/ubuntu/trove/trove/guestagent/datastore/experimental/postgresql/manager.py", line 219, in do_prepare
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server self.app.stop_db()
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/home/ubuntu/trove/trove/guestagent/datastore/experimental/postgresql/service.py", line 396, in stop_db
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server disable_on_boot=do_not_start_on_reboot, update_db=update_db)
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server File "/home/ubuntu/trove/trove/guestagent/datastore/service.py", line 298, in stop_db_service
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server raise RuntimeError(_("Database failed to stop."))
2019-07-25 03:06:08.955 TRACE oslo_messaging.rpc.server RuntimeError: Database failed to stop.

Looking at the state of the postgresql service, it thinks postgres is down:

ubuntu@db0:/var/log/trove$ sudo service postgresql status
● postgresql.service - PostgreSQL RDBMS
   Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-07-25 02:56:06 UTC; 1h 4min ago
  Process: 1468 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
 Main PID: 1468 (code=exited, status=0/SUCCESS)

Jul 25 02:56:02 db0 systemd[1]: Starting PostgreSQL RDBMS...
Jul 25 02:56:02 db0 systemd[1]: Started PostgreSQL RDBMS.
Jul 25 02:56:06 db0 systemd[1]: Stopped PostgreSQL RDBMS.

However pg_isready says it is up (which is correct):

buntu@db0:/var/log/trove$ pg_isready
/var/run/postgresql:5432 - accepting connections

My suspicion is that the guestagent is attempting to shut Postgres down just as the service is starting it - hence confusion and inconsistent resulting states.

I can 'fix' this by making the guestagent wait 100s before trying to stop postgres (see patch), however that is just a workaround and we need to figure out how to stop the guestagent trying to stop postgres before it is properly up.

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :
Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

Guestagent log

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

Postgres log

Revision history for this message
Mark Kirkwood (mark-kirkwood-k) wrote :

FWIW, some testing of the absolute minimum sleep time reveals it to be 30s, which is interesting (command_process_timeout is 30s). Setting 35s was required for 100% reliable startup.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.