Why does it take so long for the prepare() message to make it from the task manager to the guest?

Bug #1511623 reported by Amrith Kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
Medium
Amrith Kumar

Bug Description

We can now see that it could take a long time to get a message from the task manager to the guest agent.

Consider these messages from the host:

2015-10-30 03:05:44.066 31192 DEBUG trove.db.models [-] Saving DBInstance: {u'cluster_id': None, u'shard_id': None, u'deleted_at': None, u'id': u'0b7d8b3b-5cf9-4323-a3a9-fbf24665619e', u'datastore_version_id': u'8c29840f-edf5-433e-a80c-ab67c66cf9a8', 'errors': {}, u'hostname': None, u'server_status': None, u'task_description': u'The instance is building.', u'volume_size': 1, u'type': None, u'updated': datetime.datetime(2015, 10, 30, 3, 5, 44, 66689), '_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x7fbb5da12b90>, u'deleted': 0, u'configuration_id': None, u'volume_id': u'4d257db1-a856-49cc-8dad-df719ef23b44', u'slave_of_id': None, u'task_start_time': None, u'name': u'TEST_4188c881-9fac-41a6-bc13-8bd0279c0ad1', u'task_id': 5, u'created': datetime.datetime(2015, 10, 30, 3, 5, 38), u'tenant_id': u'e77d43662fd04881bd125ba69bc8a075', u'compute_instance_id': u'162c57f9-cc4e-4881-8408-9e766efe2235', u'flavor_id': u'7'} save /opt/stack/new/trove/trove/db/models.py:62
2015-10-30 03:05:44.074 31192 DEBUG trove.taskmanager.models [-] End _create_server_volume_individually for id: 0b7d8b3b-5cf9-4323-a3a9-fbf24665619e _create_server_volume_individually /opt/stack/new/trove/trove/taskmanager/models.py:764
2015-10-30 03:05:44.091 31192 DEBUG trove.taskmanager.models [-] Entering guest_prepare _guest_prepare /opt/stack/new/trove/trove/taskmanager/models.py:877
2015-10-30 03:05:44.095 31192 DEBUG trove.guestagent.api [-] Sending the call to prepare the Guest. prepare /opt/stack/new/trove/trove/guestagent/api.py:232
2015-10-30 03:05:44.099 31192 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2015-10-30 03:05:44.101 31192 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on localhost:5672
2015-10-30 03:05:44.114 31192 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 127.0.0.1:5672
2015-10-30 03:05:44.126 31192 DEBUG trove.guestagent.api [-] Casting prepare _cast /opt/stack/new/trove/trove/guestagent/api.py:76
2015-10-30 03:05:44.128 31192 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2015-10-30 03:05:44.130 31192 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on localhost:5672
2015-10-30 03:05:44.142 31192 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 127.0.0.1:5672
2015-10-30 03:05:44.142 31192 DEBUG oslo_messaging._drivers.amqpdriver [-] CAST unique_id: 616ac9a7d92144e6a80a64596dd6790d exchange 'openstack' topic 'guestagent.0b7d8b3b-5cf9-4323-a3a9-fbf24665619e' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448

Here is the time when the message makes it to the guest

2015-10-30 03:12:16.688 31263 DEBUG trove.conductor.manager [-] Message from 0b7d8b3b-5cf9-4323-a3a9-fbf24665619e: trove.guestagent.datastore.mysql.manager_base received prepare(), invoking begin_install(). log_message /opt/stack/new/trove/trove/conductor/manager.py:141

that's almost 7 minutes.

Anecdotally, killing the task manager makes the message get to the guest sooner.

Amrith Kumar (amrith)
Changed in trove:
importance: Undecided → Medium
Revision history for this message
Amrith Kumar (amrith) wrote :
Download full text (5.7 KiB)

At least one part of this problem has been narrowed to an unfortunate side-effect of the oslo.messaging change which went into Juno. To understand this problem, a simple repro was created.

You can recreate the problem and experiment with the proposed fix by un-commenting a line in the test program.

To run the repro:

copy server.py, server.conf, client.py and client.conf to a linux machine (I used ubuntu) and execute the following.

python client.py --config-file client.conf

You will get an output like:

amrith@amrith-work:~/source/rabbit$ python client.py --config-file client.conf
2016-01-12 13:30:06.948 101415 DEBUG __main__ [-] client starting main client.py:27
2016-01-12 13:30:07.005 101415 DEBUG __main__ [-] Now calling get_rpc_server() main client.py:56
2016-01-12 13:30:07.010 101415 DEBUG __main__ [-] Now calling server.start() main client.py:59
2016-01-12 13:30:07.010 101415 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2016-01-12 13:30:07.024 101415 DEBUG __main__ [-] Now calling rpcserver.stop() main client.py:67
2016-01-12 13:30:07.025 101415 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2016-01-12 13:30:07.031 101415 DEBUG oslo_messaging._drivers.amqpdriver [-] CAST unique_id: 435d43890c394e5ab41bdd3b336012e7 exchange 'openstack' topic 'fbb9b530-3299-44d3-b048-ee2411afce0c' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
2016-01-12 13:30:07.033 101415 DEBUG __main__ [-] Going to sleep for 240s main client.py:89
2016-01-12 13:30:07.033 101415 DEBUG oslo_messaging._drivers.amqpdriver [-] received message unique_id: 435d43890c394e5ab41bdd3b336012e7 __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:195

--

In another window, execute

python server.py --config-file server.conf

You get something like this:

amrith@amrith-work:~/source/rabbit$ python server.py --config-file server.conf
2016-01-12 13:31:01.532 101448 DEBUG __main__ [-] server starting main server.py:54
2016-01-12 13:31:01.589 101448 DEBUG __main__ [-] Server listening for server = guestclient.fbb9b530-3299-44d3-b048-ee2411afce0c, topic = fbb9b530-3299-44d3-b048-ee2411afce0c main server.py:64
2016-01-12 13:31:01.594 101448 DEBUG __main__ [-] Calling server.start() main server.py:76
2016-01-12 13:31:01.594 101448 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2016-01-12 13:31:01.606 101448 DEBUG __main__ [-] Going to sleep for 240s now main server.py:79
2016-01-12 13:33:07.023 101448 DEBUG oslo_messaging._drivers.amqpdriver [-] received message unique_id: 435d43890c394e5ab41bdd3b336012e7 __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:195
2016-01-12 13:33:07.024 101448 DEBUG __main__ [-] got a ping from 2016-01-12T13:30:07.024895 at 2016-01-12 13:33:07.024389 ping server.py:41

Observe that the message sent from the client at about 2016-01-12 13:30:0...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/266528

Changed in trove:
status: New → In Progress
Amrith Kumar (amrith)
Changed in trove:
milestone: none → mitaka-2
Revision history for this message
Craig Vyvial (cp16net) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/266528
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=78e6895b4e8a2eb7f0d4f1079ec28c798ba05d3c
Submitter: Jenkins
Branch: master

commit 78e6895b4e8a2eb7f0d4f1079ec28c798ba05d3c
Author: Amrith Kumar <email address hidden>
Date: Tue Jan 12 13:49:35 2016 -0500

    Sometimes prepare messages are delayed

    This bug addresses one issue which manifests itself as a delayed
    deliver of prepare messages on the guest. A full description of the
    change is found in the bug report along with a set of sample programs
    to recreate the problem and test the fix.

    I'm marking this as 'partial-bug' only because I'm not sure this is
    the only issue that can cause this. It is just that this one is
    certainly the one that has been plaguing me on my dev environment.

    Change-Id: I065fb7c7f674ce587d4a480ac2f3462826ba2604
    Partial-Bug: #1511623

Changed in trove:
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.