serial message ack is slow

Bug #1734788 reported by gordon chung
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Medium
Unassigned
Ocata
Triaged
Medium
Unassigned
Pike
Triaged
Medium
Unassigned
oslo.messaging
Fix Released
Critical
Mehdi Abaakouk
python-oslo.messaging (Ubuntu)
Fix Released
Medium
Unassigned
Artful
Triaged
Medium
Unassigned

Bug Description

this confirms sileht's guess: http://eavesdrop.openstack.org/irclogs/%23openstack-telemetry/%23openstack-telemetry.2017-11-21.log.html#t2017-11-21T14:17:07

in ceilometer we use a single thread per listener. it seems there is significant time spent on ack'ing messages compared to everything else. i've made a hacky patch which adds logging[1]: http://logs.openstack.org/09/520109/22/check/oslo.messaging-telemetry-dsvm-integration-rabbit/4d0cd5b/logs/screen-ceilometer-anotification.txt.gz

example of slow ack'ing can be found here:

Nov 27 23:23:19.309865 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:19.309 16681 WARNING oslo_messaging.server [-] WHHHHHHAT(0): 1c152eebf46443dbb75a34e6cb08ad69, 5f93beb9-f0b3-4c9d-99f6-fa26a741d7ae
Nov 27 23:23:19.321231 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:19.320 16681 WARNING oslo_messaging.notify.dispatcher [-] OKKKKKKK(f3f9b576-d3c9-11e7-986a-fa163eded301, 0:00:03.035383): 1c152eebf46443dbb75a34e6cb08ad69, 5f93beb9-f0b3-4c9d-99f6-fa26a741d7ae
Nov 27 23:23:20.892141 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:20.888 16681 DEBUG ceilometer.publisher.gnocchi [-] 35 measures posted against 35 metrics through 10 resources batch_measures /opt/stack/new/ceilometer/ceilometer/publisher/gnocchi.py:421
Nov 27 23:23:20.892809 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:20.888 16681 DEBUG ceilometer.publisher.gnocchi [-] Resource cache hit for update 6c185a83-36b9-46de-83b7-b73673ce57b8 _if_not_cached /opt/stack/new/ceilometer/ceilometer/publisher/gnocchi.py:456
Nov 27 23:23:20.893475 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:20.889 16681 WARNING oslo_messaging.notify.dispatcher [-] DOOOOOOOONNNNNEE(f3f9b576-d3c9-11e7-986a-fa163eded301): 1c152eebf46443dbb75a34e6cb08ad69, 5f93beb9-f0b3-4c9d-99f6-fa26a741d7ae
Nov 27 23:23:24.315239 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:24.313 16681 WARNING oslo_messaging.server [-] WHHHHHHAT(0): 7af7e903393b48d88e041aad1dce911f, 3d76c968-63b8-40f3-9dcb-7276d4e44688
Nov 27 23:23:29.319132 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:29.318 16681 WARNING oslo_messaging.server [-] WHHHHHHAT(1): b882779aadd64a5a95b51b7743bbb19e, 08dcda5d-1241-4f1c-aabe-70a41b47b1fa
Nov 27 23:23:30.296866 ubuntu-xenial-ovh-gra1-0001082777 ceilometer-agent-notification[16100]: 2017-11-27 23:23:30.296 16681 WARNING oslo_messaging.notify.listener [-] FIIIIIIIIIINALLY: 1c152eebf46443dbb75a34e6cb08ad69, 5f93beb9-f0b3-4c9d-99f6-fa26a741d7ae

in the above you can see the single process (16681) where "1c152eebf46443dbb75a34e6cb08ad69" is a batch of messages:
- "WHHHHHHAT" logs when a batch of messages is fetched and queued.
- "OKKKKKKK" logs when batch is picked up to be processed (by the single listener thread).
- "DOOOOOOOONNNNNEE" logs when batch is handled by all endpoints (~1.5s)
- "FIIIIIIIIIINALLY" logs when all messages in batch have been acknowledged (~10s)

there is no reason for it to take 10s to ack 35 messages in this case when it takes ~1.5s to do the real work of processing the messages.

Revision history for this message
György Szombathelyi (gyurco) wrote :

Whatever the root cause is, oslo_messaging 5.17.1 doesn't have this issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/523431

Changed in oslo.messaging:
assignee: nobody → Mehdi Abaakouk (sileht)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/523431
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=c38857e1101027a734a35f4e80bc4084fabc034b
Submitter: Zuul
Branch: master

commit c38857e1101027a734a35f4e80bc4084fabc034b
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Nov 28 15:47:54 2017 +0100

    rabbitmq: don't wait for message ack/requeue

    I don't see any obvious reason why we should wait ack/requeue is done.

    This waiter have already be removed from amqp1.

    https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/amqp1_driver/controller.py#n242

    So, this change remove it from rabbitmq driver too.

    Closes-bug: #1734788

    Change-Id: I5ecedc762596181be19410b863851a0054fd6579

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/524099

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/524100

gordon chung (chungg)
Changed in oslo.messaging:
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/pike)

Reviewed: https://review.openstack.org/524099
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=221664faa128557fa1eb415837675a55a1560131
Submitter: Zuul
Branch: stable/pike

commit 221664faa128557fa1eb415837675a55a1560131
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Nov 28 15:47:54 2017 +0100

    rabbitmq: don't wait for message ack/requeue

    I don't see any obvious reason why we should wait ack/requeue is done.

    This waiter have already be removed from amqp1.

    https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/amqp1_driver/controller.py#n242

    So, this change remove it from rabbitmq driver too.

    Closes-bug: #1734788

    Change-Id: I5ecedc762596181be19410b863851a0054fd6579
    (cherry picked from commit c38857e1101027a734a35f4e80bc4084fabc034b)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.34.1

This issue was fixed in the openstack/oslo.messaging 5.34.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/ocata)

Reviewed: https://review.openstack.org/524100
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=e1ac7f8862425e5a8f9b96e28234a0b59e579dbd
Submitter: Zuul
Branch: stable/ocata

commit e1ac7f8862425e5a8f9b96e28234a0b59e579dbd
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Nov 28 15:47:54 2017 +0100

    rabbitmq: don't wait for message ack/requeue

    I don't see any obvious reason why we should wait ack/requeue is done.

    This waiter have already be removed from amqp1.

    https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/amqp1_driver/controller.py#n242

    So, this change remove it from rabbitmq driver too.

    Closes-bug: #1734788

    Change-Id: I5ecedc762596181be19410b863851a0054fd6579
    (cherry picked from commit c38857e1101027a734a35f4e80bc4084fabc034b)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.30.2

This issue was fixed in the openstack/oslo.messaging 5.30.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.17.3

This issue was fixed in the openstack/oslo.messaging 5.17.3 release.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in python-oslo.messaging (Ubuntu):
status: New → Confirmed
Changed in python-oslo.messaging (Ubuntu):
status: Confirmed → Fix Released
Changed in python-oslo.messaging (Ubuntu Artful):
status: New → Triaged
importance: Undecided → Medium
Changed in cloud-archive:
status: New → Fix Released
importance: Undecided → Medium
Changed in python-oslo.messaging (Ubuntu):
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.