Nailgun Receiver hangs, and do not process Astute messages

Bug #1541885 reported by Ihor Kalnytskyi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Georgy Kibardin
Mitaka
In Progress
Medium
Bulat Gaifullin
Newton
Fix Committed
Medium
Georgy Kibardin

Bug Description

Sometimes, if something goes wrong (unexpected RabbitMQ crash or something similar), the TCP connections are invalidated. However, Nailgun Receiver may not receive this information and may stuck on 'recv' execution.

    [root@nailgun ~]# strace -p 44
    Process 44 attached
    recvfrom(3,

where

    3 is a socket connection to RabbitMQ
    44 is a PID of receiver process

However, on RabbitMQ side we can see there's no consumers of Nailgun queue (Nailgun queue is used to retrieve messages from Astute), and there's some messages in the queue:

    [root@nailgun /]# rabbitmqctl list_queues name consumers messages
    Listing queues ...
    nailgun 0 7
    naily 7 0
    naily_service_1196f899-d676-4115-9b84-88df0dca8e48 1 0
    naily_service_4c4d3844-2721-4184-ac0d-ee381cceb230 1 0
    naily_service_72950288-2750-419d-bc01-0c5af39cb5cb 1 0
    naily_service_bd284d98-c028-48ee-a487-e1f14fa0b91a 1 0
    naily_service_c94ecd5e-0501-41be-8ce7-f799427fbf7d 1 0
    naily_service_daf069d2-bc0a-4781-adb6-467ec559d1f9 1 0
    naily_service_ed4a53ce-d198-4e8d-a8db-2648f606da26 1 0

Moreover, the socket on Receiver side has only "FREAD" flag, while it should be at least "O_NONBLOCK" (since it's initially opened this way). This is super strange, and leads to the fact that we hang waiting for the input from socket and will figure out that it's dead only when TCP keepalive will check connection (7200 seconds by default).

Apparently, we must have some mechanism to check it sooner. So we need either enable RMQ heartbits for Receiver, or decrease TCP keepalive timeout to, let's say, 1 minute.

P.S: the issue with hanged receiver has been occurred few times on QA envs.

Dmitry Pyzhov (dpyzhov)
tags: added: module-nailgun
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Georgy Kibardin (gkibardin)
status: Confirmed → In Progress
tags: added: keep-in-9.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/299270

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Georgy Kibardin (<email address hidden>) on branch: master
Review: https://review.openstack.org/299270

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

It seems that the problem only appears in development use cases, i.e. when a VM is reverted to a snapshot. Taking into account that introducing heartbeats requires additional implementation - reconnects must be implemented, kombu doesn't implement them.
I propose to move this to the next release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/317987

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Finally, I was wrong - reconnection upon unresponded heartbeat is implemented in ConsumerMixin. So, we just need to turn them on.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/317987
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=1b448f6d98a2199c9951d660a0f0238d22030f74
Submitter: Jenkins
Branch: master

commit 1b448f6d98a2199c9951d660a0f0238d22030f74
Author: Georgy Kibardin <email address hidden>
Date: Wed May 18 13:57:11 2016 +0300

    Turning heartbeats on

    So that stalled connection could be detected and reconnects are
    attempted.

    Change-Id: Ie71d4e7049ea8011db7c95f10abc13af645816ec
    Closes-Bug: #1541885

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/378904

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/mitaka)

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: stable/mitaka
Review: https://review.opendev.org/378904
Reason: This repo is retired now, no further work will get merged.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.