Api hangs if it doesn't receive an ack from conductor

Bug #1510776 reported by Dimitry Ushakov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Magnum
Confirmed
Medium
Surojit Pathak

Bug Description

API hangs when waiting to receive ack from Conductor when unable to send message to RabbitMQ

Test infrastructure:

$ uname -a
Linux containers-test 3.13.0-62-generic #102-Ubuntu SMP Tue Aug 11
14:29:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2800.000
BogoMIPS: 5601.67
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39

$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
5 0 0 95732384 315932 19664548 0 0 11 10 0 0 7
0 93 0 0

Test setup:

[magnum]
image_id = fedora-21-atomic-3
nic_id = public
keypair_id = default
flavor_id = m1.small

NOTES:

I did notice open sockets leaking on this environment (RabbitMQ setup out of the box - mismanaged).

$ lsof -i | grep amqp | wc -l
1027

$ lsof -i | grep amqp | grep magnum | wc -l
652

I asked others to reproduce and those utilizing bare metal servers were
able to see similar behavior. Digging in further, it looks like
oslo-messaging opens up a new connection every time it attempts to send a
message to rabbit, even though rabbit might not be available to receive
the message (for whatever reason)

$ sudo rabbitmqctl status
Status of node 'rabbit@containers-test' ...
[{pid,2011},
{running_applications,[{rabbit,"RabbitMQ","3.5.6"},
{file_descriptors,[{total_limit,102300},
                    {total_used,1038},
                    {sockets_limit,92068},
                    {sockets_used,1036}]},
{processes,[{limit,1048576},{used,13132}]},
{run_queue,0},
{uptime,510653}]

As you can see here, sockets are not cleaned up but the run_queue is
empty.

hongbin (hongbin034)
Changed in magnum:
status: New → Confirmed
importance: Undecided → Medium
Adrian Otto (aotto)
Changed in magnum:
milestone: none → mitaka-1
Changed in magnum:
assignee: nobody → Surojit Pathak (suro-patz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to magnum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/264333

Changed in magnum:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (master)

Reviewed: https://review.openstack.org/264333
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=873214b6e2d8f206ba4a2750e911998c2000745a
Submitter: Jenkins
Branch: master

commit 873214b6e2d8f206ba4a2750e911998c2000745a
Author: Surojit Pathak <email address hidden>
Date: Wed Jan 6 19:30:10 2016 +0000

    Fix socket descriptor leak

    The connection to amqp was not getting cleaned up, even after the
    communication to conductor across amqp was complete, for a given
    request. Thus, sockets were leaking with each communication and finally
    led to a hang situation, where no more fds were available.

    Change-Id: I1deabdbce6ba448fe4c25d7694aabe5e5fec7b5a
    Closes-Bug: #1510776

Changed in magnum:
status: In Progress → Fix Released
Revision history for this message
hongbin (hongbin034) wrote :

Re-open this bug since the fix has been reverted https://review.openstack.org/#/c/274910/ .

Changed in magnum:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/magnum 2.0.0

This issue was fixed in the openstack/magnum 2.0.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints