Volumes/instances couldn't be created after destroy controller due to oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'conductor': [Errno 110] Connection timed out

Bug #1384785 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Bogdan Dobrelya
Mirantis OpenStack
Fix Released
High
MOS Nova
5.1.x
Invalid
High
MOS Nova

Bug Description

{

    "build_id": "2014-10-22_00-01-06",
    "ostf_sha": "de177931b53fbe9655502b73d03910b8118e25f1",
    "build_number": "36",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "f4bf25da24c4e5b0d9eb86493945200deba3d92e",
    "production": "docker",
    "fuelmain_sha": "dab17913263bbea7e9a3b55de8a0f3af5ac0e3e2",
    "astute_sha": "6a11a7c481d116e6cfdb422fab1d4bbb29cbea1c",
    "feature_groups": [
        "mirantis"
    ],
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "build_id": "2014-10-22_00-01-06",
                "ostf_sha": "de177931b53fbe9655502b73d03910b8118e25f1",
                "build_number": "36",
                "api": "1.0",
                "nailgun_sha": "f4bf25da24c4e5b0d9eb86493945200deba3d92e",
                "production": "docker",
                "fuelmain_sha": "dab17913263bbea7e9a3b55de8a0f3af5ac0e3e2",
                "astute_sha": "6a11a7c481d116e6cfdb422fab1d4bbb29cbea1c",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "6.0",
                "fuellib_sha": "af2bf11d4a3a075fa4e9fa9b7b7209af29498a46"
            }
        }
    },
    "fuellib_sha": "af2bf11d4a3a075fa4e9fa9b7b7209af29498a46"

}

Steps:
1. Create next cluster - CentOS, HA, Neutron GRE, Cinder for volumes, 3 controller+2 compute+1 cinder
2. Deploy cluster
3. Destroy non-primary controller
4. Try to create any instance/volume

Expected - instance can be created
Actual - creation failed

Errors in nova/compute.log from both compute nodes (node-2, node-6):
2014-10-23 00:56:04.661 32264 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'conductor': [Errno 110] Connection timed out
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit return method()
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 752, in _publish
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic=topic, **kwargs)
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 378, in __init__
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit **options)
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 326, in __init__
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 334, in reconnect
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 84, in __init__
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit self.revive(self._channel)
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit self.declare()
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 104, in declare
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit self.exchange.declare()
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 166, in declare
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit nowait=nowait, passive=passive,
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 620, in exchange_declare
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit (40, 11), # Channel.exchange_declare_ok
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 67, in wait
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit self.channel_id, allowed_methods)
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqp/connection.py", line 237, in _wait_method
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit self.method_reader.read_method()
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 189, in read_method
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit raise m
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 110] Connection timed out
2014-10-23 00:56:04.661 32264 TRACE oslo.messaging._drivers.impl_rabbit
2014-10-23 00:56:04.663 32264 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-10-23 00:56:05.664 32264 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 10.108.22.6:5673
2014-10-23 00:56:05.675 32264 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.108.22.6:5673
2014-10-23 00:56:20.522 32264 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed
2014-10-23 00:56:20.522 32264 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-10-23 00:56:20.522 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure
2014-10-23 00:56:20.522 32264 TRACE oslo.messaging._drivers.impl_rabbit return method()
2014-10-23 00:56:20.522 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 735, in _consume
2014-10-23 00:56:20.522 32264 TRACE oslo.messaging._drivers.impl_rabbit return self.connection.drain_events(timeout=timeout)
2014-10-23 00:56:20.522 32264 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/kombu/connection.py", line 279, in drain_events
2014-10-23 00:56:20.522 32264 TRACE oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs)

Logs are attached

Tags: nova messaging
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Looks like the heartbeats patch didn't make it into this ISO: https://review.fuel-infra.org/#/q/project:openstack/oslo.messaging

tags: added: messaging nova
Changed in mos:
status: New → Confirmed
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Let's test it again when we have the latest messaging code in oslo.messaging

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

The fix for oslo.messaging didn't make it to the ISO you tested. Should be fixed now.

Changed in mos:
status: Confirmed → Fix Committed
Revision history for this message
Egor Kotko (ykotko) wrote :

Got the same issue on 5.1.1
{"build_id": "2014-11-11_20-08-59", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "33", "auth_required": true, "api": "1.0", "nailgun_sha": "bbc9dfe78a0c33040dcd16de9a40a3491788719c", "production": "docker", "fuelmain_sha": "88d4289e88e4bd88a3cabaf15a11ae8fc9ded53f", "astute_sha": "702af3db6f5bca92525bc8322d7d5d7675ec857e", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.1-5.1.1": {"VERSION": {"build_id": "2014-11-11_20-08-59", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "33", "api": "1.0", "nailgun_sha": "bbc9dfe78a0c33040dcd16de9a40a3491788719c", "production": "docker", "fuelmain_sha": "88d4289e88e4bd88a3cabaf15a11ae8fc9ded53f", "astute_sha": "702af3db6f5bca92525bc8322d7d5d7675ec857e", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "e5b3de834a400d98d8c6ba416249832a0c16076c"}}}, "fuellib_sha": "e5b3de834a400d98d8c6ba416249832a0c16076c"}

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Egor, I checked the ISO you tested and it contains the latest oslo.messaging code. From what I see, the issue you are seeing has nothing to do with this bug. According to the logs from the fuel snapshot you provided all 3 controllers became unreachable simultaneously, so there is nothing Nova/oslo.messaging can do here.

Can you guys check RabbitMQ cluster works properly after failure of 1 node?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Nova trying to connect to a RabbitMQ instance a loop: http://xsnippet.org/360274/raw/

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Since we consider bug to be in Fuel, move it to invalid state in MOS

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
importance: Undecided → High
milestone: none → 6.0
status: New → Confirmed
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The test case description looks invalid, the proper steps should be:
3. Destroy non-primary controller
4. Wait for HA health check passed
5. Try to create any instance/volume

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is a sequence of events. And it shows, first, that *primary* controller was shut down instead of non-primary, second, that failover procedure wasn't finished yet (see the missing step 4 above)
http://pastebin.com/GV0dtZ2R

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Alexander Gubanov (ogubanov) wrote :

I verified it on mos 6.0 (build 56) - fixed!
To reproduce I've "destroyed" non-primary controller, ran OSTF tests for HA and after created instance - it built successful.
Env: mos 6.0 (build 56), Ubuntu, HA, Neutron VLAN, 3 controller, 2 compute Cinder
Proof: http://pastebin.com/xJX9rA0e

Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.