[mos] Floating IP is not accessible because of problems with neutron-l3-agent

Bug #1338966 reported by Anastasia Palkina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Anastasia Palkina
5.0.x
Invalid
High
Eugene Nikanorov
5.1.x
Invalid
High
Anastasia Palkina
6.0.x
Invalid
High
Anastasia Palkina

Bug Description

"build_id": "2014-07-07_00-31-14",
"mirantis": "yes",
"build_number": "103",
"ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
"nailgun_sha": "217ae694e487211fc8e352e4a45c0ef66644e1d8",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "cd72d39823b87aa5b7506a6e4f7d6ab0ed32de7b",
"astute_sha": "644d279970df3daa5f5a2d2ccf8b4d22d53386ff",
"release": "5.0.1",
"fuellib_sha": "869acab37a78d018a0806e6fc6b76aabb2cdf5f0"

1. Create new environment (Ubuntu, HA mode)
2. Choose VLAN segmentation
3. Choose both Ceph
4. Choose Murano
5. Add 3 controllers, 1 compute, 3 ceph
6. Start deployment
7. Stop deployment during provisioning
8. Wait until nodes become 'Pending addition'
9. Start deployment again. It was successful
10. Start OSTF tests
11. Test "Check network connectivity from instance via floating IP" has failed with error: Instance is not reachable by IP. Please refer to OpenStack logs for more details. on step "5. Check connectivity to the floating IP using ping command."
12. Login into horizon
13. Create manually instance. Add security group with 22 port, associate floating IP
14. Floating IP is not accessible

root@node-1:~# ssh cirros@192.168.111.11
ssh: connect to host 192.168.111.11 port 22: Connection timed out

Controllers node-1,2,3

Tags: neutron
Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Download full text (5.7 KiB)

I've checked on the problem env while it was alive. The problem is caused by /usr/bin/neutron-l3-agent - it got stuck:

2014-07-07 12:27:11.870 22389 INFO neutron.agent.l3_agent [req-da7de6b8-11bb-436c-a4a6-039980babbe2 None] L3 agent started
2014-07-07 12:27:13.402 22389 ERROR neutron.openstack.common.rpc.common [req-da7de6b8-11bb-436c-a4a6-039980babbe2 None] Failed to consume message from queue: (0, 0): (541) INTERNAL_ERROR
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common Traceback (most recent call last):
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/impl_kombu.py", line 594, in ensure
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common return method(*args, **kwargs)
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/impl_kombu.py", line 672, in _consume
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common queues_tail.consume(nowait=False)
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/impl_kombu.py", line 194, in consume
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common self.queue.consume(*args, callback=_callback, **options)
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 611, in consume
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common nowait=nowait)
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 1787, in basic_consume
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common (60, 21), # Channel.basic_consume_ok
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 67, in wait
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common self.channel_id, allowed_methods)
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 270, in _wait_method
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common self.wait()
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 69, in wait
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common return self.dispatch_method(method_sig, args, content)
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 87, in dispatch_method
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common return amqp_method(self, args)
2014-07-07 12:27:13.402 22389 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 526, in _close
2014-07-07 12:27:13.402 2238...

Read more...

Changed in mos:
assignee: nobody → MOS Neutron (mos-neutron)
Changed in mos:
milestone: none → 5.0.1
tags: added: neutron
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Changed target MOS milestone to 5.1: unless this is fixed in the next couple of hours before we build 5.0.1 RC, or unless it's critical and should block 5.0.1 release, it will miss 5.0.1.

Changed in mos:
milestone: 5.0.1 → 5.1
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

That is actually a problem with rabbit-mq, crashing with the reason (10.20.0.2\var\log\docker-logs\rabbitmq\):
    exception exit: client_timeout
      in function gen_server2:terminate/3
    ancestors: [<0.12639.0>,rabbit_stomp_client_sup_sup,rabbit_stomp_sup,
                  <0.260.0>]

That in turn means that mq server has missed a heart beat from the stomp_client.

This thread seems to have the resolution for the issue:

http://markmail.org/thread/kglwnkr7446ffgu4

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

this happens on the master node and has nothing to do with openstack cluster. how could this affect floating ip configuration?

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Vladimir, apparently, l3 agent can't get notification of fetch info from neutron server because of rabbitmq issue, thus it is unable to setup floating ip properly.

As for master node - could it be that logs are delivered there while the rabbitmq server fails on some controller?

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

Network 192.168.111.0/24 is a tenants private address space. This subnet shouldn't be accessible from anywhere. One can be accessible only from network namespaces for L3 and DHCP agents. ssh to this addresses should works only from this namespaces.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

eugene:

"That is actually a problem with rabbit-mq, crashing with the reason (10.20.0.2\var\log\docker-logs\rabbitmq\):"

you are looking into rabbitmq logs of rabbitmq container. this is absolutely not related to the openstack cluster under discussion.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Not reproduced on my environment:

KVM, HA mode, ISO #103
{"build_id": "2014-07-07_00-31-14", "mirantis": "yes", "build_number": "103", "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f", "nailgun_sha": "217ae694e487211fc8e352e4a45c0ef66644e1d8", "production": "docker", "api": "1.0", "fuelmain_sha": "cd72d39823b87aa5b7506a6e4f7d6ab0ed32de7b", "astute_sha": "644d279970df3daa5f5a2d2ccf8b4d22d53386ff", "release": "5.0.1", "fuellib_sha": "869acab37a78d018a0806e6fc6b76aabb2cdf5f0"}

all Sanity and Functional OSTF tests are passed.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :
Download full text (8.4 KiB)

Ok, thanks for the guidance, I've found logs from the controller.

It's still the issue with rabbitmq:

2014-07-07T12:27:22.315162+00:00 err: =CRASH REPORT==== 7-Jul-2014::12:27:13 ===
2014-07-07T12:27:22.315162+00:00 err: crasher:
2014-07-07T12:27:22.315162+00:00 err: initial call: gen:init_it/6
2014-07-07T12:27:22.315162+00:00 err: pid: <0.11493.0>
2014-07-07T12:27:22.315162+00:00 err: registered_name: []
2014-07-07T12:27:22.315162+00:00 err: exception exit: {{function_clause,
2014-07-07T12:27:22.315162+00:00 err: [{rabbit_mirror_queue_slave,terminate,
2014-07-07T12:27:22.315162+00:00 err: [{function_clause,
2014-07-07T12:27:22.315162+00:00 err: [{rabbit_mirror_queue_slave,handle_pre_hibernate,
2014-07-07T12:27:22.315162+00:00 err: [{not_started,
2014-07-07T12:27:22.315162+00:00 err: {amqqueue,
2014-07-07T12:27:22.315162+00:00 err: {resource,<<"/">>,queue,
2014-07-07T12:27:22.315162+00:00 err: <<"l3_agent_fanout_f98ecb4c3ab34472b130bce536715ef0">>},
2014-07-07T12:27:22.315162+00:00 err: false,true,none,
2014-07-07T12:27:22.315162+00:00 err: [{<<"x-ha-policy">>,longstr,<<"all">>}],
2014-07-07T12:27:22.315162+00:00 err: <0.11493.0>,[],[],
2014-07-07T12:27:22.315162+00:00 err: [{vhost,<<"/">>},
2014-07-07T12:27:22.315162+00:00 err: {name,<<"ha-all">>},
2014-07-07T12:27:22.315162+00:00 err: {pattern,<<".">>},
2014-07-07T12:27:22.315162+00:00 err: {'apply-to',<<"all">>},
2014-07-07T12:27:22.315162+00:00 err: {definition,
2014-07-07T12:27:22.315162+00:00 err: [{<<"ha-mode">>,<<"all">>},
2014-07-07T12:27:22.315162+00:00 err: {<<"ha-sync-mode">>,<<"automatic">>}]},
2014-07-07T12:27:22.315162+00:00 err: {priority,0}],
2014-07-07T12:27:22.315162+00:00 err: [{<0.11495.0>,<0.11493.0>}],
2014-07-07T12:27:22.315162+00:00 err: []}}]},
2014-07-07T12:27:22.315162+00:00 err: {gen_server2,pre_hibernate,1},
2014-07-07T12:27:22.315162+00:00 err: {proc_lib,init_p_do_apply,3}]},
2014-07-07T12:27:22.315162+00:00 err: {not_started,
2014-07-07T12:27:22.315162+00:00 err: {amqqueue,
2014-07-07T12:27:22.315162+00:00 err: {resource,<<"/">>,queue,
2014-07-07T12:27:22.315162+00:00 err: <<"l3_agent_fanout_f98ecb4c3ab34472b130bce536715ef0">>},
2014-07-07T12:27:22.315162+00:00 err: false,true,none,
2014-07-07T12:27:22.315162+00:00 err: [{<<"x-ha-policy">>,longstr,<<"all">>}],
2014-07-07T12:27:22.315162+00:00 err: <0.11493.0>,...

Read more...

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Not reproduced on my another environment with ISO #110, CentOS Simple Mode, 1 controller and 1 compute, Neutron GRE.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Eugene,

can you please describe another bug about RabbitMQ errors?

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

This issue doesn't reproduced on environments with CentOS, need to try to reproduce on Ubuntu.

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Reproduced on ISO #111
"build_id": "2014-07-09_14-02-00",
"mirantis": "yes",
"build_number": "111",
"ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
"nailgun_sha": "f5ff82558f99bb6ca7d5e1617eddddf7142fe857",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "b6f0d76964d0e8a0c4d9cc705338fb84512ea9d5",
"astute_sha": "5df009e8eab611750309a4c5b5c9b0f7b9d85806",
"release": "5.0.1",
"fuellib_sha": "364dee37435cbdc85d6b814a61f57800b83bf22d"

Test case was the same.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

On the environment with reproduced issue:

root@node-1:~# nova secgroup-list
+--------------------------------------+----------------+-------------+
| Id | Name | Description |
+--------------------------------------+----------------+-------------+
| 7854f8e7-ca92-4ea7-a0ad-c271bd43814c | default | default |
| 3961150c-d64e-48cf-a525-35912e2ba122 | test_sec_group | test |
+--------------------------------------+----------------+-------------+
root@node-1:~# nova secgroup-list-rules default
+-------------+-----------+---------+----------+--------------+
| IP Protocol | From Port | To Port | IP Range | Source Group |
+-------------+-----------+---------+----------+--------------+
| | | | | default |
| | | | | default |
+-------------+-----------+---------+----------+--------------+

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

root@node-1:~# neutron security-group-show 7854f8e7-ca92-4ea7-a0ad-c271bd43814c
unsupported locale setting
root@node-1:~#

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Looks like this is problems with security groups ^^^

Revision history for this message
Anastasia Palkina (apalkina) wrote :

I added port 22 to security group but ping floating IP also failed.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

root@node-1:~# nova list
+--------------------------------------+---------------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------------+--------+------------+-------------+-----------------------------------+
| 29ff0485-21da-4ea9-bd1f-3adb4c4e2c70 | test_instance | ACTIVE | - | Running | net04=192.168.111.5, 172.16.0.140 |
+--------------------------------------+---------------+--------+------------+-------------+-----------------------------------+

root@node-1:~# nova list-secgroup 29ff0485-21da-4ea9-bd1f-3adb4c4e2c70
+--------------------------------------+----------------+-------------+
| Id | Name | Description |
+--------------------------------------+----------------+-------------+
| 7854f8e7-ca92-4ea7-a0ad-c271bd43814c | default | default |
| 3961150c-d64e-48cf-a525-35912e2ba122 | test_sec_group | test |
+--------------------------------------+----------------+-------------+

root@node-1:~# nova secgroup-list-rules test_sec_group
+-------------+-----------+---------+-----------+--------------+
| IP Protocol | From Port | To Port | IP Range | Source Group |
+-------------+-----------+---------+-----------+--------------+
| tcp | 22 | 22 | 0.0.0.0/0 | |
+-------------+-----------+---------+-----------+--------------+

root@node-2:~# telnet 172.16.0.140 22
Trying 172.16.0.140...
Connected to 172.16.0.140.
Escape character is '^]'.
SSH-2.0-dropbear_2012.55

root@node-2:~# ip netns exec qrouter-d43693c5-04f0-4ba6-b5a2-ae2ef9ad06f8 telnet 192.168.111.5 22
Trying 192.168.111.5...
Connected to 192.168.111.5.
Escape character is '^]'.
SSH-2.0-dropbear_2012.55

See no problems on this env. Add ICMP rule if you want to be able to ping the inistance

Revision history for this message
Anastasia Palkina (apalkina) wrote :

It is difficult to reproduce initial situation

Dmitry Ilyin (idv1985)
summary: - Floating IP is not accessible because of problems with neutron-l3-agent
+ [mos] Floating IP is not accessible because of problems with
+ neutron-l3-agent
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Faced this issue on
{

    "build_id": "2014-07-23_02-01-14",
    "ostf_sha": "c1b60d4bcee7cd26823079a86e99f3f65414498e",
    "build_number": "347",
    "auth_required": false,
    "api": "1.0",
    "nailgun_sha": "f5775d6b7f5a3853b28096e8c502ace566e7041f",
    "production": "docker",
    "fuelmain_sha": "74b9200955201fe763526ceb51607592274929cd",
    "astute_sha": "fd9b8e3b6f59b2727b1b037054f10e0dd7bd37f1",
    "feature_groups": [
        "mirantis"
    ],
    "release": "5.1",
    "fuellib_sha": "fb0e84c954a33c912584bf35054b60914d2a2360"

}
Steps:
1. Cluster configuration - CentOS, simple, Neutron GRE, Cinder for volumes, Ceilometer, 1 controller, 1 compute, 1 cinder, 2 mongo nodes
2. Deploy cluster
3. After runnung OSTF test "Check network connectivity from instance via floating IP" has failed with error: Instance is not reachable by IP. Please refer to OpenStack logs for more details. on step "5. Check connectivity to the floating IP using ping command."

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Looks like there are no logs of l3 agent/neutron server in the attached file

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
no longer affects: mos/5.1.x
Revision history for this message
Tatyana Dubyk (tdubyk) wrote :

I've reproduced this bug
{
"build_id":"2014-08-11_12-45-06",
"mirantis":"yes",
"build_number":"169",
"ostf_sha":"09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
"nailgun_sha":"04ada3cd7ef14f6741a05fd5d6690260f9198095",
"production":"docker",
"api":"1.0",
"fuelmain_sha":"43374c706b4fdce28aeb4ef11e69a53f41646740",
"astute_sha":"6db5f5031b74e67b92fcac1f7998eaa296d68025",
"release":"5.0.1",
"fuellib_sha":"a31dbac8fff9cf6bc4cd0d23459670e34b27a9ab"
}

1. Create new environment (CentOS, simple mode, KVM)
2. Choose Cinder=default, Glance=Ceph
3. Choose Nova-Network (FLAT with tagging )
4. Add 1 controller, 1 compute, 3 ceph
5. Start deployment
6. Stop deployment during provisioning
7. Wait until nodes become 'Pending addition'
8. Start deployment again. It was successful
9. Verify network connectivity test. it is successfully
10. Start OSTF tests
11. Test "Check network connectivity from instance via floating IP" has failed with error: Instance is not reachable by IP. Please refer to OpenStack logs for more details."

in health check due to this reason are failed this tests:
-Check internet connectivity from a compute
-Check network connectivity from instance without floating IP
-Check network connectivity from instance via floating IP

Revision history for this message
Tatyana Dubyk (tdubyk) wrote :

My snapshot

Changed in fuel:
status: Incomplete → Confirmed
no longer affects: fuel/5.0.x
no longer affects: fuel
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

If we can reliably reproduce this it's High priority. If we can't, status should be Incomplete.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Seems that the similar issue was reproduced on 6.0 community iso # 199: https://bugs.launchpad.net/fuel/+bug/1382529/comments/13

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Floating issue. From logs it's not clear what is RCA. Need environment with stable repro. Moving it to Incomplete state.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue was not reproduced in a month, hence moving it to invalid state. Please reopen if it reoccurs.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.