Problem with instance connectivity after l3 agent rescheduling

Bug #1392330 reported by Andrey Sledzinskiy on 2014-11-13
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Vladimir Kuklin
4.1.x
High
Sergey Kolekonov
5.0.x
High
Vitaly Sedelnik
5.1.x
High
Vladimir Kuklin
6.0.x
High
Vladimir Kuklin

Bug Description

{

    "build_id": "2014-11-11_23-18-10",
    "ostf_sha": "720cc1308c3a7081736edd167e7928ca61914aaa",
    "build_number": "93",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "e1e58a78f07d33ef5467eb2e7ac8c18b86783caa",
    "production": "docker",
    "fuelmain_sha": "f09f3624e55da3b58bf9ef6e2614c4a96cb5b072",
    "astute_sha": "3c374c9f7bfbdbcd7ce2f716cd704e3044e6fb41",
    "feature_groups": [
        "mirantis"
    ],
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "build_id": "2014-11-11_23-18-10",
                "ostf_sha": "720cc1308c3a7081736edd167e7928ca61914aaa",
                "build_number": "93",
                "api": "1.0",
                "nailgun_sha": "e1e58a78f07d33ef5467eb2e7ac8c18b86783caa",
                "production": "docker",
                "fuelmain_sha": "f09f3624e55da3b58bf9ef6e2614c4a96cb5b072",
                "astute_sha": "3c374c9f7bfbdbcd7ce2f716cd704e3044e6fb41",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "6.0",
                "fuellib_sha": "0b1c4ab522009681f27053a267dcc9dd4010b7ea"
            }
        }
    },
    "fuellib_sha": "0b1c4ab522009681f27053a267dcc9dd4010b7ea"

}

Steps:
1. Create next cluster - HA, CentOS, Neutron GRE, 3 controller, 2 compute, 1 cinder
2. Deploy cluster
3. Create instance
4. Manually reschedule router from one l3-agent to another
5. Check network connectivity from instance via dhcp namespace
ip netns exec qdhcp-bf6d240a-3057-4132-b6eb-cdffbed66093 ssh cirros@192.168.111.4 "ping -c 3 8.8.8.8"

6. Instance connectivity works fine
7. Go to node on which l3 agent that hosts router is running
8. Stop l3-agent resource via pacemaker
pcs resource ban p_neutron-l3-agent NODE
9. Wait until router will be rescheduled to another l3-agent
10. Log in to controller with l3 agent hosting router
11. Check instance connectivity via dhcp namespace
ip netns exec qdhcp-bf6d240a-3057-4132-b6eb-cdffbed66093 ssh cirros@192.168.111.4 "ping -c 3 8.8.8.8"

Expected - instance has connection to outside world
Actual - instance has no connection to outside world

Seems that there are problems with migration of metadata because after router rescheduling qrouter has no interfaces for internal/external networks
[root@node-1 ~]# ip netns exec qrouter-b300a3fc-8548-4d96-8164-e8370985b715 ip a
35: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

Also there are errors in neutron-ns-metadata-proxy-b300a3fc-8548-4d96-8164-e8370985b715.log on node-1
2014-11-12 12:52:15.980 14417 ERROR neutron.agent.linux.daemon [-] Error while handling pidfile: /var/lib/neutron/external/pids/b300a3fc-8548-4d96-8164-e8370985b715.pid
2014-11-12 12:52:15.980 14417 TRACE neutron.agent.linux.daemon Traceback (most recent call last):
2014-11-12 12:52:15.980 14417 TRACE neutron.agent.linux.daemon File "/usr/lib/python2.6/site-packages/neutron/agent/linux/daemon.py", line 33, in __init__
2014-11-12 12:52:15.980 14417 TRACE neutron.agent.linux.daemon fcntl.flock(self.fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
2014-11-12 12:52:15.980 14417 TRACE neutron.agent.linux.daemon IOError: [Errno 11] Resource temporarily unavailable
2014-11-12 12:52:15.980 14417 TRACE neutron.agent.linux.daemon

Logs are attached

Changed in fuel:
status: New → Confirmed
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Sergey Vasilenko (xenolog) wrote :
Download full text (3.5 KiB)

I spend a lot of time for reproduce it on ISO #114
It was not simple.

but...

looks like Neutron and OVS together issue:

2014-11-20 16:07:02.307 25086 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 5.0 seconds ...
2014-11-20 16:07:07.353 25086 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 127.0.0.1:5673
2014-11-20 16:07:08.702 25086 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 127.0.0.1:5673
2014-11-20 16:07:53.908 25086 ERROR neutron.agent.linux.utils [req-2d96d235-0261-4f69-a766-6203cfa3f495 None]
Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-ofctl', 'dump-flows', 'br-int', 'table=23']
Exit code: 1
Stdout: ''
Stderr: 'ovs-ofctl: /var/run/openvswitch/br-int.mgmt: failed to open socket (Connection refused)\n'
2014-11-20 16:07:53.909 25086 ERROR neutron.agent.linux.ovs_lib [req-2d96d235-0261-4f69-a766-6203cfa3f495 None] Unable to execute ['ovs-ofctl', 'dump-flows', 'br-int', 'table
=23']. Exception:
Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-ofctl', 'dump-flows', 'br-int', 'table=23']
Exit code: 1
Stdout: ''
Stderr: 'ovs-ofctl: /var/run/openvswitch/br-int.mgmt: failed to open socket (Connection refused)\n'
2014-11-20 16:08:06.310 25086 WARNING neutron.agent.linux.ovs_lib [req-2d96d235-0261-4f69-a766-6203cfa3f495 None] Found not yet ready openvswitch port: [u'qr-b618e937-de', [u
'map', [[u'attached-mac', u'fa:16:3e:46:22:9a'], [u'iface-id', u'b618e937-ded8-41d8-9ddf-f3bd054a2962'], [u'iface-status', u'active']]], [u'set', []]]
2014-11-20 16:08:13.343 25086 WARNING neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-2d96d235-0261-4f69-a766-6203cfa3f495 None] Device b618e937-ded8-41d8-9ddf-f3bd0
54a2962 not defined on plugin
2014-11-20 16:08:28.084 25086 ERROR neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-2d96d235-0261-4f69-a766-6203cfa3f495 None] process_ancillary_network_ports - iter
ation:28738 - failure while retrieving port details from server
2014-11-20 16:08:28.084 25086 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2014-11-20 16:08:28.084 25086 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/agent/ovs_neutron
_agent.py", line 1254, in process_ancillary_network_ports
2014-11-20 16:08:28.084 25086 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent self.treat_ancillary_devices_added(port_info['added'])
2014-11-20 16:08:28.084 25086 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 1142, in treat_ancillary_devices_added
2014-11-20 16:08:28.084 25086 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent raise DeviceListRetrievalError(devices=devices, error=e)
2014-11-20 16:08:28.084 25086 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'19bc0777-d655-4fa4-b143-82e9d5b58b4c', u'a4414f4c-2192-4995-9fb1-3d5dae2b9bfd']) because of error: Remote error: DBConnecti...

Read more...

Sergey Vasilenko (xenolog) wrote :

But by hands this command returns:

root@node-2:/var/log/neutron# ovs-ofctl dump-flows br-int table=23
NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=7545.898s, table=23, n_packets=0, n_bytes=0, idle_age=7545, priority=0 actions=drop
root@node-2:/var/log/neutron#

see log of neutron-openvswitch-agent on node-2

Sergey Vasilenko (xenolog) wrote :

After this error neutron-openvswitch-agent hangs down and do nothing, except sending updates to the neutron-server.

Sergey Vasilenko (xenolog) wrote :
Sergey Vasilenko (xenolog) wrote :

After killing by hands, agent say to log following:

2014-11-20 16:08:28.084 25086 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent
2014-11-20 16:22:16.715 25086 WARNING neutron.agent.linux.ovs_lib [req-2d96d235-0261-4f69-a766-6203cfa3f495 None] Found not yet ready openvswitch port: [u'qr-b618e937-de', [u'map', [[u'attached-mac', u'fa:16:3e:46:22:9a'], [u'iface-id', u'b618e937-ded8-41d8-9ddf-f3bd054a2962'], [u'iface-status', u'active']]], [u'set', []]]
2014-11-20 16:22:23.479 25086 WARNING neutron.agent.linux.ovs_lib [req-2d96d235-0261-4f69-a766-6203cfa3f495 None] Unable to parse interface details. Exception: list index out of range
2014-11-20 18:22:01.249 25086 ERROR neutron.agent.linux.utils [req-2d96d235-0261-4f69-a766-6203cfa3f495 None]
Command: ['ps', '--ppid', '25307', '-o', 'pid=']
Exit code: 1
Stdout: ''
Stderr: ''

after this its die.

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → MOS Neutron (mos-neutron)
Sergey Vasilenko (xenolog) wrote :
Download full text (3.9 KiB)

Later I do following:
* singletoning neutron server to the node-1
* enabling debug on node-1
* restarting on node-1 server and agents

some time l3-agent don't create router namespaces:
root@node-1:/var/log/neutron# ip netns
haproxy
root@node-1:/var/log/neutron# neutron l3-agent-list-hosting-router router04
+--------------------------------------+--------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+--------+----------------+-------+
| 021bce30-d7da-4143-bc61-0a2995a086ce | node-1 | True | :-) |
+--------------------------------------+--------+----------------+-------+
root@node-1:/var/log/neutron# ip netns
haproxy

but some tome after:

root@node-1:/var/log/neutron# neutron l3-agent-list-hosting-router router04
+--------------------------------------+--------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+--------+----------------+-------+
| 021bce30-d7da-4143-bc61-0a2995a086ce | node-1 | True | :-) |
+--------------------------------------+--------+----------------+-------+
root@node-1:/var/log/neutron# ip netns
qrouter-8334420c-818c-4abb-9e78-3a9d12288f1a
qrouter-8dd0cb25-054e-4373-a191-91cb8dbc9c64
haproxy
root@node-1:/var/log/neutron# ip netns exec qrouter-8dd0cb25-054e-4373-a191-91cb8dbc9c64 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
80: qr-13a0e6db-b0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:91:5e:71 brd ff:ff:ff:ff:ff:ff
    inet 192.168.111.1/24 brd 192.168.111.255 scope global qr-13a0e6db-b0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe91:5e71/64 scope link
       valid_lft forever preferred_lft forever
82: qg-19bc0777-d6: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:10:76:fd brd ff:ff:ff:ff:ff:ff
    inet 10.77.11.130/24 brd 10.77.11.255 scope global qg-19bc0777-d6
       valid_lft forever preferred_lft forever
    inet 10.77.11.138/32 brd 10.77.11.138 scope global qg-19bc0777-d6
       valid_lft forever preferred_lft forever
    inet 10.77.11.139/32 brd 10.77.11.139 scope global qg-19bc0777-d6
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe10:76fd/64 scope link
       valid_lft forever preferred_lft forever
root@node-1:/var/log/neutron# ip netns exec qrouter-8dd0cb25-054e-4373-a191-91cb8dbc9c64 ping 192.168.111.33
PING 192.168.111.33 (192.168.111.33) 56(84) bytes of data.
From 192.168.111.1 icmp_seq=1 Destination Host Unreachable
From 192.168.111.1 icmp_seq=2 Destination Host Unreachable
From 192.168.111.1 icmp_seq=3 Destination Host Unreachable
From 192.168.111.1 icmp_seq=4 Destination Host Unreachable
From 192.168.111.1 icmp_seq=5 Destination Host Unreachable
From 192.168.111.1 icmp_seq=6 Destination Host Unreachable
^C
--- 192.168.111....

Read more...

Sergey Vasilenko (xenolog) wrote :

log files after enabling debug

Sergey Vasilenko (xenolog) wrote :

After this all I found, that instance just lost his IP address. I.e. DHCP-agent all that time was in inoperability state.

I do following:
* enable debug on node-3 (where dhcp agent was located)
* restart all agents

after 3 minutes (depends from dhcp client of instance) instance got IP address and connectivity was restored.

logs with debug -- attached.

Sergey Vasilenko (xenolog) wrote :
tags: added: neutron
Changed in fuel:
assignee: MOS Neutron (mos-neutron) → Andrey Sledzinskiy (asledzinskiy)
status: Confirmed → Incomplete
Sergey Kolekonov (skolekonov) wrote :

Is there any way to reproduce this problem stably?

Changed in fuel:
assignee: Andrey Sledzinskiy (asledzinskiy) → MOS Neutron (mos-neutron)
status: Incomplete → Confirmed

Are steps from test description not correct? or you weren't able to reproduce it manually with steps specified in bug description ?

Sergey Kolekonov (skolekonov) wrote :

Yes, I wasn't able to reproduce this issue manually on the 6.0 ISO #25. The only difference is that I used cluster with 3 controllers and 1 compute+cinder node.

Sergey Kolekonov (skolekonov) wrote :

Also I found the following error is system tests Jenkins Job (6.0_fuelmain.system_test.ubuntu.thread_5/44):
ERROR: Check l3-agent rescheduling after l3-agent dies
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/usr/lib/python2.7/dist-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/usr/lib/python2.7/dist-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/6.0_fuelmain.system_test.ubuntu.thread_5/fuelweb_test/helpers/decorators.py", line 52, in wrapper
    return func(*args, **kwagrs)
  File "/home/jenkins/workspace/6.0_fuelmain.system_test.ubuntu.thread_5/fuelweb_test/tests/tests_strength/test_neutron.py", line 86, in deploy_ha_neutron
    node_fqdn, self.env.nodes().slaves[0:6])
  File "/home/jenkins/workspace/6.0_fuelmain.system_test.ubuntu.thread_5/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/6.0_fuelmain.system_test.ubuntu.thread_5/fuelweb_test/models/fuel_web_client.py", line 509, in find_devops_node_by_nailgun_fqdn
    macs = {i['mac'] for i in nailgun_node['meta']['interfaces']}
TypeError: 'NoneType' object has no attribute '__getitem__'

Seems like it's not connected with Neutron.

Fix for the last problem is on review now - https://review.openstack.org/#/c/136016/
With this patch test failed on last step when we check instance connectivity

Changed in fuel:
assignee: MOS Neutron (mos-neutron) → Andrey Sledzinskiy (asledzinskiy)

Test is constantly fail because instance loses its connectivity
Can someone look at the env with problem?

Changed in fuel:
assignee: Andrey Sledzinskiy (asledzinskiy) → MOS Neutron (mos-neutron)

Fix proposed to branch: master
Review: https://review.openstack.org/139130

Changed in fuel:
assignee: MOS Neutron (mos-neutron) → Vladimir Kuklin (vkuklin)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/139130
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d33dd473bf49f88f70cb1e5b693979b866d8ee60
Submitter: Jenkins
Branch: master

commit d33dd473bf49f88f70cb1e5b693979b866d8ee60
Author: Vladimir Kuklin <email address hidden>
Date: Thu Dec 4 20:51:55 2014 +0300

    Set neutron OCF scripts umask to 0022

    For some reason pacemaker sets umask to 0026
    which leads to 0751 rights set for neutron
    agents scripts, which in turn create
    some of files with these rights set, which
    in turn can make metadata proxy process
    unmanagable, making router hang.

    Change-Id: I146f96c2215aff95af9fd53682f501f3a1b90349
    Closes-bug: #1392330
    Related-bug: #1397284

Changed in fuel:
status: In Progress → Fix Committed
tags: added: customer-found

verified on {

    "build_id": "2014-12-26_14-25-46",
    "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4",
    "build_number": "58",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90",
    "production": "docker",
    "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8",
    "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91",
    "feature_groups": [
        "mirantis"
    ],
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "build_id": "2014-12-26_14-25-46",
                "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4",
                "build_number": "58",
                "api": "1.0",
                "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90",
                "production": "docker",
                "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8",
                "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "6.0",
                "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"
            }
        }
    },
    "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"

}

Changed in fuel:
status: Fix Committed → Fix Released
tags: added: release-notes

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/146248
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers