Migrate agents tests are failed by timeout according to wait for password in ssh connectivity test

Bug #1493228 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Artem Panchenko
7.0.x
Fix Released
High
Artem Panchenko
8.0.x
Fix Released
High
Artem Panchenko

Bug Description

Test failed on step "Check instance connectivity" by timeout
https://product-ci.infra.mirantis.net/job/7.0.system_test.ubuntu.ha_neutron_destructive_2/29/testReport/(root)/neutron_l3_migration_after_reset/neutron_l3_migration_after_reset/

At the same time this one connectivity check works perfect if we use login + pwd
http://paste.openstack.org/show/449832/

So the problem is that we create instance key on the first controller, and and the same time dhcp agent run on second controller, so we ssh to it try to check connective with assumption that key is here (and key is absent it still is placed on the first one)so we hangs and wait for password as result tests failed by timeout

May be we need to copy instance key on all the controllers to avoid such false negative tests results

summary: - Need to increase timeout for check instance connectivity by fixed ip in
- agent migration tests
+ Migrate agents tests are failed by timeout according to wait for
+ password in ssh connectivity test
Changed in fuel:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/221715

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Dmitry Tyzhnenko (dtyzhnenko)
status: Confirmed → In Progress
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

I checked tests logs and commands output from bug description and found root cause of failure - DHCP agent was not working properly on node-2 ('slave-01') controller and there was no net namespace name for it:

2015-09-08 02:51:17,859 - DEBUG __init__.py:55 -- Calling: get_ssh_for_node with args: (<fuelweb_test.models.fuel_web_client.FuelWebClient object at 0x7fce60299750>, 'slave-01') {}
...
015-09-08 02:51:18,580 - DEBUG helpers.py:330 -- Executing command: 'ip netns | grep ce4eb0e0-e6d3-46a2-93c8-1133795963bc'
2015-09-08 02:51:18,586 - DEBUG test_neutron.py:240 -- dhcp namespace is

So in ssh/ping command namespace name was missed and connectivity check failed:

2015-09-08 02:51:39,333 - DEBUG __init__.py:55 -- Calling: check_instance_connectivity with args: (<class 'tests.tests_strength.test_neutron.TestNeutronFailover'>, <devops.helpers.helpers.SSHClient object at 0x7fce68298110>, '', u'192.168.111.4') {}
2015-09-08 02:51:39,333 - DEBUG helpers.py:330 -- Executing command: '. openrc; ip netns exec ssh -i /root/.ssh/webserver_rsa -o 'StrictHostKeyChecking no' cirros@192.168.111.4 "ping -c 1 8.8.8.8"'

I inspected diagnostic snapshot and found that 'psc status' reported that dhcp-agent was running fine on all controllers:

[node-2.test.domain.local] out: Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
[node-2.test.domain.local] out: Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

But there were no dnsmasq processes on node-2 according to `ps` command output:

$ fgrep -l ce4eb0e0-e6d3-46a2-93c8-1133795963bc node-*.test.domain.local/commands/ps.txt
node-1.test.domain.local/commands/ps.txt
node-5.test.domain.local/commands/ps.txt

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

I checked live environment and confirm that dhcp_agent for 'net04' network are running only on 2 controllers, because dhcp_agents_per_network parameter is set to 2:

https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/network.pp#L182
https://github.com/stackforge/fuel-library/commit/89ef0dadcbb8bf7b4b9ec36110f0a97848f99eff

So looks like the tests which try to ping/access instances using DHCP agent net namespace from random controller should be fixed: list of nodes with running agents for specific network must be fetched from neutron:

root@node-4:~# neutron agent-list | grep DHCP
| 1b233ec4-dbdf-4bfb-bbfd-79737df1210d | DHCP agent | node-6.test.domain.local | :-) | True | neutron-dhcp-agent |
| 5394460f-0cb3-4f53-af3c-f3ea26bf21c0 | DHCP agent | node-5.test.domain.local | :-) | True | neutron-dhcp-agent |
| e5d21dd2-6549-4289-be14-aa20bd631199 | DHCP agent | node-4.test.domain.local | :-) | True | neutron-dhcp-agent |

root@node-4:~# neutron dhcp-agent-list-hosting-net net04
+--------------------------------------+--------------------------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+--------------------------+----------------+-------+
| 1b233ec4-dbdf-4bfb-bbfd-79737df1210d | node-6.test.domain.local | True | :-) |
| 5394460f-0cb3-4f53-af3c-f3ea26bf21c0 | node-5.test.domain.local | True | :-) |
+--------------------------------------+--------------------------+----------------+-------+

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/221988

Changed in fuel:
assignee: Dmitry Tyzhnenko (dtyzhnenko) → Artem Panchenko (apanchenko-8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/221988
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=b423cab21510c00f70720d8d5d21b3f539a20431
Submitter: Jenkins
Branch: master

commit b423cab21510c00f70720d8d5d21b3f539a20431
Author: Artem Panchenko <email address hidden>
Date: Wed Sep 9 22:56:43 2015 +0000

    Revert "Remove get_node_with_dhcp method"

    This reverts commit 3431eb870ab4d5d83d8749e9a10af26b7602ca37.

    Closes-bug: #1493228
    Change-Id: Id3ed8e15fc33c545658a4b51c749bb6ee6a2383a

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Test still fails by timeout and now the problem is resetting node with DHCP agent (node-3 in example below) after we started new instance:

#tests log:
2015-09-10 01:03:27,657 - DEBUG __init__.py:60 -- Done: get_fqdn_by_hostname with result: node-3.test.domain.local
2015-09-10 01:03:27,657 - DEBUG test_neutron.py:42 -- node name with dhcp is node-3.test.domain.local
...
2015-09-10 01:03:28,803 - DEBUG helpers.py:330 -- Executing command: 'ip netns | grep 4be6c957-0a3c-475d-ae8e-a362bbafb6b3'
2015-09-10 01:03:28,819 - DEBUG test_neutron.py:249 -- dhcp namespace is qdhcp-4be6c957-0a3c-475d-ae8e-a362bbafb6b3
...
2015-09-10 01:03:56,517 - DEBUG __init__.py:60 -- Done: reshedule_router_manually with result: None
2015-09-10 01:03:56,517 - DEBUG __init__.py:55 -- Calling: check_instance_connectivity with args: (<class 'tests.tests_strength.test_neutron.TestNeutronFailover'>, <devops.helpers.helpers.SSHClient object at 0x7f9c9f4d1e50>, 'qdhcp-4be6c957-0a3c-475d-ae8e-a362bbafb6b3', u'192.168.111.4') {}
2015-09-10 01:03:56,518 - DEBUG helpers.py:330 -- Executing command: '. openrc; ip netns exec qdhcp-4be6c957-0a3c-475d-ae8e-a362bbafb6b3 ssh -i /root/.ssh/webserver_rsa -o 'StrictHostKeyChecking no' cirros@192.168.111.4 "ping -c 1 8.8.8.8"'
...
2015-09-10 01:04:15,204 - DEBUG __init__.py:60 -- Done: check_instance_connectivity with result: None
...

2015-09-10 01:04:15,235 - DEBUG __init__.py:55 -- Calling: get_node_with_l3 with args: (<class 'tests.tests_strength.test_neutron.TestNeutronFailover'>, <tests.tests_strength.test_neutron.TestNeutronFailover object at 0x7f9c980ed490>, u'node-3.test.domain.local') {}
...
2015-09-10 01:04:15,235 - DEBUG test_neutron.py:51 -- new node with l3 is node-3.test.domain.local
...
2015-09-10 01:04:15,497 - DEBUG __init__.py:60 -- Done: get_node_with_l3 with result: Node object
2015-09-10 01:04:15,497 - INFO fuel_web_client.py:1642 -- Reboot (warm restart) nodes [u'slave-03']
2015-09-10 01:04:15,498 - INFO fuel_web_client.py:1605 -- Shutting down (warm) nodes [u'slave-03']
2015-09-10 01:04:15,498 - DEBUG fuel_web_client.py:1607 -- Shutdown node slave-03
...
2015-09-10 01:11:30,548 - DEBUG helpers.py:330 -- Executing command: 'mysql --connect_timeout=5 -sse "SELECT VARIABLE_VALUE FROM information_schema.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_ready';"'

So at 01:04 test reset node with DHCP agent, at 01:11 it back online and test started to ping instance from node-3, but there were no DHCP agent namespace on node-3:

root@node-3:~# . openrc; ip netns exec qdhcp-4be6c957-0a3c-475d-ae8e-a362bbafb6b3 ssh -i /root/.ssh/webserver_rsa -o 'StrictHostKeyChecking no' cirros@192.168.111.4 "ping -c 1 8.8.8.8"
Cannot open network namespace "qdhcp-4be6c957-0a3c-475d-ae8e-a362bbafb6b3": No such file or directory

http://paste.openstack.org/show/454664/

IMHO we need to refresh a list of online DHCP agents from Neutron after node reset in order to avoid such issues.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/222088

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/222529

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/222537

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/222088
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=e2295b53fdb97b0d53c5d36255671a3a7e4a5ade
Submitter: Jenkins
Branch: master

commit e2295b53fdb97b0d53c5d36255671a3a7e4a5ade
Author: Artem Panchenko <email address hidden>
Date: Thu Sep 10 12:24:53 2015 +0300

    Refresh list of online DHCP agents after reset

    Get list of online DHCP agents for 'net04' network after
    node reset and re-initialize SSH connection to remote.
    Also create RSA key pair using API and copy private key
    on node before connecting to instance via SSH.

    Change-Id: Ia3d04fc1297c7075473f0c8184ca78d7472af5a7
    Partial-bug: #1493228

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/7.0)

Reviewed: https://review.openstack.org/222529
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=496430c8ab18582d79ea71b7e6aae6243ae98dcb
Submitter: Jenkins
Branch: stable/7.0

commit 496430c8ab18582d79ea71b7e6aae6243ae98dcb
Author: Artem Panchenko <email address hidden>
Date: Wed Sep 9 22:56:43 2015 +0000

    Revert "Remove get_node_with_dhcp method"

    This reverts commit 3431eb870ab4d5d83d8749e9a10af26b7602ca37.

    Closes-bug: #1493228
    Change-Id: Id3ed8e15fc33c545658a4b51c749bb6ee6a2383a
    (cherry picked from commit b423cab21510c00f70720d8d5d21b3f539a20431)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/222537
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=3a62f59bc954dfbae47164779f985a512e19505d
Submitter: Jenkins
Branch: stable/7.0

commit 3a62f59bc954dfbae47164779f985a512e19505d
Author: Artem Panchenko <email address hidden>
Date: Thu Sep 10 12:24:53 2015 +0300

    Refresh list of online DHCP agents after reset

    Get list of online DHCP agents for 'net04' network after
    node reset and re-initialize SSH connection to remote.
    Also create RSA key pair using API and copy private key
    on node before connecting to instance via SSH.

    Change-Id: Ia3d04fc1297c7075473f0c8184ca78d7472af5a7
    Partial-bug: #1493228

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-qa (master)

Change abandoned by Dmitry Tyzhnenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/221715
Reason: already fixed

Dmitry Pyzhov (dpyzhov)
tags: added: area-qa
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
tags: removed: non-release
tags: added: non-release
tags: removed: non-release
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.