Tempest scenario jobs failing due to no FIP connectivity

Bug #1754327 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Slawek Kaplonski

Bug Description

It is quite often (especially for linuxbridge scenario job) that some tests (random) are failing because ssh to instance is not possible.
Example of such failed tests: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/testr_results.html.gz

Same issue appears sometimes in dvr scenario job but it is not so often probably because it is multinode job and load on host is maybe lower so instances can boot faster.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I was checking logs from such failed test: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/testr_results.html.gz

Instance checked: c6c1cdcf-c8ba-414d-a7e8-c7ac5e54e191

Instance fixed IP 10.1.0.4,
Floating IP: 172.24.5.14
Port's MAC: fa:16:3e:f7:57:4d

Timeline of events there is:
1. Instance spawned and paused by nova at Mar 06 19:52:59.027047: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/screen-n-cpu.txt.gz?level=INFO#_Mar_06_19_52_59_027047
2. DHCP agent send info that port is ready at Mar 06 19:52:42.035366 (but port is unbound then so it probably was created before instance)
3. Neutron sends notification to nova that port is active at Mar 06 19:52:45.006898: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/screen-q-svc.txt.gz#_Mar_06_19_52_45_006898
4. Instance started and paused by nova-compute at Mar 06 19:52:59.027047: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/screen-n-cpu.txt.gz?level=INFO#_Mar_06_19_52_59_027047
5. Instance resumed by nova-compute at Mar 06 19:53:02.159936: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/screen-n-cpu.txt.gz?level=INFO#_Mar_06_19_53_02_159936
6. Tempest starts trying to ssh to instance at 2018-03-06 19:53:08.151: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/tempest.txt.gz?level=INFO#_2018-03-06_19_53_08_151
7. First DHCPREQUEST logged in dhcp agent at Mar 06 19:57:45.087006: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/logs/screen-q-dhcp.txt.gz#_Mar_06_19_57_45_087006
8. Request to delete instance (failed test) sent at: 2018-03-06 19:59:56.375330: http://logs.openstack.org/07/525607/12/check/neutron-tempest-plugin-scenario-linuxbridge/09f04f9/job-output.txt.gz#_2018-03-06_19_59_56_375330

So it looks like there is no problem with communication between services or with some "lag" during configuration of port by l2/dhcp agent.

I also checked directly on test node (thanks infra team) that after some time each of such instances are ready.
So I sent patch https://review.openstack.org/#/c/549324/ and it looks that tests are passing then but in many cases additional check_connectivity is called (so it would fail with timeout in normal case).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/550832

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/550832
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0ab03003b9f9c4f0cace538eee84478a099c0c58
Submitter: Zuul
Branch: master

commit 0ab03003b9f9c4f0cace538eee84478a099c0c58
Author: Sławek Kapłoński <email address hidden>
Date: Thu Mar 8 14:18:31 2018 +0100

    [Scenario tests] Try longer SSH timeout for ubuntu image

    It looks that many scenario tests are failing because of too long
    instance booting time and reached ssh timeout during checking
    connectivity.
    So longer timeout should solve this problem and tests should
    not fail with this reason.

    Change-Id: I5d0678ea2383483e6106976c148353ef4352befd
    Closes-Bug: #1754327

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/554859

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/554859
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fc182f3171ae3a83d724898aafcdf528b1ace3ef
Submitter: Zuul
Branch: stable/queens

commit fc182f3171ae3a83d724898aafcdf528b1ace3ef
Author: Sławek Kapłoński <email address hidden>
Date: Thu Mar 8 14:18:31 2018 +0100

    [Scenario tests] Try longer SSH timeout for ubuntu image

    It looks that many scenario tests are failing because of too long
    instance booting time and reached ssh timeout during checking
    connectivity.
    So longer timeout should solve this problem and tests should
    not fail with this reason.

    Change-Id: I5d0678ea2383483e6106976c148353ef4352befd
    Closes-Bug: #1754327
    (cherry picked from commit 0ab03003b9f9c4f0cace538eee84478a099c0c58)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.1

This issue was fixed in the openstack/neutron 12.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b1

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/573632

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/573634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/573634
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3bae7d57a8d3f8f1395d781866e7284f56460a83
Submitter: Zuul
Branch: stable/ocata

commit 3bae7d57a8d3f8f1395d781866e7284f56460a83
Author: Sławek Kapłoński <email address hidden>
Date: Thu Mar 8 14:18:31 2018 +0100

    [Scenario tests] Try longer SSH timeout for ubuntu image

    It looks that many scenario tests are failing because of too long
    instance booting time and reached ssh timeout during checking
    connectivity.
    So longer timeout should solve this problem and tests should
    not fail with this reason.

    Change-Id: I5d0678ea2383483e6106976c148353ef4352befd
    Closes-Bug: #1754327
    (cherry picked from commit 0ab03003b9f9c4f0cace538eee84478a099c0c58)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/573632
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=073b44b1005e7e93bdef6fa04153498009cc9412
Submitter: Zuul
Branch: stable/pike

commit 073b44b1005e7e93bdef6fa04153498009cc9412
Author: Sławek Kapłoński <email address hidden>
Date: Thu Mar 8 14:18:31 2018 +0100

    [Scenario tests] Try longer SSH timeout for ubuntu image

    It looks that many scenario tests are failing because of too long
    instance booting time and reached ssh timeout during checking
    connectivity.
    So longer timeout should solve this problem and tests should
    not fail with this reason.

    Change-Id: I5d0678ea2383483e6106976c148353ef4352befd
    Closes-Bug: #1754327
    (cherry picked from commit 0ab03003b9f9c4f0cace538eee84478a099c0c58)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers