tempest volume_boot_pattern and basic_ops running concurrently causing timeouts

Bug #1802971 reported by wes hayutin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Slawek Kaplonski
tripleo
Fix Released
Critical
chandan kumar
Tags: ci tech-debt
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :

From both the logs, after ssh into cirros image, it tries to find the instance-id of the data source but it failed:
checking http://169.254.169.254/2009-04-04/instance-id
2018-11-12 17:40:46 | failed 1/20: up 188.34. request failed
2018-11-12 17:40:46 | failed 2/20: up 190.71. request failed
2018-11-12 17:40:46 | failed 3/20: up 192.88. request failed
2018-11-12 17:40:46 | failed 4/20: up 195.14. request failed
2018-11-12 17:40:46 | failed 5/20: up 197.43. request failed
2018-11-12 17:40:46 | failed 6/20: up 199.58. request failed
2018-11-12 17:40:46 | failed 7/20: up 201.87. request failed
2018-11-12 17:40:46 | failed 8/20: up 203.99. request failed
2018-11-12 17:40:46 | failed 9/20: up 206.25. request failed
2018-11-12 17:40:46 | failed 10/20: up 208.38. request failed
2018-11-12 17:40:46 | failed 11/20: up 210.69. request failed
2018-11-12 17:40:46 | failed 12/20: up 212.93. request failed
2018-11-12 17:40:46 | failed 13/20: up 215.10. request failed
2018-11-12 17:40:46 | failed 14/20: up 217.41. request failed
2018-11-12 17:40:46 | failed 15/20: up 219.55. request failed
2018-11-12 17:40:46 | failed 16/20: up 221.85. request failed
2018-11-12 17:40:46 | failed 17/20: up 223.98. request failed
2018-11-12 17:40:46 | failed 18/20: up 226.26. request failed
2018-11-12 17:40:46 | failed 19/20: up 228.43. request failed
2018-11-12 17:40:46 | failed 20/20: up 230.67. request failed
2018-11-12 17:40:46 | failed to read iid from metadata. tried 20
2018-11-12 17:40:46 | no results found for mode=net. up 232.92. searched: nocloud configdrive ec2
2018-11-12 17:40:46 | failed to get instance-id of datasource
2018-11-12 17:40:46 | Top of dropbear init script
2018-11-12 17:40:46 | Starting dropbear sshd: failed to get instance-id of datasource

Which is leading to ssh timedout.

Revision history for this message
chandan kumar (chkumar246) wrote :

Based on discussion with slawq on #tripleo channel, http://logs.openstack.org/03/616203/9/gate/tripleo-ci-centos-7-standalone/75f1be5/logs/undercloud/home/zuul/tempest.log.txt.gz#_2018-11-12_17_40_46 -> it don't even have fixed IP configured
The first instance works fine but the second instance fails.

It is seen similar issue time to time on neutron gates.

Revision history for this message
chandan kumar (chkumar246) wrote :

We need dnsmasq logs from neutron_dhcp container in order to investigate and find out what is going wrong.

Revision history for this message
Rabi Mishra (rabi) wrote :

Probably relevant, I can see the following in the journal logs.

Nov 12 17:40:42 centos-7-inap-mtl01-0000480210 dockerd-current[14710]: dnsmasq: cannot open or create lease file /var/lib/neutron/dhcp/aab3cc80-e850-4695-957a-9dc8446aa78e/leases: No such file or directory

http://logs.openstack.org/03/616203/9/gate/tripleo-ci-centos-7-standalone/75f1be5/logs/undercloud/var/log/journal.txt.gz

Related errors in the dhcp agent logs

/usr/bin/docker-current: Error response from daemon: Conflict. The container name "/neutron-dnsmasq-qdhcp-aab3cc80-e850-4695-957a-9dc8446aa78e" is already in use by container 7a54c41ab05b3e01ecbfd63def485263c07dfbd446e2db627f8b665c68132f75. You have to remove (or rename) that container to be able to reuse that name..

http://logs.openstack.org/03/616203/9/gate/tripleo-ci-centos-7-standalone/75f1be5/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2018-11-12_17_40_40_144

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@Rabi: I don't think it's related as it is about different network.
Id of network used in failed test was: 783701e2-8acc-4cb8-ab8a-c7040f549475 and I found only logs like:

level=error msg="Handler for POST /v1.26/containers/create returned error: Conflict. The container name \"/neutron-dnsmasq-qdhcp-783701e2-8acc-4cb8-ab8a-c7040f549475\" is already in use by container fbc4bbc8f5d3ee398dd8e6637b33e466d1349e67742b6344c2c009cc169e2b4a. You have to remove (or rename) that container to be able to reuse that name."

in log file: http://logs.openstack.org/03/616203/9/gate/tripleo-ci-centos-7-standalone/75f1be5/logs/undercloud/var/log/journal.txt.gz#_Nov_12_17_29_13

I don't know if that is related or not as before this log it looks that proper container was spawned. And also first instance in test was configured fine, issue was only with second so IMO dhcp server worked properly at least for first instance.

wes hayutin (weshayutin)
tags: added: promotion-blocker
Revision history for this message
wes hayutin (weshayutin) wrote :

hrm.. is it possible that this will fix the issue in tripleo? https://review.openstack.org/#/c/617022/

Revision history for this message
Emilien Macchi (emilienm) wrote :

https://review.openstack.org/#/c/617022/ isn't going to help as nothing uses this environment file.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/617845

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/617845
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=a554bed7e00599f9fa32abe1d15eb2df0dd62aa8
Submitter: Ian Wienand (<email address hidden>)
Branch: master

commit a554bed7e00599f9fa32abe1d15eb2df0dd62aa8
Author: Wes Hayutin <email address hidden>
Date: Tue Nov 13 15:48:43 2018 -0700

    remove volumebootpattern from master

    skip list for volumebootpattern

    Related-Bug: #1802971
    Change-Id: Ib7af2c2c6e4737cb1c3824f5131a885c2ce51434

tags: added: ci tech-debt
removed: alert promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/617913
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=5102788c23e8984bd13e48377682d77f8d9065c6
Submitter: Zuul
Branch: master

commit 5102788c23e8984bd13e48377682d77f8d9065c6
Author: Chandan Kumar <email address hidden>
Date: Wed Nov 14 14:31:41 2018 +0530

    Fixed logstash file name for tempest

    * The file getting generated is tempest.log under /home/zuul/
      tempest.log not tempest-output.log that's why it is not able
      to indexed in logstash.
    * And tempest_log_file var is used twice in validate-tempest role
      and tempest.log is used at each place which also leads that
      tempest_output.log was never found in ci logs.

    Related-Bug:#1802971

    Change-Id: I9bb9f8bdd0a17d2a1481356caaf186ed6348f6ba

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

it seems Slawek is already looking at this.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
status: New → In Progress
Revision history for this message
wes hayutin (weshayutin) wrote :

Any updates?

Changed in tripleo:
milestone: stein-2 → stein-3
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Do You have any link to logs with such issue? Originally reported logs are now not available anymore.
Or maybe You have log stash query to find such issues in current runs?

Changed in tripleo:
assignee: nobody → chandan kumar (chkumar246)
Revision history for this message
chandan kumar (chkumar246) wrote :

from the latest run of standalone, http://logs.openstack.org/93/604293/128/check/tripleo-ci-centos-7-standalone/9e35890/logs/stackviz/#/testrepository.subunit/test-details/tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_port_security_macspoofing_port is passing.

and http://logs.openstack.org/93/604293/128/check/tripleo-ci-centos-7-standalone/9e35890/logs/tempest.html, I think in order to reproduce that we need some script:
* Monitor a specific tests in a set of ci job if it fails notify it and share the result with the debugger. @wes, What do you think?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/635478

Revision history for this message
chandan kumar (chkumar246) wrote :

https://review.openstack.org/635478 added review for unskipping bootvolume pattern to get some logs why it is failing.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

From patch https://review.openstack.org/635478 it looks for me that this test if now working fine.

Changed in tripleo:
milestone: stein-3 → stein-rc1
Revision history for this message
wes hayutin (weshayutin) wrote :

no longer an upstream job, moved to standalone

Revision history for this message
wes hayutin (weshayutin) wrote :

disregard last statement, that was for another bug

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/635478
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=ed1364ecbcf829c29fff708ee025bc09209e7faa
Submitter: Zuul
Branch: master

commit ed1364ecbcf829c29fff708ee025bc09209e7faa
Author: Chandan Kumar <email address hidden>
Date: Thu Feb 7 16:56:50 2019 +0530

    Unskip bootvolumepattern from master

    Boot volume pattern tests was in skip list from very long time
    in order to fix the issue, we need to unskip it and add it in
    os_tempest role and tests so that if it fails again we can
    design a logstash elastic search query to fix it.

    https://tree.taiga.io/project/tripleo-ci-board/task/791

    Change-Id: I87b90a90d9258fbf40a2b9ea5344a8347b1ad076
    Related-Bug: 1802971

wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.