Failed Master node installation

Bug #1587411 reported by Alexandr Kostrikov
38
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Alexandr Kostrikov
Mitaka
Fix Committed
High
Alexander Kurenyshev

Bug Description

At https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.unlock_settings_tab/4/console there were failure.

======================================================================
ERROR: Create environment and set up master node
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/tests/base_test_case.py", line 355, in setup_master
    self.env.setup_environment()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/models/environment.py", line 403, in setup_environment
    self.wait_for_provisioning()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/__init__.py", line 59, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/models/environment.py", line 455, in wait_for_provisioning
    (self.d_env.admin_net), 22), timeout=timeout)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 114, in _wait
    return raising_predicate()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/models/environment.py", line 455, in <lambda>
    (self.d_env.admin_net), 22), timeout=timeout)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 62, in _tcp_ping
    s.connect((str(host), int(port)))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

Additional information:
The environment is still exist, may be investigated

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

There is workaround to restart such threads on CI as it is random error to get true results for a thread in Jenkins.

Dina Belova (dbelova)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Looks like this has been resolved in newer versions of the test. Setting to incomplete unless it is reproduced

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

@Alex - that is a confirmed bug, but with a very low reproduce rate.
There is workaround in progress, but the root cause is to be fixed.
It is bug in iso/libvirt interaction.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

So, either update of libvirt on CI or fixes in Fuel ISO will resolve this issue.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Alexandr Kostrikov (akostrikov-mirantis)
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

We havne't observed this issue in last runs, remove swarm-blocker tag though.

tags: removed: swarm-blocker
Changed in fuel:
milestone: 9.0 → 10.0
importance: High → Medium
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

That can be reproduced, but know workaround exist - to re-run job.

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :
Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :

Alexandr, could you take a look at this again?

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

@Kyrylo we got a workaround - re-run job.
It was risky to fix it during acceptance, I am going to talk about fix with fuel-devops team and post reply tomorrow.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Changed in fuel:
importance: Medium → Critical
tags: added: bvt-fail
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

smoke_neutron had started on srv30-bud, on which were hang env with full load on cpu and memory.

akostrikov@srv30-bud:~$ virsh list
 Id Name State
----------------------------------------------------
 14242 9.0-mos.main.ubuntu.bvt_2.591_admin running
 14243 9.0-mos.main.ubuntu.bvt_2.591_slave-01 running
 14244 9.0-mos.main.ubuntu.bvt_2.591_slave-02 running
 14245 9.0-mos.main.ubuntu.bvt_2.591_slave-03 running
 14246 9.0-mos.main.ubuntu.bvt_2.591_slave-04 running
 14247 9.0-mos.main.ubuntu.bvt_2.591_slave-05 running
 14248 9.0-mos.main.ubuntu.bvt_2.591_slave-06 running
 14311 9.0.system_test.ubuntu.bonding_ha.174.174_admin running
 14317 9.0.system_test.ubuntu.bonding_ha.174.174_slave-04 running
 14318 9.0.system_test.ubuntu.bonding_ha.174.174_slave-01 running
 14319 9.0.system_test.ubuntu.bonding_ha.174.174_slave-02 running
 14320 9.0.system_test.ubuntu.bonding_ha.174.174_slave-03 running
 14321 9.0.system_test.ubuntu.bonding_ha.174.174_slave-05 running

Smoke neutron had been restarted on another iso.

For now issue will be fixed, but bug should be fixed due to it random nature.
In such cases, at least it should throw exception that machine is under load.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Maksim, as I see, for now this test is green, moved to High status.

Changed in fuel:
importance: Critical → High
Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :
tags: added: swarm-fail
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

It wasn't fixed in 9.0, let's try to fix it in 9.1.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/353372

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/353372
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=df51f7a04014f7d4243c151caa1d3bf8904c41e4
Submitter: Jenkins
Branch: master

commit df51f7a04014f7d4243c151caa1d3bf8904c41e4
Author: Alexandr Kostrikov <email address hidden>
Date: Wed Aug 10 13:28:12 2016 +0300

    Add retry to the start of VM

    The whole setup should be restarted
    to reinitialize VM resoruces.

    Change-Id: I3967c9be8aaeccf3a292a295ce180cf2b5fd64cc
    Closes-bug: 1587411

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/353532

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/mitaka)

Reviewed: https://review.openstack.org/353532
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=d030b3daaf335c3918d2341ebf1fef52a6e520d8
Submitter: Jenkins
Branch: stable/mitaka

commit d030b3daaf335c3918d2341ebf1fef52a6e520d8
Author: Alexandr Kostrikov <email address hidden>
Date: Wed Aug 10 13:28:12 2016 +0300

    Add retry to the start of VM

    The whole setup should be restarted
    to reinitialize VM resoruces.

    Change-Id: I3967c9be8aaeccf3a292a295ce180cf2b5fd64cc
    Closes-bug: 1587411
    (cherry picked from commit df51f7a04014f7d4243c151caa1d3bf8904c41e4)

tags: added: in-stable-mitaka
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Reproduced at https://custom-ci.infra.mirantis.net/job/10.0.custom.system_test/1005/console.
Problem that disk state had not been erased and anaconda artifacts from previous build messed up the deployment.

Changed in fuel:
status: Fix Released → In Progress
Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :

I've changed Importance to Critical.
This bug is blocker for Temtest

tags: added: blocker-for-qa
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

@Sergii, that bug is not related to failures. Change, that has been made masked issues like that https://bugs.launchpad.net/fuel/+bug/1612731 .

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Also, it may relate to old devops version https://bugs.launchpad.net/fuel/+bug/1612639

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

I propose to not to fix it on 9.1 due to fact that it is causing more errors than solves because we can not clearly fix it without methods in devops 3. It is possible to add workarounds and methods, but it will lead to dirty code without any benefits. There were no reproduces since "broken" fix, so it is safer to fix it later than to mess with code now.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The issue is reproduced again on CI. We need to understand the root of the issue and not just ignore that sometimes master is failed to start. This issue reproduced on MOS 9.0, MOS 9.1 and MOS 10.0. Let's do not ignore it.

Revision history for this message
Anastasia Kuznetsova (akuznetsova) wrote :
tags: added: swarm-blocker
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

That is a known issue, which can be fixed by re-running swarm.
The fix is not trivial and unstable on current devops and causing more failures.

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

We have tracked problem with @akurenyshev and seems to find way to overcome it without devops update.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Since issue is floating and rare, moved to 9.2

Changed in fuel:
importance: High → Medium
tags: removed: blocker-for-qa swarm-blocker swarm-fail
tags: removed: bvt-fail
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/390597

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/390597
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=4f486061a9e44f4e2993cd12199d96f873ef7f9e
Submitter: Jenkins
Branch: master

commit 4f486061a9e44f4e2993cd12199d96f873ef7f9e
Author: Alexandr Kostrikov <email address hidden>
Date: Tue Oct 25 18:44:01 2016 +0300

    Add reset of admin node after hang

    The first hang can be fixed with reset.
    If any other errors appear on that stage,
    the error should not be tolerated.

    Change-Id: I056879a89cb3cfab45852573730f0ced58043511
    Closes-bug: 1587411

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/392549

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/mitaka)

Reviewed: https://review.openstack.org/392549
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=132fb3ed0777cd21aa07503d74725b43f706f6ba
Submitter: Jenkins
Branch: stable/mitaka

commit 132fb3ed0777cd21aa07503d74725b43f706f6ba
Author: Alexandr Kostrikov <email address hidden>
Date: Tue Oct 25 18:44:01 2016 +0300

    Add reset of admin node after hang

    The first hang can be fixed with reset.
    If any other errors appear on that stage,
    the error should not be tolerated.

    Change-Id: I056879a89cb3cfab45852573730f0ced58043511
    Closes-bug: 1587411
    (cherry picked from commit 4f486061a9e44f4e2993cd12199d96f873ef7f9e)

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Elena Ezhova (eezhova) wrote :

The problem in the following bug https://bugs.launchpad.net/fuel/+bug/1648832 can be the same. Can someone take a look at it?

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :
Revision history for this message
Dmitry Kaigarodеsev (dkaiharodsev) wrote :

raising priority of a bug since 20 of January will be no green ISO for deployment tests (10-days ISO rotation), so we need to fix the issue with bvt tests before mention date and avoid blocker

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

I wonder, how this could be fixed by QA guys?
We have implemented the workaround - the master node restart if this problem occurs, but all these jobs failed after restart with the same error: no route to host.
It seems that problem should be investigated and solved by developers.

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Denis, Dmitry,
Jobs #802 - #815 [1] failed because of different reason - master node couldn't be deployed from ISO because package logrotate couldn't be installed. The package exists on ISO in 'extra/proposed' repo, but installed didn't see it.

The last ISO that contains logrotate in 'Packages' directory is [2], here is a part of its build log
-----
...
08:01:10 2017-01-10 08:01:10 URL:http://mirror.seed-cz1.fuel-infra.org/pkgs/snapshots/centos-7.2.1511-2016-12-12-030000/updates/x86_64//Packages/logrotate-3.8.6-7.el7_2.x86_64.rpm [67276/67276] -> "/home/jenkins/workspace/tmp/9.0-community.all/local_mirror/centos/os/x86_64/Packages/logrotate-3.8.6-7.el7_2.x86_64.rpm" [1]
...
08:02:09 [proposed: 32 of 169 ] Downloading Packages/logrotate-3.8.6-7.el7~mos1.x86_64.rpm
08:02:09
logrotate-3.8.6-7.el7~mos1.x86_64.rpm | 66 kB 00:00
...
-----

Next ISO [3] doesn't contain such string. It contains only the following
-----
...
14:02:09 [proposed: 32 of 169 ] Downloading Packages/logrotate-3.8.6-7.el7~mos2.x86_64.rpm
14:02:09
logrotate-3.8.6-7.el7~mos2.x86_64.rpm | 66 kB 00:00
...
-----

It seems that our ISO build procedure is broken. It might be related to [4] merged recently.

[1] https://ci.fuel-infra.org/job/9.0-community.main.ubuntu.bvt_2/
[2] https://ci.fuel-infra.org/job/9.0-community.all/5153/
[3] https://ci.fuel-infra.org/job/9.0-community.all/5154/
[4] https://github.com/openstack/fuel-main/commit/cbb8fa306ca62f3c6e23bc7811e58e1464ee711b

Revision history for this message
Roman Vyalov (r0mikiam) wrote :

this change was merged because we use the iso plus proposed repos for testing changes for fuel. Because all tests in the fuel ci are using ISO

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Let's move further discussion to the bug *really* related to the issue [1]

[1] https://bugs.launchpad.net/fuel/+bug/1655922

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

If the problem is different, I close this bug

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-qa 11.0.0.0rc1

This issue was fixed in the openstack/fuel-qa 11.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.