Failed Master node installation

Bug #1587411 reported by Alexandr Kostrikov on 2016-05-31
38
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Medium
Alexandr Kostrikov
Mitaka
High
Alexander Kurenyshev

Bug Description

At https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.unlock_settings_tab/4/console there were failure.

======================================================================
ERROR: Create environment and set up master node
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/tests/base_test_case.py", line 355, in setup_master
    self.env.setup_environment()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/models/environment.py", line 403, in setup_environment
    self.wait_for_provisioning()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/__init__.py", line 59, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/models/environment.py", line 455, in wait_for_provisioning
    (self.d_env.admin_net), 22), timeout=timeout)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 114, in _wait
    return raising_predicate()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.unlock_settings_tab/fuelweb_test/models/environment.py", line 455, in <lambda>
    (self.d_env.admin_net), 22), timeout=timeout)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 62, in _tcp_ping
    s.connect((str(host), int(port)))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

Additional information:
The environment is still exist, may be investigated

There is workaround to restart such threads on CI as it is random error to get true results for a thread in Jenkins.

Dina Belova (dbelova) on 2016-06-01
Changed in fuel:
status: New → Confirmed
Alex Schultz (alex-schultz) wrote :

Looks like this has been resolved in newer versions of the test. Setting to incomplete unless it is reproduced

Changed in fuel:
status: Confirmed → Incomplete

@Alex - that is a confirmed bug, but with a very low reproduce rate.
There is workaround in progress, but the root cause is to be fixed.
It is bug in iso/libvirt interaction.

Changed in fuel:
status: Incomplete → Confirmed

So, either update of libvirt on CI or fixes in Fuel ISO will resolve this issue.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Alexandr Kostrikov (akostrikov-mirantis)

We havne't observed this issue in last runs, remove swarm-blocker tag though.

tags: removed: swarm-blocker
Changed in fuel:
milestone: 9.0 → 10.0
importance: High → Medium

That can be reproduced, but know workaround exist - to re-run job.

Kyrylo Romanenko (kromanenko) wrote :

Alexandr, could you take a look at this again?

@Kyrylo we got a workaround - re-run job.
It was risky to fix it during acceptance, I am going to talk about fix with fuel-devops team and post reply tomorrow.

Maksim Malchuk (mmalchuk) wrote :
Changed in fuel:
importance: Medium → Critical
tags: added: bvt-fail

smoke_neutron had started on srv30-bud, on which were hang env with full load on cpu and memory.

akostrikov@srv30-bud:~$ virsh list
 Id Name State
----------------------------------------------------
 14242 9.0-mos.main.ubuntu.bvt_2.591_admin running
 14243 9.0-mos.main.ubuntu.bvt_2.591_slave-01 running
 14244 9.0-mos.main.ubuntu.bvt_2.591_slave-02 running
 14245 9.0-mos.main.ubuntu.bvt_2.591_slave-03 running
 14246 9.0-mos.main.ubuntu.bvt_2.591_slave-04 running
 14247 9.0-mos.main.ubuntu.bvt_2.591_slave-05 running
 14248 9.0-mos.main.ubuntu.bvt_2.591_slave-06 running
 14311 9.0.system_test.ubuntu.bonding_ha.174.174_admin running
 14317 9.0.system_test.ubuntu.bonding_ha.174.174_slave-04 running
 14318 9.0.system_test.ubuntu.bonding_ha.174.174_slave-01 running
 14319 9.0.system_test.ubuntu.bonding_ha.174.174_slave-02 running
 14320 9.0.system_test.ubuntu.bonding_ha.174.174_slave-03 running
 14321 9.0.system_test.ubuntu.bonding_ha.174.174_slave-05 running

Smoke neutron had been restarted on another iso.

For now issue will be fixed, but bug should be fixed due to it random nature.
In such cases, at least it should throw exception that machine is under load.

Maksim Malchuk (mmalchuk) wrote :
Nastya Urlapova (aurlapova) wrote :

@Maksim, as I see, for now this test is green, moved to High status.

Changed in fuel:
importance: Critical → High
tags: added: swarm-fail

It wasn't fixed in 9.0, let's try to fix it in 9.1.

Fix proposed to branch: master
Review: https://review.openstack.org/353372

Changed in fuel:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/353372
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=df51f7a04014f7d4243c151caa1d3bf8904c41e4
Submitter: Jenkins
Branch: master

commit df51f7a04014f7d4243c151caa1d3bf8904c41e4
Author: Alexandr Kostrikov <email address hidden>
Date: Wed Aug 10 13:28:12 2016 +0300

    Add retry to the start of VM

    The whole setup should be restarted
    to reinitialize VM resoruces.

    Change-Id: I3967c9be8aaeccf3a292a295ce180cf2b5fd64cc
    Closes-bug: 1587411

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/353532
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=d030b3daaf335c3918d2341ebf1fef52a6e520d8
Submitter: Jenkins
Branch: stable/mitaka

commit d030b3daaf335c3918d2341ebf1fef52a6e520d8
Author: Alexandr Kostrikov <email address hidden>
Date: Wed Aug 10 13:28:12 2016 +0300

    Add retry to the start of VM

    The whole setup should be restarted
    to reinitialize VM resoruces.

    Change-Id: I3967c9be8aaeccf3a292a295ce180cf2b5fd64cc
    Closes-bug: 1587411
    (cherry picked from commit df51f7a04014f7d4243c151caa1d3bf8904c41e4)

tags: added: in-stable-mitaka

Reproduced at https://custom-ci.infra.mirantis.net/job/10.0.custom.system_test/1005/console.
Problem that disk state had not been erased and anaconda artifacts from previous build messed up the deployment.

Changed in fuel:
status: Fix Released → In Progress
Sergii Turivnyi (sturivnyi) wrote :

I've changed Importance to Critical.
This bug is blocker for Temtest

tags: added: blocker-for-qa

@Sergii, that bug is not related to failures. Change, that has been made masked issues like that https://bugs.launchpad.net/fuel/+bug/1612731 .

Also, it may relate to old devops version https://bugs.launchpad.net/fuel/+bug/1612639

I propose to not to fix it on 9.1 due to fact that it is causing more errors than solves because we can not clearly fix it without methods in devops 3. It is possible to add workarounds and methods, but it will lead to dirty code without any benefits. There were no reproduces since "broken" fix, so it is safer to fix it later than to mess with code now.

The issue is reproduced again on CI. We need to understand the root of the issue and not just ignore that sometimes master is failed to start. This issue reproduced on MOS 9.0, MOS 9.1 and MOS 10.0. Let's do not ignore it.

tags: added: swarm-blocker

That is a known issue, which can be fixed by re-running swarm.
The fix is not trivial and unstable on current devops and causing more failures.

We have tracked problem with @akurenyshev and seems to find way to overcome it without devops update.

Nastya Urlapova (aurlapova) wrote :

Since issue is floating and rare, moved to 9.2

Changed in fuel:
importance: High → Medium
tags: removed: blocker-for-qa swarm-blocker swarm-fail
tags: removed: bvt-fail

Reviewed: https://review.openstack.org/390597
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=4f486061a9e44f4e2993cd12199d96f873ef7f9e
Submitter: Jenkins
Branch: master

commit 4f486061a9e44f4e2993cd12199d96f873ef7f9e
Author: Alexandr Kostrikov <email address hidden>
Date: Tue Oct 25 18:44:01 2016 +0300

    Add reset of admin node after hang

    The first hang can be fixed with reset.
    If any other errors appear on that stage,
    the error should not be tolerated.

    Change-Id: I056879a89cb3cfab45852573730f0ced58043511
    Closes-bug: 1587411

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/392549
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=132fb3ed0777cd21aa07503d74725b43f706f6ba
Submitter: Jenkins
Branch: stable/mitaka

commit 132fb3ed0777cd21aa07503d74725b43f706f6ba
Author: Alexandr Kostrikov <email address hidden>
Date: Tue Oct 25 18:44:01 2016 +0300

    Add reset of admin node after hang

    The first hang can be fixed with reset.
    If any other errors appear on that stage,
    the error should not be tolerated.

    Change-Id: I056879a89cb3cfab45852573730f0ced58043511
    Closes-bug: 1587411
    (cherry picked from commit 4f486061a9e44f4e2993cd12199d96f873ef7f9e)

Elena Ezhova (eezhova) wrote :

The problem in the following bug https://bugs.launchpad.net/fuel/+bug/1648832 can be the same. Can someone take a look at it?

raising priority of a bug since 20 of January will be no green ISO for deployment tests (10-days ISO rotation), so we need to fix the issue with bvt tests before mention date and avoid blocker

I wonder, how this could be fixed by QA guys?
We have implemented the workaround - the master node restart if this problem occurs, but all these jobs failed after restart with the same error: no route to host.
It seems that problem should be investigated and solved by developers.

Dmitry Teselkin (teselkin-d) wrote :

Denis, Dmitry,
Jobs #802 - #815 [1] failed because of different reason - master node couldn't be deployed from ISO because package logrotate couldn't be installed. The package exists on ISO in 'extra/proposed' repo, but installed didn't see it.

The last ISO that contains logrotate in 'Packages' directory is [2], here is a part of its build log
-----
...
08:01:10 2017-01-10 08:01:10 URL:http://mirror.seed-cz1.fuel-infra.org/pkgs/snapshots/centos-7.2.1511-2016-12-12-030000/updates/x86_64//Packages/logrotate-3.8.6-7.el7_2.x86_64.rpm [67276/67276] -> "/home/jenkins/workspace/tmp/9.0-community.all/local_mirror/centos/os/x86_64/Packages/logrotate-3.8.6-7.el7_2.x86_64.rpm" [1]
...
08:02:09 [proposed: 32 of 169 ] Downloading Packages/logrotate-3.8.6-7.el7~mos1.x86_64.rpm
08:02:09
logrotate-3.8.6-7.el7~mos1.x86_64.rpm | 66 kB 00:00
...
-----

Next ISO [3] doesn't contain such string. It contains only the following
-----
...
14:02:09 [proposed: 32 of 169 ] Downloading Packages/logrotate-3.8.6-7.el7~mos2.x86_64.rpm
14:02:09
logrotate-3.8.6-7.el7~mos2.x86_64.rpm | 66 kB 00:00
...
-----

It seems that our ISO build procedure is broken. It might be related to [4] merged recently.

[1] https://ci.fuel-infra.org/job/9.0-community.main.ubuntu.bvt_2/
[2] https://ci.fuel-infra.org/job/9.0-community.all/5153/
[3] https://ci.fuel-infra.org/job/9.0-community.all/5154/
[4] https://github.com/openstack/fuel-main/commit/cbb8fa306ca62f3c6e23bc7811e58e1464ee711b

Roman Vyalov (r0mikiam) wrote :

this change was merged because we use the iso plus proposed repos for testing changes for fuel. Because all tests in the fuel ci are using ISO

Dmitry Teselkin (teselkin-d) wrote :

Let's move further discussion to the bug *really* related to the issue [1]

[1] https://bugs.launchpad.net/fuel/+bug/1655922

If the problem is different, I close this bug

This issue was fixed in the openstack/fuel-qa 11.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers