Fuel for OpenStack

Node has not become online after repititive cold reboot

Bug #1588877 reported by Artem Hrechanychenko on 2016-06-03

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Unassigned	Fuel for OpenStack 10.0
Mitaka	Fix Released	High	Unassigned	Fuel for OpenStack 9.1
Newton	Fix Committed	High	Unassigned	Fuel for OpenStack 10.0

Bug Description

Detailed bug description:
016-06-03 07:57:56,726 - ERROR decorators.py:126 -- Traceback (most recent call last):
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/helpers/decorators.py", line 817, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/tests/tests_strength/test_repetitive_restart.py", line 120, in ceph_partitions_repetitive_cold_restart
    'slave-05']))
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/models/fuel_web_client.py", line 2005, in cold_restart_nodes
    ' after cold start'.format(node.name))
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/asserts.py", line 163, in assert_true
    raise ASSERTION_ERROR(message)
AssertionError: Node slave-01 has not become online after cold start

Steps to reproduce:
        1. Revert snapshot 'prepare_load_ceph_ha'
        2. Wait until MySQL Galera is UP on some controller
        3. Check Ceph status
        4. Run ostf
        5. Fill ceph partitions on all nodes up to 30%
        6. Check Ceph status
        7. Run RALLY
        8. 100 times repetitive reboot: <<<<failed on 3rd reboot
        9. Cold restart of all nodes
        10. Wait for HA services ready
        11. Wait until MySQL Galera is UP on some controller
        12. Run ostf

Expected results:
all nodes became online
Actual result:
AssertionError: Node slave-01 has not become online after cold start

Reproducibility:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.repetitive_restart/97/console

Workaround:
-

Impact:
swarm

Tags:

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2016-06-03:

fail_error_ceph_partitions_repetitive_cold_restart_diagnostic-logs_2016_06_03__07_57_06.tgz Edit (45.0 MiB, application/x-tar)

Changed in fuel:
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)

Ilya Kutukov (ikutukov) on 2016-06-03

Changed in fuel:
status:	New → Confirmed
tags:	added: area-python

Dmitry Pyzhov (dpyzhov) on 2016-06-03

tags:

added: area-library
removed: area-python

Kyrylo Galanov (kgalanov) on 2016-06-06

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Kyrylo Galanov (kgalanov)

Maksim Malchuk (mmalchuk) on 2016-06-06

Changed in fuel:
assignee:	Kyrylo Galanov (kgalanov) → Oleksiy Molchanov (omolchanov)

Revision history for this message

Kyrylo Galanov (kgalanov) wrote on 2016-06-06:

The issue was not reproduced in CI since 97 build. Other environments in this test are ok itself after manual check.
Some improvements should be done in devops library to increase restart procedure realibility and repeatability.

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Kyrylo Galanov (kgalanov)
status:	Confirmed → Incomplete

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-06-06:

According to Kyrylo this is an issue in test.

Changed in fuel:
assignee:	Kyrylo Galanov (kgalanov) → Fuel QA Team (fuel-qa)
tags:	added: area-qa removed: area-library

Sergey Shevorakov (sshevorakov) on 2016-06-08

tags:

added: swarm-fail

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-06-09:

Let's try to track this issue in 10.0. It doesn't reproduce on latest 9.0 runs

Changed in fuel:
milestone:	9.0 → 10.0
status:	Incomplete → Confirmed

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-06-13:

So the problem is that after third reboot nodes are going into maintenance mode - https://docs.mirantis.com/openstack/fuel/fuel-8.0/operations.html#maintenance-mode

What we need to do in test - disable UMM feature to prevent such behaviour

Quote from docs:
The configuration options are:

    UMM=yes
    REBOOT_COUNT=2
    COUNTER_RESET_TIME=10

where:

UMM
    tells the system to go into the maintenance mode based on the REBOOT_COUNT and COUNTER_RESET_TIME values. If the value is anything other than yes (or if the UMM.conf file is missing), the system will go into the native Ubuntu recovery mode.
REBOOT_COUNT
    determines the number of unclean reboots that trigger the system to go into the maintenance mode.
COUNTER_RESET_TIME
    determines the period of time (in minutes) before the Unclean reboot counter reset.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-13: Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/329146

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Maksym Strukov (unbelll)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-12: Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/329146
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=5809bb6208ae464706f10ae39aa53fd5134205a3
Submitter: Jenkins
Branch: master

commit 5809bb6208ae464706f10ae39aa53fd5134205a3
Author: Maksym Strukov <email address hidden>
Date: Mon Jun 13 21:49:07 2016 +0300

Disable UMM before nodes repetitive restart

    The problem is that after third reboot nodes are going
    into maintenance mode and became unavailable for further
    testing. We need disable UMM feature to prevent such behaviour.

Change-Id: I1cce936201872f47d13e3c482e23e1ba4cfc24b2
Closes-Bug: #1588877

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-07-18:

Reproduced again on ci https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.repetitive_restart/140/console

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-26: Fix proposed to fuel-qa (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/347198

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-26: Fix merged to fuel-qa (stable/mitaka)

#10

Reviewed: https://review.openstack.org/347198
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=e21c9a6099b8a0f56fd0314816c5ad7921b8c724
Submitter: Jenkins
Branch: stable/mitaka

commit e21c9a6099b8a0f56fd0314816c5ad7921b8c724
Author: Maksym Strukov <email address hidden>
Date: Mon Jun 13 21:49:07 2016 +0300

Disable UMM before nodes repetitive restart

    The problem is that after third reboot nodes are going
    into maintenance mode and became unavailable for further
    testing. We need disable UMM feature to prevent such behaviour.

Change-Id: I1cce936201872f47d13e3c482e23e1ba4cfc24b2
Closes-Bug: #1588877

Tatyana Kuterina (tkuterina) on 2016-08-02

tags:

added: on-verification

Revision history for this message

Tatyana Kuterina (tkuterina) wrote on 2016-08-02:

#11

Verified on 9.1 snapshot #81

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
495
cat /etc/fuel_build_number:
495
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
python-packetary-9.0.0-1.mos142.noarch
fuel-migrate-9.0.0-1.mos8496.noarch
fuel-release-9.0.0-1.mos6349.noarch
fuel-bootstrap-cli-9.0.0-1.mos285.noarch
fuel-openstack-metadata-9.0.0-1.mos8748.noarch
fuel-ostf-9.0.0-1.mos938.noarch
shotgun-9.0.0-1.mos90.noarch
python-fuelclient-9.0.0-1.mos325.noarch
fuel-9.0.0-1.mos6349.noarch
fuel-misc-9.0.0-1.mos8496.noarch
fuel-provisioning-scripts-9.0.0-1.mos8748.noarch
rubygem-astute-9.0.0-1.mos753.noarch
fuel-setup-9.0.0-1.mos6349.noarch
network-checker-9.0.0-1.mos74.x86_64
fuel-agent-9.0.0-1.mos285.noarch
fuel-ui-9.0.0-1.mos2717.noarch
fuel-library9.0-9.0.0-1.mos8496.noarch
nailgun-mcagents-9.0.0-1.mos753.noarch
fuel-notify-9.0.0-1.mos8496.noarch
fuel-nailgun-9.0.0-1.mos8748.noarch
fuelmenu-9.0.0-1.mos274.noarch
fuel-mirror-9.0.0-1.mos142.noarch
fuel-utils-9.0.0-1.mos8496.noarch

FUEL_QA_COMMIT=bfb750898b0f5ef196eb0c8a295cc29279487ade
UBUNTU_MIRROR_ID=ubuntu-2016-07-31-170655
CENTOS_MIRROR_ID=centos-7.2.1511-2016-05-31-083834
MOS_UBUNTU_MIRROR_ID=9.0-2016-08-01-154321
MOS_CENTOS_OS_MIRROR_ID=os-2016-06-23-135731
MOS_CENTOS_PROPOSED_MIRROR_ID=proposed-2016-08-01-154321
MOS_CENTOS_UPDATES_MIRROR_ID=updates-2016-06-23-135916
MOS_CENTOS_HOLDBACK_MIRROR_ID=holdback-2016-06-23-140047
MOS_CENTOS_HOTFIX_MIRROR_ID=hotfix-2016-07-18-162958
MOS_CENTOS_SECURITY_MIRROR_ID=security-2016-06-23-140002