Node has not become online after repititive cold reboot

Bug #1588877 reported by Artem Hrechanychenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Unassigned
Mitaka
Fix Released
High
Unassigned
Newton
Fix Committed
High
Unassigned

Bug Description

Detailed bug description:
 016-06-03 07:57:56,726 - ERROR decorators.py:126 -- Traceback (most recent call last):
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/helpers/decorators.py", line 817, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/tests/tests_strength/test_repetitive_restart.py", line 120, in ceph_partitions_repetitive_cold_restart
    'slave-05']))
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.repetitive_restart/fuelweb_test/models/fuel_web_client.py", line 2005, in cold_restart_nodes
    ' after cold start'.format(node.name))
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/asserts.py", line 163, in assert_true
    raise ASSERTION_ERROR(message)
AssertionError: Node slave-01 has not become online after cold start

Steps to reproduce:
        1. Revert snapshot 'prepare_load_ceph_ha'
        2. Wait until MySQL Galera is UP on some controller
        3. Check Ceph status
        4. Run ostf
        5. Fill ceph partitions on all nodes up to 30%
        6. Check Ceph status
        7. Run RALLY
        8. 100 times repetitive reboot: <<<<failed on 3rd reboot
        9. Cold restart of all nodes
        10. Wait for HA services ready
        11. Wait until MySQL Galera is UP on some controller
        12. Run ostf

Expected results:
 all nodes became online
Actual result:
 AssertionError: Node slave-01 has not become online after cold start

Reproducibility:
 https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.repetitive_restart/97/console

Workaround:
 -

Impact:
 swarm

Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Ilya Kutukov (ikutukov)
Changed in fuel:
status: New → Confirmed
tags: added: area-python
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
removed: area-python
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Kyrylo Galanov (kgalanov)
Changed in fuel:
assignee: Kyrylo Galanov (kgalanov) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

The issue was not reproduced in CI since 97 build. Other environments in this test are ok itself after manual check.
Some improvements should be done in devops library to increase restart procedure realibility and repeatability.

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Kyrylo Galanov (kgalanov)
status: Confirmed → Incomplete
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

According to Kyrylo this is an issue in test.

Changed in fuel:
assignee: Kyrylo Galanov (kgalanov) → Fuel QA Team (fuel-qa)
tags: added: area-qa
removed: area-library
tags: added: swarm-fail
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Let's try to track this issue in 10.0. It doesn't reproduce on latest 9.0 runs

Changed in fuel:
milestone: 9.0 → 10.0
status: Incomplete → Confirmed
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

So the problem is that after third reboot nodes are going into maintenance mode - https://docs.mirantis.com/openstack/fuel/fuel-8.0/operations.html#maintenance-mode

What we need to do in test - disable UMM feature to prevent such behaviour

Quote from docs:
The configuration options are:

    UMM=yes
    REBOOT_COUNT=2
    COUNTER_RESET_TIME=10

where:

UMM
    tells the system to go into the maintenance mode based on the REBOOT_COUNT and COUNTER_RESET_TIME values. If the value is anything other than yes (or if the UMM.conf file is missing), the system will go into the native Ubuntu recovery mode.
REBOOT_COUNT
    determines the number of unclean reboots that trigger the system to go into the maintenance mode.
COUNTER_RESET_TIME
    determines the period of time (in minutes) before the Unclean reboot counter reset.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/329146

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Maksym Strukov (unbelll)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/329146
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=5809bb6208ae464706f10ae39aa53fd5134205a3
Submitter: Jenkins
Branch: master

commit 5809bb6208ae464706f10ae39aa53fd5134205a3
Author: Maksym Strukov <email address hidden>
Date: Mon Jun 13 21:49:07 2016 +0300

    Disable UMM before nodes repetitive restart

    The problem is that after third reboot nodes are going
    into maintenance mode and became unavailable for further
    testing. We need disable UMM feature to prevent such behaviour.

    Change-Id: I1cce936201872f47d13e3c482e23e1ba4cfc24b2
    Closes-Bug: #1588877

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
ElenaRossokhina (esolomina) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/347198

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/mitaka)

Reviewed: https://review.openstack.org/347198
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=e21c9a6099b8a0f56fd0314816c5ad7921b8c724
Submitter: Jenkins
Branch: stable/mitaka

commit e21c9a6099b8a0f56fd0314816c5ad7921b8c724
Author: Maksym Strukov <email address hidden>
Date: Mon Jun 13 21:49:07 2016 +0300

    Disable UMM before nodes repetitive restart

    The problem is that after third reboot nodes are going
    into maintenance mode and became unavailable for further
    testing. We need disable UMM feature to prevent such behaviour.

    Change-Id: I1cce936201872f47d13e3c482e23e1ba4cfc24b2
    Closes-Bug: #1588877

tags: added: on-verification
Revision history for this message
Tatyana Kuterina (tkuterina) wrote :

Verified on 9.1 snapshot #81

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 495
cat /etc/fuel_build_number:
 495
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 python-packetary-9.0.0-1.mos142.noarch
 fuel-migrate-9.0.0-1.mos8496.noarch
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-openstack-metadata-9.0.0-1.mos8748.noarch
 fuel-ostf-9.0.0-1.mos938.noarch
 shotgun-9.0.0-1.mos90.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-misc-9.0.0-1.mos8496.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8748.noarch
 rubygem-astute-9.0.0-1.mos753.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2717.noarch
 fuel-library9.0-9.0.0-1.mos8496.noarch
 nailgun-mcagents-9.0.0-1.mos753.noarch
 fuel-notify-9.0.0-1.mos8496.noarch
 fuel-nailgun-9.0.0-1.mos8748.noarch
 fuelmenu-9.0.0-1.mos274.noarch
 fuel-mirror-9.0.0-1.mos142.noarch
 fuel-utils-9.0.0-1.mos8496.noarch

FUEL_QA_COMMIT=bfb750898b0f5ef196eb0c8a295cc29279487ade
UBUNTU_MIRROR_ID=ubuntu-2016-07-31-170655
CENTOS_MIRROR_ID=centos-7.2.1511-2016-05-31-083834
MOS_UBUNTU_MIRROR_ID=9.0-2016-08-01-154321
MOS_CENTOS_OS_MIRROR_ID=os-2016-06-23-135731
MOS_CENTOS_PROPOSED_MIRROR_ID=proposed-2016-08-01-154321
MOS_CENTOS_UPDATES_MIRROR_ID=updates-2016-06-23-135916
MOS_CENTOS_HOLDBACK_MIRROR_ID=holdback-2016-06-23-140047
MOS_CENTOS_HOTFIX_MIRROR_ID=hotfix-2016-07-18-162958
MOS_CENTOS_SECURITY_MIRROR_ID=security-2016-06-23-140002

tags: removed: on-verification
Maksym Strukov (unbelll)
Changed in fuel:
assignee: Maksym Strukov (unbelll) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.