Node failed to reboot after reset environment and then deploy

Bug #1318567 reported by Andrey Sledzinskiy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Sharshov
5.0.x
Won't Fix
Medium
Vladimir Sharshov
5.1.x
Fix Released
High
Vladimir Sharshov
6.0.x
Fix Released
High
Vladimir Sharshov

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_0_swarm/job/master_fuelmain.system_test.ubuntu.thread_3/44/testReport/%28root%29/deploy_stop_reset_on_ha/deploy_stop_reset_on_ha/

http://jenkins-product.srt.mirantis.net:8080/view/0_0_swarm/job/master_fuelmain.system_test.ubuntu.thread_2/50/testReport/%28root%29/deploy_flat_stop_reset_on_provisioning/deploy_flat_stop_reset_on_provisioning/

Tests deploy_flat_stop_reset_on_provisioning and deploy_stop_reset_on_ha failed waiting Node gets online status after stop deployment

Error Message

Waiting timed out

Stacktrace

Traceback (most recent call last):
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "/usr/lib/python2.7/unittest/case.py", line 1044, in runTest
    self._testFunc()
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/master_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/helpers/decorators.py", line 49, in wrapper
    return func(*args, **kwagrs)
  File "/home/jenkins/workspace/master_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/tests/test_environment_action.py", line 243, in deploy_stop_reset_on_ha
    self.fuel_web.wait_nodes_get_online_state(self.env.nodes().slaves[:3])
  File "/home/jenkins/workspace/master_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/master_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/models/fuel_web_client.py", line 862, in wait_nodes_get_online_state
    timeout=60 * 4)
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 95, in wait
    raise TimeoutError("Waiting timed out")
TimeoutError: Waiting timed out

Tags: system-tests
Changed in fuel:
milestone: 5.0 → 5.1
Changed in fuel:
status: New → Confirmed
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Can not we simply increase timeout to make it pass? As far as know, this passes manually fine..

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Yes we can and seems we should :)

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

It seems that timeout is because of this bug - https://bugs.launchpad.net/fuel/+bug/1316554 .
I tried it on ISO#214 - nodes stuck on provisioning state after stop and didn't reboot, nodes' status went to offline so there was timeout waiting them online

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

api: '1.0'
astute_sha: a6f99f830ad356def866ded2e44f0f899e80ca24
build_id: 2014-06-25_14-51-54
build_number: '69'
fuellib_sha: 1cf02c3a06edaae9278129f5a534d4c9d0c33784
fuelmain_sha: ae0fd7b9f7db7ee974b0232f3d1c66d6c6f3cc49
mirantis: 'yes'
nailgun_sha: 3c3298ededbf876f08fa3937e434516f3e874c40
ostf_sha: 265eb80beec1ad53f58ae04b78e5755937752a50
production: docker
release: 5.0.1

This issue is still reproduced on 5.0.1 ISO and it is caused by 'unaccessible' error:

https://bugs.launchpad.net/fuel/+bug/1316583/comments/17

We need to backport the following two changes to stable/5.0 to fix it:

https://review.openstack.org/#/c/96116/
https://review.openstack.org/#/c/96488/

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :
Dmitry Ilyin (idv1985)
summary: - [System tests] Node failed to get online state within allocated timeout
- after stopping deployment
+ [systest] Node failed to get online state within allocated timeout after
+ stopping deployment
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote : Re: [systest] Node failed to get online state within allocated timeout after stopping deployment

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_2/124/testReport/%28root%29/deploy_flat_stop_on_deploying/deploy_flat_stop_on_deploying/

After stopping deployment one of the node went offline and then deployment failed
Logs are attached
Issue need to be reproduced because snapshot wasn't created during test

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Andrey, last snapshot is more enough or we are waiting issue reproduction?

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue wasn't reproduced on today run. I think we need env with reproduced issue

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Reproduced issue locally - after stopping deployment controller back online but after running deployment again it failed to reboot and deployment failed with timeout of provisioning
Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: Incomplete → Confirmed
assignee: Fuel QA Team (fuel-qa) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Reason: node-1 do not erased and rebooted. No error in logs. No filesystem mounted like RO. For some reason mcollec

Good answer (when we start deploy, but without reboot):

MC agent 'erase_node', method 'erase_node', results: {:sender=>"1", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>0, :erased=>nil, :rebooted=>false, :debug_msg=>["Create file for discovery preventing /var/run/nodiscover"], :error_msg=>nil}}

MC agent 'erase_node', method 'erase_node', results: {:sender=>"1", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>0, :debug_msg=>["Create file for discovery preventing /var/run/nodiscover"], :error_msg=>nil, :erased=>nil, :rebooted=>true}}

Bad answer(but look correct):

MC agent 'erase_node', method 'erase_node', results: {:sender=>"1", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>0, :erased=>nil, :rebooted=>false, :debug_msg=>["Create file for discovery preventing /var/run/nodiscover"], :error_msg=>nil}}

We erase node 3 times and last or prelast attempt was fail. Maybe for some reason reboot mechanic do not work, but this code also used when we remove or reset env.

And cobbler says what node was rebooted.

2014-07-29 16:43:07 DEBUG
[394] Successfully rebooted: node-1
2014-07-29 16:43:07 DEBUG
[394] Reboot task status: node: node-1 status: [1406648575.853949, "Power management (reboot)", "complete", []]
2014-07-29 16:43:02 DEBUG
[394] Reboot task status: node: node-1 status: [1406648575.853949, "Power management (reboot)", "running", []]
2014-07-29 16:42:57 DEBUG
[394] Reboot task status: node: node-1 status: [1406648575.853949, "Power management (reboot)", "running", []]

Also file /var/run/nodiscover was created

[root@bootstrap ~]# ls -all /var/run/nodiscover
-rw-r--r-- 1 root root 0 2014-07-29 15:42 /var/run/nodiscover

We have some strange moment with time (1 hour different), but no errors. If try to reboot nodes by hands, it will provisioned without problem. I suppose Cobbler could not successfully reboot node for some reason.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue is reproduced on latest system tests run
Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Potentially we have this bug: https://github.com/cobbler/cobbler/issues/426 in Cobbler. If we comment this line ```utils.die(self.logger,"command succeeded (rc=%s), but output ('%s') was not understood" % (rc, output))``` in action_power.py, we solve problem (it help me once), but i return original code and repeat command with old code and also always work good.

As result restarting of Cobbler container solved the problem, but do not give any additional information about error.

I can add insurance reboot using mcollective which helped with nodes which report about success reboot, but really stay in boostrap stage.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/111965

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote : Re: [systest] Node failed to get online state within allocated timeout after stopping deployment

Setting Medium priority, because issue affects only standalone 5.0.2 installation. 5.1 master node with 5.0.2 environments is not affected.

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Reproduced on ISO #448
"build_id": "2014-08-18_02-01-17",
"ostf_sha": "d2a894d228c1f3c22595a77f04b1e00d09d8e463",
"build_number": "448",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "bc9e377dbe010732bc2ba47161ed9d433998e07b",
"production": "docker",
"fuelmain_sha": "08f04775dcfadd8f5b438a31c63e81f29276b7d3",
"astute_sha": "8e1db3926b2320b30b23d7a772122521b0d96166",
"feature_groups": ["mirantis", "experimental"],
"release": "5.1",
"fuellib_sha": "2c9ad4aec9f3b6fc060cb5a394733607f07063c1"

1. Create new environment (Ubuntu, HA mode)
2. Choose VLAN segmentation
3. Choose both Ceph
4. Add 3 controllers+ceph, compute
5. Start deployment
6. Provisioning doesn't started on 3-rd controller

2014-08-18 12:49:56 DEBUG

[434] Got node types: uid=9 type=target

2014-08-18 12:49:56 DEBUG

[434] Got node types: uid=8 type=bootstrap

2014-08-18 12:49:56 DEBUG

[434] Got node types: uid=7 type=target

2014-08-18 12:49:56 DEBUG

[434] Got node types: uid=6 type=target

Manually reboot bootstrap controller. Provisioning started and finished.

Revision history for this message
Kirill Omelchenko (komelchenko) wrote :

One more reproduced on system tests:
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_2/153/testReport/?

Scenario:
CentOS, Simple
1. Add 1x controller, 1x compute
2. Deploy
3. Wait for the progress to reach 10%
4. Stop deployment, wait for nodes to become ready for deployment
5. Add 1x cinder node
6. Deploy

Expected:
Nodes get provisioned and the cluster is set up.

Actual:
Controller node is offline in the UI and CLI:
[root@nailgun ~]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online
---|-------------|---------------------|---------|---------------|-------------------|------------|---------------|-------
1 | provisioned | slave-01_controller | 1 | 10.108.30.166 | 64:ca:a9:60:57:43 | controller | | False
2 | provisioned | slave-02_compute | 1 | 10.108.30.4 | 64:92:27:73:36:d4 | compute | | True
3 | provisioned | slave-03_cinder | 1 | 10.108.30.5 | 64:d7:1b:35:ef:98 | cinder | | True
[root@nailgun ~]# ping node-1
PING node-1.test.domain.local (10.108.30.3) 56(84) bytes of data.
64 bytes from node-1.test.domain.local (10.108.30.3): icmp_seq=1 ttl=64 time=0.308 ms
64 bytes from node-1.test.domain.local (10.108.30.3): icmp_seq=2 ttl=64 time=0.531 ms

In fact the node is in bootstrap state.
astute.log contains records http://paste.openstack.org/show/102416/
I assume they indicate that it either hasn't been rebooted or has been marked as bootstrap.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/111965
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=22f17f657027153cdd9eba176c0420c83818eb35
Submitter: Jenkins
Branch: master

commit 22f17f657027153cdd9eba176c0420c83818eb35
Author: Vladimir Sharshov <email address hidden>
Date: Tue Aug 5 13:33:27 2014 +0400

    Prevent problem with nodes reboot using Cobbler

    Sometimes Cobbler return success status for reboot
    operation which really failed. We try additionally
    reboot this nodes using ssh.

    Change-Id: Idba1e1734f8446381acedef42a93e88fe4248d63
    Closes-Bug: #1318567

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote : Re: [systest] Node failed to get online state within allocated timeout after stopping deployment

{"build_id": "2014-09-12_00-01-11", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "4", "auth_required": true, "api": "1.0", "nailgun_sha": "d389bc6489fe296c9c210f7c65ac84e154a8b82b", "production": "docker", "fuelmain_sha": "d899675a5a393625f8166b29099d26f45d527035", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["experimental"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-12_00-01-11", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "4", "api": "1.0", "nailgun_sha": "d389bc6489fe296c9c210f7c65ac84e154a8b82b", "production": "docker", "fuelmain_sha": "d899675a5a393625f8166b29099d26f45d527035", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["experimental"], "release": "5.1", "fuellib_sha": "395fd9d20a003603cc9ad26e16cb13c1c45e24e6"}}}, "fuellib_sha": "395fd9d20a003603cc9ad26e16cb13c1c45e24e6"}

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote : Re: [systest] Node failed to reboot after reset environment

Bug was reproduced on CI
Logs are attached
Steps:
1. Create cluster - Centos, simple, Flat nova network, 1 controller, 1 compute
2. Deploy cluster
3. Reset cluster after deployment
4. Wait nodes get online state
5. Deploy cluster again

Actual result - controller failed to reboot

summary: - [systest] Node failed to get online state within allocated timeout after
- stopping deployment
+ [systest] Node failed to reboot after reset environment
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
summary: - [systest] Node failed to reboot after reset environment
+ [systest] Node failed to reboot after reset environment and then deploy
Revision history for this message
Vladimir Sharshov (vsharshov) wrote : Re: [systest] Node failed to reboot after reset environment and then deploy

If problem will reproduce tomorrow, i suggest to delegate reboot command from Cobbler to MClient which use in case of erasing/reseting cluster.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue was reproduced on stop deployment - after stopping deployment controller back online but after running deployment again it failed to reboot and deployment failed with timeout of provisioning
Logs are attached

Errors on controller in mcollective
2014-10-27 10:55:07 ERR

10:55:07.471470 #1192] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.108.0.2:61613: es_oldrecv: receive failed: stream closed

2014-10-27 10:55:07 ERR

10:55:07.471272 #1192] ERROR -- : rabbitmq.rb:50:in `on_hbread_fail' Heartbeat read failed from 'stomp://mcollective@10.108.0.2:61613': {"lock_fail"=>true, "lock_fail_count"=>2, "read_fail_count"=>0, "ticker_interval"=>29.5}

2014-10-27 10:54:38 ERR

10:54:37.967914 #1192] ERROR -- : rabbitmq.rb:50:in `on_hbread_fail' Heartbeat read failed from 'stomp://mcollective@10.108.0.2:61613': {"lock_fail"=>true, "lock_fail_count"=>1, "read_fail_count"=>0, "ticker_interval"=>29.5}

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Also reproduced on 5.1.1-7 ISO
Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
summary: - [systest] Node failed to reboot after reset environment and then deploy
+ Node failed to reboot after reset environment and then deploy
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/136728

Revision history for this message
Evgeniy L (rustyrobot) wrote :

I'm not sure if we should backport it to 5.1.1 because afaik it happens not often and in case when user resets env and then tries to deploy it again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/136728
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=122cdaab63357dc2138f01c89637c4188ad831ed
Submitter: Jenkins
Branch: master

commit 122cdaab63357dc2138f01c89637c4188ad831ed
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Nov 24 12:53:00 2014 +0300

    Repeated erase and reboot for bootstrap nodes

    If Atute got unxepecting status 'bootstrap'
    from node in the middle of provision operation,
    we try to erase and reboot node. In mostly
    enviroments where this bug was reproduced
    this operation was enough to solve problem.

    If, for some reason, this does not help, we
    can with great confidence to look at the
    problem in Cobbler.

    Change-Id: I9ffaa885bd150a6fa119f07bc04b79bd5afb92a1
    Closes-Bug: #1318567

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/137365

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/5.1)

Reviewed: https://review.openstack.org/137365
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=1aa00584e0785a94ddc5d99c88ac45d02c4bbef1
Submitter: Jenkins
Branch: stable/5.1

commit 1aa00584e0785a94ddc5d99c88ac45d02c4bbef1
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Nov 24 12:53:00 2014 +0300

    Repeated erase and reboot for bootstrap nodes

    If Atute got unxepecting status 'bootstrap'
    from node in the middle of provision operation,
    we try to erase and reboot node. In mostly
    enviroments where this bug was reproduced
    this operation was enough to solve problem.

    If, for some reason, this does not help, we
    can with great confidence to look at the
    problem in Cobbler.

    Change-Id: I9ffaa885bd150a6fa119f07bc04b79bd5afb92a1
    Closes-Bug: #1318567
    (cherry picked from commit 122cdaab63357dc2138f01c89637c4188ad831ed)

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Released for 6.0:
{"build_id": "2014-12-02_22-41-00", "ostf_sha": "7e79964ddb5092fc4568c6fb08a348bb326df2a8", "build_number": "35", "auth_required": true, "api": "1.0", "nailgun_sha": "f71d8437783a4522aef4ff5a02393840b2a4a33f", "production": "docker", "fuelmain_sha": "b0f2f749ac2c1de3472f8ddeb3a0105798ca5837", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-02_22-41-00", "ostf_sha": "7e79964ddb5092fc4568c6fb08a348bb326df2a8", "build_number": "35", "api": "1.0", "nailgun_sha": "f71d8437783a4522aef4ff5a02393840b2a4a33f", "production": "docker", "fuelmain_sha": "b0f2f749ac2c1de3472f8ddeb3a0105798ca5837", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "adeaf01b19e37e357ef9103113927945a8034ccf"}}}, "fuellib_sha": "adeaf01b19e37e357ef9103113927945a8034ccf"}

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Released for 5.1.1:
{"build_id": "2014-11-27_23-41-13", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "45", "auth_required": true, "api": "1.0", "nailgun_sha": "500e36d08a45dbb389bf2bd97673d9bff48ee84d", "production": "docker", "fuelmain_sha": "51e66db7750e9c856ba128f35cfb6724895bf479", "astute_sha": "ef8aa0fd0e3ce20709612906f1f0551b5682a6ce", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-11-27_23-41-13", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "45", "api": "1.0", "nailgun_sha": "500e36d08a45dbb389bf2bd97673d9bff48ee84d", "production": "docker", "fuelmain_sha": "51e66db7750e9c856ba128f35cfb6724895bf479", "astute_sha": "ef8aa0fd0e3ce20709612906f1f0551b5682a6ce", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "15a387462f7be50c4f87ad986d0c81535025c125"}}}, "fuellib_sha": "15a387462f7be50c4f87ad986d0c81535025c125"}

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.