one node is in provisioning state during deployment task

Bug #1496046 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Georgy Kibardin

Bug Description

During deployment we have found that one of compute nodes is still in provisioning state when deployment step is running:
[root@fuel ~]# fuel node 2>/dev/null | grep provisioning
347 | provisioning | compute_155 | 2 | 10.20.0.132 | 0c:c4:7a:1d:f1:d8 | compute | | True | 2
[root@fuel ~]# fuel task 2>/dev/null
id | status | name | cluster | progress | uuid
---|---------|----------------|---------|----------|-------------------------------------
51 | running | dump | None | 0 | 6ac4be7d-1243-45b2-8669-34399e9be961
46 | ready | check_networks | 2 | 100 | c97c65df-709e-4941-a7b8-bbf749f7da29
49 | ready | provision | 2 | 100 | 391b4a8c-08f2-4a49-8af0-647c0cf09af2
50 | error | deployment | 2 | None | 80a0cc1d-3b81-4675-aa4b-b080f542a1fd
45 | error | deploy | 2 | 100 | 6653f807-4bac-4682-b173-50630e6cb3b7

Deployment has been failed due the following bug https://bugs.launchpad.net/fuel/+bug/1496036

Cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,DVR,Ceph-all,Nova-debug,Nova-quotas,7.0-293
Controllers:3 Computes:180 Copmutes+Ceph: 20

api: '1.0'
astute_sha: a717657232721a7fafc67ff5e1c696c9dbeb0b95
auth_required: true
build_id: '293'
build_number: '293'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: 0c5a39a43bffaddc91b46179d16e1bb2dd85bc0c
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
fuelmain_sha: 6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85
nailgun_sha: 16a39d40120dd4257698795f12de4ae8200b1778
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 2864459e27b0510a0f7aedac6cdf27901ef5c481
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-15_15-38-24.tar.xz

description: updated
Revision history for this message
Leontii Istomin (listomin) wrote :
Revision history for this message
Andrey Maximov (maximov) wrote :

is it possible to redeploy this node or it is stalled in "provisioning" state forever ? I need this to correctly set priority of this bug.

Revision history for this message
Leontii Istomin (listomin) wrote :

It was in provisioning state after deployment failure, then I perform "deploy changes" again and the node still is in provisioning state. I didn't try to reset/delete environment

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

According to the bootstrap logs from node-347, this node was provisioned without flaws and then went to reboot. After reboot, nailgun-agent succeded to report 'target' system type from that node too.

2015-09-15T14:49:19 debug: [579] Not provisioned: , got target OSes: 203,204,205,206,207,209,210,208,211,215,214,216,218,213,217,212,219,225,223,221,230,234,227,229,2
31,224,220,235,245,246,242,238,251,240,241,237,243,252,226,249,222,236,250,239,232,253,233,247,255,256,259,254,244,248,258,228,257,260,279,272,263,270,267,283,264,271
,281,269,287,265,280,277,266,262,284,294,274,286,275,261,293,295,289,276,300,268,282,303,285,304,301,273,288,292,305,278,290,296,291,297,298,299,302,308,306,307,309,3
10,311,312,316,313,314,315,317,318,319,324,322,323,325,320,326,321,328,327,329,336,337,342,340,343,330,331,332,334,354,351,341,335,346,344,345,339,349,338,333,353,352
,357,358,348,356,373,370,360,355,359,347,364,350,404,387,367,386,376,371,362,379,361,365,369,363,390,388,368,377,383,382,389,400,393,401,381,402,392,399,391,372,378,3
94,397,405,396,398,374,385,395,375,380,366,384,403
2015-09-15T14:49:19 info: [579] Provisioning finished

So, provisioning was not affected by any means. Something went wrong with nailgun tasks/state machine.

Revision history for this message
Vladimir Kozhukalov (kozhukalov) wrote :

To what I can see in logs node-347 (compute_155) (got from 'fuel node' command) was, in fact, successfully provisioned. Astute (astute.log) sent success status message to nailgun and nailgun received this message (receiverd.log.1 2015-09-15 14:38:08.104 ) but for some reason it did not change node status to 'deploy' (app.log 2015-09-15 14:38:08.445) while for other nodes at the same message the status was successfully changed (for example node-340 controller_2).

Revision history for this message
Vladimir Kozhukalov (kozhukalov) wrote :

It looks like there was kind of race condition prevented nailgun from being able to update the status for the node. Changing priority to medium and milestone to 8.0.

Changed in fuel:
status: New → Triaged
status: Triaged → Confirmed
importance: Undecided → Medium
milestone: none → 8.0
assignee: nobody → Fuel Python Team (fuel-python)
tags: added: tricky
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

We passed SCF in 8.0. Moving the bug to 9.0.

Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Tricky bug with low reproducibility. Moving to 10.0

Changed in fuel:
milestone: 9.0 → 10.0
Changed in fuel:
assignee: Fuel Python (Deprecated) (fuel-python) → Georgy Kibardin (gkibardin)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Here is what we have in receiverd.log
2015-09-15 14:38:08.104 INFO [7f59e463b700] (receiver) RPC method provision_resp received: {"task_uuid": "391b4a8c-08f2-4a49-8af0-647c0cf09af2", "nodes": [{"status": "provisioned", "progress": 100, "uid": "340"}, {"status": "provisioned", "progress": 100, "uid": "345"}, {"status": "provisioned", "progress": 100, "uid": "346"}, {"status": "provisioned", "progress": 100, "uid": "347"}, {"status": "provisioned", "progress": 100, "uid": "349"}, {"status": "provisioned", "progress": 100, "uid": "350"}, {"status": "provisioned", "progress": 100, "uid": "352"}, {"status": "provisioned", "progress": 100, "uid": "353"}, {"status": "provisioned", "progress": 100, "uid": "355"}, {"status": "provisioned", "progress": 100, "uid": "356"}, {"status": "provisioned", "progress": 100, "uid": "358"}, {"status": "provisioning", "progress": 100, "uid": "381"}, {"status": "provisioning", "progress": 100, "uid": "382"}, {"status": "provisioning", "progress": 100, "uid": "383"}, {"status": "provisioning", "progress": 100, "uid": "384"}, {"status": "provisioning", "progress": 100, "uid": "389"}, {"status": "provisioning", "progress": 100, "uid": "390"}, {"status": "provisioning", "progress": 100, "uid": "391"}]}
2015-09-15 14:38:08.434 DEBUG [7f59e463b700] (task) Updating task: 391b4a8c-08f2-4a49-8af0-647c0cf09af2
2015-09-15 14:38:08.436 DEBUG [7f59e463b700] (task) Updating cluster status: 391b4a8c-08f2-4a49-8af0-647c0cf09af2 cluster_id: 2 status: None
2015-09-15 14:38:08.441 DEBUG [7f59e463b700] (task) Updating parent task: 6653f807-4bac-4682-b173-50630e6cb3b7.

All these nodes are locked for update and then their status is updated, however this update didn't get into db.
At the same time in app.log

2015-09-15 14:38:08.445 DEBUG [7f2b9da9b740] (node) Node compute_155 has provisioning or error status - status not updated by agent
2015-09-15 14:38:08.445 INFO [7f2b9da9b740] (node) Volume information is locked for update on node compute_155
2015-09-15 14:38:08.455 INFO [7f2b9da9b740] (node) Interfaces are locked for update on node compute_155

Here is the code which handles this

        # don't update provisioning and error back to discover
        if instance.status in ('provisioning', 'error'):
            if data.get('status', 'discover') == 'discover':
                logger.debug(
                    u"Node {0} has provisioning or error status - "
                    u"status not updated by agent".format(
                        instance.human_readable_name
                    )
                )

                data['status'] = instance.status

At the point the update from agent came instance.status is 'provisioning'. The transaction in receiverd is not yet completed by that time. Then

data['status'] = instance.status

happens. Then the transaction is completed. And then node update with wrong status came.

This version of events is most likely happened.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

The fix is to replace

data['status'] = instance.status

with

data.pop('status', None)

Which is actually done here:

Reviewed: https://review.openstack.org/235264
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=477f52f2e5f2d7fcc26e19ec6f380d284ce87c2c

Changed in fuel:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.