Configuration Updates which used ExternalResource fail

Bug #1709682 reported by John Fulton
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Rabi Mishra

Bug Description

After a successful deploy using ceph-ansible as described in the new docs [1], re-running the same deploy command again, as users will do to make changes to the overcloud, results in the following error:

overcloud.AllNodesDeploySteps:
  resource_type: OS::TripleO::PostDeploySteps
  physical_resource_id: 32d9e831-d45a-4bd8-87d3-7061e70c183f
  status: UPDATE_FAILED
  status_reason: |
    resources.AllNodesDeploySteps: ValueError: resources.WorkflowTasks_Step2_Execution: Property actions not assigned

[1] https://review.openstack.org/487155
[2]
time openstack overcloud deploy --templates ~/templates \
-e ~/templates/environments/docker.yaml \
-e ~/templates/environments/ceph-ansible/ceph-ansible.yaml \
-e ~/templates/environments/low-memory-usage.yaml \
-e ~/templates/environments/disable-telemetry.yaml \
-e ~/templates/environments/docker-centos-tripleoupstream.yaml \
-e ~/tripleo-ceph-ansible/tht/overcloud-ceph-ansible.yaml

Changed in tripleo:
importance: High → Critical
importance: Critical → High
summary: - Upgrades which used ceph-ansible workflow fail
+ Configuration Updates which used ceph-ansible workflow fail
Revision history for this message
John Fulton (jfulton-org) wrote : Re: Configuration Updates which used ceph-ansible workflow fail

Heat logs on undercloud where problem happened

Changed in tripleo:
assignee: Giulio Fidente (gfidente) → nobody
affects: tripleo → heat
Changed in heat:
milestone: pike-rc1 → none
Revision history for this message
Giulio Fidente (gfidente) wrote :

Thomas, this was triggered when updating an ExternalResource previously in FAILED state. From the logs, it seems an issue with the deletion of the backup stack and it doesn't happen when updating a resource in clean state.

Can you help us find the root cause?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/496216

Changed in heat:
assignee: nobody → Rabi Mishra (rabi)
status: Triaged → In Progress
Rabi Mishra (rabi)
Changed in heat:
milestone: none → queens-1
Revision history for this message
John Fulton (jfulton-org) wrote : Re: Configuration Updates which used ceph-ansible workflow fail

The proposed fix resolved my issue: https://review.openstack.org/#/c/496216

Details:
I set up a deployment w/ ceph-ansible that I knew it would fail because
I gave it block devices which do not exist and because the journal count
did not line up with the actual requested OSDs:

  CephAnsibleDisksConfig:
    devices:
      - /dev/vde
      - /dev/vdf
      - /dev/vdg
    raw_journal_devices:
      - /dev/vdd
      - /dev/vdd
    journal_size: 256 # vdd is 1024M
    journal_collocation: false
    raw_multi_journal: true

I saw the failure.

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::Mistral::ExternalResource
  physical_resource_id: 389fc7b1-6d9b-4daa-bfee-552f13d496b5
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR
Heat Stack create failed.
Heat Stack create failed.

real 25m46.850s
user 0m4.801s
sys 0m0.400s
(undercloud) [stack@undercloud tripleo-ceph-ansible]$

I then restored the Disk config to something that should work
in my environment:

  CephAnsibleDisksConfig:
    devices:
      - /dev/vdb
      - /dev/vdc
    raw_journal_devices:
      - /dev/vdd
      - /dev/vdd
    journal_size: 256 # vdd is 1024M
    journal_collocation: false
    raw_multi_journal: true

I then re-ran the same deployment command.

I see it started the mistral -> ansible execution and didn't error:

2017-08-24 16:10:50Z [overcloud-AllNodesDeploySteps-2lll4xrc7jrr.ControllerDeployment_Step1]: UPDATE_COMPLETE state changed
2017-08-24 16:10:50Z [overcloud-AllNodesDeploySteps-2lll4xrc7jrr.WorkflowTasks_Step2_Execution]: UPDATE_IN_PROGRESS state changed
2017-08-24 16:10:51Z [overcloud-AllNodesDeploySteps-2lll4xrc7jrr.WorkflowTasks_Step2_Execution]: UPDATE_COMPLETE The Resource WorkflowTasks_Step2_Execution requires replacement.
2017-08-24 16:10:51Z [overcloud-AllNodesDeploySteps-2lll4xrc7jrr.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS state changed
...
2017-08-24 16:28:10Z [AllNodesDeploySteps]: UPDATE_COMPLETE state changed
2017-08-24 16:28:26Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE

Overcloud Endpoint: http://192.168.24.9:5000/v2.0
Overcloud Deployed

real 30m33.691s
user 0m6.225s
sys 0m0.463s
(undercloud) [stack@undercloud tripleo-ceph-ansible]$

and the ceph cluster was fixed too:

[root@overcloud-controller-0 ~]# ceph -s
    cluster 3d2ddd5a-88e1-11e7-8968-00979f13efb1
     health HEALTH_OK
     monmap e1: 1 mons at {overcloud-controller-0=192.168.24.15:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
      fsmap e5: 1/1/1 up {0=overcloud-controller-0=up:active}
     osdmap e18: 6 osds: 6 up, 6 in
            flags sortbitwise,require_jewel_osds
      pgmap v38: 240 pgs, 8 pools, 2068 bytes data, 20 objects
            199 MB used, 4382 MB / 4581 MB avail
                 240 active+clean
[root@overcloud-controller-0 ~]#

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/496216
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=f849b4deb277e2511cf4547fc26f1e849c987954
Submitter: Jenkins
Branch: master

commit f849b4deb277e2511cf4547fc26f1e849c987954
Author: rabi <email address hidden>
Date: Tue Aug 22 17:34:50 2017 +0530

    Set resource._properties_data=None when loading from db

    In I462ce7161497306483286b78416f9037ac80d6fa we changed to use the
    frozen_defintion properties for delete. However, When deleting a
    resource from backup stack, where the resource is in INIT_COMPLETE,
    setting the _stored_properties_data(_properties_data) to {} when
    loading the resource from the db, results in error, when resources
    access properties in handle_delete.

    Change-Id: If76372c7ef9aee258efb1bfbc724d8637bc6a32c
    Closes-Bug: #1709682

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/499163

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/pike)

Reviewed: https://review.openstack.org/499163
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=d4867d4a5871277df8c505e8f09906813e0755d0
Submitter: Jenkins
Branch: stable/pike

commit d4867d4a5871277df8c505e8f09906813e0755d0
Author: rabi <email address hidden>
Date: Tue Aug 22 17:34:50 2017 +0530

    Set resource._properties_data=None when loading from db

    In I462ce7161497306483286b78416f9037ac80d6fa we changed to use the
    frozen_defintion properties for delete. However, When deleting a
    resource from backup stack, where the resource is in INIT_COMPLETE,
    setting the _stored_properties_data(_properties_data) to {} when
    loading the resource from the db, results in error, when resources
    access properties in handle_delete.

    Change-Id: If76372c7ef9aee258efb1bfbc724d8637bc6a32c
    Closes-Bug: #1709682
    (cherry picked from commit f849b4deb277e2511cf4547fc26f1e849c987954)

tags: added: in-stable-pike
summary: - Configuration Updates which used ceph-ansible workflow fail
+ Configuration Updates which used ExternalResource fail
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 10.0.0.0b1

This issue was fixed in the openstack/heat 10.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 9.0.1

This issue was fixed in the openstack/heat 9.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.