[queens promotion] - overcloud deployment fails step2: /dev/loop3 missing

Bug #1750311 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
yatin

Bug Description

This may be a duplicate of https://bugs.launchpad.net/tripleo/+bug/1749645? At the moment, the python-mistralclient branch of 1749645 is still open.

fs016 is failing to deploy the overcloud in Queens promotion jobs with the following error:

2018-02-19 02:49:26 | 2018-02-19 02:49:19Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.AllNodesDeploySteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
2018-02-19 02:49:26 |
2018-02-19 02:49:26 | Stack overcloud CREATE_FAILED
2018-02-19 02:49:26 |
2018-02-19 02:49:26 | overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
2018-02-19 02:49:26 | resource_type: OS::Mistral::ExternalResource
2018-02-19 02:49:26 | physical_resource_id: 6367dc09-8e45-487e-8b75-7b52e580a648
2018-02-19 02:49:26 | status: CREATE_FAILED
2018-02-19 02:49:26 | status_reason: |
2018-02-19 02:49:26 | resources.WorkflowTasks_Step2_Execution: ERROR

Full logs are linked below:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-queens/8b4282d/undercloud/home/jenkins/overcloud_deploy.log.txt.gz

Not much info in:
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-queens/8b4282d/undercloud/home/jenkins/failed_deployment_list.log.txt.gz

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-queens/8b4282d/undercloud/var/log/extra/errors.txt.gz show some heat/mistral errors.

Revision history for this message
Ronelle Landy (rlandy) wrote :

Checking to see if this is new bug before adding tags.

Revision history for this message
yatin (yatinkarel) wrote :

Looks like it's a new issue occuring while installing ceph: "Error: Could not stat device /dev/loop3 - No such file or directory.", "stderr_lines": ["Error: Could not stat device /dev/loop3 - No such file or directory."
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-queens/8b4282d/undercloud/var/log/mistral/engine.log.txt.gz#_2018-02-19_02_49_17_318

Changed in tripleo:
milestone: none → queens-rc1
importance: Undecided → High
status: New → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There is also 'Endpoint of type baremetal-introspection' error (2018-02-19T02:14:40.429 ./8b4282d/undercloud/var/log/mistral/mistral-db-manage.log.txt.gz) and a lots of ironic db sync Permission denied warnings before that in ./8b4282d/undercloud/var/log/ironic/ironic-dbsync.log.txt.gz

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

But yeah, the root cause of the failure seems like ceph PG issues:

Feb 19 02:49:13 ./8b4282d/subnode-2/var/log/journal.txt.gz :
Health check failed: Degraded data redundancy: 168 pgs unclean (PG_DEGRADED)

Revision history for this message
Ronelle Landy (rlandy) wrote :

Adding promotion-blocker since we are seeing this error in the Queens promotion jobs.

Changed in tripleo:
importance: High → Critical
tags: added: ci promotion-blocker
Changed in tripleo:
assignee: nobody → John Fulton (jfulton-org)
summary: - [queens promotion] - overcloud deployment fails
- resources.WorkflowTasks_Step2_Execution 'OS::Mistral::ExternalResource'
+ [queens promotion] - overcloud deployment fails step2: /dev/loop3
+ missing
Revision history for this message
John Fulton (jfulton-org) wrote :

- We added https://review.openstack.org/#/c/484963 in openstack-infra/tripleo-ci to create /dev/loop3 which the delopment then uses as an OSD.
- that task seems to have failed as /dev/loop3 is not present as per the error message:

(item=/dev/loop3) => {"changed": false, "cmd": "parted --script /dev/loop3 print | egrep -sq \'^ 1.*ceph\'", "delta": "0:00:00.015108", "end": "2018-02-19 02:49:11.988895", "failed_when_result": false, "item": "/dev/loop3", "msg": "non-zero return code", "rc": 1, "start": "2018-02-19 02:49:11.973787", "stderr": "Error: Could not stat device /dev/loop3 - No such file or directory.", "stderr_lines": ["Error: Could not stat device /dev/loop3 - No such file or directory."]

Search the following for "Could not stat device /dev/loop3":

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-queens/8b4282d/undercloud/var/log/mistral/engine.log.txt.gz#_2018-02-19_02_49_17_320

Looking for output of openstack-infra/tripleo-ci / scripts/bootstrap-overcloud-full-minimal.sh to see what happened.

Revision history for this message
John Fulton (jfulton-org) wrote :

In the past I used the output of 'lsblk' [1] in the file bootstrap-subnodes.log.txt.gz, from the CI report for scenario001-multinode-containers, to verify that /dev/loop3 was created. The output the lsblk command would appear in undercloud/var/log/ from the CI job.

However, I am clicking around this CI job's undercloud/var/log [2] but don't see it. Thus I am unable to verify that the task which should have set up the block device failed in some way that might explain this issue.

[1] https://review.openstack.org/#/c/484963/13/scripts/bootstrap-overcloud-full-minimal.sh

[2] https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-queens/8b4282d/undercloud/var/log/

Revision history for this message
John Fulton (jfulton-org) wrote :

- I don't know why subnodes.log.txt.gz is missing from this job [1]
- It looks like it should be collected [2]
- Looking past the missing log file.. the deployment would not have gotten as far as it did if line 39 of bootstrap-overcloud-full-minimal.sh had failed (sudo yum -y install git python-heat-agent*)
- So, I assume this script ran, though we're missing the logs
- Is it possible, when it ran that line 44 of bootstrap-overcloud-full-minimal.sh [3] was false?
- In other words, did "${TOCI_JOBTYPE:-''}" not contain "multinode" ?

[1] https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-queens/8b4282d/undercloud/var/log/

[2] https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/collect-logs/defaults/main.yml#L119

[3] https://github.com/openstack-infra/tripleo-ci/blob/master/scripts/bootstrap-overcloud-full-minimal.sh#L44

Revision history for this message
John Fulton (jfulton-org) wrote :

Next step:

rlandy is going to try a DNM patch which will 'ls /var/log' so we can get more information on what might be happening.

Thank you rlandy

Revision history for this message
yatin (yatinkarel) wrote :

Looking at logs it looks like script(https://github.com/openstack-infra/tripleo-ci/blob/master/scripts/bootstrap-overcloud-full-minimal.sh) didn't ran because BOOTSTRAP_SUBNODES_MINIMAL=0, i would propose a patch in openstack-infra/tripleo-ci to fix it.

Revision history for this message
yatin (yatinkarel) wrote :

Proposed patch in openstack-infra/tripleo-ci:- https://review.openstack.org/#/c/546062/

Revision history for this message
John Fulton (jfulton-org) wrote :

Thank you rlandy for adding a patch to collect the missing logs going forward: https://review.openstack.org/#/c/546028

Thank you yatin for adding a patch to set the appropriate variables https://review.openstack.org/#/c/546062

I am updating the bug assignment to yatin as he's proposing the fix.

Changed in tripleo:
assignee: John Fulton (jfulton-org) → nobody
assignee: nobody → yatin (yatinkarel)
Revision history for this message
yatin (yatinkarel) wrote :

After https://review.openstack.org/#/c/546062/, issue is not seen, removing promotion-blocker tag.

tags: removed: ci promotion-blocker
Changed in tripleo:
status: Triaged → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.