pike promotion errors, overcloud images and containers were not promoted properly

Bug #1722640 reported by wes hayutin on 2017-10-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Gabriele Cerami

Bug Description

New promotion on current-tripleo/ 2017-10-10 15:41

Commit hash
commits:
- commit_branch: master
  commit_hash: 39c29e7c7facf47a6131570d9602f8a7737fa4be
  distgit_dir: /home/centos-pike/data/openstack-rally_distro/
  distro_hash: baed7867b233f1e9b49723f7a8636da681a0e324
  dt_build: '1507649820'
  dt_commit: '1507649566'

Errors: in http://38.145.33.13/pike.log
http://paste.openstack.org/show/623274/

Images are available, but not promoted
https://images.rdoproject.org/pike/rdo_trunk/?C=M;O=D

current-tripleo is pointing to 10-07 not 10-10
rdo container registry is down

<dmsimard> weshay, trown, adarazs: I put the registry in read only for a bit, trying a solution from upstream to resolve an issue
<trown> dmsimard: ack

David Moreau Simard (dmsimard) wrote :

The *container* image upload/tag/promotion/push failure is most certainly my fault.

I am trying to implement a definitive solution to the image pruning issues we have been having with the help of upstream openshift devs. This should not reproduce.

John Trowbridge (trown) wrote :

logs for the image failure:

2017-10-10 20:02:45,020 28459 INFO promoter Promoting the qcow image for dlrn hash 39c29e7c7facf47a6131570d9602f8a7737fa4be_baed7867 on pike to current-tripleo
2017-10-10 20:02:45,026 28459 ERROR promoter QCOW IMAGE UPLOAD FAILED LOGS BELOW:
2017-10-10 20:02:45,026 28459 ERROR promoter + RELEASE=pike
+ PROMOTED_HASH=39c29e7c7facf47a6131570d9602f8a7737fa4be_baed7867
+ LINK_NAME=current-tripleo
+ image_path=pike/rdo_trunk/current-tripleo
+ ssh_cmd='ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
+ pushd /tmp/
/tmp ~
+ mkdir -p 39c29e7c7facf47a6131570d9602f8a7737fa4be_baed7867
+ ln -s 39c29e7c7facf47a6131570d9602f8a7737fa4be_baed7867 stable
ln: failed to create symbolic link ‘stable/39c29e7c7facf47a6131570d9602f8a7737fa4be_baed7867’: File exists

2017-10-10 20:02:45,026 28459 ERROR promoter Command '['bash', '/home/centos/ci-config/ci-scripts/promote-images.sh', 'pike', '39c29e7c7facf47a6131570d9602f8a7737fa4be_baed7867', 'current-tripleo']' returned non-zero exit status 1
Traceback (most recent call last):
  File "/home/centos/ci-config/ci-scripts/dlrnapi_promoter/dlrnapi_promoter.py", line 127, in tag_qcow_images
    stderr=subprocess.STDOUT
  File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['bash', '/home/centos/ci-config/ci-scripts/promote-images.sh', 'pike', '39c29e7c7facf47a6131570d9602f8a7737fa4be_baed7867', 'current-tripleo']' returned non-zero exit status 1
2017-10-10 20:02:45,026 28459 ERROR promoter END OF QCOW IMAGE UPLOAD FAILURE

John Trowbridge (trown) wrote :

https://review.rdoproject.org/r/10085 should fix the qcow image promote part of this bug.

David Moreau Simard (dmsimard) wrote :

The registry is once again available, I've posted details of the issue here: https://www.redhat.com/archives/rdo-list/2017-October/msg00037.html

wes hayutin (weshayutin) wrote :

Removing alert, hopefully one of the 4hr jobs (pike,master) will show this working by tomorrow, if not we'll poke at it by hand to confirm. The next build job should be able to give us the feedback we need.

tags: removed: alert

I don't think it's right solution for images problem, because the real issue is that promoter script can't detect that new and old hashes are the same and trying to promote anyway. I try to solve it here: https://review.rdoproject.org/r/10094

Let's revert this change with tmp dir: https://review.rdoproject.org/r/#/c/10095/

Attila Darazs (adarazs) wrote :

https://review.rdoproject.org/r/10094 reviewed -- needs some adjustment, but that's definitely a problem. I didn't notice this when full_hash was added. Thanks for noticing.

Also I agree with reverting that change with the /tmp/ dir.

Arx Cruz (arxcruz) on 2017-10-11
Changed in tripleo:
assignee: John Trowbridge (trown) → Gabriele Cerami (gcerami)
Gabriele Cerami (gcerami) wrote :

Cloud images link was corrected manually yesterday. Fix https://review.rdoproject.org/api/10099 for the link is merged.
Container images were uploaded manually launching the upload script. The crontab was modified on the setup playbook, but not on the promoter server. Should be ok now.
waiting for https://review.openstack.org/511334 to fix urls of the images and using overcloud images as undercloud image.

Changed in tripleo:
milestone: queens-1 → queens-2
Gabriele Cerami (gcerami) wrote :

Happened again, space problems on the image server. Appending cleanup code in every promotion

Alan Pevec (apevec) wrote :

@Gabriele what is the preventive action and who owns it?

Attila Darazs (adarazs) wrote :

This just happened now again. We need to correct it manually and fix the promoter ASAP.

Attila Darazs (adarazs) wrote :

Only the container image upload failed. I recreated the command run in the promoter script and got the same error:

failed: [localhost] (item=cinder-backup) => {"failed": true, "item": "cinder-backup", "msg": "Error pulling trunk.registry.rdoproject.org/pike/centos-binary-cinder-backup - code: None message: Get https://trunk.registry.rdoproject.org/v2/pike/centos-binary-cinder-backup/manifests/1e0dd6786e3a4326a37e47541844657524e689a9_81220c5f: unauthorized: authentication required"}

Something changed on the RDO registry side probably that prevents us from accessing it.

Attila Darazs (adarazs) wrote :

This seems to be a separate issue from this, something with the rdo registry not allowing anonymous pulls, so opening a different bug for that.

Changed in tripleo:
milestone: queens-2 → queens-3
wes hayutin (weshayutin) on 2018-01-03
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers