There is no 'major version' upgrades job for ci

Bug #1583125 reported by Marios Andreou
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
mathieu bultel

Bug Description

Our CI doesn't currently run the major version upgrades workflow [1]. The full workflow involves 3 heat stack updates (upgrade init, upgrade controllers, upgrade converge) and individual invocations of 'upgrade_non_controller.sh' for each of your non-controller, non-cinder nodes.

Having said that, running the full upgrades workflow end-to-end can be done - in the worst case/at a first pass as a bash script. One immediate concern is how long such a job would take to run on our CI infra; the current upgrades job takes close to 2 hours (e.g. "21:16:49.930" to "23:13:42.119" [2]) and isn't running any package updates or any of the upgrades workflow documented at [1]). Skipping the image build (using cached images) is one possibility towards mitigating that conern.

Another possibility towards the same goal is to only implement parts of the full upgrades workflow - for example explore upgrading just the controller node to start with. A practical issue there is the process may still involve at least 2 steps (since we would still need to converge to deliver datacenter wide config changes, like the change of a rabbitmq password that all services talking to rabbit would also need to know/update their configs about).

This bug is intended to track the work for adding an upgrades job to our CI and for any general discussion around that work so it can be targetted to an N2 delivery - resulting from an irc chat with shardy/jistr/social on freenode #tripleo [3]

[1] "Upgrade documentation" https://review.openstack.org/#/c/308985/

[2] http://logs.openstack.org/36/317736/1/check-tripleo/gate-tripleo-ci-f22-upgrades/3f34b22/console.html#_2016-05-17_23_13_42_119

[3] (excerpt from irc for context)
13:23 < shardy_> what would be good is improving the CI upgrade job so we can test/prove the monolithic implementation
13:24 < shardy_> then, we can make incremental improvements with assurance things still work
13:24 < shardy_> evidently there are still walltime challenges there, but perhaps we can think about if/how that might be possible e.g with cached images etc
13:25 < shardy_> probably we need an actually-test-upgrades bug for CI
13:27 < marios> shardy_: yeah the initial deploy in full (image build etc) doesn't have to (also) be exercised for the upgrade - so using images would proably save quite a bit of time. The
                greater challenge however is automating the current upgrades procedure (jistr has wip docs at https://review.openstack.org/#/c/308985/ btw) - in the worst case with a script
                :/ possibly with mistral but I am not sure yet about that

13:28 < matbu> hey guys, i have a ansible role for upgrade
13:28 < shardy_> marios: as a first step can we just run the controller upgrade?
13:28 < marios> shardy_: the fact of the matter is it is an operator driven procedure involving 3 heat stack updates and the 'upgrade_non_controller'sh' for each non controller (except
                cinder)
13:28 < matbu> it could be use in CI
13:28 < jistr> marios: i think script should be fine
13:29 < marios> shardy_: jistr i guess for the purposes of ci pinning nova may not be a big deal (thinking of just upgrading the controller)
13:29 < jistr> marios: i mean, i'm not sure if it makes sense to automate the current procedure in mistral if the procedure might change a lot
13:29 < jistr> e.g. will we even be using heat for the upgrades going forward...
13:30 < marios> shardy_: jistr i.e. without running sthing to pin nova-compute rpc version on the compute
13:30 < jistr> marios, shardy_: re just controller upgrade -- not sure if we could run converge afterwards. Maybe it would work in many cases, but it's not the general case. At the very
               least we'd feed into the converge step that it needs to keep nova pinned.
13:31 < jistr> marios: without pinning nova might stop working (i.e. pingtest after upgrade might fail)
13:31 < marios> jistr: i think we'd need to run converge ... for delivering sthing like rabbit password change to all nodes
13:31 < jistr> yea...
13:31 < jistr> so i'm wondering if we can skip the compute upgrade
13:31 < marios> jistr: e.g. if there is sthing which lands in /etc/foo/foo.conf on controller which also affects a service running on _other_node_
13:31 < jistr> but compute upgrade should be reasonably fast actually
13:32 < shardy_> Ok, few things to work out here - would it be a good idea to raise a bug, outline the issues, and discuss the options?
13:32 < shardy_> we can target that at n2 and ensure someone has bandwidth to work on it
13:33 < shardy_> This is why I want a bug - it makes it very easy to see who's assigned to the work :)
13:33 < marios> shardy_: OK I can file that
13:37 < matbu> marios: jistr shardy_ https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-upgrade

description: updated
Revision history for this message
Jiří Stránský (jistr) wrote :

Mathieu has made an ansible role to automate the overcloud upgrade workflow, it could be used in CI or we could at least extract the automation logic from there. E.g. see the overcloud upgrade automation:

https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-upgrade/blob/ce61209a90c5206f09bd44a824fa39189f517368/templates/major-upgrade-overcloud.sh.j2

Changed in tripleo:
assignee: nobody → mbu (mat-bultel)
Steven Hardy (shardy)
Changed in tripleo:
milestone: newton-2 → newton-3
Revision history for this message
Adriano Petrich (apetrich) wrote :

We have a job for it https://ci.centos.org/view/rdo/view/tripleo-periodic/job/tripleo-quickstart-upgrade-major-liberty-to-mitaka/

Currently it is failing on
ERROR: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.0.2.1' ([Errno 111] ECONNREFUSED)") [SQL: u'SELECT 1']

Revision history for this message
Emilien Macchi (emilienm) wrote :

mbu and I are working on https://review.openstack.org/#/c/323750/ that will address this bug.

Changed in tripleo:
status: Triaged → In Progress
Steven Hardy (shardy)
Changed in tripleo:
milestone: newton-3 → newton-rc1
Changed in tripleo:
milestone: newton-rc1 → newton-rc2
Revision history for this message
Emilien Macchi (emilienm) wrote :

Moving it to ocata-1 as we won't make it for the deadline.

Changed in tripleo:
milestone: newton-rc2 → ocata-1
Steven Hardy (shardy)
Changed in tripleo:
milestone: ocata-1 → ocata-2
Revision history for this message
mathieu bultel (mat-bultel) wrote :

I think the review is (almost) done.
I got some successful upgrade with the review (on a local env) and yesterday the job hit the timeout during the converge step:
http://logs.openstack.org/50/323750/106/experimental-tripleo/gate-tripleo-ci-centos-7-ovb-nonha-upgrades-nv/9da27ff/console.html

I don't really understand why some of the classic jobs failed, but any eyes on this review would be welcome.

I'm planning to remove the ceilometer migration, to gain few minutes and try to not hit the timeout this time.
I know the upgrade can be successful without this step.

Changed in tripleo:
milestone: ocata-2 → ocata-3
Revision history for this message
Steven Hardy (shardy) wrote :

This was completed via https://review.openstack.org/#/c/404831/ which is now non-voting in the t-h-t check queue.

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.