Nova services do not restart on N->O upgrade
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack-Ansible |
Fix Released
|
Critical
|
Jesse Pretorius |
Bug Description
OS Ubuntu 16.04
Version 14.1.1 upgrading to stable/ocata
Docs - https:/
Possibly duplicate of https:/
On an environment with successful Tempest results of Stable/Newton, and upgrading to Stable/Ocata via the upgrade script, post upgrade tests fail. Nova logs report IncompatibleVersion errors, and all nova services are reporting running 14.1.1 versions. Other services correctly run the 15.0.0 version after upgrade.
http://
Restarting all nodes/containers then results in all services running the 15.0.0 version and Tempest tests running successfully.
The Ansible output of the upgrade does show nova containers were not restarted
TASK [Lxc container restart] *******
Friday 17 March 2017 15:09:03 -0500 (0:00:01.844) 0:07:59.862 **********
skipping: [infra01_
skipping: [infra02_
skipping: [infra01_
skipping: [infra03_
skipping: [infra02_
skipping: [infra01_
skipping: [infra03_
skipping: [infra02_
skipping: [infra01_
skipping: [infra03_
skipping: [infra02_
skipping: [infra01_
skipping: [infra03_
skipping: [infra02_
skipping: [infra01_
skipping: [infra03_
skipping: [infra02_
skipping: [compute02]
skipping: [compute03]
skipping: [compute01]
skipping: [compute06]
skipping: [compute07]
skipping: [compute04]
skipping: [compute05]
skipping: [compute08]
skipping: [compute09]
skipping: [infra01_
skipping: [infra03_
skipping: [infra03_
skipping: [infra02_
Full installation/
http://
Changed in openstack-ansible: | |
status: | New → Confirmed |
importance: | Undecided → Critical |
Changed in openstack-ansible: | |
assignee: | Dan Kolb (dankolbrs) → Jesse Pretorius (jesse-pretorius) |
the nova services within the containers should have restarted when the new set of venvs, config, and init scripts were dropped. If any of that didn't happen we'd be facing an issue with the nova service as found within the os_nova role. I could see on possibility where the handler maybe didn't fire? Was this an upgrade that ran into an issue mid-upgrade and was rerun?
As for the container restarts we don't restart all containers on all upgrade, in fact we do our best not to restart containers in an effort to maximize uptime. So the fact that the container restart was all skipped, in my mind, means that we did a decent job of not impacting uptime during the major migration.
As for the fact that all of the services were running the proper version once the containers were restarted, I'd have to suspect that there was something else that happened that caused the handler not to fire which could be a bug in the role or something else. Would it be possible to get the entire log for the run? Also have we confirmed this issue on multiple runs?