Upgrade liberty to mitaka fails

Bug #1608867 reported by Adriano Petrich
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Undecided
Adriano Petrich

Bug Description

I'm seeing two sets of errors happening when upgrading Liberty to Mitaka. Most of the runs I'm getting one or the other.

Here is an example of each kind of error:
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-upgrade-major-liberty-to-mitaka-57/undercloud/home/stack/upgrade_console.log.gz
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-upgrade-major-liberty-to-mitaka-53/undercloud/home/stack/upgrade_console.log.gz

Both runs with the same settings and two different errors

After installing Liberty and upgrading the undercloud CI does:

source /home/stack/stackrc

echo "execute aodh upgrade"
openstack overcloud deploy --templates tripleo-heat-templates \
    -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
    -e tripleo-heat-templates/environments/network-isolation.yaml \
    -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
    -e ~/network-environment.yaml \
    -e tripleo-heat-templates/environments/puppet-pacemaker.yaml \
    -e tripleo-heat-templates/environments/major-upgrade-aodh.yaml

##This works all the times

echo "execute keystone upgrade"
openstack overcloud deploy --templates tripleo-heat-templates \
    -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
    -e tripleo-heat-templates/environments/network-isolation.yaml \
    -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
    -e ~/network-environment.yaml \
    -e tripleo-heat-templates/environments/puppet-pacemaker.yaml \
    -e tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml

## this fails some times like here https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-upgrade-major-liberty-to-mitaka-53/undercloud/home/stack/upgrade_console.log.gz
#
There's not much debug about this.
# 2016-08-01 16:46:02 [overcloud-ControllerAllNodesDeployment-uf5x5k4knlbq]: UPDATE_COMPLETE Stack UPDATE completed successfully
# 2016-08-01 16:46:03 [overcloud-ControllerAllNodesValidationDeployment-33gvrro23v3e]: UPDATE_IN_PROGRESS Stack UPDATE started
# 2016-08-01 16:46:0Deployment failed: Heat Stack update failed.
# 3 [0]: SIGNAL_COMPLETE Unknown
# 2016-08-01 16:46:05 [overcloud-ControllerAllNodesValidationDeployment-33gvrro23v3e]: UPDATE_COMPLETE # Stack UPDATE completed successfully
# 2016-08-01 16:46:05 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
# 2016-08-01 16:46:06 [ControllerAllNodesValidationDeployment]: UPDATE_COMPLETE state changed
# 2016-08-01 16:46:06 [0]: SIGNAL_COMPLETE Unknown
# 2016-08-01 16:46:07 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
# 2016-08-01 16:46:08 [0]: SIGNAL_COMPLETE Unknown
# Stack overcloud UPDATE_FAILED

echo "execute script delivery"
openstack overcloud deploy --templates tripleo-heat-templates \
    -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
    -e tripleo-heat-templates/environments/network-isolation.yaml \
    -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
    -e ~/network-environment.yaml \
    -e tripleo-heat-templates/environments/puppet-pacemaker.yaml \
    -e tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml \
    -e overcloud-repo.yaml \

# this passes

echo "execute major upgrade controller"
openstack overcloud deploy --templates tripleo-heat-templates \
    -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
    -e tripleo-heat-templates/environments/network-isolation.yaml \
    -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
    -e ~/network-environment.yaml \
    -e tripleo-heat-templates/environments/puppet-pacemaker.yaml \
    -e tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml

# this fails the other half of the runs

# 2016-08-02 08:24:23 [overcloud-CephStorageAllNodesValidationDeployment-p3z4pbdh3rnl]: UPDATE_COMPLETE Stack UPDATE completed successfully
# 2016-08-02 08:24:23 [overcloud-ComputeAllNodesValidationDeployment-6dnbx62nknc3]: UPDATE_IN_PROGRESS Stack UPDATE started
# 2016-08-02 08:24:24 [overcloud-ComputeAllNodesValidationDeployment-6dnbx62nknc3]: UPDATE_COMPLETE Stack UPDATE completed successfully
# 2016-08-02 08:24:25 [ComputeAllNodesValidERROR: Timed out waiting for a reply to message ID e8479a0b7ef444558daf4ca0ed3c9edc
ationDeployment]: UPDATE_COMPLETE state changed
# 2016-08-02 08:24:38 [0]: SIGNAL_COMPLETE Unknown
# 2016-08-02 08:24:39 [CephStorageDeployment]: SIGNAL_COMPLETE Unknown
# 2016-08-02 08:24:39 [0]: SIGNAL_COMPLETE Unknown
# 2016-08-02 08:24:42 [0]: SIGNAL_COMPLETE Unknown
# 2016-08-02 08:24:43 [NetworkDeployment]: SIGNAL_COMPLETE Unknown

echo "execute converge"
openstack overcloud deploy --templates tripleo-heat-templates \
    -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
    -e tripleo-heat-templates/environments/puppet-pacemaker.yaml \
    -e tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml \
    -e tripleo-heat-templates/environments/network-isolation.yaml \
    -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
    -e ~/network-environment.yaml

## even when the previous one fails this runs and fails with

+ openstack overcloud deploy --templates tripleo-heat-templates -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e tripleo-heat-templates/environments/puppet-pacemaker.yaml -e tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml -e tripleo-heat-templates/environments/network-isolation.yaml -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.yaml
ERROR: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.0.2.1' ([Errno 111] ECONNREFUSED)") [SQL: u'SELECT 1']
[u'

after that I can't eve get the error logs

heat resource-list --nested-depth 5 overcloud
An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-918d5aa7-2f53-4866-9397-90241d301972)

Tags: upgrade-bugs
Revision history for this message
Adriano Petrich (apetrich) wrote :

this is a non HA deployment btw 1 Controller 1 Compute and 1 Ceph nodes

For the keystone errors here is all we collect https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-upgrade-major-liberty-to-mitaka-53/

here for example is the /var/log/cluster/corosync.log from the controller

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-upgrade-major-liberty-to-mitaka-53/overcloud-controller-0/var/log/cluster/corosync.log.gz

Revision history for this message
Adriano Petrich (apetrich) wrote :

<bandini> apetrich: also the liberty cloud *must* have at least openstack-puppet-modules-7.1.2-1.el7ost installed on the overcloud. do we have that?
<bandini> this is due to https://bugzilla.redhat.com/show_bug.cgi?id=1347827
<apetrich> nope it has openstack-puppet-modules-7.1.2-0.20160711190332.3418a7c.el7.centos.noarch.rpm
<bandini> apetrich: ok so that is definitely needed, although I don't think it is at the root of this specific failure
<bandini> this one is weird, I never even see it trying to remove the openstack-keystone resource
* bandini digs some more
<apetrich> bandini, for liberty not even current has 7.1.2-1 :/

Revision history for this message
Adriano Petrich (apetrich) wrote :
Download full text (5.5 KiB)

Upon a better look I see that we have openstack-puppet-modules-8* installed so the previous problem does not impact here

Here are some better logs from a different failure

heat resource-list --nested-depth 5 overcloud | grep FAIL
| AodhPostUpgradeDeployment | 0f467c87-64d6-49b4-8f60-21dd54a38396 | OS::Heat::SoftwareDeploymentGroup | DELETE_FAILED | 2016-08-01T15:45:49 | overcloud-UpdateWorkflow-5vgiolzyif2j |
| UpdateWorkflow | 8b1ea2b5-31e0-4af0-84bd-9f51499951e6 | OS::TripleO::Tasks::UpdateWorkflow | UPDATE_FAILED | 2016-08-01T16:00:01 | overcloud |
| allNodesConfig | 4dcb4af7-f18f-443d-9335-8f804ce7a8ba | OS::TripleO::AllNodes::SoftwareConfig | UPDATE_FAILED | 2016-08-01T16:00:05 | overcloud |
| CephClusterConfig | f3770bf2-ea63-49c6-af9c-bef31aadcde7 | OS::TripleO::CephClusterConfig::SoftwareConfig | UPDATE_FAILED | 2016-08-01T16:00:07 | overcloud |
| ControllerSwiftDeployment | 387fe262-23cf-48d7-951d-9d69de5d9937 | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2016-08-01T16:00:09 | overcloud |
| ControllerBootstrapNodeDeployment | dd1c9e55-ebd4-4063-ae40-3fb5acb751a6 | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2016-08-01T16:00:10 | overcloud |
| ControllerPacemakerUpgradeDeployment_Step1 | | OS::Heat::SoftwareDeploymentGroup | CREATE_FAILED | 2016-08-01T16:00:10 | overcloud-UpdateWorkflow-5vgiolzyif2j |
| ObjectStorageSwiftDeployment | 350b64b3-2fe2-44c3-a084-be3be5a31cbd | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2016-08-01T16:00:11 | overcloud

[stack@undercloud ~]$ heat resource-show overcloud UpdateWorkflow
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes | {} |
| creation_time | 2016-08-01T14:24:30 ...

Read more...

Revision history for this message
Adriano Petrich (apetrich) wrote :
Download full text (11.7 KiB)

looking at heat-engine logs on the undercloud:

2016-08-01 14:20:37.315 16992 ERROR stevedore.extension [-] Could not load 'mistral': No module named mistralclient.api
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension [-] No module named mistralclient.api
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension Traceback (most recent call last):
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 162, in _load_plugins
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension verify_requirements,
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/stevedore/enabled.py", line 67, in _load_one_plugin
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension verify_requirements,
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 183, in _load_one_plugin
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension plugin = ep.resolve()
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2235, in resolve
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension module = __import__(self.module_name, fromlist=['__name__'], level=0)
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/mistral.py", line 14, in <module>
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension from mistralclient.api import base as mistral_base
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension ImportError: No module named mistralclient.api
2016-08-01 14:20:37.315 16992 ERROR stevedore.extension
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension [-] Could not load 'magnum': No module named magnumclient.openstack.common.apiclient
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension [-] No module named magnumclient.openstack.common.apiclient
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension Traceback (most recent call last):
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 162, in _load_plugins
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension verify_requirements,
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/stevedore/enabled.py", line 67, in _load_one_plugin
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension verify_requirements,
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 183, in _load_one_plugin
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension plugin = ep.resolve()
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2235, in resolve
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension module = __import__(self.module_name, fromlist=['__name__'], level=0)
2016-08-01 14:20:37.340 16992 ERROR stevedore.extension File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/magnum.py", ...

tags: added: upgrade-bugs
Revision history for this message
Adriano Petrich (apetrich) wrote :

Adding the heat-engine log the upgrade failed around 16:00

Revision history for this message
mathieu bultel (mat-bultel) wrote :

Hi,

I think this issue is due to the undercloud mariadb upgrade.
I added the mysql migration to the RDO-CI upgrade role.
Looks good to me in my test, I will re-kick the periodic jobs in the day.

Changed in tripleo:
assignee: nobody → mbu (mat-bultel)
Revision history for this message
Adriano Petrich (apetrich) wrote :
Download full text (8.2 KiB)

The upgrade still fails with the mariadb fix.

I also applied this https://bugzilla.redhat.com/show_bug.cgi?id=1366392 fix and it is still broken.

If failed on the script delivery step

heat stack-list shows that it failed two steps

| UpdateWorkflow | 2c761e90-1618-45bf-b419-d6e7cebe247a | OS::TripleO::Tasks::UpdateWorkflow | UPDATE_FAILED | 2016-08-15T13:20:36 |
| ControllerAllNodesDeployment | ee96879d-285a-499b-a3c2-a07a50d6d24c | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2016-08-15T13:20:52 |
[stack@undercloud ~]$ heat resource-show overcloud UpdateWorkflow
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes | {} |
| creation_time | 2016-08-15T10:56:35 |
| description | |
| links | http://192.0.2.1:8004/v1/d63c071ed3414007a1e8298d892b0a07/stacks/overcloud/e67e2c37-eed8-4289-9bca-3380f7c42aa5/resources/UpdateWorkflow (self) |
| | http://192.0.2.1:8004/v1/d63c071ed3414007a1e8298d892b0a07/stacks/overcloud/e67e2c37-eed8-4289-9bca-3380f7c42aa5 (stack) |
| | http://192.0.2.1:8004/v1/d63c071ed3414007a1e8298d892b0a07/stacks/overcloud-UpdateWorkflow-fn2ipw2bdyuf/2c761e90-1618-45bf-b419-d6e7cebe247a (nested) |
| logical_resource_id | UpdateWorkflow |
| physical_resource_id | 2c761e90-1618-45bf-b419-d6e7cebe247a |
| required_by | AllNodesExtraConfig |
| resource_name | UpdateWorkflow |
| resource_status | UPDATE_FAILED |
| resource_status_reason | resourc...

Read more...

Revision history for this message
Emilien Macchi (emilienm) wrote :

it sounds duplicated with 1612642, feel fee to cancel it if not.

Revision history for this message
Adriano Petrich (apetrich) wrote :

No I don't think it is a duplicate to that one.

Anyway mbu found out that it was the undercloud running out of memory and killing mariadb. I got it to pass increasing the undercloud memory to 16gb

Changed in tripleo:
assignee: mbu (mat-bultel) → Adriano Petrich (apetrich)
status: New → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.