[ocata to pike] upgrade failing at step3 during dbsync

Bug #1724636 reported by Emilien Macchi on 2017-10-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Sofer Athlan-Guyot

Bug Description

Testing upgrades in upstream gate, it doesn't work yet.
We now reach step 3 but it's failing during the dbsyncs.

Failure during upgrade:
http://logs.openstack.org/25/500625/15/check/legacy-tripleo-ci-centos-7-containers-multinode-upgrades/2a547a7/logs/undercloud/home/zuul/overcloud_upgrade_console.log.txt.gz#_2017-10-18_15_49_11

In http://logs.openstack.org/25/500625/15/check/legacy-tripleo-ci-centos-7-containers-multinode-upgrades/2a547a7/logs/subnode-2/var/log/messages.txt.gz :

Oct 18 15:48:38 centos-7-rax-ord-0000291272 os-collect-config: "Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: cinder-manage db sync returned 1 instead of one of [0]",

Confirmed here:
http://logs.openstack.org/25/500625/15/check/legacy-tripleo-ci-centos-7-containers-multinode-upgrades/2a547a7/logs/subnode-2/var/log/cinder/cinder-manage.log.txt.gz#_2017-10-18_15_45_50_862

DBError: (pymysql.err.InternalError) (1018, u'Can\'t read dir of \'./cinder/\' (errno: 13 "Permission denied")') [SQL: u'SHOW FULL TABLES FROM `cinder`']

Same for Heat (and probably others, but haven't reached that step yet).

Alex Schultz (alex-schultz) wrote :

This is likely from the user ID change as we go from non-containerized mariadb to containerized mariadb. I think the containerized version uses a different UID and so the existing ownership might need to change.

Michele Baldessari (michele) wrote :

NB: Until we fix https://bugs.launchpad.net/tripleo/+bug/1713007 in pike upgrades just won't work (see also https://bugzilla.redhat.com/show_bug.cgi?id=1475404 for some more info).

I am totally surprised that galera is even somewhat up and running (as it normally just stays in slave mode without those fixes)

I'll look some more at this tomorrow with Damien.

Damien Ciabrini (dciabrin) wrote :

I wonder if it's a dup of https://bugs.launchpad.net/tripleo/+bug/1701485

I have to check the logs but as said in comment #1, this looks a discrepancy
between the mysql pid which is running and the permission of /var/lib/mysql
on disk.

So it's either the new containerized mysql service that tries to access the
DB on disk before it has been chown to kolla's mysql uid, or it's the old
galera pid which failed to stop during the upgrade and which cannot access
the DB on disk anymore.

Damien Ciabrini (dciabrin) wrote :

OK so When looking at http://logs.openstack.org/25/500625/15/check/legacy-tripleo-ci-centos-7-containers-multinode-upgrades/2a547a7/logs/subnode-2/var/log/messages.txt.gz, I don't see any mention of a container "mysql_data_ownership" which should have been started to chown /var/lib/mysql for containers. It has not been run.

Likewise, looking at http://logs.openstack.org/25/500625/15/check/legacy-tripleo-ci-centos-7-containers-multinode-upgrades/2a547a7/logs/subnode-2/var/log/cluster/corosync.log.txt.gz, I don't see any mention of pacemaker starting any containerized galera on its own.

So I think this upgrade job is not doing the thing it is intended to? If it's a multinode upgrade, it should use the containerized pacemaker settings, i.e. the contents of docker-ha.yaml should be passed to the deploy command.

Emilien Macchi (emilienm) wrote :

Damien, the command used to upgrade is the following:

openstack overcloud deploy --templates tripleo-heat-templates --libvirt-type qemu --timeout 80 -e /home/zuul/cloud-names.yaml -e /home/zuul/tripleo-heat-templates/environments/deployed-server-environment.yaml -e /home/zuul/tripleo-heat-templates/environments/deployed-server-bootstrap-environment-centos.yaml --overcloud-ssh-user zuul -e /home/zuul/tripleo-heat-templates/ci/environments/multinode.yaml -e /home/zuul/tripleo-heat-templates/environments/low-memory-usage.yaml -e /opt/stack/new/tripleo-ci/test-environments/worker-config.yaml -e /home/zuul/tripleo-heat-templates/environments/debug.yaml --validation-errors-nonfatal --roles-file /home/zuul/overcloud_roles.yaml --compute-scale 0 -e tripleo-heat-templates/environments/docker.yaml -e tripleo-heat-templates/ci/environments/multinode-containers.yaml -e tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml -e /home/zuul/containers-default-parameters.yaml -e /home/zuul/overcloud-repo.yaml

Source: http://logs.openstack.org/25/500625/15/check/legacy-tripleo-ci-centos-7-containers-multinode-upgrades/2a547a7/logs/undercloud/home/zuul/overcloud_upgrade_console.log.txt.gz#_2017-10-18_15_14_36

Damien Ciabrini (dciabrin) wrote :

Thanks Emilien, it seems like tripleo-heat-templates/ci/environments/multinode-containers.yaml is meant to configure the proper containerized pacemaker services.

So I'm really sure why I don't seem to see any call to the new containerized services. I need to replicate the deployment locally to investigate more

Changed in tripleo:
milestone: queens-1 → queens-2

Wondering if that could be related[1] or not at all. Triggering a new job to check if it help as it has merged.

[1] https://bugs.launchpad.net/tripleo/+bug/1730349

Wrong branch, cherry pick and add depends-on.

Fix proposed to branch: master
Review: https://review.openstack.org/518578

Changed in tripleo:
assignee: nobody → Sofer Athlan-Guyot (sofer-athlan-guyot)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/518578
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f48e11ee2715277bd1fe6c32580cc41a0de6ce4e
Submitter: Zuul
Branch: master

commit f48e11ee2715277bd1fe6c32580cc41a0de6ce4e
Author: Sofer Athlan-Guyot <email address hidden>
Date: Wed Nov 8 17:46:35 2017 +0100

    Make sure /var/lib/mysql rights are setup correctly.

    If you do an upgrade on then bootstrap[1] is not run, so you have to
    make sure the permission are setup right every time.

    This is duplicating what is happening in the pacemaker mysql template[2]

    Partial-Bug: #1724636

    [1] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/database/mysql.yaml#L128
    [2] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/database/mysql.yaml#L162

    Change-Id: Ib224dd10361171dfd579867be35a2c67a71fd9d5

Reviewed: https://review.openstack.org/518579
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9ca5d97476da2d394fbf2485f7e6855fbbbac693
Submitter: Zuul
Branch: stable/pike

commit 9ca5d97476da2d394fbf2485f7e6855fbbbac693
Author: Sofer Athlan-Guyot <email address hidden>
Date: Wed Nov 8 17:46:35 2017 +0100

    Make sure /var/lib/mysql rights are setup correctly.

    If you do an upgrade on then bootstrap[1] is not run, so you have to
    make sure the permission are setup right every time.

    This is duplicating what is happening in the pacemaker mysql template[2]

    Partial-Bug: #1724636

    [1] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/database/mysql.yaml#L128
    [2] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/database/mysql.yaml#L162

    Change-Id: Ib224dd10361171dfd579867be35a2c67a71fd9d5
    (cherry picked from commit f48e11ee2715277bd1fe6c32580cc41a0de6ce4e)

tags: added: in-stable-pike
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Jose Luis Franco (jfrancoa) wrote :

Is this still happening? I guess it's still oepened because the submitted patch was a Partial-Bug but it didn't close it. @Sofer, is there anything missing on this or could we just close it?

Changed in tripleo:
milestone: rocky-1 → rocky-2

This has been closed by fixing the ci workflow. Wrong container were used.

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.