Ocata -> Pike pacemaker enabled overcloud fails during db_syncs because /var/lib/mysql/ is not accessible by mysql

Bug #1701485 reported by Marius Cornea
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

Ocata -> Pike pacemaker enabled overcloud upgrade fails during db_syncs because /var/lib/mysql/ is not accessible by mysql:

We can see that mysql runs as the galera resource but the ownership of /var/lib/mysql/ has been changed hence when the db_syncs run the mysql process cannot access the databases:

[root@overcloud-controller-0 mysql]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.16-10.el7-94ff4df) - partition with quorum
Last updated: Fri Jun 30 08:49:38 2017
Last change: Thu Jun 29 22:31:14 2017 by root via cibadmin on overcloud-controller-0

1 node configured
15 resources configured

              *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

Online: [ overcloud-controller-0 ]

Full list of resources:

 Master/Slave Set: galera-master [galera] (unmanaged)
     galera (ocf::heartbeat:galera): Master overcloud-controller-0 (unmanaged)
 Clone Set: rabbitmq-clone [rabbitmq] (unmanaged)
     Stopped: [ overcloud-controller-0 ]
 Master/Slave Set: redis-master [redis] (unmanaged)
     redis (ocf::heartbeat:redis): Master overcloud-controller-0 (unmanaged)
 ip-192.168.0.15 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged)
 ip-172.16.18.25 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged)
 ip-10.0.0.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged)
 ip-10.0.0.16 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged)
 ip-10.0.0.143 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged)
 ip-10.0.1.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged)
 Clone Set: haproxy-clone [haproxy] (unmanaged)
     haproxy (systemd:haproxy): Started overcloud-controller-0 (unmanaged)
 openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 (unmanaged)
 Docker container: rabbitmq-bundle [192.168.0.1:8787/tripleoupstream/centos-binary-rabbitmq:latest] (unmanaged)
   rabbitmq-bundle-docker-0 (ocf::heartbeat:docker): Stopped (unmanaged)
 Docker container: galera-bundle [192.168.0.1:8787/tripleoupstream/centos-binary-mariadb:latest] (unmanaged)
   galera-bundle-docker-0 (ocf::heartbeat:docker): Stopped (unmanaged)
 Docker container: redis-bundle [192.168.0.1:8787/tripleoupstream/centos-binary-redis:latest] (unmanaged)
   redis-bundle-docker-0 (ocf::heartbeat:docker): Stopped (unmanaged)
 Docker container: haproxy-bundle [192.168.0.1:8787/tripleoupstream/centos-binary-haproxy:latest] (unmanaged)
   haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped (unmanaged)

Failed Actions:
* rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=84, status=complete, exitreason='none',
    last-rc-change='Thu Jun 29 22:24:42 2017', queued=1ms, exec=2463ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@overcloud-controller-0 mysql]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8bdafcdcc281 192.168.0.1:8787/tripleoupstream/centos-binary-keystone:latest "kolla_start" 50 minutes ago Up 50 minutes keystone
3ff5ce1a6eb8 192.168.0.1:8787/tripleoupstream/centos-binary-iscsid:latest "kolla_start" 50 minutes ago Up 50 minutes iscsid
a26243551a83 192.168.0.1:8787/tripleoupstream/centos-binary-nova-placement-api:latest "kolla_start" 50 minutes ago Up 50 minutes nova_placement
c74d04057579 192.168.0.1:8787/tripleoupstream/centos-binary-horizon:latest "kolla_start" 50 minutes ago Up 50 minutes horizon
317a4d2cc59d 192.168.0.1:8787/tripleoupstream/centos-binary-mariadb:latest "kolla_start" 10 hours ago Up 10 hours clustercheck
d2b83628167e 192.168.0.1:8787/tripleoupstream/centos-binary-mongodb:latest "kolla_start" 10 hours ago Up 10 hours mongodb
55de71d6b5d5 192.168.0.1:8787/tripleoupstream/centos-binary-memcached:latest "/bin/bash -c 'source" 10 hours ago Up 10 hours memcached

[root@overcloud-controller-0 mysql]# ls -ld /var/lib/mysql/
drwxr-xr-x. 17 42434 42434 4096 Jun 29 22:23 /var/lib/mysql/

[root@overcloud-controller-0 mysql]# docker logs --tail 2 nova_api_db_sync
DBError: (pymysql.err.InternalError) (1018, u'Can\'t read dir of \'./nova_api/\' (errno: 13 "Permission denied")') [SQL: u'SHOW FULL TABLES FROM `nova_api`']
[root@overcloud-controller-0 mysql]# docker logs --tail 2 heat_engine_db_sync
2017-06-30 07:58:55.927 11 ERROR oslo_db.sqlalchemy.exc_filters
ERROR: (pymysql.err.InternalError) (1018, u'Can\'t read dir of \'./heat/\' (errno: 13 "Permission denied")') [SQL: u'SHOW FULL TABLES FROM `heat`']

[root@overcloud-controller-0 mysql]# mysql -e 'show tables' keystone;
ERROR 1018 (HY000) at line 1: Can't read dir of './keystone/' (errno: 13 "Permission denied")

Revision history for this message
Marius Cornea (mcornea) wrote :

Attaching the sosreport.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This should be addressed in the scope of the https://review.openstack.org/#/c/478530/

tags: added: containers
Revision history for this message
Martin André (mandre) wrote :

I'm not 100% confident this is a duplicate of bug #1697917. Here the issue seems to be related to the ownership of /var/lib/mysql/.

Normally, the uid/gid are "fixed" with https://github.com/openstack/tripleo-heat-templates/blob/0de13ab13d327eebe307df36c9d02f9792521fdb/docker/services/pacemaker/database/mysql.yaml#L110-L119

Is it possible the upgrade from O to P takes a different path?

Changed in tripleo:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/480202

Revision history for this message
Marios Andreou (marios-b) wrote :

note a related problem is discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1466745

Changed in tripleo:
importance: Critical → High
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/480202
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=cdc3477b7780c751807fe18c3767762189cf03b4
Submitter: Jenkins
Branch: master

commit cdc3477b7780c751807fe18c3767762189cf03b4
Author: marios <email address hidden>
Date: Tue Jul 4 16:52:26 2017 +0300

    Remove non-containerized pacemaker resources on upgrade

    Adds upgrade_tasks to remove the pacemaker resources using the
    ansible-pacemaker module.

    Resources are disabled and removed in step2 (called only on
    bootstrap node) and then the cluster stop is moved to step3

    The existing systemd/service call is kept but only to disable
    services after they are disabled/deleted from the cluster.

    Related-Bug: 1701485
    Co-Authored-By: Damien Ciabrini <email address hidden>
    Change-Id: Ia597d240ea5834c50a8f6c4fac0b6ed417b8535c

Changed in tripleo:
milestone: pike-3 → pike-rc1
Changed in tripleo:
milestone: pike-rc1 → pike-rc2
Changed in tripleo:
milestone: pike-rc2 → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
status: In Progress → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.