mysql_init_bundle Fails to start on Ussuri

Bug #1913472 reported by Harry Kominos
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
New
Undecided
Unassigned

Bug Description

We have a tripleo Ussuri deployment that was deployed around May.
Until now the cloud has been operating fine and we have been able to expand without issues
3 controllers and 34 computes and external ceph.
Yesterday we tried to add a few new compute nodes But the installation does not go further than step 2

with an error in the ansible log

2021-01-26 23:22:41,647 p=972858 u=stack n=ansible | fatal: [controller-0]: FAILED! => changed=false msg: '[''mysql_init_bundle''] failed to start, check logs in /var/log/containers/stdouts/'

I dont really see anything in the mysql_init log.

The only thing is see is that it seems be failing on

bash-4.4# puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,galera_ready,mysql_database,mysql_grant,mysql_user -e 'noop_resource('\''package'\''); include tripleo::profile::base::pacemaker;include tripleo::profile::pacemaker::database::mysql_bundle' but when I run it manually within the container I think it does not return an exit code and it just hangs. (I dont know if I am running it correctly though)

The overcloud command looks like this

openstack overcloud deploy --templates ~/templates --stack-only \ -e /home/stack/containers-prepare-parameter.yaml \ -e /home/stack/templates/node-info.yaml \ -r /home/stack/templates/roles_data.yaml \ -n /home/stack/templates/network_data.yaml \ -e /home/stack/templates/environments/network-environment-OVS.yaml \ -e /home/stack/templates/environments/network-isolation.yaml \ -e /home/stack/templates/environments/ceph-ansible/ceph-ansible-external.yaml \ -e /home/stack/templates/ceph-config.yaml \ -e /home/stack/templates/environments/docker-ha.yaml \ -e /home/stack/templates/environments/ssl/enable-tls.yaml \ -e /home/stack/templates/environments/ssl/inject-trust-anchor-hiera.yaml \ -e /home/stack/templates/environments/ssl/inject-trust-anchor.yaml \ -e /home/stack/templates/environments/ssl/tls-endpoints-public-dns.yaml \ -e /home/stack/templates/environments/predictable-placement/custom-domain.yaml \ -e /home/stack/templates/cloudname.yaml \ -e /home/stack/templates/environments/manila-cephfsnative-config.yaml \ -e /home/stack/templates/environments/ceph-ansible/ceph-mds.yaml \ -e /home/stack/templates/manila-cephfsnative-config.yaml \ -e /home/stack/templates/environments/enable-legacy-telemetry.yaml \ -e /home/stack/templates/environments/neutron-ovs-dvr.yaml \ -e /home/stack/templates/environments/services/octavia.yaml \ -e /home/stack/templates/overcloud_dashboard_hardening.yaml \ -e /home/stack/templates/novafixes.yaml \
--timeout 1500

Revision history for this message
Harry Kominos (hkominos) wrote :
Revision history for this message
Harry Kominos (hkominos) wrote :

Also attaching the last lines from ansible.log

Revision history for this message
Harry Kominos (hkominos) wrote :

Also attaching mysql init log file

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Check the running containers. Usually if you hit this, there will be running db sync containers. If you check the logs for those, you will see that it cannot communicate with the database. Please check our VIP and network configurations.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Also check your podman version. It should be 1.6.x. If you have 2.0.x you need to enable container-tools:2.0 and downgrade podman.

Revision history for this message
Harry Kominos (hkominos) wrote :

podman --version
podman version 1.6.4

no dbsync containers run (or have run in a while)

[root@controller-0 heat-admin]# podman ps |grep sync fd7f9e4c5a94 under-ussuri02.ctlplane.example.com:8787/tripleou/centos-binary-swift-object:current-tripleo kolla_start 6 months ago Up 29 hours ago swift_rsync

https://paste.centos.org/view/934608fe

VIP is pingable from the controller,pcs reports no issues and the overcloud is working fine.
I will do a more thorough networking check tomorrow as per your instructions.
Is there any other log that might be of interest?

The only thing that I might be seeing is in /var/log/containers/mysql/mysqld
https://paste.centos.org/view/1fe4cdf9

are these warnings about connection drops that might be relevant??
Any other points would be helpful

Revision history for this message
Vasileios Baousis (bbaous) wrote :
Download full text (7.1 KiB)

We work together with the reported

Further to the above actions, we performed the following

1. The same container to the other controllers can be started/stop without any problem. The command is executed and returns without a problem ( the puppet configurations seems to be the same)

Correct execution (controller1)
puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,galera_ready,mysql_database,mysql_grant,mysql_user -e 'noop_resource('\''package'\''); include tripleo::profile::base::pacemaker;include tripleo::profile::pacemaker::database::mysql_bundle'
...
..

2021-02-06T14:00:30.782993941+00:00 stdout F Info: Loading facts
2021-02-06T14:00:30.783025041+00:00 stdout F Info: Loading facts
2021-02-06T14:00:30.783054016+00:00 stdout F Info: Loading facts
2021-02-06T14:00:38.312934917+00:00 stderr F Warning: Found multiple default providers for service: swiftinit, base, pacemaker, pacemaker_xml; using swiftinit
2021-02-06T14:00:39.492883048+00:00 stderr F Warning: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5
2021-02-06T14:00:39.492883048+00:00 stderr F (file: /etc/puppet/hiera.yaml)
2021-02-06T14:00:39.493806071+00:00 stderr F Warning: Undefined variable '::deploy_config_name';
2021-02-06T14:00:39.493806071+00:00 stderr F (file & line not available)
2021-02-06T14:00:39.524991159+00:00 stderr F Warning: The function 'hiera' is deprecated in favor of using 'lookup'. See https://puppet.com/docs/puppet/6.14/deprecated_language.html
2021-02-06T14:00:39.524991159+00:00 stderr F (file & line not available)
2021-02-06T14:00:39.818677831+00:00 stderr F Warning: This method is deprecated, please use match expressions with Stdlib::Compat::Array instead. They are described at https://docs.puppet.com/puppet/latest/refere
nce/lang_data_type.html#match-expressions. at ["/etc/puppet/modules/tripleo/manifests/profile/pacemaker/database/mysql_bundle.pp", 214]:["unknown", 1]
2021-02-06T14:00:39.818677831+00:00 stderr F (location: /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:34:in `deprecation')
2021-02-06T14:00:39.968930710+00:00 stdout F Notice: Compiled catalog for controller-1.example.com in environment production in 0.50 seconds
2021-02-06T14:00:40.043804586+00:00 stdout F Info: Applying configuration version '1612620039'
2021-02-06T14:00:40.160360183+00:00 stdout F Notice: Applied catalog in 0.12 seconds
2021-02-06T14:00:40.160892547+00:00 stdout F Changes:
2021-02-06T14:00:40.160892547+00:00 stdout F Events:
2021-02-06T14:00:40.160892547+00:00 stdout F Resources:
2021-02-06T14:00:40.160892547+00:00 stdout F Skipped: 24
2021-02-06T14:00:40.160892547+00:00 stdout F Total: 31
2021-02-06T14:00:40.160892547+00:00 stdout F Time:
2021-02-06T14:00:40.160892547+00:00 stdout F File line: 0.00
2021-02-06T14:00:40.160892547+00:00 stdout F File: 0.00
2021-0...

Read more...

Revision history for this message
Harry Kominos (hkominos) wrote :

So it appears that the issue is with
crm_node -l within the container. That process is hanging and therefore the container waits for ever.
If I Kill the process (which might not be the right thing anyway) then the container exits properly

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.