mixed deployment fails resources.ControllerServicesBaseDeployment_Step2 with status code: 6

Bug #1623673 reported by Adriano Petrich
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
High
Adriano Petrich

Bug Description

This is a mixed deployment bug.

After installing a mitaka UC and OC using tripleo-quickstart and upgrading only the Undercloud to master.

All that works fine but to test if we can still operate on the overcloud I do a simple dns change on the overcloud

openstack overcloud deploy --templates tripleo-heat-templates \
    -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
    -e tripleo-heat-templates/environments/network-isolation.yaml \
    -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
    -e ~/network-environment.yaml \
    -e update-heat-stack.yaml \

That fails with

| ControllerNodesPostDeployment | 709aa15f-add2-4c89-a0f2-a249d61ae9f0 | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-09-14T20:28:31 |

resource-show that gets me:

+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes | {} |
| creation_time | 2016-09-14T19:04:31 |
| description | |
| links | http://192.0.2.1:8004/v1/bea425713ccc4bfb9cf31095475655e1/stacks/overcloud/44c103f3-4392-4cad-88e0-7926b023fd9c/resources/ControllerNodesPostDeployment (self) |
| | http://192.0.2.1:8004/v1/bea425713ccc4bfb9cf31095475655e1/stacks/overcloud/44c103f3-4392-4cad-88e0-7926b023fd9c (stack) |
| | http://192.0.2.1:8004/v1/bea425713ccc4bfb9cf31095475655e1/stacks/overcloud-ControllerNodesPostDeployment-toa4hdclch6i/709aa15f-add2-4c89-a0f2-a249d61ae9f0 (nested) |
| logical_resource_id | ControllerNodesPostDeployment |
| physical_resource_id | 709aa15f-add2-4c89-a0f2-a249d61ae9f0 |
| required_by | BlockStorageNodesPostDeployment |
| | CephStorageNodesPostDeployment |
| resource_name | ControllerNodesPostDeployment |
| resource_status | UPDATE_FAILED |
| resource_status_reason | resources.ControllerNodesPostDeployment: resources.ControllerServicesBaseDeployment_Step2: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 |
| resource_type | OS::TripleO::ControllerPostDeployment |
| updated_time | 2016-09-14T20:28:31 |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

#### files associated with that call

[stack@undercloud ~]$ cat network-environment.yaml
# note the EC2MetadataIp: value should match the value in
# /home/stack/undercloud.conf for key local_ip
# # EC2MetadataIp: "local_ip"
#
# #ControlPlaneSubnetCidr: key name must match the keyname found in
# #/usr/share/openstack-tripleo-heat-templates/network/config/$type/node.yaml

parameter_defaults:
  InternalApiNetCidr: 172.16.20.0/24
  StorageNetCidr: 172.16.21.0/24
  TenantNetCidr: 172.16.22.0/24
  ExternalNetCidr: 172.16.23.0/24
  InternalApiAllocationPools: [{'start': '172.16.20.10', 'end': '172.16.20.100'}]
  StorageAllocationPools: [{'start': '172.16.21.10', 'end': '172.16.21.100'}]
  TenantAllocationPools: [{'start': '172.16.22.10', 'end': '172.16.22.100'}]
  ExternalAllocationPools: [{'start': '172.16.23.110', 'end': '172.16.23.150'}]
  ExternalInterfaceDefaultRoute: 172.16.23.1
  NeutronExternalNetworkBridge: "''"
  ControlPlaneSubnetCidr: "24"
  ControlPlaneDefaultRoute: 192.0.2.1
  EC2MetadataIp: 192.0.2.1
  DnsServers: ["192.168.23.1", "8.8.8.8",]

[stack@undercloud ~]$ cat update-dnsserver.yaml
heat_template_version: 2014-10-16
description: 'Update resolv.conf'
parameters:
  server:
    type: string

resources:

  NameServerConfig:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script
      config: |
        #!/bin/sh
        echo "nameserver 8.8.8.8" >> /etc/resolv.conf

  NameServerDeployment:
    type: OS::Heat::SoftwareDeployment
    properties:
      name: NameServerDeployment
      config: {get_resource: NameServerConfig}
      server: {get_param: server}

outputs:
  deploy_stdout:
    value: "None"

[stack@undercloud ~]$ cat update-heat-stack.yaml
resource_registry:
  OS::TripleO::NodeExtraConfig: update-dnsserver.yaml

Changed in tripleo:
milestone: none → newton-rc2
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
Changed in tripleo:
importance: High → Critical
Changed in tripleo:
assignee: nobody → Adriano Petrich (apetrich)
Revision history for this message
Emilien Macchi (emilienm) wrote :

can you paste your Puppet logs on the node where is fails?

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Shouldn't DNS updates be handled via the DnsServers parameter?

Changed in tripleo:
milestone: newton-rc2 → ocata-1
tags: added: newton-backport-potential
Revision history for this message
Michele Baldessari (michele) wrote :

Where do the templates in "openstack overcloud deploy --templates tripleo-heat-templates" come from? I ask because if they are the newton ones, then this breakage is really expected, because we are reasserting the state via the new heat templates but the puppet modules on the overcloud are still the older ones.

That is the reason we noop the postdeploy in the upgrade steps before the convergence step:
https://github.com/openstack/tripleo-heat-templates/blob/master/environments/major-upgrade-pacemaker-init.yaml#L6

Changed in tripleo:
milestone: ocata-1 → newton-rc3
Changed in tripleo:
importance: Critical → High
Revision history for this message
Adriano Petrich (apetrich) wrote :
Download full text (5.1 KiB)

@jaosorior it is just a simple stack update to verify that the undercloud can still do operations on the overcloud. It is how we validated the mixed deployments in the past

@emilienm /var/log/puppet is empty as per https://bugs.launchpad.net/tripleo/+bug/1536009 I think

Running with --debug didn't get much more besides this:

2016-10-03 20:29:00Z [overcloud-ControllerNodesPostDeployment-cpwwpitdzlp3-ControllerServicesBaseDeployment_Step2-harlz5gwmnqh.0]: SIGNAL_IN_PROGRESS Signal: deployment 7da9b595-8975-48eb-8ddd-24f7198f4ebe failed (6)
2016-10-03 20:29:01Z [overcloud-ControllerNodesPostDeployment-cpwwpitdzlp3-ControllerServicesBaseDeployment_Step2-harlz5gwmnqh.0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-10-03 20:29:01Z [overcloud-ControllerNodesPostDeployment-cpwwpitdzlp3-ControllerServicesBaseDeployment_Step2-harlz5gwmnqh]: UPDATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-10-03 20:29:02Z [overcloud-ControllerNodesPostDeployment-cpwwpitdzlp3.ControllerServicesBaseDeployment_Step2]: UPDATE_FAILED resources.ControllerServicesBaseDeployment_Step2: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-10-03 20:29:02Z [overcloud-ControllerNodesPostDeployment-cpwwpitdzlp3]: UPDATE_FAILED resources.ControllerServicesBaseDeployment_Step2: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-10-03 20:29:03Z [ControllerNodesPostDeployment]: UPDATE_FAILED resources.ControllerNodesPostDeployment: resources.ControllerServicesBaseDeployment_Step2: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-10-03 20:29:03Z [overcloud]: UPDATE_FAILED resources.ControllerNodesPostDeployment: resources.ControllerServicesBaseDeployment_Step2: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6

 Stack overcloud UPDATE_FAILED

Heat Stack update failed.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run
    return self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 1099, in take_action
    self._deploy_tripleo_heat_templates(stack, parsed_args)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 438, in _deploy_tripleo_heat_templates
    parsed_args.timeout, env)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 448, in _try_overcloud_deploy_with_compat_yaml
    tht_root, env)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 277, ...

Read more...

Revision history for this message
Adriano Petrich (apetrich) wrote :
Download full text (46.6 KiB)

Here is the controller journal log for os-collect-config

Oct 03 19:13:34 overcloud-controller-0 os-collect-config[2833]: dib-run-parts Mon Oct 3 19:13:34 UTC 2016 Running /usr/libexec/os-refresh-config/configure.d/55-heat-config
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,168] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,168] (heat-config) [WARNING] Skipping config 31f59329-7c70-4cb8-9ad1-ac4462c862a2, already deployed
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,169] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/31f59329-7c70-4cb8-9ad1-ac4462c862a2.json
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,169] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,169] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,169] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,170] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,170] (heat-config) [WARNING] Skipping config 0a2941cb-6ef6-45d2-bbb3-212898bf4760, already deployed
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,170] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/0a2941cb-6ef6-45d2-bbb3-212898bf4760.json
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,170] (heat-config) [WARNING] Skipping config 53efc0ea-1146-428a-80ab-f0b99f94af12, already deployed
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,170] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/53efc0ea-1146-428a-80ab-f0b99f94af12.json
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,170] (heat-config) [WARNING] Skipping config d234590a-0c19-489a-a80b-fb66c1b7d6d5, already deployed
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,171] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/d234590a-0c19-489a-a80b-fb66c1b7d6d5.json
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,171] (heat-config) [WARNING] Skipping config aeb26b58-39e0-42a3-acca-65a7f6201368, already deployed
Oct 03 19:13:35 overcloud-controller-0 os-collect-config[2833]: [2016-10-03 19:13:35,171] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/aeb26b58-39e0-...

Revision history for this message
Adriano Petrich (apetrich) wrote :

So the mariadb in the controller did not stop properly:

[root@overcloud-controller-0 ~]# systemctl stop mariadb
[root@overcloud-controller-0 ~]# ps axf | grep -i mysql
 7266 pts/0 S+ 0:00 \_ grep --color=auto -i mysql
12874 ? S 0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://
13350 ? Sl 1:21 \_ /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep_on=ON --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm:// --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
[root@overcloud-controller-0 ~]# killall mysqld
[root@overcloud-controller-0 ~]# ps axf | grep -i mysql
 9032 pts/0 S+ 0:00 \_ grep --color=auto -i mysql
[root@overcloud-controller-0 ~]# systemctl stop mariadb
Broadcast message from systemd-journald@overcloud-controller-0 (Mon 2016-10-03 21:19:11 UTC):

haproxy[14091]: proxy mysql has no server available!

[root@overcloud-controller-0 ~]# systemctl start mariadb
[root@overcloud-controller-0 ~]#

and if I try to run that pp again I get
Notice: Compiled catalog for overcloud-controller-0.localdomain in environment production in 12.39 seconds
Notice: /Stage[main]/Mysql::Server::Install/Package[mysql-server]/ensure: created
Notice: /Stage[main]/Redis::Service/Service[redis]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Swift::Client/Package[swiftclient]/ensure: created
Notice: /Stage[main]/Keystone::Deps/Anchor[keystone::service::end]: Triggered 'refresh' from 1 events
Notice: Finished catalog run in 6.31 seconds

Revision history for this message
Adriano Petrich (apetrich) wrote :

Yeah confirming that the stack update worked.

| bf9b0623-dbd0-42b1-9754-9dc27de34d0d | overcloud | UPDATE_COMPLETE | 2016-10-03T17:35:33Z | 2016-10-03T21:30:03Z |

So it is a workaround in the controller:

systemctl stop mariadb
killall mysqld
systemctl start mariadb

Revision history for this message
Adriano Petrich (apetrich) wrote :

I don't know it this is an actual tripleo bug that needs fixed or an env/quickstart problem

Some thoughts on what I've seen:

when I logged in the controller this is what I saw

mariadb is said to be disabled as seen here:

[root@overcloud-controller-0 ~]# systemctl status mariadb
● mariadb.service - MariaDB 10.1 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

but mysqld was running

[root@overcloud-controller-0 ~]# ps axf | grep mysql
14408 pts/0 S+ 0:00 \_ grep --color=auto mysql
13040 ? S 0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://
13516 ? Sl 0:31 \_ /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep_on=ON --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm:// --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1

when I killed it with killall mysqld this message was broadcast

Broadcast message from systemd-journald@overcloud-controller-0 (Tue 2016-10-04 09:40:53 UTC):

haproxy[12372]: proxy mysql has no server available!

and in a few minutes mysqld was running again with a different PID but not mariadb service as seen here:

[root@overcloud-controller-0 ~]# ps axf | grep mysql
27269 pts/0 S+ 0:00 \_ grep --color=auto mysql
17030 ? S 0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://
18153 ? Sl 0:03 \_ /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep_on=ON --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm:// --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=20b9f29a-8a0d-11e6-a3e4-f3afe97b260a:5245
21842 ? Ssl 0:00 /usr/libexec/mysqld --basedir=/usr

So something is clearly starting mysqld, mariadb cannot start if mysqld is running and mariadb is started by ControllerServicesBaseDeployment_Step2 if mariadb cannot start that step fails and the stack update fails.

I don't know where to go from here.

Revision history for this message
Adriano Petrich (apetrich) wrote :

Might be env/image specific. Setting to invalid

Changed in tripleo:
status: Triaged → Invalid
Steven Hardy (shardy)
Changed in tripleo:
milestone: newton-rc3 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.