replace controller fail in build containers in new controller

Bug #1756138 reported by SharonBarak
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Expired
Medium
Unassigned

Bug Description

#queens-latest

hi ,
when replacing failed controller (controller-2 removed) we face the following
after new controller added to the ironic/nova etc and the failed one removed

1 - pcs save the the dead containers of the removed controller in the bundle resources ( rabbitmq/galera/haproxy/ -bundle )

 Docker container set: rabbitmq-bundle [172.31.255.1:8787/cbis/centos-binary-rabbitmq:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
   rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1
   rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped

assuming those should"v removed .... ( *-2 was in the removed controller / could not remove them manually ... cib.xml / pcs resource bundle update ... )

2 - during update following errors trigger

overcloud.AllNodesDeploySteps.ControllerDeployment_Step2.3:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: ffc0b7c2-2bae-40cb-8815-9263ac810f15
  status: CREATE_FAILED
  status_reason: |
    Error: resources[3]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,tripleo::firewall::rule,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation -e 'include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle'",
            " with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/tripleo/manifests/firewall/rule.pp\", 134]:",
            "Warning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications."
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/5f0e11da-8d21-48bd-8167-df39259a5190_playbook.retry

    PLAY RECAP *********************************************************************
    localhost : ok=6 changed=2 unreachable=0 failed=1

    (truncated, view all with --long)
  deploy_stderr: |

any advice ?

Changed in tripleo:
milestone: none → rocky-1
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
SharonBarak (sharonbarak) wrote :

yes ...
the issues is trigger after the new controller is added manually into pacemaker ...

puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,tripleo::firewall::rule,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation -e 'include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle'"

the above trigger in the new added controller (step2) , not sure whats trigger it currently
options
1 - rabbit container in the new controller exit with "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!" , seems same like "https://bugs.launchpad.net/tripleo/+bug/1723095"

2 - at the step that its failed the new controller didnt start yet haproxy / galeta eta ... not clear to me
if pcs should start in the new controller new containers with the index of the new controller ( the stop container which run in the old controller still in cluster as stop ... )
 Docker container set: rabbitmq-bundle [172.31.255.1:8787/cbis/centos-binary-rabbitmq:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
   rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1
   rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped

 Docker container set: galera-bundle [172.31.255.1:8787/cbis/centos-binary-mariadb:pcmklatest]
   galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0
   galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1
   galera-bundle-2 (ocf::heartbeat:galera): Stopped

in addition to that the cluster logs in new controller show
overcloud-controller-3 crmd: error: do_lrm_invoke: no lrmd connection for remote node rabbitmq-bundle-2 found on cluster node overcloud-controller-3. Can not process request.
overcloud-controller-3 crmd: error: do_lrm_invoke: no lrmd connection for remote node redis-bundle-2 found on cluster node overcloud-controller-3. Can not process request.

not clear to me how its should behave ... ( i tried once to change the replica for the bundle processes to 2 ... and i could get processed with the update .... )

( we run queens , which taken two weeks before the latest ... )

Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Incomplete
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Revision history for this message
Marios Andreou (marios-b) wrote :

This is an automated action. Bug status has been set to 'Incomplete' and target milestone has been removed due to inactivity. If you disagree please re-set these values and reach out to us on freenode #tripleo

Changed in tripleo:
milestone: wallaby-3 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for tripleo because there has been no activity for 60 days.]

Changed in tripleo:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.