Deployment failing: task rabbitmq timeout on secondary controllers

Bug #1661004 reported by Raymond Maika
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Ivan Suzdal
Newton
Fix Released
High
Ivan Suzdal
Ocata
Fix Committed
High
Ivan Suzdal

Bug Description

The task rabbitmq/rabbitmq.pp times out on both secondary controllers while waiting for the step: (/Stage[main]/Rabbitmq::Service/Service[rabbitmq-server]/ensure) ensure changed 'stopped' to 'running'

The ocf resource is created, but doesn't seem to start manually either. Task rabbitmq fails and deployment is stopped on all nodes other than the secondary controllers which move into error state.

Steps to reproduce:
- Create environment
- Add 3 controllers and any other set of nodes
- Deploy environment

Expected results:
- rabbitmq-server is added as an ocf resource to the pacemaker cluster and environment is deployed

Actual results:
- rabbitmq-server ocf resources are created, but do not start on any secondary controllers

Reproducibility:
- This is happening on all environments I'm deploying with any version of fuel 10+

Impact:
- Unable to deploy any environment

Description of the environment:
- Fuel 10.0
- Newton on Ubuntu 16.04
- KVM
- Neutron with tunneling segmentation
- Ceph for all storage types

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Please attach diagnostic snapshot, marking as Incomplete.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Raymond Maika (raymond.maika) wrote :

Uploading diagnostic snapshot in split archive, upload was timing out.

Revision history for this message
Raymond Maika (raymond.maika) wrote :
Revision history for this message
Raymond Maika (raymond.maika) wrote :

Part 3/3 of snapshot.

Changed in fuel:
status: Incomplete → New
Revision history for this message
Serg Melikyan (smelikyan) wrote :

Marked as confirmed, because we see same failures in Opnfv, raising priority because without HA release is broken.

Changed in fuel:
importance: Undecided → High
status: New → Confirmed
assignee: nobody → Fuel for Openstack (fuel)
milestone: none → 10.x-updates
Changed in fuel:
assignee: Fuel for Openstack (fuel) → MOS Oslo (mos-oslo)
milestone: 10.x-updates → 10.1
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

In the unpacked snapshot unpack logs-node-13-10.130.0.8.tar.gz and do the following grep - http://paste.openstack.org/show/601088/

The results show that Pacemaker detects no free space on node-13 in partitions /, /var/log, /var/lib/mysql. Literally they have 1mb, 3mb, and 3mb free space, correspondingly. We have Pacemaker tuned to shut down all the resources in case at least one partition has less than 512mb of free space. So, that is the reason RabbitMQ does not start.

At the same time in unpacked config.tar.gz there is a result of "df -m" command for node-13 in file scripts/cluster-2/node-13-10.130.0.8/df-m and it shows that / had ~ 1TB of free space and /var/log and /var/lib both had >= 3TB of free space. So it seems like we have a bug in free space calculation logic, and I continue investigating it.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The problem is clear - once we moved to Xenial in 10.0, we stopped packaging our own Pacemaker and use the upstream one. But it is missing fix for bug https://bugs.launchpad.net/fuel/+bug/1524729 which we did not push upstream, and hence the face the problem.

Packaging team, could you please package Pacemaker for 10.0 and add patch https://review.fuel-infra.org/#/c/14563 to it?

Workaround for users - besides patching the Pacemaker (a single bash script), you can use disks with less than 1TB size for controllers.

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → MOS Packaging Team (mos-packaging)
Revision history for this message
Raymond Maika (raymond.maika) wrote :

Dmitry - is there a way that I can manually apply the patch to the pacemaker package? Before or after building a fuel iso? I looked but haven't spent much time looking at the packages in fuel before so I'm not sure where I may find pacemaker/extra/resources. Or is the workaround just editing the Sysinfo scripts on each node after they're pushed onto the slave nodes?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Raymond, if you can repackage the pacemaker deb package, do that. Then put the package in a dedicated mirror and add that mirror to the environment with high priority.

If you, like me, are unfamiliar with packaging, but are fine with manual steps during deployment, the following hack _might_ work:
 * Deploy env, wait for deployment to fail
 * On each controller patch the Sysinfo script
 * For each controller execute
     crm node status-attr <controller FQDN> delete "#health_disk"
 * Verify with 'pcs resource' that resources are beginning to start on all controller
 * Redeploy the environment

I am not sure if that way works because I don't know if Fuel will completely reinstall the controller during the redeployment, or it will continue from the last failed steps, which means that it will leave Pacemaker patched.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to fuel-infra/jeepyb-config (master)

Related fix proposed to branch: master
Change author: Ivan Suzdal <email address hidden>
Review: https://review.fuel-infra.org/31616

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to fuel-infra/zuul-layouts (master)

Related fix proposed to branch: master
Change author: Ivan Suzdal <email address hidden>
Review: https://review.fuel-infra.org/31618

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to fuel-infra/jeepyb-config (master)

Reviewed: https://review.fuel-infra.org/31616
Submitter: Andrey Nikitin <email address hidden>
Branch: master

Commit: 8cc2ea22e52f3ca063d5718d70989b76b189a301
Author: Ivan Suzdal <email address hidden>
Date: Tue Mar 7 12:06:21 2017

Add new project:

  - packages/xenial/pacemaker

Change-Id: I1e443b28bf7d2f80866ce21dc6a418aabc6a1375
Related-Bug: #1661004

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to fuel-infra/zuul-layouts (master)

Reviewed: https://review.fuel-infra.org/31618
Submitter: Dmitry Burmistrov <email address hidden>
Branch: master

Commit: 4cc38d4d855afa1be0999bf04f1157f472fa2e79
Author: Ivan Suzdal <email address hidden>
Date: Tue Mar 7 12:08:25 2017

Add new project to zuul layout:

  - packages/xenial/pacemaker

Change-Id: I59a9aa4b3ed10b7bc39360dd35b380731681c0c2
Related-Bug: #1661004

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/xenial/pacemaker (master)

Fix proposed to branch: master
Change author: Ivan Suzdal <email address hidden>
Review: https://review.fuel-infra.org/31622

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/xenial/pacemaker (master)

Reviewed: https://review.fuel-infra.org/31622
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: 3b1f8835f7e35f3c6e7be065d1f87f617981071c
Author: Ivan Suzdal <email address hidden>
Date: Tue Mar 7 15:44:35 2017

Add pacemaker package

Change-Id: I1dd1d420cac61fbbb334a251bf55b573338df6e8
Closes-Bug: #1661004

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/xenial/pacemaker (10.0/newton)

Fix proposed to branch: 10.0/newton
Change author: Ivan Suzdal <email address hidden>
Review: https://review.fuel-infra.org/31708

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/xenial/pacemaker (10.0/newton)

Reviewed: https://review.fuel-infra.org/31708
Submitter: Pkgs Jenkins <email address hidden>
Branch: 10.0/newton

Commit: 57f06364f224446cc077addc97daed26ff92a346
Author: Ivan Suzdal <email address hidden>
Date: Thu Mar 9 09:16:30 2017

Add pacemaker package

Change-Id: I1dd1d420cac61fbbb334a251bf55b573338df6e8
Closes-Bug: #1661004
(cherry picked from commit 3b1f8835f7e35f3c6e7be065d1f87f617981071c)

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I have verified that the latest installation of 10.0 does have the fix https://review.fuel-infra.org/#/c/14563/ . Checked file /usr/lib/ocf/resource.d/pacemaker/SysInfo. Installed pacemaker:

pacemaker/unknown,now 1.1.14-2~u16.04+mos1 amd64 [installed,upgradable to: 1.1.14-2ubuntu1.1]
pacemaker-cli-utils/unknown,now 1.1.14-2~u16.04+mos1 amd64 [installed,upgradable to: 1.1.14-2ubuntu1.1]
pacemaker-common/unknown,now 1.1.14-2~u16.04+mos1 all [installed,upgradable to: 1.1.14-2ubuntu1.1]
pacemaker-resource-agents/unknown,now 1.1.14-2~u16.04+mos1 all [installed,upgradable to: 1.1.14-2ubuntu1.1]

Hence marking the bug as verified.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.