Deployment fails because controllers are overloaded (CPU): Failed to execute hook 'ceilometer-radosgw-user'. puppet timeout error: execution expired

Bug #1543718 reported by Artem Panchenko
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Critical
Oleksiy Molchanov
8.0.x
Fix Committed
Critical
Oleksiy Molchanov
Mitaka
Invalid
Critical
Oleksiy Molchanov

Bug Description

System test 'huge_ha_neutron_tun_ceph_ceilometer_rados' failed because controllers were overloaded during deployment, puppet task 'ceilometer-radosgw-user' failed by timeout:

2016-02-09 00:03:11 ERROR [803] Error running RPC method granular_deploy: Failed to execute hook 'ceilometer-radosgw-user' Puppet run failed. Check puppet logs for details
---
uids:
- '1'
- '3'
- '2'
parameters:
  puppet_modules: /etc/puppet/modules
  puppet_manifest: /etc/puppet/modules/osnailyfacter/modular/ceilometer/radosgw_user.pp
  timeout: 300
  cwd: /
priority: 1900
fail_on_error: true
type: puppet
id: ceilometer-radosgw-user
, trace:
["/usr/share/gems/gems/astute-8.0.0/lib/astute/nailgun_hooks.rb:64:in `block in process'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/nailgun_hooks.rb:26:in `each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/nailgun_hooks.rb:26:in `process'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/deployment_engine/granular_deployment.rb:233:in `post_deployment_actions'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/deployment_engine.rb:75:in `deploy'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/orchestrator.rb:216:in `deploy_cluster'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/orchestrator.rb:52:in `granular_deploy'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/dispatcher.rb:92:in `granular_deploy'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:189:in `dispatch_message'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:146:in `block in dispatch'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:64:in `call'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:64:in `block in each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:144:in `each_with_index'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:144:in `dispatch'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:123:in `block in perform_main_job'"]
2016-02-09 00:03:11 ERROR [803] b97d40c3-8081-42d1-9178-d978e510c5c4: puppet timeout error: execution expired

Here is a part of atop logs on 1 of controllers:

http://paste.openstack.org/show/486469/
http://paste.openstack.org/show/486471/
http://paste.openstack.org/show/486475/

As you can see it was overloaded, puppet, ceilometer and rabbitmq utilized all CPU resources. Also a lot of swap (~50%) was used, RAM was mostly utilized by OpenStack services (nova, neutron, heat).

Here are HW characteristics of VMs used for controller nodes:

root@node-1:~# grep -P 'processor|model name|^\s*$' /proc/cpuinfo
processor : 0
model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

processor : 1
model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

root@node-1:~# free -m
             total used free shared buffers cached
Mem: 3009 2903 106 24 5 106
-/+ buffers/cache: 2791 218
Swap: 3071 1306 1765

In that test controller nodes have an additional 'ceph-osd' role and ceilometer is enabled. I think we have to increase RAM/CPU values for VMs in such tests, but we need a confirmation from deployment engineers that lack of resources is a root cause of deployment failure.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Diagnostic snapshot doesn't contain remote logs (see bug #1541390), so attaching archive with full /var/log folder https://drive.google.com/file/d/0BzaZINLQ8-xkb1NyQ2FRTFZKajA/view?usp=sharing

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The test covers unsupported case "controller nodes have an additional 'ceph-osd' role", see https://docs.mirantis.com/openstack/fuel/fuel-6.1/release-notes.html#storage-technologies-issues. Note, it should be put to the reference architecture or at least reappear to the 7.0 release notes, it is not there now!

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This should be addressed in the docs bug https://bugs.launchpad.net/fuel/+bug/1543963

tags: added: ceph
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

@Bogdan,

>The test covers unsupported case "controller nodes have an additional 'ceph-osd' role"

from docs you mentioned I see that it's just "not recommended", but actually currently it's fully supported by Fuel, because I can assign 'ceph-osd' role to controllers via GUI/CLI/API without any limitations.

Anyway, I don't understand how assigning 'ceph-osd' role to controllers affects deployment process. Cluster wasn't under load (since it wasn't operational), according to atop logs Ceph didn't use a lot of resources.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The case is not recommended because it cannot be fixed - there is no a spell making control plane, storage heavies and/or compute workloads to co-exist within a single node. We shall do this clearly seen in the docs. This is a known limitation, folks may put OSDs at controllers at they own risk. Won't fix for Fuel.

Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

version

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Bug is marked as 'New' because of launchpad issues. Moving back to "Won't fix" state.

Changed in fuel:
status: New → Won't Fix
no longer affects: fuel/mitaka
Revision history for this message
Andrii Petrenko (aplsms) wrote :

Colleagues,
I need to reopen this bug because one of our big customer got exactly this bug.

After applying MU2 and redeploying the cluster deployment failed.

Deployment has failed. Method granular_deploy. Failed to execute hook 'ceilometer-radosgw-user' Puppet run failed. Check puppet logs for details
--- uids: - '62' - '63' - '58' parameters: puppet_modules: /etc/puppet/modules puppet_manifest: /etc/puppet/modules/osnailyfacter/modular/ceilometer/radosgw_user.pp timeout: 300 cwd: / priority: 1900 fail_on_error: true type: puppet id: ceilometer-radosgw-user . Inspect Astute logs for the details

Nodes 62, 63 and 58 are the controllers.

I can provide logs on demand.

Changed in fuel:
status: Won't Fix → New
tags: added: customer-found support
Changed in fuel:
importance: Undecided → Critical
Changed in fuel:
assignee: Fuel Library (Deprecated) (fuel-library) → Fuel Sustaining (fuel-sustaining-team)
milestone: 9.0 → 10.0
status: New → Confirmed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Andrii, please share diagnostic snapshot.

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Andrii Petrenko (aplsms)
Revision history for this message
Andrii Petrenko (aplsms) wrote :

Done. Links shared via slack

Revision history for this message
Andrii Petrenko (aplsms) wrote :

Customer confirmed:
 Ceph OSD and Controllers are separate.

Andrii Petrenko (aplsms)
Changed in fuel:
assignee: Andrii Petrenko (aplsms) → Oleksiy Molchanov (omolchanov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/357842

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

9 and 10 are not affected.

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Russell Holloway (russell-holloway) wrote :

Is there a workaround for this until a release is made available? I've also hit this issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/8.0)

Change abandoned by Oleksiy Molchanov (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/357842

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.