When scale up, new computes should be disabled by default

Bug #1398817 reported by Mike Scherbakov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
6.0.x
Won't Fix
High
Fuel Library (Deprecated)

Bug Description

enable_new_services should be False in nova.conf when you scale the env (add more computes to already deployed environment).
See details how it works at: https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L507-L508

The story is the following:
when we deploy new cluster, we want all services to be started, and we want operational environment right after it is deployed. Then, before running production workloads, we test it with HealthCheck feature of Fuel, create test tenant, and run some test workloads first. Then we start running production workloads.

After some time, user needs to scale up: add more compute hosts. If you simply add new computes with Fuel and deploy, they will be automatically registered in Nova DB. If user wants to start a new VM at that time, it will VERY likely go to the new host, as the less loaded one. Needless to say, that the new host might not be ready to accept production load. Before moving any production load on it, administrator of the cloud has to ensure that the new compute is ready for it. It is another story how to do it.

So, a simple step for now would be:
1) enable computes when you deploy new environment
2) disable new computes, when you add them to already deployed environment
3) update Operations Guide: deploy new compute, check that it is ready for production, and manually enable compute via nova-manage command on controller host.

Please consider the same approach for other OpenStack projects where possible. Config option for Cinder: https://github.com/openstack/cinder/blob/master/cinder/db/api.py#L58

Mike Scherbakov (mihgen)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
tags: added: customer-found
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I think it makes sense to backport this to 6.x as well

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Astute Team (fuel-astute)
status: New → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Looks like enable_new_services should be set to *False* at the *pre deploy* hook in Astute orchestration. Otherwise we will end up with all computes disabled after the very first deployment is done. Also, it could be reasonable to put it back to *True* at the post deploy hook, so the only job would be left to do is to manually enable the new services

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Well, I'm thinking in a wrong way :) Looks like we will get all computes disabled anyway...

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Although, the comment #3 could be relevant if we introduced some new 'scale' action in orchestration logic. Standard deploy action should not manipulate with enable_new_services

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Bogdan, sorry, looks like I was not clear enough:
> when we deploy new cluster, we want all services to be started
> for now would be disable new computes by default - ONLY when you add new computes to existing already deployed env. When you deploy fresh env, computes should be enabled.
I'll fix description to make it more clear.

Please also see this email thread for reference: http://<email address hidden>/msg41238.html
We need similar thing for adding new Cinder nodes: https://github.com/openstack/cinder/blob/master/cinder/db/api.py#L58

description: updated
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I suppose we can fix it using granular deployment. We can deploy all new env with 'enable_new_services' as false. After deploy we will set this option to true. At now moment we working on it.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Vladimir, could you please update the current status of this issue?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If it will be resolved as a part of granular deployment feature, please provide a references to the patches

Revision history for this message
Dima Shulyak (dshulyak) wrote :

I see two options here:

1. Based on cluster.status == 'operational' make compute services disabled directly in puppet manifests which deploys them
2. Based on same status run separate task on all computes which are deployed after cluster is operational (it can be a part of compute group, or post_deployment task)

I will change assignee of this task, because any of this variant is doable without any changes to astute

Revision history for this message
Dima Shulyak (dshulyak) wrote :

If it needs to be fixed in 6.0.1 as well - there should be separate change in osnailyfacter/site.pp

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This bug should be tracked as a blueprint due to the changes to deployment task and some implications. For example, nova host-aggregates could be used to achieve desired behavior, but that would imply only admin user context for related pre-/post-deployment tasks, see http://docs.openstack.org/havana/config-reference/content/host-aggregates.html

Also, enable_new_services could be used, but that would anyway imply additional deployment task, end-user, developer and docs impacts and cannot be tracked as a bug.

Note, that due to the deployment changes and granular deploy requirements, this improvemet cannot be backported for 6.0.

Changed in fuel:
status: Triaged → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I returned it back to confirmed for 6.1 as it could be a quick fix for 6.1 possible (w/o nova host aggregates)

Changed in fuel:
status: Invalid → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

related upstream bug https://bugs.launchpad.net/nova/+bug/1426332

As far as we cannot use the enable_new_services option yet, the w/a for the 6.1 release should be based on
nova-manage service disable/enable

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/161664

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A deployment orchestration has no an ability to trigger some actions at controller(s) after every new compute launched. That means that we should drop enable_new_services paramter and issue nova service-disable locally at computes at deployment stage and enable them back at post deploy

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I tested the w/a for 6.1 https://review.openstack.org/#/c/161664/ and the results are good, new compute services is disabled within few seconds after have started http://paste.openstack.org/show/190436/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Updated the fix to follow a more simple way, which is:
 - deploy nova-compute and cinder-volume services stopped and disabled;
 - run and enable them back as a post-deploy hook

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/161664
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ca54e745d5096fd1eaf9beb4ed654e8c24542c5e
Submitter: Jenkins
Branch: master

commit ca54e745d5096fd1eaf9beb4ed654e8c24542c5e
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Feb 23 16:29:48 2015 +0100

    Manage new compute/cinder services state

    * When deploying or scaling an Openstack environment,
      disable all of the nova-compute/cinder-volume services and use
      a separate post-deploy tasks to re-enable them back.
      That should be done to prohibit nova/cinder schedulers to assign
      tasks for compute/cinder nodes until the deployment or scaling
      is finished.
      (Note, that nova-computes fix may be re-implemented later with
      host-aggregates)

    * Add enable_volumes parameter (default true) for openstack::cinder

    DocImpact
    Related blueprint disable-new-computes
    Closes-bug: #1398817

    Change-Id: Ia63d043753693360a008ec89924cdcdd93c007f3
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers