"nova-manage cell_v2 discover_hosts" failed, causing ncc status to temporarily go into error status

Bug #1720846 reported by Jason Hobbs
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Confirmed
High
Corey Bryant

Bug Description

In a deploy of openstack-base, the the n-c-c unit temporarily went into error state:

http://paste.ubuntu.com/25661755/

2017-09-30 10:35:40 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] error: hook failed: "cloud-compute-relation-changed"

imo, charms shouldn't enter an error state if they can recover on their own. To an external observer (juju wait in this case) this is indistinguishable from a fatal error, because there is no way to know how long to wait for the error to correct itself.

I've attached a crashdump from the test run.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, we disable failed hook retries in our CI tests. We believe that the charm should never enter a failed/error state. So any time the application unit workload state is error, we see that as a bug.

As charmers, we don't want to rely on juju's retries to save us. It's nice that it will/might, but we don't lean on that by design.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

2017-09-30 10:35:38 INFO juju-log cloud-compute:27: Adding hosts to cell.
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed Traceback (most recent call last):
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/cloud-compute-relation-changed", line 1214, in <module>
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed main()
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/cloud-compute-relation-changed", line 1208, in main
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed hooks.execute(sys.argv)
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/charmhelpers/core/hookenv.py", line 768, in execute
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed self._hooks[hook_name]()
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/cloud-compute-relation-changed", line 706, in compute_changed
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed add_hosts_to_cell()
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/nova_cc_utils.py", line 817, in add_hosts_to_cell
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed subprocess.check_output(cmd)
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed raise CalledProcessError(retcode, cmd, output=output)
2017-09-30 10:35:40 DEBUG cloud-compute-relation-changed subprocess.CalledProcessError: Command '['nova-manage', 'cell_v2', 'discover_hosts']' returned non-zero exit status 1
2017-09-30 10:35:40 ERROR juju.worker.uniter.operation runhook.go:107 hook "cloud-compute-relation-changed" failed: exit status 1

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Having said that, we've not encountered this error in our deploys of openstack-base onto dell r610s via MAAS 2.2.2 and Juju 2.2.4. Curious how frequently and how reproducible this is?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

It seems like there's a race somewhere, whether that's in nova or the charm I'm not sure. It'd be great to capture why the command is failing ('nova-manage cell_v2 discover_hosts'). Would you be able to manually run this command next time you hit this error?

An alternative to running the discover_hosts subcommand is to set [scheduler]/discover_hosts_in_cells_interval in nova.conf (see nova/doc/source/user/cells.rst). We actually do both in the nova-cc charm because we don't know if all the compute nodes are up when we run discover_hosts. Perhaps we could drop the subcommand and just rely on the nova.conf setting?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Triaging as medium since this doesn't always occur.

Changed in charm-nova-cloud-controller:
status: New → In Progress
status: In Progress → Triaged
importance: Undecided → Medium
Revision history for this message
Corey Bryant (corey.bryant) wrote :

After looking at the code again, I'm hesitant to drop add_hosts_to_cell, at least for the upgrade path. We could probably get away with not calling it on new installs. However it's useful on upgrades from non-cellsv2 to cellsv2, where existing hosts are discovered by 'nova-manage cell_v2 discover_hosts' and existing instances are mapped to cells right after that with a call to 'nova-manage cell_v2 map_instances'.

Revision history for this message
Nobuto Murata (nobuto) wrote :

I don't know if it's helpful, but I saw multiple deployments which had `nova-manage cell_v2` errors. In that case, I found that the nova-cloud-controller charm was issuing nova-manage command before having mysql information in nova.conf.

Chris Gregan (cgregan)
tags: added: cpe-onsite
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Re-triaging high, as it is affecting field deployments.

Changed in charm-nova-cloud-controller:
importance: Medium → High
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I think I see what is going on here. I'll propose a fix.

Changed in charm-nova-cloud-controller:
assignee: nobody → Corey Bryant (corey.bryant)
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks Nobuto, that was a helpful comment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (master)

Fix proposed to branch: master
Review: https://review.openstack.org/509900

Changed in charm-nova-cloud-controller:
status: Triaged → In Progress
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I can't see the crashdump right now but it looks like what is happening here is a pre-ocata release is deployed (non cells_v2) and then the compute-changed hook attempts to run a cells_v2 command.

In migrate_nova_databases, the pre-ocata path is taken, cells v2 db's aren't set up (which is correct behavior for pre-ocata) and finalize_migrate_nova_databases() is called [0]. Then in the compute-changed hook, add_hosts_to_cell() is called within an is_db_initialized() check. is_db_initialized() returns true because finalize_migrate_nova_databases() has been called. However, it's not ocata, so add_hosts_to_cell() likely fails because cells v2 db's don't exist.

[0] http://paste.ubuntu.com/25683260/
[1] http://paste.ubuntu.com/25683266/

Revision history for this message
Corey Bryant (corey.bryant) wrote :

My previous theory is invalid. We already check if >= ocata in add_hosts_to_cell() before doing anything. In fact we check that all of the following are true to ensure that the DBs are already set up as well prior to running cell_v2 discover_hosts:

 807 if (CompareOpenStackReleases(os_release('nova-common')) >= 'ocata' and
 808 is_relation_made('amqp', 'password') and
 809 is_relation_made('shared-db', 'novaapi_password') and
 810 is_relation_made('shared-db', 'novacell0_password') and
 811 is_relation_made('shared-db', 'nova_password')):

Here's the whole function: http://paste.ubuntu.com/25685460/

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Additionally, we only run discover_hosts if list_cells returns 'cell1' (this is the one cell that the charms deploy). This means that 'cell1' is already in the database, and therefore the database should completely be ready to run discover_hosts at this point.

Unfortunately I'm at a loss for how to move forward until we can recreate this and get the failing output from 'nova-manage cell_v2 discover_hosts'.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

discover_hosts and its return codes are described here:
https://github.com/openstack/nova/blob/master/doc/source/cli/nova-manage.rst

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Marking as incomplete until we can recreate or get failing output from 'nova-manage cell_v2 discover_hosts'.

Changed in charm-nova-cloud-controller:
status: In Progress → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1720846] Re: "nova-manage cell_v2 discover_hosts" failed, causing ncc status to temporarily go into error status

Corey, can we add logging of nova-manage cell_v2 discover_hosts to the
charm?

On Fri, Oct 6, 2017 at 7:54 AM, Corey Bryant <email address hidden>
wrote:

> Marking as incomplete until we can recreate or get failing output from
> 'nova-manage cell_v2 discover_hosts'.
>
> ** Changed in: charm-nova-cloud-controller
> Status: In Progress => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1720846
>
> Title:
> "nova-manage cell_v2 discover_hosts" failed, causing ncc status to
> temporarily go into error status
>
> Status in OpenStack nova-cloud-controller charm:
> Incomplete
>
> Bug description:
> In a deploy of openstack-base, the the n-c-c unit temporarily went
> into error state:
>
> http://paste.ubuntu.com/25661755/
>
> 2017-09-30 10:35:40 DEBUG juju.worker.uniter agent.go:17 [AGENT-
> STATUS] error: hook failed: "cloud-compute-relation-changed"
>
> imo, charms shouldn't enter an error state if they can recover on
> their own. To an external observer (juju wait in this case) this is
> indistinguishable from a fatal error, because there is no way to know
> how long to wait for the error to correct itself.
>
> I've attached a crashdump from the test run.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-nova-cloud-controller/+
> bug/1720846/+subscriptions
>

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I think adding --verbose to the call would make sense for the future and should send more verbose output to /var/log/nova/nova-manage.log.

I think I was able to recreate the issue. While deploying, I ran the following in nova-cloud-controller:

while true; do
        echo "sudo nova-manage cell_v2 list_cells"
        sudo nova-manage cell_v2 list_cells
        rc=$?
        echo "rc=$rc"
        if [ $rc -eq 0 ]; then
            echo "sudo nova-manage cell_v2 discover_hosts"
            sudo nova-manage cell_v2 discover_hosts
            rc=$?
            echo "rc=$rc"
        fi
done

Results are in the attached file. Note I trimmed repeated output such as the repeated 'No sql_connection parameter is established' errors that occur before nova.conf is setup.

You'll notice there's a window between when list_cells starts returning cell1 and when the database is actually setup and ready for the discover_hosts call. This must be the error that is being hit.

Changed in charm-nova-cloud-controller:
status: Incomplete → Confirmed
status: Confirmed → Triaged
Changed in charm-nova-cloud-controller:
status: Triaged → In Progress
Revision history for this message
Corey Bryant (corey.bryant) wrote :

My theory in comment #19 is invalid. The script was running in a different process than the charm and the window I was seeing was actually during the charm's execution of 'nova-manage db sync'. I confirmed that the nova-manage commands do in fact run synchronously. So there is no race in the nova-manage commands as far as I can tell.

I'm going to add failure verbosity to nova-cc so that we can capture the real issue next time on first failure. Then you should be able to see a more informative traceback in the juju log.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.openstack.org/509900
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=3b873932be005eb6ec2b979fe72a92ece4fe905b
Submitter: Zuul
Branch: master

commit 3b873932be005eb6ec2b979fe72a92ece4fe905b
Author: Corey Bryant <email address hidden>
Date: Fri Oct 6 17:37:04 2017 +0000

    Add nova-manage failure verbosity and clean up Cells V2 code

    Update all nova-manage commands to use subprocess.check_output()
    and log subprocess.CalledProcessError.output on failure. This
    will help capture nova-manage error details on first failure.

    Specify cell1 uuid on discover_hosts call. This doesn't change
    behavior, it is just more explicit and useful if we move to
    multiple cells in the future.

    Introduce an is_cellv2_init_ready() function that uses contexts to
    check if cells_v2 is ready to initialize. This cleans up the
    corresponding TODOs.

    Move checks for cell v2 init readiness to update_cell_db_if_ready(),
    also cleaning up corresponding TODOs.

    Change-Id: I313edce84d3d249031e020a4fbb4baf216c01ddb
    Related-Bug: 1720846

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Jason,

The change above has landed and will allow you to gather more details the next time this issue occurs. Please let me know if it occurs again and I'll take a look.

Thanks,
Corey

Changed in charm-nova-cloud-controller:
status: In Progress → Confirmed
status: Confirmed → Triaged
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Jason,

It seems that this hasn't resurfaced in a while. With the last change committed to add failure verbosity, I'm going to mark this as Incomplete for now. Please move it back to New and give me a ping if you hit this again.

Thanks!
Corey

Changed in charm-nova-cloud-controller:
status: Triaged → Incomplete
Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (3.7 KiB)

I just hit this in a charm gate job:
2020-07-23 10:37:15 ERROR juju-log cloud-compute:22: Cell1 discover_hosts failed
b"Getting computes from cell 'cell1': ef421f8f-5182-4014-bb1f-b9a93ebec5a5\nChecking host mapping for compute host 'juju-e09d65-zaza-c69af9d26505-8.project.serverstack': 8b52ad0c-4cf8-4c7d-984b-3283de75be4f\nCreating host mapping for compute host 'juju-e09d65-zaza-c69af9d26505-8.project.serverstack': 8b52ad0c-4cf8-4c7d-984b-3283de75be4f\nChecking host mapping for compute host 'juju-e09d65-zaza-c69af9d26505-6.project.serverstack': 54d39de2-671b-4c46-b31e-4e80202e1c27\nCreating host mapping for compute host 'juju-e09d65-zaza-c69af9d26505-6.project.serverstack': 54d39de2-671b-4c46-b31e-4e80202e1c27\nERROR: Duplicate host mapping was encountered. This command should be run once after all compute hosts have been deployed and should not be run in parallel. When run in parallel, the commands will collide with each other trying to map the same hosts in the database at the same time. Error: Host 'juju-e09d65-zaza-c69af9d26505-6.project.serverstack' mapping already exists\n"
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed Traceback (most recent call last):
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/cloud-compute-relation-changed", line 1383, in <module>
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed main()
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/cloud-compute-relation-changed", line 1375, in main
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed hooks.execute(sys.argv)
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/charmhelpers/core/hookenv.py", line 945, in execute
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed self._hooks[hook_name]()
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/cloud-compute-relation-changed", line 674, in cloud_compute_relation_changed
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed add_hosts_to_cell_when_ready()
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/cloud-compute-relation-changed", line 689, in add_hosts_to_cell_when_ready
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed ncc_utils.add_hosts_to_cell()
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/nova_cc_utils.py", line 954, in add_hosts_to_cell
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed subprocess.check_output(cmd)
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed **kwargs).stdout
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed File "/usr/lib/python3.6/subprocess.py", line 438, in run
2020-07-23 10:37:15 DEBUG cloud-compute-relation-changed ...

Read more...

Changed in charm-nova-cloud-controller:
status: Incomplete → New
Changed in charm-nova-cloud-controller:
status: New → Confirmed
Revision history for this message
Eric Chen (eric-chen) wrote :

I think I encounter the same issue in CI server.

https://openstack-ci-reports.ubuntu.com/artifacts/f81/807668/1/check/hirsute-wallaby-bluestore/f814a43/index.html

In crashdump, nova-cloud-controller_0/var/log/juju/unit-nova-cloud-controller-0.log

2021-09-08 02:17:10 ERROR juju-log cloud-compute:27: Cell1 discover_hosts failed
b"Getting computes from cell 'cell1': 67c4cadf-077f-4555-8fbf-764393be6010\nChecking host mapping for compute host 'juju-157e45-zaza-5e177bef34cd-18.project.serverstack': 0fb96ad6-3c68-4199-8d5c-6e53b4418f2b\nCreating host mapping for compute host 'juju-157e45-zaza-5e177bef34cd-18.project.serverstack': 0fb96ad6-3c68-4199-8d5c-6e53b4418f2b\nERROR: Duplicate host mapping was encountered. This command should be run once after all compute hosts have been deployed and should not be run in parallel. When run in parallel, the commands will collide with each other trying to map the same hosts in the database at the same time. Error: Host 'juju-157e45-zaza-5e177bef34cd-18.project.serverstack' mapping already exists\n"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.