Bug #1630580 “upgrade-non-controller.sh getting stuck cleaning u...” : Bugs : tripleo

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-06:

#1

Have you checked that rabbit/galera on the controllers are correctly up and running when this script runs on the compute nodes?

openstack-nova-compute gets upgraded and somewhere along the %post in rpm does a "systemctl try-restart openstack-nova compute", since the service was started it will be restarted but this is unusually taking a long time because rabbit on the controller is not reachable.

I have seen this phenomenon happen quite a few times, so I thought I'd throw it in here

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-10:

#2

There might be another scenario that kicks in. Namely this:
From /var/log/nova/nova-compute.log:
2016-10-10 11:25:55.580 28911 WARNING nova.conductor.api [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...

tags:

added: upgrade

Michele Baldessari (michele) on 2016-10-10

Changed in tripleo:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-10:

#3

So I reproduced this on a M->N upgrade when upgrading the compute nodes:
2016-10-10 11:12:15.417 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_virtual_host" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:15.417 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_hosts" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:15.418 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_userid" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:15.418 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_password" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:25.493 28911 WARNING nova.conductor.api [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...
....
2016-10-10 12:10:55.791 28911 WARNING nova.conductor.api [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...

After 1 hour I just killed the "systemctl try-restart openstack-nova-compute" process and the upgrade proceeded normally.
What I found particularly odd is that start openstack-nova-conductor by hand in another terminal on the compute node did not unblock the situation.

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-10:

#4

So the reason for this failure in my case *seems* to be the following:
[root@overcloud-novacompute-0 nova]# grep -E "^connection" /etc/nova/nova.conf
[root@overcloud-novacompute-0 nova]#

And so nova-conductor could never really start properly:
2016-10-10 13:10:55.167 21150 WARNING oslo_db.sqlalchemy.engines [req-d76e0767-9f8d-49e8-89b9-9c31a8ad8498 - - - - -] SQL connection failed. -82 attempts left.
2016-10-10 13:11:05.171 21150 WARNING oslo_db.sqlalchemy.engines [req-d76e0767-9f8d-49e8-89b9-9c31a8ad8498 - - - - -] SQL connection failed. -83 attempts left.
2016-10-10 13:11:15.178 21150 WARNING oslo_db.sqlalchemy.engines [req-d76e0767-9f8d-49e8-89b9-9c31a8ad8498 - - - - -] SQL connection failed. -84 attempts left.

Not sure why connection string was empty. Hiera seems to have the proper keys:
[root@overcloud-novacompute-0 nova]# hiera nova::database_connection
mysql+pymysql://nova:TqAmAKYCzgnn8adb3ZFcbFtZT@172.16.2.4/nova
[root@overcloud-novacompute-0 nova]# hiera nova::api_database_connection
mysql+pymysql://nova_api:TqAmAKYCzgnn8adb3ZFcbFtZT@172.16.2.4/nova_api

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-10:

#5

Ok so the reason for this failure (for the M->N upgrade case at least, it does not cover the L->M case which is likely due to rabbit being down?) is that the upgrade scripts have a bug so openstack-nova-conductor does not actually get started on the controllers and so openstack-nova-compute on the controllers waits forever for the conductor to show up in the rabbitmq queues.

Michele Baldessari (michele) on 2016-10-10

Changed in tripleo:
importance:	High → Critical
milestone:	none → newton-rc3
assignee:	nobody → Michele Baldessari (michele)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-10: Fix proposed to tripleo-heat-templates (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/384667

Changed in tripleo:
status:	Confirmed → In Progress

Michele Baldessari (michele) on 2016-10-10

tags:

added: newton-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-11: Fix proposed to tripleo-heat-templates (stable/newton)

#7

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/384912

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-11: Fix merged to tripleo-heat-templates (master)

#8

Reviewed: https://review.openstack.org/384667
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=30a570a7f461a4c4411d9cde67d8de9042952016
Submitter: Jenkins
Branch: master

commit 30a570a7f461a4c4411d9cde67d8de9042952016
Author: Michele Baldessari <email address hidden>
Date: Mon Oct 10 21:18:26 2016 +0200

Actually start the systemd services in step3 of the major-upgrade step

    We have the following function in the upgrade process after we updated
    the packages and called the db-sync commands:
    services=$(services_to_migrate)
    ...
    for service in $(services); do
        manage_systemd_service start "${service%%-clone}"
        check_resource_systemd "${service%%-clone}" started 600
    done

    The above is broken because $services contains a list of services to
    start, so $(services) will return gibberish and the for loop will never
    execute anything.

    One of the symptoms for this is the openstack-nova-compute service not
    restarting on the compute nodes during the yum -y upgrade. The reason
    for this is that during the service restart, nova-compute waits for
    nova-conductor to show up in the rabbitmq queues, which cannot happen
    since the service was actually never started.

Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
Closes-Bug: #1630580

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

Harry Rybacki (hrybacki-h) wrote on 2016-10-11:

#9

@michele - how do we want to track the bug for Liberty now that we have a fix released for Newton?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-11: Fix merged to tripleo-heat-templates (stable/newton)

#10

Reviewed: https://review.openstack.org/384912
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c1eedd1825d2cb7c4ff17674a7f8ef63c2534174
Submitter: Jenkins
Branch: stable/newton

commit c1eedd1825d2cb7c4ff17674a7f8ef63c2534174
Author: Michele Baldessari <email address hidden>
Date: Mon Oct 10 21:18:26 2016 +0200

Actually start the systemd services in step3 of the major-upgrade step

    We have the following function in the upgrade process after we updated
    the packages and called the db-sync commands:
    services=$(services_to_migrate)
    ...
    for service in $(services); do
        manage_systemd_service start "${service%%-clone}"
        check_resource_systemd "${service%%-clone}" started 600
    done

    The above is broken because $services contains a list of services to
    start, so $(services) will return gibberish and the for loop will never
    execute anything.

    One of the symptoms for this is the openstack-nova-compute service not
    restarting on the compute nodes during the yum -y upgrade. The reason
    for this is that during the service restart, nova-compute waits for
    nova-conductor to show up in the rabbitmq queues, which cannot happen
    since the service was actually never started.

    Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
    Closes-Bug: #1630580
    (cherry picked from commit 30a570a7f461a4c4411d9cde67d8de9042952016)

tags:

added: in-stable-newton

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-11:

#11

@Harry do we have (or can get) any logs for your issue? I still suspect something wrong went with rabbit in your case (the same code we fixed in newton does not exist in L->M), but I'd need some confirmation?

Revision history for this message

Harry Rybacki (hrybacki-h) wrote on 2016-10-12:

#12

@Michele, we sure can. We've got the problem reproducing in CI via one of our periodic jobs[1] -- console output. However the failed overcloud deploy is botching log collection.

I'll run the job locally, fetch logs, and add another comment. Is there any logs/commands that would specifically help you?

Revision history for this message

Harry Rybacki (hrybacki-h) wrote on 2016-10-12:

#13

^^ [1] - https://ci.centos.org/view/rdo/view/tripleo-periodic/job/tripleo-quickstart-upgrade-major-liberty-to-mitaka/143/console

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-12:

#14

Thanks Harry. Ideally we could get sosreports from the 3 controllers and the compute node that is stuck. If that is too heavy to get to, we can probably confirm/dispel my theory with /var/log from 3 controllers and the compute and a "pcs status" from one controller.

Revision history for this message

Harry Rybacki (hrybacki-h) wrote on 2016-10-12:

#15

Michele, I've tar'd a lot of info from the undercloud, compute, and control (no HA setup) nodes on my fileserver[1] -- they are pretty large. If you have issues reaching it let me know and I'll find another way to host them for you.

[1] - http://file.rdu.redhat.com/hrybacki/lp_1630580/

Revision history for this message

Adriano Petrich (apetrich) wrote on 2016-10-12:

#16

ControllerPacemakerUpgradeDeployment_Step2 log Edit (30.8 KiB, text/plain)

Download full text (4.0 KiB)

So I can reproduce that on a L->M deployment
but I'm seeing a failure before the non-controller update

the execute major upgrade controller is failing but the quickstart script does not check it before doing the compute upgrade

2016-10-12 18:06:19 [0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-10-12 18:06Deployment failed: Heat Stack update failed.
:20 [overcloud-UpdateWorkflow-s2oodedta44w-ControllerPacemakerUpgradeDeployment_Step2-rpqeo34d6xra]: CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment to server failed: d
eploy_status_code : Deployment exited with non-zero status code: 1
2016-10-12 18:06:22 [ControllerPacemakerUpgradeDeployment_Step2]: CREATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:23 [overcloud-UpdateWorkflow-s2oodedta44w]: UPDATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:23 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:24 [UpdateWorkflow]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:24 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:25 [overcloud]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:25 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:26 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:27 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:28 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:29 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:30 [0]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_FAILED

Looking at the controller logs I can see:

File "/usr/bin/neutron-db-manage", line 10, in <module>
sys.exit(main())
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 750, in main
return_val |= bool(CONF.command.func(config, CONF.command.name))
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 224, in do_upgrade
run_sanity_checks(config, revision)
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 732, in run_sanity_checks
script_dir.run_env()
File "/usr/lib/python2.7/site-packages/alembic/script/base.py", line 407, in run_env
util.load_python_file(self.dir, 'env.py')
File "/usr/lib/python2.7/site-packages/alembic/util/pyfiles.py", line 93, in load_python_file
module = load_module_py(module_id, path)
File "/usr/lib/python2.7/site-packages/alembic/util/compat.py", line 79, in load_module_py
mod = imp.load_source(module_id, path, fp)
File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/alembic_migrations/env.py", line 28, in <module>
from networking_...

So I can reproduce that on a L->M deployment
but I'm seeing a failure before the non-controller update

the execute major upgrade controller is failing but the quickstart script does not check it before doing the compute upgrade

2016-10-12 18:06:19 [0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-10-12 18:06Deployment failed:  Heat Stack update failed.
:20 [overcloud-UpdateWorkflow-s2oodedta44w-ControllerPacemakerUpgradeDeployment_Step2-rpqeo34d6xra]: CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment to server failed: d
eploy_status_code : Deployment exited with non-zero status code: 1
2016-10-12 18:06:22 [ControllerPacemakerUpgradeDeployment_Step2]: CREATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:23 [overcloud-UpdateWorkflow-s2oodedta44w]: UPDATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:23 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:24 [UpdateWorkflow]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:24 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:25 [overcloud]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:25 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:26 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:27 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:28 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:29 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:30 [0]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_FAILED

Looking at the controller logs I can see:

File "/usr/bin/neutron-db-manage", line 10, in <module>
sys.exit(main())
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 750, in main
return_val |= bool(CONF.command.func(config, CONF.command.name))
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 224, in do_upgrade
run_sanity_checks(config, revision)
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 732, in run_sanity_checks
script_dir.run_env()
File "/usr/lib/python2.7/site-packages/alembic/script/base.py", line 407, in run_env
util.load_python_file(self.dir, 'env.py')
File "/usr/lib/python2.7/site-packages/alembic/util/pyfiles.py", line 93, in load_python_file
module = load_module_py(module_id, path)
File "/usr/lib/python2.7/site-packages/alembic/util/compat.py", line 79, in load_module_py
mod = imp.load_source(module_id, path, fp)
File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/alembic_migrations/env.py", line 28, in <module>
from networking_cisco.db.migration.models import head  # noqa
File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/models/head.py", line 19, in <module>
from networking_cisco.plugins.cisco.db.l3 import (  # noqa
File "/usr/lib/python2.7/site-packages/networking_cisco/plugins/cisco/db/l3/ha_db.py", line 38, in <module>
from networking_cisco import backwards_compatibility as bc_attr
File "/usr/lib/python2.7/site-packages/networking_cisco/backwards_compatibility.py", line 17, in <module>
from neutron.tests.common import base
File "/usr/lib/python2.7/site-packages/neutron/tests/common/base.py", line 17, in <module>
from oslo_db.sqlalchemy import test_base
File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/test_base.py", line 23, in <module>
raise NameError('Oslotest is not installed. Please add oslotest in your'
NameError: Oslotest is not installed. Please add oslotest in your test-requirements

Revision history for this message

Adriano Petrich (apetrich) wrote on 2016-10-13:

#17

I'm pretty sure what is missing is this https://review.openstack.org/#/c/385413/

but I don't see a stable/mitaka branch

the one installed is
ame : python-networking-cisco
Arch : noarch
Epoch : 1
Version : 3.2.1
Release : 0.20161006005841.b210da8.el7.centos
Size : 6.2 M
Repo : installed
From repo : delorean
Summary : Cisco OpenStack Neutron driver
URL : https://pypi.python.org/pypi/networking-cisco
License : ASL 2.0
Description : This package contains Cisco networking driver for OpenStack Neutron.

and does not have that patch

Revision history for this message

Michele Baldessari (michele) wrote on 2016-10-13:

#18

Download full text (3.1 KiB)

Thanks Harry, I got the files.

You have the same issue Adriano is seeing:
messages:Oct 12 14:38:43 localhost os-collect-config: te-packages/neutron/db/migration/cli.py\", line 750, in main\n return_val |= bool(CONF.command.func(config, CONF.command.name))\n File \"/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py\", line 224, in do_upgrade\n run_sanity_checks(config, revision)\n File \"/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py\", line 732, in run_sanity_checks\n script_dir.run_env()\n File \"/usr/lib/python2.7/site-packages/alembic/script/base.py\", line 407, in run_env\n util.load_python_file(self.dir, 'env.py')\n File \"/usr/lib/python2.7/site-packages/alembic/util/pyfiles.py\", line 93, in load_python_file\n module = load_module_py(module_id, path)\n File \"/usr/lib/python2.7/site-packages/alembic/util/compat.py\", line 79, in load_module_py\n mod = imp.load_source(module_id, path, fp)\n File \"/usr/lib/python2.7/site-packages/networking_cisco/db/migration/alembic_migrations/env.py\", line 28, in <module>\n from networking_cisco.db.migration.models import head # noqa\n File \"/usr/lib/python2.7/site-packages/networking_cisco/db/migration/models/head.py\", line 19, in <module>\n from networking_cisco.plugins.cisco.db.l3 import ( # noqa\n File \"/usr/lib/python2.7/site-packages/networking_cisco/plugins/cisco/db/l3/ha_db.py\", line 38, in <module>\n from networking_cisco import backwards_compatibility as bc_attr\n File \"/usr/lib/python2.7/site-packages/networking_cisco/backwards_compatibility.py\", line 17, in <module>\n from neutron.tests.common import base\n File \"/usr/lib/python2.7/site-packages/neutron/tests/common/base.py\", line 17, in <module>\n from oslo_db.sqlalchemy import test_base\n File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/test_base.py\", line 23, in <module>\n raise NameError('Oslotest is not installed. Please add oslotest in your'\nNameError: Oslotest is not installed. Please add oslotest in your test-requirements\n", "deploy_status_code": 1}
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/alembic_migrations/env.py", line 28, in <module>
messages:Oct 12 14:38:43 localhost os-collect-config: from networking_cisco.db.migration.models import head # noqa
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/models/head.py", line 19, in <module>
messages:Oct 12 14:38:43 localhost os-collect-config: from networking_cisco.plugins.cisco.db.l3 import ( # noqa
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/plugins/cisco/db/l3/ha_db.py", line 38, in <module>
messages:Oct 12 14:38:43 localhost os-collect-config: from networking_cisco import backwards_compatibility as bc_attr
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/backwards_compatibility.py", line 17, in <module>

Adriano is going to file a new separate bug to track the needed change in RDO.
Let me know if you have an...

tripleo

upgrade-non-controller.sh getting stuck cleaning up openstack-nova-compute package

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches