upgrade-non-controller.sh getting stuck cleaning up openstack-nova-compute package

Bug #1630580 reported by Harry Rybacki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

Attempting to do a major upgrade (liberty->mitaka). During `upgrade-non-controller.sh` execution the process seems to get stuck during a yum update on a specific package:

  + echo 'get compute uuid and update compute'
  get compute uuid and update compute
  ++ nova list
  ++ grep compute
  ++ awk '{ print $2; }'
  + for compute in '$(nova list | grep compute | awk '\''{ print $2; }'\'')'
  + echo 'Run upgrade on 629e3a10-0896-4eff-81cd-c856adc6c951'
  Run upgrade on 629e3a10-0896-4eff-81cd-c856adc6c951
  + /bin/upgrade-non-controller.sh --upgrade 629e3a10-0896-4eff-81cd-c856adc6c951
  Tue Oct 4 19:23:50 UTC 2016 upgrade-non-controller.sh nova node 629e3a10-0896-4eff-81cd- c856adc6c951 found with status ACTIVE
  Tue Oct 4 19:23:53 UTC 2016 upgrade-non-controller.sh checking for upgrade script /root/tripleo_upgrade_node.sh on node 629e3a10-0896-4eff-81cd-c856adc6c951 (192.0.2.7)
  Warning: Permanently added '192.0.2.7' (ECDSA) to the list of known hosts.^M
  Tue Oct 4 19:23:54 UTC 2016 upgrade-non-controller.sh upgrade script /root/tripleo_upgrade_node.sh found on node 629e3a10-0896-4eff-81cd-c856adc6c951 (192.0.2.7)
  Tue Oct 4 19:23:57 UTC 2016 upgrade-non-controller.sh Executing /root/tripleo_upgrade_node.sh on 192.0.2.7
  Loaded plugins: fastestmirror, priorities
  Determining fastest mirrors
   * base: mirror.cogentco.com
   * extras: mirror.net.cen.ct.gov
   * updates: mirror.vcu.edu
  771 packages excluded due to repository priority protections
  Package python-zaqarclient is obsoleted by python2-zaqarclient, trying to install python2- zaqarclient-1.0.1-0.20160913004709.b040bbb.el7.centos.noarch instead
  Resolving Dependencies

  <snip>

  Cleanup : 1:python-networking-cisco-2.1.2-0.20160826004345.e4b8a 188/308
  Cleanup : 1:openstack-nova-api-12.0.5-0.20160923003400.f680157.e 189/308
  Cleanup : 1:openstack-nova-compute-12.0.5-0.20160923003400.f6801 190/308

Deployed via TripleO-Quickstart mimicking a ci.centos periodic job[1] that is getting killed while stuck in the same part of the upgrade. Logs were not collected as Jenkins killed the job prior to collection it's not known if it's getting caught attempting to clean up the same package.

[1] - https://ci.centos.org/view/rdo/view/tripleo-periodic/job/tripleo-quickstart-upgrade-major-liberty-to-mitaka/137/

Revision history for this message
Michele Baldessari (michele) wrote :

Have you checked that rabbit/galera on the controllers are correctly up and running when this script runs on the compute nodes?

openstack-nova-compute gets upgraded and somewhere along the %post in rpm does a "systemctl try-restart openstack-nova compute", since the service was started it will be restarted but this is unusually taking a long time because rabbit on the controller is not reachable.

I have seen this phenomenon happen quite a few times, so I thought I'd throw it in here

Revision history for this message
Michele Baldessari (michele) wrote :

There might be another scenario that kicks in. Namely this:
From /var/log/nova/nova-compute.log:
2016-10-10 11:25:55.580 28911 WARNING nova.conductor.api [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...

tags: added: upgrade
Changed in tripleo:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Michele Baldessari (michele) wrote :

So I reproduced this on a M->N upgrade when upgrading the compute nodes:
2016-10-10 11:12:15.417 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_virtual_host" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:15.417 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_hosts" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:15.418 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_userid" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:15.418 28911 WARNING oslo_config.cfg [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Option "rabbit_password" from group "oslo_messaging_rabbit" is deprecated for removal. Its value may be silently ignored in the future.
2016-10-10 11:12:25.493 28911 WARNING nova.conductor.api [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...
....
2016-10-10 12:10:55.791 28911 WARNING nova.conductor.api [req-8bde1808-4b3d-48d3-80e7-e5a077ad5297 - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...

After 1 hour I just killed the "systemctl try-restart openstack-nova-compute" process and the upgrade proceeded normally.
What I found particularly odd is that start openstack-nova-conductor by hand in another terminal on the compute node did not unblock the situation.

Revision history for this message
Michele Baldessari (michele) wrote :

So the reason for this failure in my case *seems* to be the following:
[root@overcloud-novacompute-0 nova]# grep -E "^connection" /etc/nova/nova.conf
[root@overcloud-novacompute-0 nova]#

And so nova-conductor could never really start properly:
2016-10-10 13:10:55.167 21150 WARNING oslo_db.sqlalchemy.engines [req-d76e0767-9f8d-49e8-89b9-9c31a8ad8498 - - - - -] SQL connection failed. -82 attempts left.
2016-10-10 13:11:05.171 21150 WARNING oslo_db.sqlalchemy.engines [req-d76e0767-9f8d-49e8-89b9-9c31a8ad8498 - - - - -] SQL connection failed. -83 attempts left.
2016-10-10 13:11:15.178 21150 WARNING oslo_db.sqlalchemy.engines [req-d76e0767-9f8d-49e8-89b9-9c31a8ad8498 - - - - -] SQL connection failed. -84 attempts left.

Not sure why connection string was empty. Hiera seems to have the proper keys:
[root@overcloud-novacompute-0 nova]# hiera nova::database_connection
mysql+pymysql://nova:TqAmAKYCzgnn8adb3ZFcbFtZT@172.16.2.4/nova
[root@overcloud-novacompute-0 nova]# hiera nova::api_database_connection
mysql+pymysql://nova_api:TqAmAKYCzgnn8adb3ZFcbFtZT@172.16.2.4/nova_api

Revision history for this message
Michele Baldessari (michele) wrote :

Ok so the reason for this failure (for the M->N upgrade case at least, it does not cover the L->M case which is likely due to rabbit being down?) is that the upgrade scripts have a bug so openstack-nova-conductor does not actually get started on the controllers and so openstack-nova-compute on the controllers waits forever for the conductor to show up in the rabbitmq queues.

Changed in tripleo:
importance: High → Critical
milestone: none → newton-rc3
assignee: nobody → Michele Baldessari (michele)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/384667

Changed in tripleo:
status: Confirmed → In Progress
tags: added: newton-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/384912

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/384667
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=30a570a7f461a4c4411d9cde67d8de9042952016
Submitter: Jenkins
Branch: master

commit 30a570a7f461a4c4411d9cde67d8de9042952016
Author: Michele Baldessari <email address hidden>
Date: Mon Oct 10 21:18:26 2016 +0200

    Actually start the systemd services in step3 of the major-upgrade step

    We have the following function in the upgrade process after we updated
    the packages and called the db-sync commands:
    services=$(services_to_migrate)
    ...
    for service in $(services); do
        manage_systemd_service start "${service%%-clone}"
        check_resource_systemd "${service%%-clone}" started 600
    done

    The above is broken because $services contains a list of services to
    start, so $(services) will return gibberish and the for loop will never
    execute anything.

    One of the symptoms for this is the openstack-nova-compute service not
    restarting on the compute nodes during the yum -y upgrade. The reason
    for this is that during the service restart, nova-compute waits for
    nova-conductor to show up in the rabbitmq queues, which cannot happen
    since the service was actually never started.

    Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
    Closes-Bug: #1630580

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Harry Rybacki (hrybacki-h) wrote :

@michele - how do we want to track the bug for Liberty now that we have a fix released for Newton?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/newton)

Reviewed: https://review.openstack.org/384912
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c1eedd1825d2cb7c4ff17674a7f8ef63c2534174
Submitter: Jenkins
Branch: stable/newton

commit c1eedd1825d2cb7c4ff17674a7f8ef63c2534174
Author: Michele Baldessari <email address hidden>
Date: Mon Oct 10 21:18:26 2016 +0200

    Actually start the systemd services in step3 of the major-upgrade step

    We have the following function in the upgrade process after we updated
    the packages and called the db-sync commands:
    services=$(services_to_migrate)
    ...
    for service in $(services); do
        manage_systemd_service start "${service%%-clone}"
        check_resource_systemd "${service%%-clone}" started 600
    done

    The above is broken because $services contains a list of services to
    start, so $(services) will return gibberish and the for loop will never
    execute anything.

    One of the symptoms for this is the openstack-nova-compute service not
    restarting on the compute nodes during the yum -y upgrade. The reason
    for this is that during the service restart, nova-compute waits for
    nova-conductor to show up in the rabbitmq queues, which cannot happen
    since the service was actually never started.

    Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
    Closes-Bug: #1630580
    (cherry picked from commit 30a570a7f461a4c4411d9cde67d8de9042952016)

tags: added: in-stable-newton
Revision history for this message
Michele Baldessari (michele) wrote :

@Harry do we have (or can get) any logs for your issue? I still suspect something wrong went with rabbit in your case (the same code we fixed in newton does not exist in L->M), but I'd need some confirmation?

Revision history for this message
Harry Rybacki (hrybacki-h) wrote :

@Michele, we sure can. We've got the problem reproducing in CI via one of our periodic jobs[1] -- console output. However the failed overcloud deploy is botching log collection.

I'll run the job locally, fetch logs, and add another comment. Is there any logs/commands that would specifically help you?

Revision history for this message
Harry Rybacki (hrybacki-h) wrote :
Revision history for this message
Michele Baldessari (michele) wrote :

Thanks Harry. Ideally we could get sosreports from the 3 controllers and the compute node that is stuck. If that is too heavy to get to, we can probably confirm/dispel my theory with /var/log from 3 controllers and the compute and a "pcs status" from one controller.

Revision history for this message
Harry Rybacki (hrybacki-h) wrote :

Michele, I've tar'd a lot of info from the undercloud, compute, and control (no HA setup) nodes on my fileserver[1] -- they are pretty large. If you have issues reaching it let me know and I'll find another way to host them for you.

[1] - http://file.rdu.redhat.com/hrybacki/lp_1630580/

Revision history for this message
Adriano Petrich (apetrich) wrote :
Download full text (4.0 KiB)

So I can reproduce that on a L->M deployment
but I'm seeing a failure before the non-controller update

the execute major upgrade controller is failing but the quickstart script does not check it before doing the compute upgrade

2016-10-12 18:06:19 [0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-10-12 18:06Deployment failed: Heat Stack update failed.
:20 [overcloud-UpdateWorkflow-s2oodedta44w-ControllerPacemakerUpgradeDeployment_Step2-rpqeo34d6xra]: CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment to server failed: d
eploy_status_code : Deployment exited with non-zero status code: 1
2016-10-12 18:06:22 [ControllerPacemakerUpgradeDeployment_Step2]: CREATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:23 [overcloud-UpdateWorkflow-s2oodedta44w]: UPDATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:23 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:24 [UpdateWorkflow]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:24 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:25 [overcloud]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step2.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-10-12 18:06:25 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:26 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:27 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:28 [0]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:29 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-10-12 18:06:30 [0]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_FAILED

Looking at the controller logs I can see:

File "/usr/bin/neutron-db-manage", line 10, in <module>
sys.exit(main())
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 750, in main
return_val |= bool(CONF.command.func(config, CONF.command.name))
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 224, in do_upgrade
run_sanity_checks(config, revision)
File "/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py", line 732, in run_sanity_checks
script_dir.run_env()
File "/usr/lib/python2.7/site-packages/alembic/script/base.py", line 407, in run_env
util.load_python_file(self.dir, 'env.py')
File "/usr/lib/python2.7/site-packages/alembic/util/pyfiles.py", line 93, in load_python_file
module = load_module_py(module_id, path)
File "/usr/lib/python2.7/site-packages/alembic/util/compat.py", line 79, in load_module_py
mod = imp.load_source(module_id, path, fp)
File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/alembic_migrations/env.py", line 28, in <module>
from networking_...

Read more...

Revision history for this message
Adriano Petrich (apetrich) wrote :

I'm pretty sure what is missing is this https://review.openstack.org/#/c/385413/

but I don't see a stable/mitaka branch

the one installed is
ame : python-networking-cisco
Arch : noarch
Epoch : 1
Version : 3.2.1
Release : 0.20161006005841.b210da8.el7.centos
Size : 6.2 M
Repo : installed
From repo : delorean
Summary : Cisco OpenStack Neutron driver
URL : https://pypi.python.org/pypi/networking-cisco
License : ASL 2.0
Description : This package contains Cisco networking driver for OpenStack Neutron.

and does not have that patch

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (3.1 KiB)

Thanks Harry, I got the files.

You have the same issue Adriano is seeing:
messages:Oct 12 14:38:43 localhost os-collect-config: te-packages/neutron/db/migration/cli.py\", line 750, in main\n return_val |= bool(CONF.command.func(config, CONF.command.name))\n File \"/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py\", line 224, in do_upgrade\n run_sanity_checks(config, revision)\n File \"/usr/lib/python2.7/site-packages/neutron/db/migration/cli.py\", line 732, in run_sanity_checks\n script_dir.run_env()\n File \"/usr/lib/python2.7/site-packages/alembic/script/base.py\", line 407, in run_env\n util.load_python_file(self.dir, 'env.py')\n File \"/usr/lib/python2.7/site-packages/alembic/util/pyfiles.py\", line 93, in load_python_file\n module = load_module_py(module_id, path)\n File \"/usr/lib/python2.7/site-packages/alembic/util/compat.py\", line 79, in load_module_py\n mod = imp.load_source(module_id, path, fp)\n File \"/usr/lib/python2.7/site-packages/networking_cisco/db/migration/alembic_migrations/env.py\", line 28, in <module>\n from networking_cisco.db.migration.models import head # noqa\n File \"/usr/lib/python2.7/site-packages/networking_cisco/db/migration/models/head.py\", line 19, in <module>\n from networking_cisco.plugins.cisco.db.l3 import ( # noqa\n File \"/usr/lib/python2.7/site-packages/networking_cisco/plugins/cisco/db/l3/ha_db.py\", line 38, in <module>\n from networking_cisco import backwards_compatibility as bc_attr\n File \"/usr/lib/python2.7/site-packages/networking_cisco/backwards_compatibility.py\", line 17, in <module>\n from neutron.tests.common import base\n File \"/usr/lib/python2.7/site-packages/neutron/tests/common/base.py\", line 17, in <module>\n from oslo_db.sqlalchemy import test_base\n File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/test_base.py\", line 23, in <module>\n raise NameError('Oslotest is not installed. Please add oslotest in your'\nNameError: Oslotest is not installed. Please add oslotest in your test-requirements\n", "deploy_status_code": 1}
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/alembic_migrations/env.py", line 28, in <module>
messages:Oct 12 14:38:43 localhost os-collect-config: from networking_cisco.db.migration.models import head # noqa
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/db/migration/models/head.py", line 19, in <module>
messages:Oct 12 14:38:43 localhost os-collect-config: from networking_cisco.plugins.cisco.db.l3 import ( # noqa
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/plugins/cisco/db/l3/ha_db.py", line 38, in <module>
messages:Oct 12 14:38:43 localhost os-collect-config: from networking_cisco import backwards_compatibility as bc_attr
messages:Oct 12 14:38:43 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/networking_cisco/backwards_compatibility.py", line 17, in <module>

Adriano is going to file a new separate bug to track the needed change in RDO.
Let me know if you have an...

Read more...

Revision history for this message
mathieu bultel (mat-bultel) wrote :

Emilien proposed a fix that has been merged:
https://bugs.launchpad.net/tripleo/+bug/1631012

So the networking-cisco issue is resolved.

I'm testing the compute upgrate atm, to see if something goes wrong.

Btw, we need to move out of the if statement for the compute upgrade in the major upgrade script in ansible role.
Just make a check of the overcloud stack status after the controller upgrade, then if it's ok, upgrade the compute.

Revision history for this message
mathieu bultel (mat-bultel) wrote :
Revision history for this message
Adriano Petrich (apetrich) wrote :

I've moved this into a different bug https://bugs.launchpad.net/tripleo/+bug/1632988

Revision history for this message
Michele Baldessari (michele) wrote :

Harry, I subscribed you to the cisco rpm bug that is affecting you: https://bugs.launchpad.net/tripleo/+bug/1632988

Revision history for this message
mathieu bultel (mat-bultel) wrote :

I got a successful upgrade.
So i think we could close this bug since the review above is resolving the issue

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0rc3

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0rc3 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 6.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.