Standard: Unable to configure worker/controller-1 node after ansible deployment

Bug #1829260 reported by Yosief Gebremariam
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Ovidiu Poncea

Bug Description

Brief Description
-----------------

Deployment of 2+3 StarlingX system; the controller-0 was booted and configured using ansible. The remaining nodes( controller-1 and 3 workers) were booted and all became online. Attempting to configure the worker and controller-1 nodes failed because the of missing sysinv inventory list of these nodes.

[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | locked | disabled | online |
| 3 | compute-0 | worker | locked | disabled | online |
| 4 | compute-1 | worker | locked | disabled | online |
| 5 | compute-2 | worker | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

[wrsroot@controller-0 ~(keystone_admin)]$ system host-disk-list controller-1

[wrsroot@controller-0 ~(keystone_admin)]$ system host-disk-list compute-0

[wrsroot@controller-0 log(keystone_admin)]$ system host-if-list controller-1

Severity
--------
Critical

Steps to Reproduce
------------------
1. Attempt a deployment Standard 2+2 or 2+3 using ansible
2. Boot controller-0 and configure using ansible
3. Configure controller-0 interfaces ( mgmt and cluster_host networks)
4. Unlock controller-0
4. Boot remaining nodes
5. Attempt to configure worker and controller-1 nodes

Expected Behavior
------------------
deployment successfull

Actual Behavior
----------------
Configuring worker/standby controller failed because of missing inventory data for nodes

Reproducibility
---------------
100% reproducible.mm

System Configuration
--------------------
Multi-node system, Dedicated storage BM

Branch/Pull Time/Commit
-----------------------

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190515T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="101"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-15 01:30:00 +0000"

Timestamp/Logs
--------------
following error is logged on /var/log/sysinv.log

2019-05-15 11:38:25.339 89701 ERROR sysinv.openstack.common.rpc.common [req-80477dd9-7c38-47f3-b755-a3881b75261b None None] ['Traceback (most recent call last):\n', ' File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', ' File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 2957, in idisk_update_by_ihost\n ihost = self.dbapi.ihost_get(ihost_uuid)\n', ' File "/usr/lib64/python2.7/site-packages/sysinv/objects/__init__.py", line 100, in wrapper\n result = fn(*args, **kwargs)\n', ' File "/usr/lib64/python2.7/site-packages/sysinv/db/sqlalchemy/api.py", line 1239, in ihost_get\n return self._host_get(server)\n', ' File "/usr/lib64/python2.7/site-packages/sysinv/db/sqlalchemy/api.py", line 1218, in _host_get\n return query.one()\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2814, in one\n ret = self.one_or_none()\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2784, in one_or_none\n ret = list(self)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__\n return self._execute_and_instances(context)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances\n close_with_result=True)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args\n **kw\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session\n conn = self.session.connection(**kw)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 998, in connection\n execution_options=execution_options)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1005, in _connection_for_bind\n conn = engine.contextual_connect(**kw)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect\n self._wrap_pool_connect(self.pool.connect, None),\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2151, in _wrap_pool_connect\n e, dialect, self)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1461, in _handle_dbapi_exception_noconnection\n util.raise_from_cause(newraise, exc_info)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause\n reraise(type(exception), exception, tb=exc_tb, cause=cause)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect\n return fn()\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 387, in connect\n return _ConnectionFairy._checkout(self)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 766, in _checkout\n fairy = _ConnectionRecord.checkout(pool)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 521, in checkout\n rec.checkin()\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__\n compat.reraise(exc_type, exc_value, exc_tb)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 518, in checkout\n dbapi_connection = rec.get_connection()\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 607, in get_connection\n self.__connect()\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 651, in __connect\n connection = pool._invoke_creator(self)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect\n return dialect.connect(*cargs, **cparams)\n', ' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 393, in connect\n return self.dbapi.connect(*cargs, **cparams)\n', ' File "/usr/lib64/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect\n conn = _connect(dsn, connection_factory=connection_factory, async=async)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/support/psycopg2_patcher.py", line 46, in eventlet_wait_callback\n state = conn.poll()\n', 'DBConnectionError: (psycopg2.OperationalError) could not connect to server: Connection refused\n\tIs the server running on host "localhost" (127.0.0.1) and accepting\n\tTCP/IP connections on port 5432?\n\n']
2019-05-15 11:38:25.343 80939 ERROR sysinv.agent.manager [-] Sysinv Agent exception updating idisk conductor.
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager Traceback (most recent call last):
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 1204, in agent_audit
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager idisk)
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 352, in idisk_update_by_ihost
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager idisk_dict_array=idisk_dict_array))
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 121, in call
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager result = rpc.call(context, real_topic, msg, timeout)
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/__init__.py", line 141, in call
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager return _get_impl().call(CONF, context, topic, msg, timeout)
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 799, in call
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager rpc_amqp.get_connection_pool(conf, Connection))
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 621, in call
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager rv = list(rv)
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 570, in __iter__
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager raise result # pylint: disable=raising-bad-type
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager RemoteError: Remote error: DBConnectionError (psycopg2.OperationalError) could not connect to server: Connection refused
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager Is the server running on host "localhost" (127.0.0.1) and accepting
2019-05-15 11:38:25.343 80939 TRACE sysinv.agent.manager TCP/IP connections on port 5432?

Ghada Khalil (gkhalil)
tags: added: stx.config
Changed in starlingx:
importance: Undecided → High
assignee: nobody → Tee Ngo (teewrs)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Tee to triage as she is the prime for Ansible. She can re-route if needed.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority as Ansible deployment is intended to be the default configuration engine for StarlingX.

Based on input from the test team, adding the stx.sanity tag as well.

tags: added: stx.sanity
tags: added: stx.2.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :
Changed in starlingx:
assignee: Tee Ngo (teewrs) → Ovidiu Poncea (ovidiu.poncea)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/658391
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=9720932899b69287871a419422880f04d618286f
Submitter: Zuul
Branch: master

commit 9720932899b69287871a419422880f04d618286f
Author: Ovidiu Poncea <email address hidden>
Date: Fri May 10 17:46:27 2019 +0300

    Fix missing reboot flag for config uuid on unlock

    Due to a limitation in config uuid functionality, on first unlock
    of controller-0, node remains in config-out-of-date as we loose
    the reboot flag.

    Example output after unlock:
    $ system host-show controller-0 | grep config
    | config_applied | 62228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |
    | config_status | Config out-of-date |
    | config_target | e2228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |

    The reboot flag is:
    CONFIG_REBOOT_REQUIRED = (1 << 127)

    We set config_target through sysinv and config_applied
    through puppet once manifests have applied. If there the reboot
    flag in config_target is set but not in config_applied we are
    "Config-out-of-date".

    On host-unlock or runtime manifest apply we set config_uuid in
    hieradata to e.g.:
    platform::config::params::config_uuid: \
       62228cc1-e5da-4f2e-a3c3-c468e9a46fb5

    Then, after runtime manifest apply or after reboot, sysinv-agent
    takes this value and updates config_applied.

    A config uuid with the reboot flag is passed to puppet ONLY when
    host is unlocked (which makes sense as this is when we do the
    reboot). Runtime manifests don't pass the reboot flag to puppet
    (it is a runtime, reboot flag has to remain).
    So, in our case, at unlock it is correctly set but then sysinv
    does a runtime manifest apply and resets it to a value w/o
    the reboot flag. Therefore, the reboot flag is no longer set,
    that's why even after unlock we still have Config-out-of-date.

    To fix the issue we generate a new config_uuid with the reboot
    flag set and we properly send it to puppet as the last operation
    we attempt before reboot.

    Change-Id: I12865d45f4456de81d72689f799441531a444bea
    Closes-Bug: #1828271
    Closes-Bug: #1829004
    Closes-Bug: #1829260
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

The issue was not observed in recent standard and storage system installs which confirms that the issue has been resolved.

Changed in starlingx:
status: Fix Released → Confirmed
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Confirmed → Fix Released
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.