Backup & Restore: controller failed to unlock - Manifest apply timeout

Bug #1979851 reported by Heitor Matsui
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Heitor Matsui

Bug Description

Brief Description
-----------------
After run the ansible restore playbook the controller failed to unlock.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
- Install a simplex system with WRCP 22.06
- Run the Backup Ansible playbook from controller-0
- Install a clean image of WRCP in the system with wipedisk=false
- Run the restore Ansible playbook locally with the backup file saved above
- Unlock controller-0

Expected Behavior
------------------
Unlock controller-0 after restore successfully

Actual Behavior
----------------
Failed to unlock controller-0 after restore

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
master 2022-06-22

Last Pass
---------
N/A

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0
+------------------------+-----------------------------------------------------------------+
| Property | Value |
+------------------------+-----------------------------------------------------------------+
| action | none |
| administrative | locked |
| availability | online |
| bm_ip | None |
| bm_type | none |
| bm_username | None |
| boot_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-5.0 |
| capabilities | {u'is_max_cpu_configurable': u'configurable', u'stor_function': |
| | u'monitor', u'Personality': u'Controller-Active'} |
| clock_synchronization | ntp |
| config_applied | 43208253-9d30-42b8-bee6-c5aa8cd541cf |
| config_status | None |
| config_target | 43208253-9d30-42b8-bee6-c5aa8cd541cf |
| console | ttyS0,115200n8 |
| created_at | 2022-06-22T20:41:02.251373+00:00 |
| device_image_update | None |
| hostname | controller-0 |
| id | 1 |
| install_output | text |
| install_state | None |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| max_cpu_mhz_allowed | 2600 |
| max_cpu_mhz_configured | None |
| mgmt_ip | 192.168.204.2 |
| mgmt_mac | 00:00:00:00:00:00 |
| operational | disabled |
| personality | controller |
| reboot_needed | False |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-5.0 |
| serialid | None |
| software_load | 22.06 |
| subfunction_avail | offline |
| subfunction_oper | disabled |
| subfunctions | controller,worker |
| task | Manifest apply timeout ; Unlock to retry |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2022-06-23T01:06:49.700171+00:00 |
| uptime | 4346 |
| uuid | 12991d90-755b-47de-b1c9-f57be736914e |
| vim_progress_status | services-enabled |
+------------------------+-----------------------------------------------------------------+
 [sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+----------+-------------------------------------------+------------------+-------------------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+----------+-------------------------------------------+------------------+-------------------+-----------+
| cert-manager | 22.06-37 | cert-manager-fluxcd-manifests | fluxcd-manifests | restore-requested | completed |
| nginx-ingress-controller | 22.06-37 | nginx-ingress-controller-fluxcd-manifests | fluxcd-manifests | restore-requested | completed |
| oidc-auth-apps | 22.06-71 | oidc-auth-apps-fluxcd-manifests | fluxcd-manifests | uploaded | completed |
| platform-integ-apps | 22.06-55 | platform-integ-apps-fluxcd-manifests | fluxcd-manifests | restore-requested | completed |
+--------------------------+----------+-------------------------------------------+------------------+-------------------+-----------+
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 8a14acce-a15a-4739-b95b-5e8dcb2d69fe
    health: HEALTH_WARN
            13 slow ops, oldest one blocked for 4259 sec, mon.controller-0 has slow ops

  services:
    mon: 1 daemons, quorum controller-0 (age 72m)
    mgr: controller-0(active, since 72m)
    mds: 1 up:standby
    osd: 1 osds: 0 up, 0 in

  data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs:
After run the restore ansible playbook the controller finished in offline state anyways the bootstrap finished without error, check in the ansible logs:

2022-06-23 10:32:44,559 p=11433 u=sysadmin | TASK [restore-platform/restore-more-data : Check controller-0 is in online state] *********************************************************************************************************************************
2022-06-23 10:32:44,559 p=11433 u=sysadmin | Thursday 23 June 2022 10:32:44 +0000 (0:00:00.273) 0:32:15.257 *********
2022-06-23 10:32:47,011 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (30 retries left).
2022-06-23 10:32:59,263 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (29 retries left).
2022-06-23 10:33:11,431 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (28 retries left).
2022-06-23 10:33:23,608 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (27 retries left).
2022-06-23 10:33:35,808 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (26 retries left).
2022-06-23 10:33:48,009 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (25 retries left).
2022-06-23 10:34:00,189 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (24 retries left).
2022-06-23 10:34:12,546 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (23 retries left).
2022-06-23 10:34:24,724 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (22 retries left).
2022-06-23 10:34:36,882 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (21 retries left).
2022-06-23 10:34:49,085 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (20 retries left).
2022-06-23 10:35:01,415 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (19 retries left).
2022-06-23 10:35:13,592 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (18 retries left).
2022-06-23 10:35:25,830 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (17 retries left).
2022-06-23 10:35:38,003 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (16 retries left).
2022-06-23 10:35:50,238 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (15 retries left).
2022-06-23 10:36:02,408 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (14 retries left).
2022-06-23 10:36:14,581 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (13 retries left).
2022-06-23 10:36:26,860 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (12 retries left).
2022-06-23 10:36:39,041 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (11 retries left).
2022-06-23 10:36:51,246 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (10 retries left).
2022-06-23 10:37:03,555 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (9 retries left).
2022-06-23 10:37:15,759 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (8 retries left).
2022-06-23 10:37:27,965 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (7 retries left).
2022-06-23 10:37:40,168 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (6 retries left).
2022-06-23 10:37:52,412 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (5 retries left).
2022-06-23 10:38:04,693 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (4 retries left).
2022-06-23 10:38:17,020 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (3 retries left).
2022-06-23 10:38:29,234 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (2 retries left).
2022-06-23 10:38:41,418 p=11433 u=sysadmin | FAILED - RETRYING: Check controller-0 is in online state (1 retries left).
2022-06-23 10:38:53,632 p=11433 u=sysadmin | changed: [localhost]

mtcAgent.log
2022-06-24T12:39:25.564 [180405.00070] controller-0 mtcAgent --- nodeBase.cpp ( 699) log_adminAction : Info : controller-0 Add Action
2022-06-24T12:39:25.564 [180405.00071] controller-0 mtcAgent tok tokenUtil.cpp ( 697) tokenUtil_new_token : Info : controller-0 Requesting Authentication Token
2022-06-24T12:39:25.564 [180405.00072] controller-0 mtcAgent tok tokenUtil.cpp ( 118) tokenUtil_get_first : Info : controller-0 waiting on token request completion loop:0
2022-06-24T12:39:45.566 [180405.00073] controller-0 mtcAgent tok tokenUtil.cpp ( 118) tokenUtil_get_first : Info : controller-0 waiting on token request completion loop:20
2022-06-24T12:39:55.567 [180405.00074] controller-0 mtcAgent --- httpUtil.cpp ( 503) httpUtil_status : Warn : controller-0 failed to maintain connection to '127.0.0.1:5000' for 'controller-0 tokenUtil_new_token 'get new''
2022-06-24T12:39:55.567 [180405.00075] controller-0 mtcAgent --- httpUtil.cpp ( 564) httpUtil_handler :Error : controller-0 tokenUtil_new_token 'get new' Request Timeout (30)
2022-06-24T12:39:55.567 [180405.00076] controller-0 mtcAgent tok tokenUtil.cpp ( 455) tokenUtil_handler :Error : controller-0 Token Request Failed - Error Code (16)

Test Activity
-------------
Regression Testing

Workaround
----------
N/A

description: updated
Changed in starlingx:
assignee: nobody → Heitor Matsui (heitormatsui)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/847619
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/6d369461311fa0e3e2745855f441e7124ae6a604
Submitter: "Zuul (22348)"
Branch: master

commit 6d369461311fa0e3e2745855f441e7124ae6a604
Author: Heitor Matsui <email address hidden>
Date: Fri Jun 24 18:16:57 2022 -0300

    Wait for services to come up

    In a restore scenario, during the restore-more-data stage, the
    restarted services may come up but still unable to serve requests,
    and this, specifically keystone, will cause mtcAgent to fail when
    he is executed on the next step, with a Token Request Failed error.

    This commit adds a wait time right after services are restarted,
    so that services are given enough time to be up and ready to serve
    requests.

    Test Plan:
    PASS: run AIO-SX backup and restore successfully (while monitoring
          keystone ability to serve requests after the service restart
          and observing that it took some time after the process came
          up for it to be able to reply token requests)

    Closes-bug: 1979851
    Change-Id: I81f199f8e2f9ba2116350908ac54b1070df8027c
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.