Subcloud install playbook failed to ping OAM interface of subclouds

Bug #1999543 reported by Li Zhu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Li Zhu

Bug Description

Brief Description
------------------
Failed to ping OAM interface of subclouds in the install playbook.

Failure:
TASK [common/prepare-env : Fail if host is unreachable] ************************
Wednesday 23 November 2022 01:43:16 +0000 (0:00:10.177) 1:08:56.357 ****
skipping: [subcloudXXX] => (item=PING <ip> 56 data bytes)
skipping: [subcloudXXX] => (item=)
skipping: [subcloudXXX] => (item=--- <ip> ping statistics ---)
failed: [subcloudXXX] (item=1 packets transmitted, 0 received, 100% packet loss, time 0ms) => changed=false
  ansible_loop_var: item
  item: 1 packets transmitted, 0 received, 100% packet loss, time 0ms
  msg: Host <ip> is unreachable!PLAY RECAP

The install completed at 01h43min05s and it failed at 01h43min16s, 9 seconds after the install completed.
In daemon logs we can see that the server was rebooted at 01h43min05s. The OAM interface seem to have taken longer than 9 seconds till it becomes available. The step " [Waiting 9000 seconds for port 22 become open on <ip>]" also checks if the interface is available. So maybe the oam interface connection was NOT stable for few seconds. I was able to connect to the host post failure.

Severity
--------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------
Run remote subcloud install

Expected Behavior
-----------------
The subcloud deployment should complete successfully

Actual Behavior
---------------
The subcloud install successfully, however the OAM was not stable for few seconds and then the install playbook failed to PING it.

Reproducibility
---------------
Intermittent

System Configuration
------------------
DC lab

SW_VERSION="22.12"
BUILD_ID="2022-11-22_01-25-31"

Timestamp/Logs
---------------
daemon logs: reboot system

/var/log/daemon.log:2022-11-23T01:43:05.980 localhost systemd[1]: info Starting LSB: Execute the kexec -e command to reboot system...
ansible logs: - subcloudXXX

2022-11-23-00:34:19 Executing playbook command: ['ansible-playbook', '/usr/share/ansible/stx-ansible/playbooks/install.yml', '-i', '/var/opt/dc/ansible/subcloudXXX_inventory.yml', '--limit', 'subcloudXXX', '-e', '@/var/opt/dc/ansible/subcloudXXX/install_values.yml']PLAY [Install Playbook] ********************************************************TASK [set_fact] ****************************************************************
...
changed: [subcloudXXX]TASK [common/prepare-env : Fail if host is unreachable] ************************
Wednesday 23 November 2022 01:43:16 +0000 (0:00:10.177) 1:08:56.357 ****
skipping: [subcloudXXX] => (item=PING <ip> 56 data bytes)
skipping: [subcloudXXX] => (item=)
skipping: [subcloudXXX] => (item=--- <ip> ping statistics ---)
failed: [subcloudXXX] (item=1 packets transmitted, 0 received, 100% packet loss, time 0ms) => changed=false
  ansible_loop_var: item
  item: 1 packets transmitted, 0 received, 100% packet loss, time 0ms
  msg: Host 2620:10a:a001:d41::198 is unreachable!PLAY RECAP *********************************************************************
subcloudXXX : ok=21 changed=9 unreachable=0 failed=1 skipped=13 rescued=0 ignored=0

Alarms
--------
n/a

Test Activity
-----------------
Feature Testing

Workaround
-----------------
Resume the subcloud deployment to bootstrap playbook phase

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/867544
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/a2af4aa40d8ff6eb8a034da63cca66ed295c4e4a
Submitter: "Zuul (22348)"
Branch: master

commit a2af4aa40d8ff6eb8a034da63cca66ed295c4e4a
Author: Li Zhu <email address hidden>
Date: Tue Dec 13 12:01:32 2022 -0500

    Replace ping with wait_for ssh port open in connectivity check

    Subclouds installation occasionally failed due to the failure of ping
    OAM interface of subclouds in the install playbook. The OAM interface
    seems to have taken longer time to become available than ssh port open.
    To check the ssh connection, it's better to use "wait_for" module to
    check the ssh port directly.

    Test plan:
    Verify successful batch subcloud deployment in a scale lab.

    Close-Bug: 1999543

    Signed-off-by: Li Zhu <email address hidden>
    Change-Id: Ib25e9efe36564bce0ad7da3624203ae7821bab89

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Fix Released. The LP was not automatically updated because there is a typo in the "Closes-Bug" in the commit msg (Close-Bug instead of Closes-Bug)

Changed in starlingx:
status: New → Fix Released
assignee: nobody → Li Zhu (lzhu1)
tags: added: stx.8.0 stx.distcloud
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.