ansible: service endpoint reconfiguration timeout

Bug #1831118 reported by Bob Church
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Al Bailey

Bug Description

Brief Description
-----------------
On a virtual AIO-SX and 2+2, I'm observing a regular occurrence of

TASK [persist-config : Wait for service endpoints reconfiguration to complete] ***************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "elapsed": 180, "msg": "Timeout waiting for service endpoints reconfiguration to complete"}

I suspect that this is related to the relative load/processing power of the host system.

Severity
--------
Major: System/Feature is usable but only after applying the workaround described below

Steps to Reproduce
------------------
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap/bootstrap.yml -e "ansible_become_pass=Li69nux* admin_password=Li69nux* system_mode=duplex"

Expected Behavior
------------------
PLAY RECAP *********************************************************************
localhost : ok=190 changed=121 unreachable=0 failed=0

Actual Behavior
----------------
PLAY RECAP *********************************************************************
localhost : ok=99 changed=34 unreachable=0 failed=1

Reproducibility
---------------
I'm using two hosts for virtual installs. This is observed 100% of the time on one host and intermittent on the other.

System Configuration
--------------------
AIO-SX and 2+2 installs

Branch/Pull Time/Commit
-----------------------
Private build of StarlingX master on 5/29

Last Pass
---------
Worked consistently on builds prior to 5/25

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
# Bump the timeout after install and before running ansible
controller-0:~$ grep timeout /usr/share/ansible/stx-ansible/playbooks/bootstrap/roles/persist-config/tasks/main.yml
        timeout: 180
controller-0:~$ sudo sed -i 's/180/360/g' /usr/share/ansible/stx-ansible/playbooks/bootstrap/roles/persist-config/tasks/main.yml
controller-0:~$ grep timeout /usr/share/ansible/stx-ansible/playbooks/bootstrap/roles/persist-config/tasks/main.yml
        timeout: 360

This corresponds to:
    # If this is initial play or replay with management and/or oam network config change, must
    # wait for the keystone endpoint runtime manifest to complete and restart
    # sysinv agent and api.
    - name: Wait for service endpoints reconfiguration to complete
      wait_for:
        path: /etc/platform/.service_endpoint_reconfigured
        state: present
        timeout: 180
        msg: Timeout waiting for service endpoints reconfiguration to complete

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; initial system config can fail. Related to ansible deployment.

Changed in starlingx:
assignee: nobody → Al Bailey (albailey1974)
importance: Undecided → High
status: New → Triaged
tags: added: stx.2.0 stx.config
Revision history for this message
Al Bailey (albailey1974) wrote :

I will double the timeout from 180 to 360.

In Bob's testing, he found that he was close to 180, so this gives a decent amount of room.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/662289

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/662289
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=a57b17b4366d4a458273f843a3aeee5468aba5c7
Submitter: Zuul
Branch: master

commit a57b17b4366d4a458273f843a3aeee5468aba5c7
Author: Al Bailey <email address hidden>
Date: Thu May 30 12:27:56 2019 -0500

    Increase the ansible timeout for service endpoint reconfiguration

    In virtual environments the time for service endpoint
    reconfiguration to complete was occasionally exceeding
    3 minutes and timing out.

    This fix simply doubles the timeout.

    Change-Id: Iea44e42fb99de7f9b27c03f88103b619714c0850
    Closes-Bug: #1831118
    Signed-off-by: Al Bailey <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.