OAM Floating IP wasn't accessible after the initial controller swact

Bug #1850092 reported by Venkata Veldanda on 2019-10-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
John Kung

Bug Description

Brief Description
-----------------

Initially controller-0 was ACTIVE
When the OAM Floating IP was changed, the new floating IP was not accessible until we make controller-0 back ACTIVE again. Ideally after OAM Floating IP change, we lock/unlock the standby (Controller-1) & then swact Controller-0 to make Controller-1 ACTIVE. After this I would lock & unlock the new standby controller (controller-0). At this stage, I would expect the new floating IP to be accessible even if Controller-1 as ACTIVE controller. This wouldn’t be successful. Only when I change back Controller-0 to ACTIVE again, only then the new floating IP would be accessible

Behavior is irrespective of change through GUI or CLI

Severity
--------
Major - can't access the floating IP

Steps to Reproduce
-------------------
1) Controller-0 ACTIVE and Controller-1 STANDBY
2) Change the OAM Floating IP through GUI or CLI
[NOTE] On my standard Lab I interchanged controller-0 fixed IP as new floating IP & the old floating IP as controller-0 fixed IP
2) Lock & Unlock STANDBY controller-1
3) Swact controller-0 to make controller-1 as ACTIVE
4) Lock & Unlock STANDBY controller-0
5) Expect new floating IP to be accessible via GUI or CLI
[NOTE] The ping to the new floating IP works but neither SSH nor http works.
6) performed additional lock/unlock of controller-0 (but still new floating IP wouldn't be successful)
7) swact back controller-1 to make controller-0 as ACTIVE
8) The new floating IP would be accessible

Expected Behavior
------------------
Expect the new floating IP to be accessible even after step [5] per above

Actual Behavior
------------------
The new floating IP would be accessible only after step [7] above.

Reproducibility
------------------
Tested once so far

System Configuration
------------------
standard with dedicated storage

Branch/Pull Time/Commit
-----------------------
STX-BUILD-ID = 2019-10-17_20-00-00

Last Pass
---------
Didn't verify on the previous builds

Timestamp/Logs
--------------
1.Logs Attached

2.Brief Timeline

# The Initial IP assigned were like below *before* starting of this test
128.224.151.243 (floating)
128.224.151.244 (c0) - Active
128.224.150.205 (c1) - Standby

# Changed to (from GUI)
128.224.151.243 (c0)
128.224.151.244 (floating)
128.224.150.205 (c1) - Standby

# From horizon.log
2019-10-25 05:38:22,572 [INFO] horizon.operation_log: [admin e893c6b3cc7a4efd97dd0e24fb6a8382] [admin 4d55c5f11318463f855c2f1a93a54626] [POST /admin/system_config/update_coam_table/ 200] parameters:[{"uuid": "8104824e-168d-4dad-93ac-76a2c4e7b3e0", "EXTERNAL_OAM_SUBNET": "128.224.150.0/23", "EXTERNAL_OAM_FLOATING_ADDRESS": "128.224.151.244", "EXTERNAL_OAM_GATEWAY_ADDRESS": "128.224.150.1", "EXTERNAL_OAM_1_ADDRESS": "128.224.150.205", "csrfmiddlewaretoken": "gL1PT3LGZT1HymR1CNDlYFEFz81LcpZmW4vHge5Kj1jjgqDssnCxt9D6Q8vodlLd", "EXTERNAL_OAM_0_ADDRESS": "128.224.151.243"}] message:[success: OAM configuration was successfully updated. ]

Locked & Unlocked C1
2019-10-25 05:40:11,788 [INFO] starlingx_dashboard.dashboards.admin.inventory.tables: Unlocked Host: "controller-1"

Swact to make C1 active
2019-10-25 05:49:17,415 [INFO] starlingx_dashboard.dashboards.admin.inventory.tables: Swact Initiated Host: "controller-0"

Locked C0 - [NOTE] At this time C1 went to config_out_of_date again and the new floating IP was NOT accessible
Then logged in to C1 (new Active) by its original IP (128.224.150.205) and issued a system host-unlock on controller-0. controller-0 unlocked successfully but still the new floating IP 128.224.151.244 wasn't accessible. However 128.224.151.244 is reachable ( a ping works & but https or ssh will not work)

[2019-10-25 11:56.51] ~
[VVeldand.blr-vveldand-l1] ➤ ssh sysadmin@128.224.151.244
ssh_exchange_identification: read: Connection reset by peer


[2019-10-25 11:58.21] ~
[VVeldand.blr-vveldand-l1] ➤ ping 128.224.151.244

Pinging 128.224.151.244 with 32 bytes of data:
Reply from 128.224.151.244: bytes=32 time=543ms TTL=58
Reply from 128.224.151.244: bytes=32 time=526ms TTL=58
Reply from 128.224.151.244: bytes=32 time=462ms TTL=58

However on GUI, I see the new floating IP is reflected.
No specific alarms related to config_out_of_date were noted at this time

However there was an ALARM related to "openstack" application failure that has been "set"
| 750.002 | Application Apply Failure | k8s_application=stx- | major | 2019-10-25T05: |
| | | openstack | | 50:12.486414 |

[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------------------------+-----------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------------------------+-----------------------+----------+----------------+
| 750.002 | Application Apply Failure | k8s_application=stx- | major | 2019-10-25T05: |
| | | openstack | | 50:12.486414 |
| | | | | |
| 200.001 | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | 2019-10-23T19: |
| | | | | 31:12.233918 |

GUI Shows the new IP
<attached the image>

# Additional Lock/Unlock performed on controller-0
# Then, I swacted back controller-1 to controller-0 (c0 will be the new Active Controller)
2019-10-25T06:31:25.000 controller-1 -sh: info HISTORY: PID=454012 UID=42425 system host-swact controller-1

Success: Then I was able to connect to both SSH & Horizon using the new Floating IP 128.224.151.244

To start the test with CLI, I assigned back the original floating IP as below
[sysadmin@controller-0 ~(keystone_admin)]$ system oam-modify oam_floating_ip=128.224.151.243 oam_c0_ip=128.224.151.244
+-----------------+--------------------------------------+
| Property | Value |
+-----------------+--------------------------------------+
| created_at | 2019-10-21T18:53:16.482342+00:00 |
| isystem_uuid | 36ee8ab9-b906-4bef-b298-9ebe667f94f3 |
| oam_c0_ip | 128.224.151.244 |
| oam_c1_ip | 128.224.150.205 |
| oam_floating_ip | 128.224.151.243 |
| oam_gateway_ip | 128.224.150.1 |
| oam_subnet | 128.224.150.0/23 |
| updated_at | None |
| uuid | 8104824e-168d-4dad-93ac-76a2c4e7b3e0 |
+-----------------+--------------------------------------+

[2019-10-25 13:15.16] ~
[VVeldand.blr-vveldand-l1] ➤ ssh sysadmin@128.224.151.243
ssh_exchange_identification: read: Connection reset by peer

## Only after making the controller-0 back ACTIVE again, the new floating IP would be accessible.

description: updated
description: updated

Logs attached

Ghada Khalil (gkhalil) wrote :

What was the status of the config out-of-date alarm during this procedure? When was it cleared?

tags: added: stx.config

# The config out of date alarm on controller-1 CLEARED after STEP 2 at the below timestamp as expected.
2019-10-25T05:47:26.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-7_12.host=controller-1", "severity" : "major", "state" : "clear", "timestamp" : "2019-10-25 05:47:26.888152" }

# SWACT Performed per STEP 3 at the below timestamp
2019-10-25 05:49:17,415 [INFO] starlingx_dashboard.dashboards.admin.inventory.tables: Swact Initiated Host: "controller-0"

# The problem described was visible starting at the timestamp "2019-10-25 05:49:17" to almost "2019-10-25T06:32" approximately until step 7 was peformed (after step 4 to before step 7).
# However, there were "no" config out-of-date alarms were seen during the above timestamp (i.e. when the problem was visible)

Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - needs further investigation

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → John Kung (john-kung)
tags: added: stx.3.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers