OAM Floating IP wasn't accessible after the initial controller swact

Bug #1850092 reported by Venkata Veldanda
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Low
John Kung

Bug Description

Brief Description
-----------------

Initially controller-0 was ACTIVE
When the OAM Floating IP was changed, the new floating IP was not accessible until we make controller-0 back ACTIVE again. Ideally after OAM Floating IP change, we lock/unlock the standby (Controller-1) & then swact Controller-0 to make Controller-1 ACTIVE. After this I would lock & unlock the new standby controller (controller-0). At this stage, I would expect the new floating IP to be accessible even if Controller-1 as ACTIVE controller. This wouldn’t be successful. Only when I change back Controller-0 to ACTIVE again, only then the new floating IP would be accessible

Behavior is irrespective of change through GUI or CLI

Severity
--------
Major - can't access the floating IP

Steps to Reproduce
-------------------
1) Controller-0 ACTIVE and Controller-1 STANDBY
2) Change the OAM Floating IP through GUI or CLI
[NOTE] On my standard Lab I interchanged controller-0 fixed IP as new floating IP & the old floating IP as controller-0 fixed IP
2) Lock & Unlock STANDBY controller-1
3) Swact controller-0 to make controller-1 as ACTIVE
4) Lock & Unlock STANDBY controller-0
5) Expect new floating IP to be accessible via GUI or CLI
[NOTE] The ping to the new floating IP works but neither SSH nor http works.
6) performed additional lock/unlock of controller-0 (but still new floating IP wouldn't be successful)
7) swact back controller-1 to make controller-0 as ACTIVE
8) The new floating IP would be accessible

Expected Behavior
------------------
Expect the new floating IP to be accessible even after step [5] per above

Actual Behavior
------------------
The new floating IP would be accessible only after step [7] above.

Reproducibility
------------------
Tested once so far

System Configuration
------------------
standard with dedicated storage

Branch/Pull Time/Commit
-----------------------
STX-BUILD-ID = 2019-10-17_20-00-00

Last Pass
---------
Didn't verify on the previous builds

Timestamp/Logs
--------------
1.Logs Attached

2.Brief Timeline

# The Initial IP assigned were like below *before* starting of this test
128.224.151.243 (floating)
128.224.151.244 (c0) - Active
128.224.150.205 (c1) - Standby

# Changed to (from GUI)
128.224.151.243 (c0)
128.224.151.244 (floating)
128.224.150.205 (c1) - Standby

# From horizon.log
2019-10-25 05:38:22,572 [INFO] horizon.operation_log: [admin e893c6b3cc7a4efd97dd0e24fb6a8382] [admin 4d55c5f11318463f855c2f1a93a54626] [POST /admin/system_config/update_coam_table/ 200] parameters:[{"uuid": "8104824e-168d-4dad-93ac-76a2c4e7b3e0", "EXTERNAL_OAM_SUBNET": "128.224.150.0/23", "EXTERNAL_OAM_FLOATING_ADDRESS": "128.224.151.244", "EXTERNAL_OAM_GATEWAY_ADDRESS": "128.224.150.1", "EXTERNAL_OAM_1_ADDRESS": "128.224.150.205", "csrfmiddlewaretoken": "gL1PT3LGZT1HymR1CNDlYFEFz81LcpZmW4vHge5Kj1jjgqDssnCxt9D6Q8vodlLd", "EXTERNAL_OAM_0_ADDRESS": "128.224.151.243"}] message:[success: OAM configuration was successfully updated. ]

Locked & Unlocked C1
2019-10-25 05:40:11,788 [INFO] starlingx_dashboard.dashboards.admin.inventory.tables: Unlocked Host: "controller-1"

Swact to make C1 active
2019-10-25 05:49:17,415 [INFO] starlingx_dashboard.dashboards.admin.inventory.tables: Swact Initiated Host: "controller-0"

Locked C0 - [NOTE] At this time C1 went to config_out_of_date again and the new floating IP was NOT accessible
Then logged in to C1 (new Active) by its original IP (128.224.150.205) and issued a system host-unlock on controller-0. controller-0 unlocked successfully but still the new floating IP 128.224.151.244 wasn't accessible. However 128.224.151.244 is reachable ( a ping works & but https or ssh will not work)

[2019-10-25 11:56.51] ~
[VVeldand.blr-vveldand-l1] ➤ ssh sysadmin@128.224.151.244
ssh_exchange_identification: read: Connection reset by peer


[2019-10-25 11:58.21] ~
[VVeldand.blr-vveldand-l1] ➤ ping 128.224.151.244

Pinging 128.224.151.244 with 32 bytes of data:
Reply from 128.224.151.244: bytes=32 time=543ms TTL=58
Reply from 128.224.151.244: bytes=32 time=526ms TTL=58
Reply from 128.224.151.244: bytes=32 time=462ms TTL=58

However on GUI, I see the new floating IP is reflected.
No specific alarms related to config_out_of_date were noted at this time

However there was an ALARM related to "openstack" application failure that has been "set"
| 750.002 | Application Apply Failure | k8s_application=stx- | major | 2019-10-25T05: |
| | | openstack | | 50:12.486414 |

[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------------------------+-----------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------------------------+-----------------------+----------+----------------+
| 750.002 | Application Apply Failure | k8s_application=stx- | major | 2019-10-25T05: |
| | | openstack | | 50:12.486414 |
| | | | | |
| 200.001 | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | 2019-10-23T19: |
| | | | | 31:12.233918 |

GUI Shows the new IP
<attached the image>

# Additional Lock/Unlock performed on controller-0
# Then, I swacted back controller-1 to controller-0 (c0 will be the new Active Controller)
2019-10-25T06:31:25.000 controller-1 -sh: info HISTORY: PID=454012 UID=42425 system host-swact controller-1

Success: Then I was able to connect to both SSH & Horizon using the new Floating IP 128.224.151.244

To start the test with CLI, I assigned back the original floating IP as below
[sysadmin@controller-0 ~(keystone_admin)]$ system oam-modify oam_floating_ip=128.224.151.243 oam_c0_ip=128.224.151.244
+-----------------+--------------------------------------+
| Property | Value |
+-----------------+--------------------------------------+
| created_at | 2019-10-21T18:53:16.482342+00:00 |
| isystem_uuid | 36ee8ab9-b906-4bef-b298-9ebe667f94f3 |
| oam_c0_ip | 128.224.151.244 |
| oam_c1_ip | 128.224.150.205 |
| oam_floating_ip | 128.224.151.243 |
| oam_gateway_ip | 128.224.150.1 |
| oam_subnet | 128.224.150.0/23 |
| updated_at | None |
| uuid | 8104824e-168d-4dad-93ac-76a2c4e7b3e0 |
+-----------------+--------------------------------------+

[2019-10-25 13:15.16] ~
[VVeldand.blr-vveldand-l1] ➤ ssh sysadmin@128.224.151.243
ssh_exchange_identification: read: Connection reset by peer

## Only after making the controller-0 back ACTIVE again, the new floating IP would be accessible.

Tags: stx.config
description: updated
description: updated
Revision history for this message
Venkata Veldanda (venkata.veldanda) wrote :

Logs attached

Revision history for this message
Ghada Khalil (gkhalil) wrote :

What was the status of the config out-of-date alarm during this procedure? When was it cleared?

tags: added: stx.config
Revision history for this message
Venkata Veldanda (venkata.veldanda) wrote :

# The config out of date alarm on controller-1 CLEARED after STEP 2 at the below timestamp as expected.
2019-10-25T05:47:26.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-7_12.host=controller-1", "severity" : "major", "state" : "clear", "timestamp" : "2019-10-25 05:47:26.888152" }

# SWACT Performed per STEP 3 at the below timestamp
2019-10-25 05:49:17,415 [INFO] starlingx_dashboard.dashboards.admin.inventory.tables: Swact Initiated Host: "controller-0"

# The problem described was visible starting at the timestamp "2019-10-25 05:49:17" to almost "2019-10-25T06:32" approximately until step 7 was peformed (after step 4 to before step 7).
# However, there were "no" config out-of-date alarms were seen during the above timestamp (i.e. when the problem was visible)

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - needs further investigation

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → John Kung (john-kung)
tags: added: stx.3.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving unresolved medium priority bugs (< 100 days OR recently reproduced) from stx.3.0 to stx.4.0

tags: added: stx.4.0
removed: stx.3.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority as this issue was reported only once many months ago. Changing the oam ip is a fairly common test-case and it's been run since then.

tags: removed: stx.4.0
Changed in starlingx:
importance: Medium → Low
Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

No progress on this bug for more than 2 years. Candidate for closure.

If there is no update, this issue is targeted to be closed as 'Won't Fix' in 2 weeks.

Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

Changing the status to 'Won't Fix' as there is no activity.

Changed in starlingx:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.