Unable to connect to router after host reboot

Bug #1830108 reported by Yang Liu
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
High
Joseph Richard

Bug Description

Brief Description
-----------------
router gateway ip unreachable after host reboots.

Severity
--------
Major

Steps to Reproduce
------------------
Note that step2&3 are possible operations that may relate to the failure, but exact steps are unknown.

1. Following VM using above router are reachable at first.
[2019-05-22 08:42:31,736] 262 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.247.67'
PING 192.168.247.67 (192.168.247.67) 56(84) bytes of data.
64 bytes from 192.168.247.67: icmp_seq=1 ttl=63 time=0.608 ms

| be9d3256-a33a-4878-b857-84097fb7c8aa | tenant2-vm-1 | ACTIVE | tenant2-mgmt-net=192.168.247.67; tenant2-net0=172.18.0.234 | | flavor-default-size2 |

2. Force reboot active controller
[2019-05-22 08:49:27,127] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-22 08:49:27,127] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

3. Lock/unlock a compute host
[2019-05-22 09:16:58,318] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-2'

4. Router interface unreachable at following time:
[2019-05-22 09:27:28,961] 262 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.247.83'
PING 192.168.247.83 (192.168.247.83) 56(84) bytes of data.
From 192.168.47.1 icmp_seq=1 Destination Host Unreachable

| b0a6890e-312b-4311-a71e-8a432046f5af | tenant2-tis-centos-guest_vifs-2 | ACTIVE | tenant2-mgmt-net=192.168.247.83 | | dedicated |

Expected Behavior
------------------
4. router gateway and VM should be reachable

Actual Behavior
----------------
4. router gateway and VM are no longer reachable

Reproducibility
---------------
Intermittent

System Configuration
--------------------
2+3 system
Lab-name: wcp99-103

Branch/Pull Time/Commit
-----------------------
stx master as of "2019-05-21_14-14-17"

Last Pass
---------
Not sure. Saw same test passed on 2019-05-15_18-01-07, however this issue is intermittent.

Timestamp/Logs
--------------
See Steps to Reproduce for time stamps of each step.

Following vim log shows router scheduling failed briefly, however it is likely due to compute-2 (original router host) was being lock/unlocked. So the error could be expected. We are seeing router is moved to compute-1 later on, but the router interfaces are not reachable even after router is rescheduled successfully.

2019-05-22T09:16:42.064 controller-1 VIM_Thread[249490] INFO _network_rebalance.py.1118 Triggering L3 Agent reschedule for disabled l3 agent host: compute-2
2019-05-22T09:17:18.652 controller-1 VIM_Thread[249490] ERROR _task.py.200 Task(add_router_to_agent) work (add_router_to_agent) timed out, id=344.
2019-05-22T09:17:18.653 controller-1 VIM_Thread[249490] ERROR nfvi_network_api.py.1281 Neutron add-router-to-agent failed, operation did not complete, agent_id=65d5cb77-45de-44ad-8f5c-646e851bbbb4 router_id=500a3531-32c2-4b9e-9ccc-1a54092088d8
2019-05-22T09:17:18.653 controller-1 VIM_Thread[249490] WARNING _network_rebalance.py.536 Unable to add router to l3 agent, response = {'completed': False, 'reason': '', 'result-data': ''}

[2019-05-22 09:30:33,609] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------+-----------+-------------------+-------+-------+------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+------------+-----------+-------------------+-------+-------+------------------+
| 65d5cb77-45de-44ad-8f5c-646e851bbbb4 | L3 agent | compute-1 | nova | :-) | UP | neutron-l3-agent |
+--------------------------------------+------------+-----------+-------------------+-------+-------+------------------+
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
tags: added: stx.sanity
Revision history for this message
Yang Liu (yliu12) wrote :

vswitch is ovs-dpdk

Yang Liu (yliu12)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on discussion with Kevin Smith, he confirmed that this doesn't appear to be a vim router rescheduling issue. It appears that the neutron l3 agent/neuron server is misbehaving.

Given that stx is currently using an older stein snapshot of neutron in a temporary fork, I'd like to see a re-test done once the docker images pointing to latest stein are built as part of the cengn May 27 evening build.

For reference, this commit updated the stx-neutron image build to point to the upstream neutron repo:
https://review.opendev.org/#/c/661015/

Changed in starlingx:
status: New → Incomplete
tags: added: stx.distro.openstack
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning back to the reporter to monitor for a re-occurrence

Changed in starlingx:
assignee: nobody → Yang Liu (yliu12)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

I am now suspecting that this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1835807

tags: added: stx.networking
Ghada Khalil (gkhalil)
tags: added: stx.2.0
removed: stx.distro.openstack
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

I'll mark this as a duplicate of https://bugs.launchpad.net/starlingx/+bug/1835807 which is stx.2.0 gating.
@Yang, if you see a re-occurrence, please contact Joseph and add the notes in the above bug.

Changed in starlingx:
assignee: Yang Liu (yliu12) → Joseph Richard (josephrichard)
Revision history for this message
Yang Liu (yliu12) wrote :

Closing as not seen same issue for some time. Will open new LP if we encounter similar issue again.

tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Won't Fix to match the duplicate LP.

Changed in starlingx:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.