Service Health Check doesnt recover after failure

Bug #1724945 reported by Senthilnathan Murugappan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.1
Fix Committed
High
Hari Prasad Killi
Trunk
Fix Committed
High
Hari Prasad Killi

Bug Description

version: 4.1.0.0-1

If healthcheck fails and restores the Interface never becomes active.
After the failure, the probes never reach the VM-tap interface and observing increase in discard counter, believe this be the reason we dont recover after restore.
   WA: is to disassoc and associate the SHC to the VMI

Also observed HC probes doesnt wait for delay seconds between retries upon failure. (Not sure if its by design)

Tcpdump for probe retry interval:
12:27:40.377930 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 12.1.1.2 (00:00:5e:00:01:00) tell 12.1.1.2, length 28
12:27:40.427450 IP (tos 0x0, ttl 63, id 37879, offset 0, flags [DF], proto TCP (6), length 60)
    12.1.1.2.43546 > 12.1.1.3.80: Flags [S], cksum 0x1a35 (incorrect -> 0x7d1c), seq 1556438834, win 29200, options [mss 1460,sackOK,TS val 195415237 ecr 0,nop,wscale 7], length 0
12:27:40.427697 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    12.1.1.3.80 > 12.1.1.2.43546: Flags [R.], cksum 0x2f67 (correct), seq 0, ack 1556438835, win 0, length 0
12:27:40.440191 IP (tos 0x0, ttl 63, id 28092, offset 0, flags [DF], proto TCP (6), length 60)
    12.1.1.2.43548 > 12.1.1.3.80: Flags [S], cksum 0x1a35 (incorrect -> 0xf805), seq 2677318004, win 29200, options [mss 1460,sackOK,TS val 195415240 ecr 0,nop,wscale 7], length 0
12:27:40.440352 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    12.1.1.3.80 > 12.1.1.2.43548: Flags [R.], cksum 0xaa53 (correct), seq 0, ack 2677318005, win 0, length 0
12:27:40.443326 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 12.1.1.1 (00:00:5e:00:01:00) tell 12.1.1.1, length 28

Flow details:
root@5b4s10:~# flow -l | grep -2 508368
 SPort 55171, TTL 0, Sinfo 0.0.0.0)

   187416<=>508368 12.1.1.3:80 6 (2->0)
                         12.1.1.2:43484
(Gen: 1, K(nh):23, Action:N(SD), Flags:, TCP:Sr, QOS:-1, S(nh):23, Stats:0/0,
--
 SPort 61408, TTL 0, Sinfo 0.0.0.0)

   508368<=>187416 5.5.5.133:43484 6 (0->2)
                         169.254.255.254:80
(Gen: 10, K(nh):5, Action:N(SD), Flags:, TCP:S, QOS:-1, S(nh):10, Stats:3/180,

root@5b4s10:~# flow --get 508368
Flow Index: 508368
Flow Generation ID: 10
Reverse Flow Index: 187416
VRF: 0
Destination VRF: 2
Flow Source: [5.5.5.133]:43484
Flow Destination: [169.254.255.254]:80
Flow Protocol: TCP
Flow Action: NAT: SourceNAT, DestinationNAT,
                              NAT(Source, Destination): [12.1.1.2]:43484, [12.1.1.3]:80
Expected Source: NextHop(Index, VRF, Type): 10, 1, RECEIVE
                              Ingress Interface(Index, VRF, OS): vif0/1, 0, vhost0
                              Interface Statistics(Out, In, Errors): 2975385, 2140815, 0
Source Information: VRF: 0
                              Layer 3 Route Information
                              Matching Route: 5.5.5.133/32
                              NextHop(Index, VRF, Type): 10, 1, RECEIVE
                              Ingress Interface(Index, VRF, OS): vif0/1, 0, vhost0
                              Interface Statistics(Out, In, Errors): 2975385, 2140815, 0
Destination Information: VRF: 2
                              Layer 3 Route Information
                              Matching Route: 12.1.1.0/24

Flow Flags:
TCP FLAGS: SYN,
UDP Source Port: 56500

Flow Statistics: 3/180
System Wide Packet Drops: 2045631
                              Reverse Path Failures: 6
                              Flow Block Drops: 0

root@5b4s10:~# dropstats | grep -v ' 0$' | grep -v '^$' | grep -v Cloned | grep -v Duplicated
Discards 446
Invalid NH 3
No L2 Route 10
root@5b4s10:~# dropstats | grep -v ' 0$' | grep -v '^$' | grep -v Cloned | grep -v Duplicated
Discards 449
Invalid NH 3
No L2 Route 10
root@5b4s10:~# dropstats | grep -v ' 0$' | grep -v '^$' | grep -v Cloned | grep -v Duplicated
Discards 452
Invalid NH 3
No L2 Route 10

Tags: vrouter
tags: added: blocker
Revision history for this message
Hari Prasad Killi (haripk) wrote :

Issue is when policy is enabled on vhost interface (in latest 4.1 builds, policy on vhost interface is disabled by default and this problem will not be seen). Need to make the health check packets go in this scenario.

Revision history for this message
Hari Prasad Killi (haripk) wrote :

Removing blocker tag as this happens only when policy is enabled on vhost interface.

Fix : in default VRF when the destination address is 169.254.*, vrouter will not do flow lookup irrespective of whether policy on vhost is enabled / disabled.

tags: removed: blocker
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.1

Review in progress for https://review.opencontrail.org/37492
Submitter: Divakar Dharanalakota (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/37492
Committed: http://github.com/Juniper/contrail-vrouter/commit/29edfac69b6eea42406c24c50dfec993ea4b73ff
Submitter: Zuul (<email address hidden>)
Branch: R4.1

commit 29edfac69b6eea42406c24c50dfec993ea4b73ff
Author: Divakar D <email address hidden>
Date: Tue Nov 14 11:49:55 2017 +0530

Disable flow processing for metadata subnet on Vhost0

Currently the healthcheck is broken if policy is enabled on Vhost
interface. Health check packets are destined to metadata IPs. The
metadata routes point to VMI's in Vhost VRF. Regular VM's routes are
present in VN's VRF. If VM is detected unreachable, HC withdraws routes
in VRF corresponding to VM, but metadata Routes are not withdrawn. HC
packets should be routed using these metadata routes, if routes are
withdrawn. If Vhost has policy enabled, flow processing happens earlier
than route look up and destination metadata IP's are Nated to VM's IP.
These Nated IP's would be looked up in VN's VRF which results in drop
nexthop as routes are withdrawn.

The solutino is not to Nat the packet till the route lookup is complete
for these metadata IP's if the policy is enabled. The required flow
processing would be completed if the nexthop is marked Policy enabled.

Change-Id: I144c36faf39b062026316a067e912eed5a2fa792
closes-bug: #1724945

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/37588
Submitter: Divakar Dharanalakota (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/37588
Committed: http://github.com/Juniper/contrail-vrouter/commit/529470a8423f4db8ebd6338d461527c6245be46a
Submitter: Zuul (<email address hidden>)
Branch: master

commit 529470a8423f4db8ebd6338d461527c6245be46a
Author: Divakar D <email address hidden>
Date: Tue Nov 14 11:49:55 2017 +0530

Disable flow processing for metadata subnet on Vhost0

Currently the healthcheck is broken if policy is enabled on Vhost
interface. Health check packets are destined to metadata IPs. The
metadata routes point to VMI's in Vhost VRF. Regular VM's routes are
present in VN's VRF. If VM is detected unreachable, HC withdraws routes
in VRF corresponding to VM, but metadata Routes are not withdrawn. HC
packets should be routed using these metadata routes, if routes are
withdrawn. If Vhost has policy enabled, flow processing happens earlier
than route look up and destination metadata IP's are Nated to VM's IP.
These Nated IP's would be looked up in VN's VRF which results in drop
nexthop as routes are withdrawn.

The solutino is not to Nat the packet till the route lookup is complete
for these metadata IP's if the policy is enabled. The required flow
processing would be completed if the nexthop is marked Policy enabled.

Change-Id: I144c36faf39b062026316a067e912eed5a2fa792
closes-bug: #1724945

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.