Pods crashing due to calico not routing traffic correctly on the hosts

Bug #1962599 reported by Camille Rodriguez
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Calico Charm
Incomplete
Critical
Unassigned

Bug Description

My coredns pods, kube-state-metrics, dashboard-metrics-scraper pods are in CrashLoopBackOff. Each pod isn't able to reach their own liveness probe.The pods that are not crashing at the moment are the ones that do not have a liveness/readyness probe. The rp_filter is set to 2 on all interfaces on the masters/workers and ignore-loose-rpf is set to true. Setup is calico with VXLAN=Always and IPIP=Never. All the kubernetes-master and kubernetes-worker were deployed on the same subnet.

Logs : https://pastebin.canonical.com/p/4S86bQndMh/
calico-kube-controller logs: https://pastebin.canonical.com/p/N44WBdMY6z/
k8s-worker calico felix log: https://paste.ubuntu.com/p/B5gPvxSTct/
k8s-control-01 calico felix log: https://paste.ubuntu.com/p/np6JQmQkTB/
k8s-control-02 calico felix log: https://paste.ubuntu.com/p/DPj53f3h8C/
k8s-control-03 calico felix current log: https://paste.ubuntu.com/p/cGDwbgfQCf/

I didn't see any error in there, I searched for fatal, error, rp_filter, etc.

No proxy in the environment.

Investigation showed that a pod isn't able to ping its host. With tcpdump, we see the ping traffic come to the host, but the kernel is not able to interpret it correctly. https://pastebin.canonical.com/p/bjPJdhtz8K/. If we start the ping from the worker instead, towards the container, there is still no reply but we see the reply and request in the tcpdump https://pastebin.canonical.com/p/8h3cGHNdzK/

From a k8s worker:
iptables : https://paste.ubuntu.com/p/XTZ2MH2MFx/
ip route: https://paste.ubuntu.com/p/M6rxDXJmNQ/
net.ipv4.conf.all.forwarding = 1

I tested a deployment without calico, replacing it by flannel, and the networking works as expected. https://pastebin.canonical.com/p/Krp8vwp6s7/

This is affecting a current deployment.

Tags: cpe-onsite
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Issue has been confirmed with cynerva and addyess from the kubernetes team last Friday

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Subscribed field critical, since this bug prevents functional networking in a charmed kubernetes deployment

Revision history for this message
George Kraft (cynerva) wrote :

While troubleshooting this, we observed that ICMP packets going from the pod -> kubernetes-worker made it through the iptables PREROUTING chains, where the packet was ultimately accepted, but then the packet never went to INPUT. It looked like something in the kernel outside of iptables was filtering the packets. This only occurred with packets from the container sent to the kubernetes-worker unit's bondM interface; the container had no issues pinging kubernetes-worker IPs attached to other interfaces.

With Flannel, traffic from the worker to the pod uses the source IP belonging to the cni0 interface instead of the default interface. I suspect that is why the issue only occurs with Calico.

I'm marking this as Incomplete for now because we are unable to reproduce. We will need more information to find the underlying issue.

Changed in charm-calico:
importance: Undecided → Critical
status: New → Incomplete
Revision history for this message
George Kraft (cynerva) wrote :

With the Flannel deployment, it would be helpful if you can try pinging the kubernetes-worker unit's bondM interface IP from a pod. That would help us confirm if the underlying pod -> bondM traffic is still being dropped.

Otherwise, we may need you to reproduce the deployment with Calico so we can investigate further.

I would like to look deeper into how the interfaces are configured. Please attach or otherwise send us output of the following commands on a kubernetes-worker unit:

sysctl -a
ip addr
ip route

Alternatively, if you can give us access to a failing environment, we can do further investigation ourselves. Please let us know how you would like to proceed.

Revision history for this message
George Kraft (cynerva) wrote :

I got access to the failing environment (thanks!) and was able to track down the issue. Packets from cali* to the bondM subnet are dropped due to a combination of IP policy routing rules and reverse path filtering (rp_filter).

Policy routing rules: https://paste.ubuntu.com/p/xh7k7bS92P/

Alternate routing table: https://paste.ubuntu.com/p/kbfCHZtF2R/

Essentially, all outbound traffic with source IP in the bondM subnet is routed out to the bondM interface. This bypasses the Calico routes to the cali* and vxlan.calico interfaces that only exist in the main routing table.

Inbound traffic from the pod to the bondM subnet is dropped because there is no reverse path leading back to the cali* interface that it comes from. Confusingly, this occurs despite the fact that initial traffic from the worker to the pod is sent to the cali* interface using a source IP from the bondM subnet; the policy routing rules apparently do not apply to outbound traffic to the pod subnet in which the bondM source address is chosen by the kernel *after* the route is chosen.

============
Workaround 1
============

One way to prevent the failure of kubelet->pod readiness checks, while leaving the underlying networking the same, is to configure Calico's Felix with a DeviceRouteSourceAddress. This sets a source address hint in the cali* routes, avoiding the bondM subnet entirely. However, pods will still be unable to reach addresses on the bondM subnet.

Example for one worker: https://paste.ubuntu.com/p/qRyKQzKrv4/

This would need to be repeated, once for each worker.

============
Workaround 2
============

You can create a new policy routing rule to direct traffic to the pod subnet back to the main routing table.

Example: https://paste.ubuntu.com/p/6yYbYqrZZ5/

However, this command will not survive a host reboot. It may need to be configured in MAAS to remain more permanent.

===============
Recommended fix
===============

Field indicates that the policy routing rules come from MAAS configuration and that they may be able to fix this with advanced policy routing configuration outside of the Calico charm.

I don't think any fixes are needed in the Calico charm. If needed, we could look into making it easier to configure Felix's DeviceRouteSourceAddress, however I think we should only pursue this if there's a clear use case for making Calico work while keeping the bondM subnet isolated from the pod network.

Revision history for this message
George Kraft (cynerva) wrote :

Given that workarounds exist, I recommend downgrading this from Field Critical.

I will leave this as Incomplete until we hear back from field on whether or not a fix in the Calico charm is still desired.

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Thank you George for all the help ! Removing field critical. I'm still unsure what part of the deployment added the routing rule to BondM. I think it's part of the default configuration by MAAS. A permanent workaround is to use the advanced-policy-routing charm to add the correct routing to the calico network. I believe this should be added by default to our deployment templates on baremetal. Example :

With 192.168.20.0/24 the OAM network and 10.128.0.0/16 the calico network.

  advanced-policy-routing:
    charm: cs:advanced-routing
    options:
      action-managed-update: False
      enable-advanced-routing: True
      advanced-routing-config: |
        [ {
            "type": "table",
            "table": "SF1"
        }, {
            "type": "route",
            "default_route": true,
            "gateway": "10.147.254.1",
            "table": "SF1"
        }, {
            "type": "rule",
            "from-net": "192.168.20.0/24",
            "to-net": "10.128.0.0/16",
            "priority": 0
        }, {
            "type": "rule",
            "from-net": "10.128.0.0/16",
            "table": "SF1",
            "priority": 100
        } ]

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

To add this bug to the fce-templates, I need to switch this bug to private.

information type: Public → Private
information type: Private → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.