ovn pods stuck in CrashLoopBackOff with "Illegal instruction (core dumped)"

Bug #1989363 reported by Bas de Bruijne
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kube OVN Charm
Invalid
High
Unassigned

Bug Description

In testrun https://solutions.qa.canonical.com/testruns/testRun/9b189314-b023-4944-a598-4b5905b1039a which is a 1.25 release test jammy on aws with kube_ovn, all k8s control plane nodes and 1 kube-ovn node are stuck waiting:

```
kubernetes-control-plane/0* waiting idle 12 3.88.150.106 6443/tcp Waiting for 3 kube-system pods to start
  containerd/3 active idle 3.88.150.106 Container runtime available
  filebeat/16 active idle 3.88.150.106 Filebeat ready.
  kube-ovn/3 waiting idle 3.88.150.106 Waiting to retry configuring Kube-OVN
  ntp/16 active idle 3.88.150.106 123/udp chrony: Ready
  telegraf/16 active idle 3.88.150.106 9103/tcp Monitoring kubernetes-control-plane/0 (source version/commit 76901fd)
kubernetes-control-plane/1 waiting idle 13 54.167.245.12 6443/tcp Waiting for 3 kube-system pods to start
  containerd/4 active idle 54.167.245.12 Container runtime available
  filebeat/19 active idle 54.167.245.12 Filebeat ready.
  kube-ovn/4 active idle 54.167.245.12
  ntp/19 active idle 54.167.245.12 123/udp chrony: Ready
  telegraf/19 active idle 54.167.245.12 9103/tcp Monitoring kubernetes-control-plane/1 (source version/commit 76901fd)
```

In the kube-system-ovs-ovn-gc6kk-openvswitch.log I see some errors:
```
kube-system-ovs-ovn-gc6kk-openvswitch.log:parm: log_ecn_error:Log packets received with corrupted ECN (bool)
kube-system-ovs-ovn-gc6kk-openvswitch.log-filename: /lib/modules/5.15.0-1019-aws/kernel/net/ipv4/netfilter/ip_tables.ko
kube-system-ovs-ovn-gc6kk-openvswitch.log-alias: ipt_icmp
kube-system-ovs-ovn-gc6kk-openvswitch.log-description: IPv4 packet filter
kube-system-ovs-ovn-gc6kk-openvswitch.log-author: Netfilter Core Team <email address hidden>
kube-system-ovs-ovn-gc6kk-openvswitch.log-license: GPL
kube-system-ovs-ovn-gc6kk-openvswitch.log-srcversion: 766BC3192DA4B163CBE393B
kube-system-ovs-ovn-gc6kk-openvswitch.log-depends: x_tables
kube-system-ovs-ovn-gc6kk-openvswitch.log-retpoline: Y
kube-system-ovs-ovn-gc6kk-openvswitch.log-intree: Y
kube-system-ovs-ovn-gc6kk-openvswitch.log-name: ip_tables
kube-system-ovs-ovn-gc6kk-openvswitch.log-vermagic: 5.15.0-1019-aws SMP mod_unload modversions
kube-system-ovs-ovn-gc6kk-openvswitch.log-sig_id: PKCS#7
kube-system-ovs-ovn-gc6kk-openvswitch.log-signer: Build time autogenerated kernel key
kube-system-ovs-ovn-gc6kk-openvswitch.log-sig_key: 4E:2B:42:19:F5:56:93:8B:20:A1:FC:3B:04:F8:B7:5A:8C:4C:32:A0
kube-system-ovs-ovn-gc6kk-openvswitch.log-sig_hashalgo: sha512
kube-system-ovs-ovn-gc6kk-openvswitch.log-signature: ...
```

But otherwise there is nothing standing out. Looking at the SKu:https://solutions.qa.canonical.com/skus/sku/fkb-undertest-kubernetes-jammy-aws-kube_ovn-kubernetes_release, we did have some passing runs already with the same configs.

Link to crashdumps:
https://oil-jenkins.canonical.com/artifacts/9b189314-b023-4944-a598-4b5905b1039a/index.html

Revision history for this message
George Kraft (cynerva) wrote :

It looks like kube-ovn/3 is stuck waiting for ovn-central pods to come up, and those pods are failing with a rather cryptic "Illegal instruction (core dumped)" message. From the ovn-central-5b6c868f-pvz4w pod:

 * ovn-northd is not running
 * ovnnb_db is not running
 * ovnsb_db is not running
timeout: the monitored command dumped core
timeout: the monitored command dumped core
Illegal instruction (core dumped)
Illegal instruction (core dumped)
 * Joining /etc/ovn/ovnnb_db.db to cluster
 * Starting ovsdb-nb
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
 * Waiting for to come up
Illegal instruction (core dumped)
Illegal instruction (core dumped)
 * Joining /etc/ovn/ovnsb_db.db to cluster
Illegal instruction (core dumped)
 * Starting ovsdb-sb
Illegal instruction (core dumped)
Illegal instruction (core dumped)
 * Waiting for to come up
 * OVN Northbound DB is not running
/kube-ovn/start-db.sh: line 268: 174 Illegal instruction (core dumped) ovs-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb-server/memory-trim-on-compaction on
 * ovn-northd is not running
 * ovnnb_db is not running
 * ovnsb_db is not running

Revision history for this message
George Kraft (cynerva) wrote :
Revision history for this message
George Kraft (cynerva) wrote :

It looks like upstream fixed this by building and release separate kube-ovn amd64 images that have AVX-512 instructions disabled. These images are released under a separate tag, such as kubeovn/kube-ovn:v1.10.4-no-avx512

We might need to sync the -no-avx512 images to rocks.canonical.com and expose a config option to override the kube-ovn image URI.

However, I'm unable to confirm if this will fix the issue. I cannot reproduce. I tried deploying on AWS, in region us-east-1, with series: jammy, and machine constraints matching those from the crashdump, but the kube-ovn pods came up fine. The crashdump is missing the EC2 instance type, and missing /proc/cpuinfo, so I can't verify availability of avx512 instructions in your deployment.

Revision history for this message
George Kraft (cynerva) wrote :

Actually, looking deeper in the crashdump logs, I was able to find that kubernetes-control-plane machines had instance-type=m3.large. If I add instance-type=m3.large to my deploy constraints, I can reproduce the issue.

When I deploy Charmed Kubernetes with default constraints, both kubernetes-control-plane and kubernetes-worker come up with instance-type=r5dn.large, which does support AVX-512.

For now, I recommend working around this by constraining your kubernetes-control-plane and kubernetes-worker units to AWS instance types like r5dn.large that support AVX-512 instructions.

no longer affects: charm-kubernetes-master
Changed in charm-kube-ovn:
importance: Undecided → High
status: New → Triaged
summary: - [1.25/beta] kubernetes-control-plane stuck waiting: "Waiting for 3 kube-
- system pods to start"
+ ovn pods stuck in CrashLoopBackOff with "Illegal instruction (core
+ dumped)"
Revision history for this message
George Kraft (cynerva) wrote :

Per team meeting, we're interested in having the kube-ovn charm "just work" on older AMD64 machines that don't support AVX-512. A couple options come to mind:

1. Check CPU capabilities at deploy time and select the -no-avx512 image if needed
2. Publish our own kube-ovn multi-arch image that uses the -no-avx512 image for amd64

I think option 1 could get complicated as we have to check CPU capabilities across both kubernetes-control-plane and kubernetes-worker applications. I would recommend option 2 as a simpler, safer alternative.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Thanks for looking into this. It is weird that we have had some passing runs on this sku, though I suppose the architecture that AWS provides could be inconsistent.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Marking as invalid due to inactivity

Changed in charm-kube-ovn:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.