New oom-killer related crash for low RAM UC devices
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
snapd |
Fix Committed
|
High
|
Zygmunt Krynicki |
Bug Description
Checkbox18 is getting killed by the oom-killer during test sessions when the
new version of snapd (2.63) is in use on very low RAM (<500Mb) Ubuntu Core
devices. This behaviour is reproducible (100% of the runs are killed with the
new snapd version) and doesn't happen with the previous one (2.62).
The Checkbox snap is installed in devmode, this triggers various audit messages
and warnings to be produced but we have discovered that this new version of
snapd produces way more messages/complains in the journal.
Namely the line that seems to be most rappresented in these logs is in the form
of:
```
core18-memory audit[2114]: SECCOMP ... syscall=NNN
```
This line doesn't appear in the snapd2.62 log, a similar line about
```
core18-memory audit[3698]: AVC ...
```
is present but in a quantity that is an order of magnitude less
Here is a table with the syscalls and their meaning (truncated, >1k)
ID COUNT MEANING
4 36313 stat
257 23271 openat
5 20478 fstat
3 18008 close
0 14508 read
8 9872 lseek
228 7541 clock_gettime
16 4109 ioctl
9 4071 mmap
202 3417 futex
10 2981 mprotect
21 2375 access
78 2084 getdents
6 1975 lstat
13 1914 rt_sigaction
11 1666 munmap
83 1446 mkdir
12 1526 brk
96 1176 gettimeofday
267 1108 readlinkat
For reference, this is the previous version counts:
ID COUNT
41 42
116 30
While Checkbox has started but is "idle" (waiting for a test to run), the
following are logged over and over:
```
May 30 14:22:11 core18-memory audit[2373]: SECCOMP auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2373 comm="python3" exe="/usr/
May 30 14:22:11 core18-memory kernel: audit: type=1326 audit(171707893
```
To work around this issue we have tried to (several combinations of the following):
- Created a very large (4Gb) swap file for it to use, it is used and the oom-killer
is still invoked at ~20% space occupied (the system freezes at this occupancy,
ressurrects after the oom-killer is done)
- Forcing vm.overcommit_
- Enabling RateLimitInterv
possibly not done properly)
The only things that seem to work are:
- Using the older version of snapd
- Using a device with more RAM (1Gb works fine)
- Increasing (when running in a VM) the RAM (so same machine, more RAM)
We have created the following shared drive directories with the logs of a run
crashing with this issue on the new version, and one passing bootstrap with the
old version. You can find it here: https:/
As an exaple, the following was failing with this new issue on the new version of
snapd (this log is basically checkbox crashing and restarting):
- http://
While here, after downgrading snapd it seems to work fine:
- http://
Reproducing steps:
- Provision a machine (or VM) with any UC operating system (we tested uc20 and uc18) with less than 500Mb of ram and the last version of snapd
- Install checkbox18 (or 20) and checkbox uc18/uc20
- sudo snap install checkbox18
- sudo snap install checkbox --channel=uc18 --devmode
- Run `sudo checkbox.
- Checkbox should be killed by the oom-killer soon after
description: | updated |
description: | updated |
Changed in snapd: | |
assignee: | nobody → Zygmunt Krynicki (zyga) |
description: | updated |
Changed in snapd: | |
status: | In Progress → Fix Committed |
I've filed a jira task https:/ /warthogs. atlassian. net/browse/ SNAPDENG- 22672 for internal scheduling.