Non-leader units enounter errored update-status hook after snapd cycling and mysqld OOM

Bug #1938731 reported by John Lettman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Expired
Undecided
Unassigned

Bug Description

I encountered a strange condition on my deployment where snapd times out, causes numerous failed systemd resets of snapd, leading to OOM against mysql. When mysql returns, it appears the update-status hook is invoked and fails.

The condition starts with a watchdog timeout of snapd:

Aug 2 10:15:58 juju-399412-jplettman-10 systemd[1]: snapd.service: Watchdog timeout (limit 5min)!
Aug 2 10:15:58 juju-399412-jplettman-10 systemd[1]: snapd.service: Killing process 601501 (snapd) with signal SIGABRT.
Aug 2 10:15:59 juju-399412-jplettman-10 snapd[601501]: SIGABRT: abort
Aug 2 10:15:59 juju-399412-jplettman-10 snapd[601501]: PC=0x563af1dfdf61 m=0 sigcode=0
Aug 2 10:15:59 juju-399412-jplettman-10 snapd[601501]: goroutine 0 [idle]:

After this problem, systemd attempts to restart snapd continuously. Each attempt fails with a timeout:

Aug 2 10:56:11 juju-399412-jplettman-10 systemd[1]: snapd.service: Scheduled restart job, restart counter is at 27.
Aug 2 10:56:11 juju-399412-jplettman-10 systemd[1]: Stopped Snap Daemon.
Aug 2 10:56:11 juju-399412-jplettman-10 systemd[1]: Starting Snap Daemon...
Aug 2 10:56:12 juju-399412-jplettman-10 snapd[880346]: AppArmor status: apparmor is enabled and all features are available
Aug 2 10:57:41 juju-399412-jplettman-10 systemd[1]: snapd.service: start operation timed out. Terminating.
Aug 2 10:57:41 juju-399412-jplettman-10 systemd[1]: snapd.service: Failed with result 'timeout'.
Aug 2 10:57:41 juju-399412-jplettman-10 systemd[1]: Failed to start Snap Daemon.

This continues until there is an out-of-memory activation against mysqld:

Aug 2 11:45:12 juju-399412-jplettman-10 kernel: [312144.522225] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mysql.service,task=mysqld,pid=601613,uid=113
Aug 2 11:45:12 juju-399412-jplettman-10 kernel: [312144.522405] Out of memory: Killed process 601613 (mysqld) total-vm:3012028kB, anon-rss:1693768kB, file-rss:0kB, shmem-rss:0kB, UID:113 pgtables:3768kB oom_score_adj:0
Aug 2 11:45:13 juju-399412-jplettman-10 kernel: [312145.006217] oom_reaper: reaped process 601613 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: mysql.service: Main process exited, code=killed, status=9/KILL
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: mysql.service: Failed with result 'signal'.
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: mysql.service: Scheduled restart job, restart counter is at 3.
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: Stopped MySQL Community Server.
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: Starting MySQL Community Server...

Around the time mysqld returns, the charm gets an address change and runs through update-status hook:

2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:584 got address change for mysql-innodb-cluster/0: ok=true, hashes=[9eb7bc8515447e6e385acf5629579fb0500749ced743743bb6090fa15179ebfb]
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:552 got config change for mysql-innodb-cluster/0: ok=true, hashes=[17c4efe67eca78b7e5acdd8f084776f9ed7a47f709b4b9718e48caa217303cb5]
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:563 got trust config change for mysql-innodb-cluster/0: ok=true, hashes=[aeddf8dcd2b7e0ac0c37321c9d8645dff25799f0d355a7d57fccc94a3c9956b2]
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:633 got update status interval change for mysql-innodb-cluster/0: ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:511 got application change for mysql-innodb-cluster/0
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:623 got storage change for mysql-innodb-cluster/0: [] ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:501 got unit change for mysql-innodb-cluster/0
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:574 got upgrade series change
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:724 no upgrade series in progress, reinitializing local upgrade series state
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:613 got relations change for mysql-innodb-cluster/0: ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:595 got leader settings change for mysql-innodb-cluster/0: ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:605 got action change for mysql-innodb-cluster/0: [] ok=true
2021-08-02 11:45:29 INFO juju.worker.uniter resolver.go:144 awaiting error resolution for "update-status" hook
2021-08-02 11:45:29 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] error: hook failed: "update-status"
2021-08-02 11:45:29 DEBUG juju.worker.dependency engine.go:564 "metric-collect" manifold worker started at 2021-08-02 11:45:29.443523599 +0000 UTC
2021-08-02 11:49:33 DEBUG juju.worker.uniter.remotestate watcher.go:676 update status timer triggered for mysql-innodb-cluster/0
2021-08-02 11:49:33 INFO juju.worker.uniter resolver.go:144 awaiting error resolution for "update-status" hook

Afterwards, the hook remains in a failed state. When both services are up, the hook proceeds successfully:

root@juju-399412-jplettman-10:/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm# (systemctl status snapd; systemctl status mysql) | grep -B2 "Active:"
● snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-08-02 11:45:21 UTC; 8h ago
--
● mysql.service - MySQL Community Server
     Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-08-02 11:45:31 UTC; 8h ago

This occurred using the openstack-charmers stable bundle on ServerStack:

https://github.com/openstack-charmers/openstack-bundles/blob/master/stable/openstack-base/bundle.yaml

The instance for mysql-innodb-cluster uses m1.small with 2 GiB RAM, 20 GiB disk, and 1 VCPU.

Revision history for this message
John Lettman (jplettman) wrote :

Attached: juju-crashdump (other machines removed for size)

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

If the OOM was activated, was snapd failing due to being OOM. i.e. I wonder if 2 GiB is enough to run mysql in clustered mode. I think 4 GiB is the minimum for a clustered server, with 2 GiB for a standalone (although it's hard verify that against the Oracle docs).

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Please could you retest on a 4GiB instance?

Changed in charm-mysql-innodb-cluster:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for charm-mysql-innodb-cluster because there has been no activity for 60 days.]

Changed in charm-mysql-innodb-cluster:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.