MySQL InnoDB Cluster Charm

Non-leader units enounter errored update-status hook after snapd cycling and mysqld OOM

Bug #1938731 reported by John Lettman on 2021-08-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MySQL InnoDB Cluster Charm	Expired	Undecided	Unassigned

Bug Description

I encountered a strange condition on my deployment where snapd times out, causes numerous failed systemd resets of snapd, leading to OOM against mysql. When mysql returns, it appears the update-status hook is invoked and fails.

The condition starts with a watchdog timeout of snapd:

Aug 2 10:15:58 juju-399412-jplettman-10 systemd[1]: snapd.service: Watchdog timeout (limit 5min)!
Aug 2 10:15:58 juju-399412-jplettman-10 systemd[1]: snapd.service: Killing process 601501 (snapd) with signal SIGABRT.
Aug 2 10:15:59 juju-399412-jplettman-10 snapd[601501]: SIGABRT: abort
Aug 2 10:15:59 juju-399412-jplettman-10 snapd[601501]: PC=0x563af1dfdf61 m=0 sigcode=0
Aug 2 10:15:59 juju-399412-jplettman-10 snapd[601501]: goroutine 0 [idle]:

After this problem, systemd attempts to restart snapd continuously. Each attempt fails with a timeout:

Aug 2 10:56:11 juju-399412-jplettman-10 systemd[1]: snapd.service: Scheduled restart job, restart counter is at 27.
Aug 2 10:56:11 juju-399412-jplettman-10 systemd[1]: Stopped Snap Daemon.
Aug 2 10:56:11 juju-399412-jplettman-10 systemd[1]: Starting Snap Daemon...
Aug 2 10:56:12 juju-399412-jplettman-10 snapd[880346]: AppArmor status: apparmor is enabled and all features are available
Aug 2 10:57:41 juju-399412-jplettman-10 systemd[1]: snapd.service: start operation timed out. Terminating.
Aug 2 10:57:41 juju-399412-jplettman-10 systemd[1]: snapd.service: Failed with result 'timeout'.
Aug 2 10:57:41 juju-399412-jplettman-10 systemd[1]: Failed to start Snap Daemon.

This continues until there is an out-of-memory activation against mysqld:

Aug 2 11:45:12 juju-399412-jplettman-10 kernel: [312144.522225] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mysql.service,task=mysqld,pid=601613,uid=113
Aug 2 11:45:12 juju-399412-jplettman-10 kernel: [312144.522405] Out of memory: Killed process 601613 (mysqld) total-vm:3012028kB, anon-rss:1693768kB, file-rss:0kB, shmem-rss:0kB, UID:113 pgtables:3768kB oom_score_adj:0
Aug 2 11:45:13 juju-399412-jplettman-10 kernel: [312145.006217] oom_reaper: reaped process 601613 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: mysql.service: Main process exited, code=killed, status=9/KILL
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: mysql.service: Failed with result 'signal'.
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: mysql.service: Scheduled restart job, restart counter is at 3.
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: Stopped MySQL Community Server.
Aug 2 11:45:14 juju-399412-jplettman-10 systemd[1]: Starting MySQL Community Server...

Around the time mysqld returns, the charm gets an address change and runs through update-status hook:

2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:584 got address change for mysql-innodb-cluster/0: ok=true, hashes=[9eb7bc8515447e6e385acf5629579fb0500749ced743743bb6090fa15179ebfb]
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:552 got config change for mysql-innodb-cluster/0: ok=true, hashes=[17c4efe67eca78b7e5acdd8f084776f9ed7a47f709b4b9718e48caa217303cb5]
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:563 got trust config change for mysql-innodb-cluster/0: ok=true, hashes=[aeddf8dcd2b7e0ac0c37321c9d8645dff25799f0d355a7d57fccc94a3c9956b2]
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:633 got update status interval change for mysql-innodb-cluster/0: ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:511 got application change for mysql-innodb-cluster/0
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:623 got storage change for mysql-innodb-cluster/0: [] ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:501 got unit change for mysql-innodb-cluster/0
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:574 got upgrade series change
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:724 no upgrade series in progress, reinitializing local upgrade series state
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:613 got relations change for mysql-innodb-cluster/0: ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:595 got leader settings change for mysql-innodb-cluster/0: ok=true
2021-08-02 11:45:29 DEBUG juju.worker.uniter.remotestate watcher.go:605 got action change for mysql-innodb-cluster/0: [] ok=true
2021-08-02 11:45:29 INFO juju.worker.uniter resolver.go:144 awaiting error resolution for "update-status" hook
2021-08-02 11:45:29 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] error: hook failed: "update-status"
2021-08-02 11:45:29 DEBUG juju.worker.dependency engine.go:564 "metric-collect" manifold worker started at 2021-08-02 11:45:29.443523599 +0000 UTC
2021-08-02 11:49:33 DEBUG juju.worker.uniter.remotestate watcher.go:676 update status timer triggered for mysql-innodb-cluster/0
2021-08-02 11:49:33 INFO juju.worker.uniter resolver.go:144 awaiting error resolution for "update-status" hook

Afterwards, the hook remains in a failed state. When both services are up, the hook proceeds successfully:

root@juju-399412-jplettman-10:/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm# (systemctl status snapd; systemctl status mysql) | grep -B2 "Active:"
● snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-08-02 11:45:21 UTC; 8h ago
--
● mysql.service - MySQL Community Server
     Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-08-02 11:45:31 UTC; 8h ago

This occurred using the openstack-charmers stable bundle on ServerStack:

https://github.com/openstack-charmers/openstack-bundles/blob/master/stable/openstack-base/bundle.yaml

The instance for mysql-innodb-cluster uses m1.small with 2 GiB RAM, 20 GiB disk, and 1 VCPU.

Revision history for this message

John Lettman (jplettman) wrote on 2021-08-03:

juju-crashdump-93ea4617-97ed-420e-b443-e366c2aa5627.tar.xz Edit (29.1 MiB, application/x-tar)

Attached: juju-crashdump (other machines removed for size)

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-08-10:

If the OOM was activated, was snapd failing due to being OOM. i.e. I wonder if 2 GiB is enough to run mysql in clustered mode. I think 4 GiB is the minimum for a clustered server, with 2 GiB for a standalone (although it's hard verify that against the Oracle docs).

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-08-10:

Please could you retest on a 4GiB instance?