Mysql pods being restarted by possible OOM killer?

Bug #2051915 reported by Matt Verran
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Snap
Fix Committed
Critical
Guillaume Boutry

Bug Description

Once openstack is up and running, even if left unused the mysql instances will go offline into maintenance.

Issue appears to be oom killer. Note exit code 137.

  mysql:
    Container ID: containerd://9662e1935ad36b0ddb41b55c58d70f9cf9e242734c3613c046a261b3acc14b4c
    Image: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Image ID: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Port: <none>
    Host Port: <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State: Running
      Started: Tue, 30 Jan 2024 14:34:07 +0000
    Last State: Terminated
      Reason: Error
      Exit Code: 137
      Started: Tue, 30 Jan 2024 14:27:20 +0000
      Finished: Tue, 30 Jan 2024 14:34:05 +0000
    Ready: True
    Restart Count: 74
    Limits:
      memory: 2Gi
    Requests:
      memory: 2Gi
    Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAME: mysql
      PEBBLE_SOCKET: /charm/container/pebble.socket
    Mounts:
      /charm/bin/pebble from charm-data (ro,path="charm/bin/pebble")
      /charm/container from charm-data (rw,path="charm/containers/mysql")
      /var/lib/mysql from nova-mysql-database-666e6aec (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfl6x (ro)

It seems this happens around the time the scheduled process (juju based?) runs logrotates etc. Its possible some of the noise in the logs exacerbates this (locale and mysql password plugin). I have seen oom take out containers when the filecache contributes extra weight to the container - the other solution may be to log to the backing store?

Observed on openstack 2023.2 375 2023.2/edge

Updating mysql and mysql router to 120/edge and 93/edge respectively doesn't help despite a healthcheck fix.

Revision history for this message
Matt Verran (mv-2112) wrote :

Using a quick and dirty 'kubectl edit statefulset' on the mysql config to bump the limit from 2g to 4g clears the OOM restarts with exit code 137's, however general instability continues

e.g:-
openstack cinder-mysql-0 2/2 Running 0 32m
openstack placement-mysql-0 2/2 Running 2 (6m50s ago) 32m
openstack horizon-mysql-0 2/2 Running 1 (6m49s ago) 31m
openstack nova-mysql-0 2/2 Running 1 (2m1s ago) 32m
openstack neutron-mysql-0 2/2 Running 2 (118s ago) 30m
openstack keystone-mysql-0 2/2 Running 1 (118s ago) 35m
openstack glance-mysql-0 2/2 Running 2 (117s ago) 31m

Querying placement-mysql-0 shows an exist code of 0 and 2 restarts, so still something else triggering this.

    State: Running
      Started: Fri, 08 Feb 2024 18:51:13 +0000
    Last State: Terminated
      Reason: Completed
      Exit Code: 0
      Started: Fri, 08 Feb 2024 18:45:05 +0000
      Finished: Fri, 08 Feb 2024 18:51:13 +0000
    Ready: True
    Restart Count: 2
    Limits:
      memory: 4Gi
    Requests:
      memory: 4Gi

Revision history for this message
Guillaume Boutry (gboutry) wrote :

A different way to manage MySQL memory has been committed (and released to 2023.2/edge).

Basically now constraints are set within mysql configuration and not on k8s limits, so it shouldn't be oomkilled anymore.

Changed in snap-openstack:
assignee: nobody → Guillaume Boutry (gboutry)
importance: Undecided → Critical
status: New → Fix Committed
Revision history for this message
Matt Verran (mv-2112) wrote (last edit ):

If this is in 2023.2/edge (390) then unfortunately I'm still seeing issues. Its definitely less, but still there.

NAME READY STATUS RESTARTS AGE
-----------------------8<--------------------8<-----------------------
horizon-0 2/2 Running 1 (81m ago) 117m
octavia-0 4/4 Running 1 (79m ago) 80m
keystone-mysql-0 2/2 Running 3 (27m ago) 118m
cinder-mysql-0 2/2 Running 3 (23m ago) 118m
neutron-mysql-0 2/2 Running 3 (17m ago) 118m
horizon-mysql-0 2/2 Running 5 (15m ago) 118m
placement-mysql-0 2/2 Running 4 (23m ago) 118m
magnum-mysql-0 2/2 Running 3 (12m ago) 89m
designate-mysql-0 2/2 Running 3 (11m ago) 86m
nova-mysql-0 2/2 Running 3 (7m10s ago) 118m
heat-mysql-0 2/2 Running 4 (6m11s ago) 98m
octavia-mysql-0 2/2 Running 3 (2m13s ago) 81m
barbican-mysql-0 2/2 Running 4 (70s ago) 93m
glance-mysql-0 1/2 Running 6 (11s ago) 118m

kubectl describe pod glance-mysql-0 -n openstack

  mysql:
    Container ID: containerd://360f8aa94fe905107622234457cb7f4a6b9dd738a960b047e4b86756fcfa49cd
    Image: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Image ID: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Port: <none>
    Host Port: <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State: Running
      Started: Mon, 12 Feb 2024 11:01:06 +0000
    Last State: Terminated
      Reason: Error
      Exit Code: 137
      Started: Mon, 12 Feb 2024 10:50:07 +0000
      Finished: Mon, 12 Feb 2024 11:01:06 +0000
    Ready: True

Revision history for this message
Matt Verran (mv-2112) wrote :

@gboutry re-reading what you mentioned above do you mean its user-configurable in the manifest?

If so I'm guessing we are now using profile-limit-memory as documented at https://charmhub.io/mysql-k8s/configure#profile-limit-memory - is there a sample config until the documentation is updated at https://microstack.run/docs please?

# mysql-k8s:
# channel: 8.0/candidate
# revision: null
# config: null

Revision history for this message
Guillaume Boutry (gboutry) wrote :

The DPE team has found possible candidate for an issue they have, that is similar to yours:

https://bugs.launchpad.net/juju/+bug/2052517

Fix will come, but it has to bubble up through the stack. I believe your issue is not a memory constraint issue.

But still, you want to increase memory limit:

The manifest feature was merged to edge very recently, and has not been documented yet.

Basically, the following manifest should help you set the memory limit to:
```
software:
  mysql-k8s:
    config:
      profile-limit-memory: 4096
```

N.B: as removal of memory constraint has been merged in edge, if you try to apply this on an old deployment, you would still have the memory constraints in pod definition. You can check that with kubectl describe. You should see NO memory constraints. If some are there, either remove them with an edit, or re-deploy the whole sunbeam

Revision history for this message
Carlos Manso (cmanso) wrote :

I'm having the same issue if it is of any help, using 2023.2. I only have the test VM running, and the host have plenty of memory, 62GB.

I have attached the description of the pod and it's logs.

cmanso@proxima:~$ snap list
Name Version Rev Tracking Publisher Notes
core18 20231027 2812 latest/stable canonical✓ base
core20 20240111 2182 latest/stable canonical✓ base
core22 20240111 1122 latest/stable canonical✓ base
juju 3.2.4 25443 3.2/stable canonical✓ -
juju-db 4.4.18 160 4.4/stable juju-qa -
lxd 5.0.3-ffb17cf 27037 5.0/stable/… canonical✓ -
microceph 18.2.0+snap3f8909a69a 862 latest/stable canonical✓ held
microk8s v1.28.7 6532 1.28-strict/stable canonical✓ -
openstack 2023.2 335 2023.2/stable canonical✓ -
openstack-hypervisor 2023.2 123 2023.2/stable canonical✓ -
snapd 2.61.1 20671 latest/stable canonical✓ snapd

Revision history for this message
Matt Verran (mv-2112) wrote :

Quick update on my findings since @cmanso chimed in..

Agree with @gboutry that memory is one cause and there seems to now be a control to help with this, I also agree that the tuneable on the health checks is the most likely cause due to the change I've seen after applying the memory settings. What I'm still seeing is even after easing the health checks its still happening.

That said, if you follow the DPE link @gboutry posted you'll see a fix is in so I'm going to try a fresh install to see if this is fixed once and for all based on the lock in pebble.

I _still_ however believe from an enterprise system tuning approach having all of this log rotates etc running at the same interval is a bad thing - see https://bugs.launchpad.net/snap-openstack/+bug/2051692 - this is why we apply random time offsets to cron jobs. I suspect that the pebble lock bug is ultimately uncovered because of this synchronous running of jobs.

Revision history for this message
Matt Verran (mv-2112) wrote :

As of 2023.2 (432) there appears to be stability with a simple bootstrap, watching multiple mysql jobs running through. Normally would have seen 3-5 restarts.

NAME READY STATUS RESTARTS AGE
modeloperator-7bcf6c9fc6-6kjg2 1/1 Running 0 28m
certificate-authority-0 1/1 Running 0 26m
horizon-mysql-router-0 2/2 Running 0 26m
cinder-ceph-mysql-router-0 2/2 Running 0 26m
rabbitmq-0 2/2 Running 0 26m
keystone-mysql-router-0 2/2 Running 0 26m
glance-mysql-router-0 2/2 Running 0 26m
cinder-ceph-0 2/2 Running 0 26m
cinder-mysql-router-0 2/2 Running 0 26m
placement-mysql-router-0 2/2 Running 0 25m
nova-api-mysql-router-0 2/2 Running 0 25m
nova-mysql-router-0 2/2 Running 0 25m
nova-cell-mysql-router-0 2/2 Running 0 25m
neutron-mysql-router-0 2/2 Running 0 25m
ovn-relay-0 2/2 Running 0 25m
placement-0 2/2 Running 0 25m
cinder-0 3/3 Running 0 25m
glance-0 2/2 Running 0 25m
neutron-0 2/2 Running 0 25m
ovn-central-0 4/4 Running 0 24m
traefik-0 2/2 Running 0 26m
neutron-mysql-0 2/2 Running 0 25m
placement-mysql-0 2/2 Running 0 25m
horizon-mysql-0 2/2 Running 0 25m
keystone-mysql-0 2/2 Running 0 25m
cinder-mysql-0 2/2 Running 0 25m
glance-mysql-0 2/2 Running 0 25m
nova-mysql-0 2/2 Running 0 25m
traefik-public-0 2/2 Running 0 26m
keystone-0 2/2 Running 0 26m
horizon-0 2/2 Running 0 26m
nova-0 4/4 Running 0 24m

Will continue on to enable dns, CaaS etc which normally triggers chaos.

Revision history for this message
Matt Verran (mv-2112) wrote :
Download full text (4.0 KiB)

Enabling more services does seem to trigger the issue still, it is reduced considerably but still can cause issues deploying (running sunbeam configure for example may fail when certain db's are offline - glance, keystone for example).

NAME READY STATUS RESTARTS AGE
modeloperator-7bcf6c9fc6-6kjg2 1/1 Running 0 83m
certificate-authority-0 1/1 Running 0 81m
horizon-mysql-router-0 2/2 Running 0 81m
cinder-ceph-mysql-router-0 2/2 Running 0 81m
rabbitmq-0 2/2 Running 0 81m
keystone-mysql-router-0 2/2 Running 0 81m
glance-mysql-router-0 2/2 Running 0 80m
cinder-ceph-0 2/2 Running 0 81m
cinder-mysql-router-0 2/2 Running 0 80m
placement-mysql-router-0 2/2 Running 0 80m
nova-api-mysql-router-0 2/2 Running 0 80m
nova-mysql-router-0 2/2 Running 0 80m
nova-cell-mysql-router-0 2/2 Running 0 79m
ovn-relay-0 2/2 Running 0 80m
placement-0 2/2 Running 0 80m
cinder-0 3/3 Running 0 80m
glance-0 2/2 Running 0 80m
neutron-0 2/2 Running 0 79m
ovn-central-0 4/4 Running 0 79m
traefik-0 2/2 Running 0 81m
traefik-public-0 2/2 Running 0 81m
keystone-0 2/2 Running 0 81m
horizon-0 2/2 Running 0 81m
nova-0 4/4 Running 0 79m
vault-0 2/2 Running 0 52m
barbican-mysql-router-0 2/2 Running 0 50m
barbican-0 3/3 Running 0 50m
heat-mysql-router-0 2/2 Running 0 46m
heat-0 4/4 Running 0 46m
magnum-mysql-router-0 2/2 Running 0 43m
neutron-mysql-router-0 2/2 Running 1 (42m ago) 79m
magnum-0 3/3 Running 0 42m
designate-mysql-router-0 2/2 Running 0 36m
bind-0 2/2 Running 0 36m
designate-0 2/2 Running 0 36m
octavia-mysql-router-0 2/2 Running 0 31m
octavia-0 4/4 Running 0 31m
heat-mysql-0 2/2 Running 2 (18m ago) 46m
glance-mysql-0 2/2 Running 2 (18m ago) 79m
octavia-mysql-0 2/2 Running 1 (10m ago) 31m
cinder-mysql-0 2/2 Running 4...

Read more...

Revision history for this message
Matt Verran (mv-2112) wrote :
Revision history for this message
Carlos Manso (cmanso) wrote (last edit ):

I tried applying the solution shown in the link @gboutry provided, I modified the statefull set to increase the delay to 300s and the timeout to 2s, and increasing the memory to 4GB but I still see the issue.

Anyway, if the issue was the liveness probe, Kubernetes shouldn't display an OOM error for the exit cause right?

    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State: Running
      Started: Wed, 06 Mar 2024 11:31:12 +0000
    Last State: Terminated
      Reason: Error
      Exit Code: 137
      Started: Wed, 06 Mar 2024 08:42:07 +0000
      Finished: Wed, 06 Mar 2024 11:31:12 +0000
    Ready: True
    Restart Count: 5
    Limits:
      memory: 4Gi
    Requests:
      memory: 4Gi
    Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=300s timeout=2s period=5s #success=1 #failure=1
    Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1

Revision history for this message
Matt Verran (mv-2112) wrote :

For single node installs, I noticed that @james-page had just one mysql instance some time ago. It looks like adding --topology single --database single to sunbeam bootstrap cluster can aid with this.

undocumented at https://microstack.run/docs

Revision history for this message
Carlos Manso (cmanso) wrote :

Unfortunately, I still have the same issue adding --topology single --database single to the bootstrap. Although for the moment, I think there have been fewer restarts at the mysql-0 pod.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.