Bug #2051915 “Mysql pods being restarted by possible OOM killer?...” : Bugs : OpenStack Snap

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-02-09:

#1

Using a quick and dirty 'kubectl edit statefulset' on the mysql config to bump the limit from 2g to 4g clears the OOM restarts with exit code 137's, however general instability continues

e.g:-
openstack cinder-mysql-0 2/2 Running 0 32m
openstack placement-mysql-0 2/2 Running 2 (6m50s ago) 32m
openstack horizon-mysql-0 2/2 Running 1 (6m49s ago) 31m
openstack nova-mysql-0 2/2 Running 1 (2m1s ago) 32m
openstack neutron-mysql-0 2/2 Running 2 (118s ago) 30m
openstack keystone-mysql-0 2/2 Running 1 (118s ago) 35m
openstack glance-mysql-0 2/2 Running 2 (117s ago) 31m

Querying placement-mysql-0 shows an exist code of 0 and 2 restarts, so still something else triggering this.

    State: Running
      Started: Fri, 08 Feb 2024 18:51:13 +0000
    Last State: Terminated
      Reason: Completed
      Exit Code: 0
      Started: Fri, 08 Feb 2024 18:45:05 +0000
      Finished: Fri, 08 Feb 2024 18:51:13 +0000
    Ready: True
    Restart Count: 2
    Limits:
      memory: 4Gi
    Requests:
      memory: 4Gi

Revision history for this message

Guillaume Boutry (gboutry) wrote on 2024-02-09:

#2

A different way to manage MySQL memory has been committed (and released to 2023.2/edge).

Basically now constraints are set within mysql configuration and not on k8s limits, so it shouldn't be oomkilled anymore.

Changed in snap-openstack:
assignee:	nobody → Guillaume Boutry (gboutry)
importance:	Undecided → Critical
status:	New → Fix Committed

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-02-12 (last edit on 2024-02-12):

#3

Sunbeam inspect Edit (232.9 KiB, application/x-tar)

If this is in 2023.2/edge (390) then unfortunately I'm still seeing issues. Its definitely less, but still there.

NAME READY STATUS RESTARTS AGE
-----------------------8<--------------------8<-----------------------
horizon-0 2/2 Running 1 (81m ago) 117m
octavia-0 4/4 Running 1 (79m ago) 80m
keystone-mysql-0 2/2 Running 3 (27m ago) 118m
cinder-mysql-0 2/2 Running 3 (23m ago) 118m
neutron-mysql-0 2/2 Running 3 (17m ago) 118m
horizon-mysql-0 2/2 Running 5 (15m ago) 118m
placement-mysql-0 2/2 Running 4 (23m ago) 118m
magnum-mysql-0 2/2 Running 3 (12m ago) 89m
designate-mysql-0 2/2 Running 3 (11m ago) 86m
nova-mysql-0 2/2 Running 3 (7m10s ago) 118m
heat-mysql-0 2/2 Running 4 (6m11s ago) 98m
octavia-mysql-0 2/2 Running 3 (2m13s ago) 81m
barbican-mysql-0 2/2 Running 4 (70s ago) 93m
glance-mysql-0 1/2 Running 6 (11s ago) 118m

kubectl describe pod glance-mysql-0 -n openstack

  mysql:
    Container ID: containerd://360f8aa94fe905107622234457cb7f4a6b9dd738a960b047e4b86756fcfa49cd
    Image: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Image ID: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Port: <none>
    Host Port: <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State: Running
      Started: Mon, 12 Feb 2024 11:01:06 +0000
    Last State: Terminated
      Reason: Error
      Exit Code: 137
      Started: Mon, 12 Feb 2024 10:50:07 +0000
      Finished: Mon, 12 Feb 2024 11:01:06 +0000
    Ready: True

If this is in 2023.2/edge (390) then unfortunately I'm still seeing issues. Its definitely less, but still there.

NAME                            READY   STATUS    RESTARTS        AGE
-----------------------8<--------------------8<-----------------------
horizon-0                       2/2     Running   1 (81m ago)     117m
octavia-0                       4/4     Running   1 (79m ago)     80m
keystone-mysql-0                2/2     Running   3 (27m ago)     118m
cinder-mysql-0                  2/2     Running   3 (23m ago)     118m
neutron-mysql-0                 2/2     Running   3 (17m ago)     118m
horizon-mysql-0                 2/2     Running   5 (15m ago)     118m
placement-mysql-0               2/2     Running   4 (23m ago)     118m
magnum-mysql-0                  2/2     Running   3 (12m ago)     89m
designate-mysql-0               2/2     Running   3 (11m ago)     86m
nova-mysql-0                    2/2     Running   3 (7m10s ago)   118m
heat-mysql-0                    2/2     Running   4 (6m11s ago)   98m
octavia-mysql-0                 2/2     Running   3 (2m13s ago)   81m
barbican-mysql-0                2/2     Running   4 (70s ago)     93m
glance-mysql-0                  1/2     Running   6 (11s ago)     118m

kubectl describe pod glance-mysql-0 -n openstack

mysql:
    Container ID:  containerd://360f8aa94fe905107622234457cb7f4a6b9dd738a960b047e4b86756fcfa49cd
    Image:         registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Image ID:      registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:b0d2e028aa86173918c319a0728a1124b6f025985bc6e9774f32d30bbfc96722
    Port:          <none>
    Host Port:     <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State:          Running
      Started:      Mon, 12 Feb 2024 11:01:06 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 12 Feb 2024 10:50:07 +0000
      Finished:     Mon, 12 Feb 2024 11:01:06 +0000
    Ready:          True

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-02-19:

#4

@gboutry re-reading what you mentioned above do you mean its user-configurable in the manifest?

If so I'm guessing we are now using profile-limit-memory as documented at https://charmhub.io/mysql-k8s/configure#profile-limit-memory - is there a sample config until the documentation is updated at https://microstack.run/docs please?

# mysql-k8s:
# channel: 8.0/candidate
# revision: null
# config: null

Revision history for this message

Guillaume Boutry (gboutry) wrote on 2024-02-19:

#5

The DPE team has found possible candidate for an issue they have, that is similar to yours:

https://bugs.launchpad.net/juju/+bug/2052517

Fix will come, but it has to bubble up through the stack. I believe your issue is not a memory constraint issue.

But still, you want to increase memory limit:

The manifest feature was merged to edge very recently, and has not been documented yet.

Basically, the following manifest should help you set the memory limit to:
```
software:
  mysql-k8s:
    config:
      profile-limit-memory: 4096
```

N.B: as removal of memory constraint has been merged in edge, if you try to apply this on an old deployment, you would still have the memory constraints in pod definition. You can check that with kubectl describe. You should see NO memory constraints. If some are there, either remove them with an edit, or re-deploy the whole sunbeam

Revision history for this message

Carlos Manso (cmanso) wrote on 2024-03-05:

#6

log.txt Edit (2.0 MiB, text/plain)

I'm having the same issue if it is of any help, using 2023.2. I only have the test VM running, and the host have plenty of memory, 62GB.

I have attached the description of the pod and it's logs.

cmanso@proxima:~$ snap list
Name Version Rev Tracking Publisher Notes
core18 20231027 2812 latest/stable canonical✓ base
core20 20240111 2182 latest/stable canonical✓ base
core22 20240111 1122 latest/stable canonical✓ base
juju 3.2.4 25443 3.2/stable canonical✓ -
juju-db 4.4.18 160 4.4/stable juju-qa -
lxd 5.0.3-ffb17cf 27037 5.0/stable/… canonical✓ -
microceph 18.2.0+snap3f8909a69a 862 latest/stable canonical✓ held
microk8s v1.28.7 6532 1.28-strict/stable canonical✓ -
openstack 2023.2 335 2023.2/stable canonical✓ -
openstack-hypervisor 2023.2 123 2023.2/stable canonical✓ -
snapd 2.61.1 20671 latest/stable canonical✓ snapd

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-03-05:

#7

Quick update on my findings since @cmanso chimed in..

Agree with @gboutry that memory is one cause and there seems to now be a control to help with this, I also agree that the tuneable on the health checks is the most likely cause due to the change I've seen after applying the memory settings. What I'm still seeing is even after easing the health checks its still happening.

That said, if you follow the DPE link @gboutry posted you'll see a fix is in so I'm going to try a fresh install to see if this is fixed once and for all based on the lock in pebble.

I _still_ however believe from an enterprise system tuning approach having all of this log rotates etc running at the same interval is a bad thing - see https://bugs.launchpad.net/snap-openstack/+bug/2051692 - this is why we apply random time offsets to cron jobs. I suspect that the pebble lock bug is ultimately uncovered because of this synchronous running of jobs.

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-03-06:

#8

As of 2023.2 (432) there appears to be stability with a simple bootstrap, watching multiple mysql jobs running through. Normally would have seen 3-5 restarts.

NAME READY STATUS RESTARTS AGE
modeloperator-7bcf6c9fc6-6kjg2 1/1 Running 0 28m
certificate-authority-0 1/1 Running 0 26m
horizon-mysql-router-0 2/2 Running 0 26m
cinder-ceph-mysql-router-0 2/2 Running 0 26m
rabbitmq-0 2/2 Running 0 26m
keystone-mysql-router-0 2/2 Running 0 26m
glance-mysql-router-0 2/2 Running 0 26m
cinder-ceph-0 2/2 Running 0 26m
cinder-mysql-router-0 2/2 Running 0 26m
placement-mysql-router-0 2/2 Running 0 25m
nova-api-mysql-router-0 2/2 Running 0 25m
nova-mysql-router-0 2/2 Running 0 25m
nova-cell-mysql-router-0 2/2 Running 0 25m
neutron-mysql-router-0 2/2 Running 0 25m
ovn-relay-0 2/2 Running 0 25m
placement-0 2/2 Running 0 25m
cinder-0 3/3 Running 0 25m
glance-0 2/2 Running 0 25m
neutron-0 2/2 Running 0 25m
ovn-central-0 4/4 Running 0 24m
traefik-0 2/2 Running 0 26m
neutron-mysql-0 2/2 Running 0 25m
placement-mysql-0 2/2 Running 0 25m
horizon-mysql-0 2/2 Running 0 25m
keystone-mysql-0 2/2 Running 0 25m
cinder-mysql-0 2/2 Running 0 25m
glance-mysql-0 2/2 Running 0 25m
nova-mysql-0 2/2 Running 0 25m
traefik-public-0 2/2 Running 0 26m
keystone-0 2/2 Running 0 26m
horizon-0 2/2 Running 0 26m
nova-0 4/4 Running 0 24m

Will continue on to enable dns, CaaS etc which normally triggers chaos.

As of 2023.2 (432) there appears to be stability with a simple bootstrap, watching multiple mysql jobs running through. Normally would have seen 3-5 restarts.

NAME                             READY   STATUS    RESTARTS   AGE
modeloperator-7bcf6c9fc6-6kjg2   1/1     Running   0          28m
certificate-authority-0          1/1     Running   0          26m
horizon-mysql-router-0           2/2     Running   0          26m
cinder-ceph-mysql-router-0       2/2     Running   0          26m
rabbitmq-0                       2/2     Running   0          26m
keystone-mysql-router-0          2/2     Running   0          26m
glance-mysql-router-0            2/2     Running   0          26m
cinder-ceph-0                    2/2     Running   0          26m
cinder-mysql-router-0            2/2     Running   0          26m
placement-mysql-router-0         2/2     Running   0          25m
nova-api-mysql-router-0          2/2     Running   0          25m
nova-mysql-router-0              2/2     Running   0          25m
nova-cell-mysql-router-0         2/2     Running   0          25m
neutron-mysql-router-0           2/2     Running   0          25m
ovn-relay-0                      2/2     Running   0          25m
placement-0                      2/2     Running   0          25m
cinder-0                         3/3     Running   0          25m
glance-0                         2/2     Running   0          25m
neutron-0                        2/2     Running   0          25m
ovn-central-0                    4/4     Running   0          24m
traefik-0                        2/2     Running   0          26m
neutron-mysql-0                  2/2     Running   0          25m
placement-mysql-0                2/2     Running   0          25m
horizon-mysql-0                  2/2     Running   0          25m
keystone-mysql-0                 2/2     Running   0          25m
cinder-mysql-0                   2/2     Running   0          25m
glance-mysql-0                   2/2     Running   0          25m
nova-mysql-0                     2/2     Running   0          25m
traefik-public-0                 2/2     Running   0          26m
keystone-0                       2/2     Running   0          26m
horizon-0                        2/2     Running   0          26m
nova-0                           4/4     Running   0          24m

Will continue on to enable dns, CaaS etc which normally triggers chaos.

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-03-06:

#9

Download full text (4.0 KiB)

Enabling more services does seem to trigger the issue still, it is reduced considerably but still can cause issues deploying (running sunbeam configure for example may fail when certain db's are offline - glance, keystone for example).

NAME READY STATUS RESTARTS AGE
modeloperator-7bcf6c9fc6-6kjg2 1/1 Running 0 83m
certificate-authority-0 1/1 Running 0 81m
horizon-mysql-router-0 2/2 Running 0 81m
cinder-ceph-mysql-router-0 2/2 Running 0 81m
rabbitmq-0 2/2 Running 0 81m
keystone-mysql-router-0 2/2 Running 0 81m
glance-mysql-router-0 2/2 Running 0 80m
cinder-ceph-0 2/2 Running 0 81m
cinder-mysql-router-0 2/2 Running 0 80m
placement-mysql-router-0 2/2 Running 0 80m
nova-api-mysql-router-0 2/2 Running 0 80m
nova-mysql-router-0 2/2 Running 0 80m
nova-cell-mysql-router-0 2/2 Running 0 79m
ovn-relay-0 2/2 Running 0 80m
placement-0 2/2 Running 0 80m
cinder-0 3/3 Running 0 80m
glance-0 2/2 Running 0 80m
neutron-0 2/2 Running 0 79m
ovn-central-0 4/4 Running 0 79m
traefik-0 2/2 Running 0 81m
traefik-public-0 2/2 Running 0 81m
keystone-0 2/2 Running 0 81m
horizon-0 2/2 Running 0 81m
nova-0 4/4 Running 0 79m
vault-0 2/2 Running 0 52m
barbican-mysql-router-0 2/2 Running 0 50m
barbican-0 3/3 Running 0 50m
heat-mysql-router-0 2/2 Running 0 46m
heat-0 4/4 Running 0 46m
magnum-mysql-router-0 2/2 Running 0 43m
neutron-mysql-router-0 2/2 Running 1 (42m ago) 79m
magnum-0 3/3 Running 0 42m
designate-mysql-router-0 2/2 Running 0 36m
bind-0 2/2 Running 0 36m
designate-0 2/2 Running 0 36m
octavia-mysql-router-0 2/2 Running 0 31m
octavia-0 4/4 Running 0 31m
heat-mysql-0 2/2 Running 2 (18m ago) 46m
glance-mysql-0 2/2 Running 2 (18m ago) 79m
octavia-mysql-0 2/2 Running 1 (10m ago) 31m
cinder-mysql-0 2/2 Running 4...

Enabling more services does seem to trigger the issue still, it is reduced considerably but still can cause issues deploying (running sunbeam configure for example may fail when certain db's are offline - glance, keystone for example).

NAME                             READY   STATUS    RESTARTS        AGE
modeloperator-7bcf6c9fc6-6kjg2   1/1     Running   0               83m
certificate-authority-0          1/1     Running   0               81m
horizon-mysql-router-0           2/2     Running   0               81m
cinder-ceph-mysql-router-0       2/2     Running   0               81m
rabbitmq-0                       2/2     Running   0               81m
keystone-mysql-router-0          2/2     Running   0               81m
glance-mysql-router-0            2/2     Running   0               80m
cinder-ceph-0                    2/2     Running   0               81m
cinder-mysql-router-0            2/2     Running   0               80m
placement-mysql-router-0         2/2     Running   0               80m
nova-api-mysql-router-0          2/2     Running   0               80m
nova-mysql-router-0              2/2     Running   0               80m
nova-cell-mysql-router-0         2/2     Running   0               79m
ovn-relay-0                      2/2     Running   0               80m
placement-0                      2/2     Running   0               80m
cinder-0                         3/3     Running   0               80m
glance-0                         2/2     Running   0               80m
neutron-0                        2/2     Running   0               79m
ovn-central-0                    4/4     Running   0               79m
traefik-0                        2/2     Running   0               81m
traefik-public-0                 2/2     Running   0               81m
keystone-0                       2/2     Running   0               81m
horizon-0                        2/2     Running   0               81m
nova-0                           4/4     Running   0               79m
vault-0                          2/2     Running   0               52m
barbican-mysql-router-0          2/2     Running   0               50m
barbican-0                       3/3     Running   0               50m
heat-mysql-router-0              2/2     Running   0               46m
heat-0                           4/4     Running   0               46m
magnum-mysql-router-0            2/2     Running   0               43m
neutron-mysql-router-0           2/2     Running   1 (42m ago)     79m
magnum-0                         3/3     Running   0               42m
designate-mysql-router-0         2/2     Running   0               36m
bind-0                           2/2     Running   0               36m
designate-0                      2/2     Running   0               36m
octavia-mysql-router-0           2/2     Running   0               31m
octavia-0                        4/4     Running   0               31m
heat-mysql-0                     2/2     Running   2 (18m ago)     46m
glance-mysql-0                   2/2     Running   2 (18m ago)     79m
octavia-mysql-0                  2/2     Running   1 (10m ago)     31m
cinder-mysql-0                   2/2     Running   4 (10m ago)     79m
designate-mysql-0                2/2     Running   2 (8m45s ago)   36m
nova-mysql-0                     2/2     Running   6 (4m45s ago)   79m
horizon-mysql-0                  2/2     Running   6 (4m45s ago)   79m
barbican-mysql-0                 2/2     Running   3 (4m45s ago)   50m
placement-mysql-0                2/2     Running   4 (4m40s ago)   79m
neutron-mysql-0                  2/2     Running   2 (46s ago)     79m
keystone-mysql-0                 2/2     Running   2 (46s ago)     79m
magnum-mysql-0                   2/2     Running   4 (46s ago)     43m
verranm@mz640:~/cloud-in-a-box/scripts$

I suspect at this point it will be slow response of my system (I can look at a rebuild later to a different md-based disk layout), but initially thats where the suggestions in https://bugs.launchpad.net/snap-openstack/+bug/2051692 come into play to reduce disk resource contention.

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-03-06:

#10

Juju debug-log showing the logrotate concurrent jobs running against mysql Edit (15.5 KiB, text/plain)

Revision history for this message

Carlos Manso (cmanso) wrote on 2024-03-06 (last edit on 2024-03-08):

#12

I tried applying the solution shown in the link @gboutry provided, I modified the statefull set to increase the delay to 300s and the timeout to 2s, and increasing the memory to 4GB but I still see the issue.

Anyway, if the issue was the liveness probe, Kubernetes shouldn't display an OOM error for the exit cause right?

    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State: Running
      Started: Wed, 06 Mar 2024 11:31:12 +0000
    Last State: Terminated
      Reason: Error
      Exit Code: 137
      Started: Wed, 06 Mar 2024 08:42:07 +0000
      Finished: Wed, 06 Mar 2024 11:31:12 +0000
    Ready: True
    Restart Count: 5
    Limits:
      memory: 4Gi
    Requests:
      memory: 4Gi
    Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=300s timeout=2s period=5s #success=1 #failure=1
    Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1

Revision history for this message

Matt Verran (mv-2112) wrote on 2024-04-03:

#13

For single node installs, I noticed that @james-page had just one mysql instance some time ago. It looks like adding --topology single --database single to sunbeam bootstrap cluster can aid with this.

undocumented at https://microstack.run/docs

Revision history for this message

Carlos Manso (cmanso) wrote on 2024-04-04:

#14

Unfortunately, I still have the same issue adding --topology single --database single to the bootstrap. Although for the moment, I think there have been fewer restarts at the mysql-0 pod.

OpenStack Snap

Mysql pods being restarted by possible OOM killer?

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches