failing metrics cause 500 error

Bug #1996204 reported by Marian Gasparovic
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Alexsander de Souza
3.3
Fix Released
High
Alexsander de Souza

Bug Description

MAAS 3.2.6
Long running MAAS with tens of deployments daily, today every juju deploy and and also cleanup would result in error containing `Exception: 'i' format requires -2147483648 <= number <= 2147483647`

Here is the first occurrence

2022-11-10 06:21:27 maasserver: [error] ################################ Exception: 'i' format requires -2147483648 <= number <= 2147483647 ################################
2022-11-10 06:21:27 maasserver: [error] Traceback (most recent call last):
  File "/snap/maas/23947/lib/python3.8/site-packages/maasserver/utils/views.py", line 243, in handle_uncaught_exception
    raise exc from exc.__cause__
  File "/snap/maas/23947/lib/python3.8/site-packages/maasserver/utils/views.py", line 309, in get_response
    response = django_get_response(request)
  File "/snap/maas/23947/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 75, in get_response
    response = self._middleware_chain(request)
  File "/snap/maas/23947/lib/python3.8/site-packages/maasserver/prometheus/middleware.py", line 55, in __call__
    self._process_metrics(request, response, latency, latencies)
  File "/snap/maas/23947/lib/python3.8/site-packages/maasserver/prometheus/middleware.py", line 66, in _process_metrics
    self.prometheus_metrics.update(
  File "/snap/maas/23947/lib/python3.8/site-packages/provisioningserver/prometheus/utils.py", line 80, in update
    metric = metric.labels(**all_labels)
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/metrics.py", line 154, in labels
    self._metrics[labelvalues] = self.__class__(
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/metrics.py", line 491, in __init__
    super(Histogram, self).__init__(
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/metrics.py", line 102, in __init__
    self._metric_init()
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/metrics.py", line 521, in _metric_init
    self._buckets.append(values.ValueClass(
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/values.py", line 49, in __init__
    self.__reset()
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/values.py", line 66, in __reset
    self._value = self._file.read_value(self._key)
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/mmap_dict.py", line 120, in read_value
    self._init_value(key)
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/mmap_dict.py", line 106, in _init_value
    _pack_integer(self._m, 0, self._used)
  File "/snap/maas/23947/usr/lib/python3/dist-packages/prometheus_client/mmap_dict.py", line 22, in _pack_integer
    data[pos:pos + 4] = _pack_integer_func(value)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

2022-11-10 06:21:27 regiond: [info] 127.0.0.1 POST /MAAS/api/2.0/nodes/gadc83/interfaces/115002/?op=link_subnet HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: Go-http-client/1.1)

After discussion with MAAS team it was determined that metrics DB is probably corrupted and service restart should help.

I will report if restart helped but this failure should be handled more gracefully and not break MAAS.

Related branches

Revision history for this message
Marian Gasparovic (marosg) wrote :

Just an update - problem is gone after a restart

Changed in maas:
importance: Undecided → High
status: New → Triaged
milestone: none → 3.4.0
Revision history for this message
Alexander Balderson (asbalderson) wrote :

We bumped into this again after about 2 months of uptime, maybe the uptime of the service created a number outside of the range?

additionally i was unable to view or update the the config for the prometheus endpoint because every request returned the same error for range.

Revision history for this message
Marian Gasparovic (marosg) wrote :

Before restart

$ maas root maas get-config name=prometheus_enabled
'i' format requires -2147483648 <= number <= 2147483647

After restart

$ maas root maas get-config name=prometheus_enabled
Success.
Machine-readable output follows:
false

Notice that prometheus was not enabled

Revision history for this message
Alberto Donato (ack) wrote :

Does this still happen with 3.3?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Marian Gasparovic (marosg) wrote (last edit ):

We are still running 3.2.7 as our long running MAAS, we hit it again today, MAAS service restart was needed to solve it.

Changed in maas:
status: Incomplete → New
Revision history for this message
Bill Wear (billwear) wrote :

Let me acknowledge your concerns about the error cropping up in the 3.2.7 version. Good to know that; obviously frustrating for you. It looks like we tried to fix this in the 3.3 version, and with your workflow, I know it's tough to upgrade on a whim. That leaves us in kind of a weird place. I'm not sure how to proceed, precisely, so I'm going to take a chance and move this back to "Incomplete", just so you can elaborate more on your thoughts regarding the path forward.

Given that we're *thinking* we fixed it in 3.3, how would you suggest we proceed? We want to make this work for you. I'm just not sure what the right answer might be. Open to your thoughts, here, but is there some way you could do a quick test of the 3.3 version, in a controlled, temporary environment? Maybe in a side-by-side configuration? I can only imagine your challenges on a daily basis, but I think the results of that might help us move forward. We're more than willing to work closely to get this resolved -- again, just not real clear on how to get there.

Changed in maas:
status: New → Incomplete
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

1) was it installed from DEBs or SNAP?

2) the size of /var/lib/maas/prometheus (deb)? (extracting this from snap is not so trivial, ask me if needed)

3) does this MAAS deploys VMs or just bare-metal?

4) how often do you use the REST API?

Revision history for this message
Marian Gasparovic (marosg) wrote :

1. snap, 3.2/stable
2. I will ping you about that
3. both VMs and baremetal
4. several times a day, querying available machines, creating and deleting resource pools and assigning machines to them for each test run. Deploys are called exclusively from Juju.

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

We should not use the operation URL as a metric label, in long living systems this leads to a explosion in the number of histograms in the DB

Changed in maas:
status: Incomplete → Triaged
assignee: nobody → Alexsander de Souza (alexsander-souza)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → New
Changed in maas:
status: New → In Progress
Changed in maas:
status: In Progress → Fix Committed
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.0-beta3
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Fix is backported to 3.3.x

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.