Deployment calls time out

Bug #2043997 reported by Marian Gasparovic
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

Since Sat Nov 18th around 8PM GMT we could see CPU utilization in our MAAS to go to almost 100%.
Prometheus shows there were two machines in deploying state since then till snap restart on 20th 2pm GMT.
However no machines were in deploying state and all our deployments ended with

ServerError: 401 Unauthorized (Authorization Error: 'Expired timestamp: given 1700471633 and now 1700472533 has a greater difference than threshold 300')

Attaching screenshot from Grafana.

Logs will be added tomorrow

Tags: cdo-qa
Revision history for this message
Marian Gasparovic (marosg) wrote :
Revision history for this message
Alberto Donato (ack) wrote :

Can you please detail which MAAS version (and type of install) is this?

Changed in maas:
status: New → Incomplete
Revision history for this message
Marian Gasparovic (marosg) wrote :

Hi Alberto,
it is snap, 3.3/edge, 3.3.5-13213-g.e2fb46e1a

Adding sosreport as well

Changed in maas:
status: Incomplete → New
Revision history for this message
Abdullah (motjuste) wrote :

All runs that we believe are related to this bug can be found here: https://solutions.qa.canonical.com/bugs/2043997

There are many inexplicable failures during the time-range when MAAS's CPU usage was so high.

Revision history for this message
Marian Gasparovic (marosg) wrote (last edit ):

I noticed there are many "[warn] Next IP address to allocate from 'oam (10.246.164.0/22)' has been observed previously:..." messages in maas.log during the problematic period (Nov 18, 19, 20)

grep "Next IP address to allocate" maas.log |sed "s/T.*//"|uniq -c
    352 2023-10-13
   2011 2023-10-14
   1908 2023-10-15
   1731 2023-10-16
   1571 2023-10-17
   1509 2023-10-18
   1453 2023-10-19
   1140 2023-10-20
    455 2023-10-21
    462 2023-10-22
    434 2023-10-23
    268 2023-10-24
    610 2023-10-25
    862 2023-10-26
   1018 2023-10-27
    793 2023-10-28
    731 2023-10-29
   1103 2023-10-30
   1017 2023-10-31
    982 2023-11-01
    778 2023-11-02
   1137 2023-11-03
   1127 2023-11-04
     98 2023-11-05
    695 2023-11-06
   1179 2023-11-07
   1360 2023-11-08
   1173 2023-11-09
   1263 2023-11-10
   1160 2023-11-11
   1378 2023-11-12
   1454 2023-11-13
    937 2023-11-14
    512 2023-11-15
   1128 2023-11-16
    995 2023-11-17
 504694 2023-11-18
1675764 2023-11-19
 891664 2023-11-20
    335 2023-11-21

And they started to be excessive after 15:00
     20 2023-11-18T00
     57 2023-11-18T01
     49 2023-11-18T02
      3 2023-11-18T03
     20 2023-11-18T04
    222 2023-11-18T05
     95 2023-11-18T06
    232 2023-11-18T10
    911 2023-11-18T11
    279 2023-11-18T12
    111 2023-11-18T13
    331 2023-11-18T14
  28568 2023-11-18T15
  42623 2023-11-18T16
  40828 2023-11-18T17
  45518 2023-11-18T18
  63902 2023-11-18T19
  69211 2023-11-18T20
  70637 2023-11-18T21
  71009 2023-11-18T22
  70068 2023-11-18T23

Revision history for this message
Marian Gasparovic (marosg) wrote :

The last successful deploy was `2023-11-18T20:10:15.766946+00:00 maas-lab1 maas.node: [info] juju-29-2-7: Status transition from DEPLOYING to DEPLOYED`
All following ended with FAILED DEPLOYMENT

20:10 UTC roughly corresponds to time in metrics since when it wrongly reports two machines in deploying state for two following days

Revision history for this message
Alberto Donato (ack) wrote :

Did anything specific happen in that time frame (e.g. bigger than usual number of deployments around the same time)?

Those messages seem to point to a very busy MAAS, where machines have been deployed/released many times, and the available IPs are almost exhausted, so MAAS tries to reuse some that have been seen in the past but should be now unused.

Changed in maas:
status: New → Incomplete
Revision history for this message
Marian Gasparovic (marosg) wrote :

There was nothing exceptional going on, just regular testing during the weekend. I check available IPs from time to time and lowest I saw was 30% available IPs. Yes, MAAS is busy, we redeploy easily over 100 machines every day.

Changed in maas:
status: Incomplete → New
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Additional information:
- This is a long-running MAAS used to deploy labs. Single host, region+rack+postgres on the same bare metal machine.
- All its CPUs were saturated when the issue occurred
- MAAS was installed a few days/weeks prior to the issue
- This failure was seen only once and persisted for two days (weekend)
- All runs during that period failed
- On Monday, MAAS snap was restarted
- After the restart, the “Next IP address to allocate” warnings have stopped

The environment is still up, logs are available for further investigation.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Marking it as Triaged for now, since there's not much we can do. It's an issue that we need to address, but for now we have all the information we need.

Changed in maas:
status: New → Triaged
milestone: none → 3.5.x
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.