MAAS should notify users when boot image storage space is low

Bug #1660440 reported by Mike Pontillo
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Wishlist
Unassigned

Bug Description

The operation of MAAS depends on OS boot images to be available, which are stored in a couple different places in a MAAS cloud:

 - On postgresql database server.
 - Decompressed and ready-to-boot on each rack controller.

MAAS should notify users when either of these storage areas are reaching critical thresholds.

See also:

    bug #1660439 (should vacuum the database automatically)
    bug #1660418 (uploading a custom image does not remove the old image)
    bug #1459876 (original issue)

description: updated
Changed in maas:
status: New → Triaged
importance: Undecided → High
tags: added: notifications
Changed in maas:
milestone: none → 2.2.0rc2
importance: High → Wishlist
assignee: nobody → Lee Trager (ltrager)
Revision history for this message
Lee Trager (ltrager) wrote :

Since the rack and region can be run separately we'll need a warning for both. This can be checked as images are being imported on the region and the rack. But what constitutes low disk space? We could check the percentage of free disk space or base the number off the images we currently have.

Another idea I've had for awhile is to show the actual disk usage over the API and UI. But this has two issues
1. What happens when /var/lib/maas and /var/lib/postgresql are on different partitions?
2. Do we do this with an RPC call or roll and store the data?

Revision history for this message
Gavin Panella (allenap) wrote :

> 1. What happens when /var/lib/maas and /var/lib/postgresql are on
> different partitions?

We should check and warn for both. We can detect this with:

  >>> same_fs = (
  ... os.stat("/var/lib/maas").st_dev ==
  ... os.stat("/var/lib/postgresql").st_dev)

Free space on a device can be queried with os.statvfs.

To calculate disk usage we may find that forking `du` is quickest:

  $ du --block-size=1 --one-file-system --summarize /var/lib/maas
  475709440 /var/lib/maas

FWIW, we should use:

  SELECT setting FROM pg_settings WHERE name = 'data_directory';

instead of hard-coding /var/lib/postgresql.

Also interesting: http://postgresguide.com/tips/disk-usage.html shows
how to check disk usage from within the database. It doesn't appear
possible to get information on free space, but it can report on disk
usage per database, per table, or per index. Not /that/ useful in the
context of this bug but worth being aware of.

> 2. Do we do this with an RPC call or roll and store the data?

The calls, even a fork out to `du` once the disk cache is warm, are all
quick and cheap, so calculating this on-demand is okay.

Rules can be evaluated region-side, like:

  * Is free disk space < 50% of the current size of the database?
      ==> Show a warning.

  * Is free disk space > 65% of the current size of the database?
      ==> Remove the warning.

This is for a region. No RPC needed there, but the logic can be plugged
in just the same given a GetDiskUsage RPC or a get_disk_usage function.
A rack would have different rules, a region+rack might be different too.

This is enough for a first version.

Later we can try to anticipate usage and warn if there will not be not
enough room to, say, download images. Doing this accurately will be more
complex than it at first appears.

Historical logs could also be added, but at a later date. Doing that
will blur the line between MAAS and a purpose-built monitoring tool. Try
to do too much and the user's perception of MAAS will change from a
helpful and capable provisioning tool to a worst-of-class monitoring
tool. We could avoid this by implementing instead a plugin for one of
those purpose-built tools, for when richer monitoring is desired. This
can be guided by a specific customer need.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

We should also detect/handle the case where the database is running on a remote system (HA).

Changed in maas:
milestone: 2.2.0rc2 → 2.2.0rc3
Changed in maas:
milestone: 2.2.0rc3 → 2.3.0
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (4.0 KiB)

FWIW, not only the region, but the rack controller should surface that it is running out of space:

2017-07-20 05:05:55 sstreams: [info] maas:v2:download/maas:boot:centos:amd64:generic:centos7image26: to_add=['20170720'] to_remove=[]
2017-07-20 05:05:55 sstreams: [info] maas:v2:download/maas:boot:centos:amd64:generic:centos7image28: to_add=['20170720'] to_remove=[]
2017-07-20 05:06:05 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-07-20 05:06:05 twisted.internet.defer: [critical]

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 434, in errback
    self._startRunCallbacks(fail)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
    _inlineCallbacks(r, g, deferred)
--- <exception caught here> ---
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/boot_images.py", line 145, in _import_boot_images
    imported = yield deferToThread(_run_import, sources, **proxies)
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 232, in wrapper
    result = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/boot_images.py", line 119, in _run_import
    imported = boot_resources.import_images(sources)
  File "/usr/lib/python3/dist-packages/provisioningserver/import_images/boot_resources.py", line 322, in import_images
    sources, storage, product_mapping)
  File "/usr/lib/python3/dist-packages/provisioningserver/import_images/download_resources.py", line 397, in download_all_boot_resources
    keyring_file=source.get('keyring')),
  File "/usr/lib/python3/dist-packages/provisioningserver/import_images/download_resources.py", line 343, in download_boot_resources
    writer.sync(reader, rpath)
  File "/usr/lib/python3/dist-packages/simplestreams/mirrors/__init__.py", line 91, in sync
    return self.sync_index(reader, path, data, content)
  File "/usr/lib/python3/dist-packages/simplestreams/mirrors/__init__.py", line 254, in sync...

Read more...

Changed in maas:
importance: Wishlist → Critical
Revision history for this message
Andres Rodriguez (andreserl) wrote :

and we should handle this exception nicely.

tags: added: error-surface
Changed in maas:
assignee: Lee Trager (ltrager) → nobody
Changed in maas:
milestone: 2.3.0 → 2.3.x
Changed in maas:
importance: Critical → Wishlist
milestone: 2.3.x → next
Changed in maas:
milestone: next → 2.5.x
Revision history for this message
Adam Collard (adam-collard) wrote :

This bug has not seen any activity in the last 6 months, so it is being automatically closed.

If you are still experiencing this issue, please feel free to re-open.

MAAS Team

Changed in maas:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.