[2.0rc2] RackController.get_image_sync_status causes huge load on regiond process

Bug #1604465 reported by Blake Rouse on 2016-07-19
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
2.0
Critical
Blake Rouse

Bug Description

The call in the handlers/controller.py to get the image sync status causes huge load on the regiond process. The call stack does like this:

ControllerHandler.check_images
node.get_image_sync_status()
BootResource.objects.boot_images_are_in_sync(boot_images) <- This is where it gets slow.

The reason it is slow is the way MAAS calculates the size of the each largefile in the database. It does this by opening each largeobject seeking to the end and getting the size. This was fine in 1.9 when the cluster page was not polled like it is today in MAAS 2.0.

    @property
    def size(self):
        """Size of content."""
        with self.content.open('rb') as stream:
            stream.seek(0, os.SEEK_END)
            size = stream.tell()
        return size

This needs to be converted to a field in the database holding the current size of the largefile instead of calculating in from the largeobject. This will speed up the BootResource.objects.boot_images_are_in_sync method. Also a more advance SQL query could be used to make less calls to check if the images are in sync.

This is currently affecting OIL, to alleviate the issue the following was performed:

class RackController(Controller):

   ...blah...

   def get_image_sync_status(self, boot_images=None):
       return 'unknown'
       ... the original function...

Related branches

tags: added: oil
Larry Michel (lmic) wrote :

Adding the top output showing the load:

ubuntu@maas2-production:~$ top

top - 22:09:40 up 1 day, 17:51, 1 user, load average: 5.61, 5.84, 5.63
Tasks: 71 total, 2 running, 69 sleeping, 0 stopped, 0 zombie
%Cpu(s): 38.2 us, 5.4 sy, 0.0 ni, 54.6 id, 0.0 wa, 0.0 hi, 1.8 si, 0.0 st
KiB Mem : 16300676 total, 13767796 free, 1751532 used, 781348 buff/cache
KiB Swap: 16645628 total, 15951660 free, 693968 used. 13767796 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21436 maas 20 0 1221024 249004 8400 S 111.3 1.5 3813:45 twistd3
21438 maas 20 0 1810808 262696 8556 S 111.3 1.6 4895:17 twistd3
21429 maas 20 0 1223360 245912 8384 S 107.3 1.5 3464:28 twistd3
21434 maas 20 0 1147128 247360 8512 S 70.7 1.5 1884:17 twistd3
25142 postgres 20 0 295304 21944 19888 S 19.3 0.1 0:00.58 postgres
24964 postgres 20 0 296604 102552 99740 S 11.0 0.6 0:01.97 postgres
25125 postgres 20 0 295304 21940 19888 S 8.3 0.1 0:00.85 postgres
25126 postgres 20 0 295304 21936 19884 S 8.3 0.1 0:00.84 postgres
25137 postgres 20 0 295304 21936 19888 R 8.3 0.1 0:00.73 postgres
25139 postgres 20 0 295304 21936 19884 S 8.3 0.1 0:00.50 postgres
25140 postgres 20 0 295304 18584 16584 S 8.3 0.1 0:00.43 postgres
25154 postgres 20 0 295304 18588 16592 S 6.3 0.1 0:00.19 postgres

Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
status: Triaged → In Progress
milestone: 2.0.0 → 2.1.0
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
milestone: 2.0.1 → none
tags: added: auto-sanity tpe-lab
tags: added: taipei-lab
removed: tpe-lab
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers