Boot images disappear

Bug #1387133 reported by Adam Collard on 2014-10-29
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
1.7
Critical
Blake Rouse

Bug Description

$ dpkg-query -W maas
maas 1.7.0~beta8+bzr3273-0ubuntu1~trusty1

Region and cluster controller are on the same machine.

At approximately 04:25 MAAS suddenly "loses" the boot images.

Oct 29 03:45:03 comfy maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 29 04:05:03 comfy maas.import-images: [INFO] Started importing boot images.
Oct 29 04:05:03 comfy maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 29 04:25:03 comfy maas.import-images: [INFO] Started importing boot images.
Oct 29 04:25:03 comfy maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 29 04:25:35 comfy maas.import-images: [INFO] Started importing boot images.
Oct 29 04:25:35 comfy maas.import-images: [WARNING] Finished importing boot images, the region does not have any boot images available.
Oct 29 04:45:03 comfy maas.import-images: [INFO] Started importing boot images.
Oct 29 04:45:03 comfy maas.import-images: [WARNING] Finished importing boot images, the region does not have any boot images available.
Oct 29 04:50:03 comfy maas.import-images: [INFO] Started importing boot images.
Oct 29 04:50:03 comfy maas.import-images: [WARNING] Finished importing boot images, the region does not have any boot images available.
Oct 29 04:55:03 comfy maas.import-images: [INFO] Started importing boot images.

At this point the UI shows an alert saying no images have been imported "Boot image import process not started."

This is at least the second time this has been seen, on different MAAS servers, both running 1.7 betas

Related branches

Adam Collard (adam-collard) wrote :

The timings of the warnings from PostgreSQL makes me think they might be a clue.

Christian Reis (kiko) wrote :

From the database log:

2014-10-29 04:25:23 CDT LOG: checkpoints are occurring too frequently (4 seconds apart)
2014-10-29 04:25:23 CDT HINT: Consider increasing the configuration parameter "checkpoint_segments".
2014-10-29 04:25:25 CDT LOG: checkpoints are occurring too frequently (2 seconds apart)
2014-10-29 04:25:25 CDT HINT: Consider increasing the configuration parameter "checkpoint_segments".

And then:

2014-10-29 04:45:03 CDT ERROR: duplicate key value violates unique constraint "maasserver_componenterror_component_key"
2014-10-29 04:45:03 CDT DETAIL: Key (component)=(maas-import-pxe-files script) already exists.
2014-10-29 04:45:03 CDT STATEMENT: INSERT INTO "maasserver_componenterror" ("created", "updated", "component", "error") VALUES ('2014-10-29 04:45:03.433921', '2014-10-29 04:45:03.433921', 'maas-import-pxe-files script', 'Boot image import process not started. Nodes will not be able to provision without boot images. Visit the boot images page to start the import.') RETURNING "maasserver_componenterror"."id"

(and then the same every 5 minutes after that)

Changed in maas:
importance: Undecided → Critical
milestone: none → 1.7.0
status: New → Confirmed
Christian Reis (kiko) wrote :

Also worthy of note is that an image was imported an hour prior to the houdini event:

Oct 29 03:25:17 comfy maas.import-images: [INFO] Installing boot images snapshot /var/lib/maas/boot-resources/snapshot-20141029-082517

Raphaël Badin (rvb) wrote :

"checkpoints are occurring too frequently": this, I think, is only indicative of a very heavy write-intensive write operation being in progress. The WAL fills up quickly and thus checkpoints are frequent.

Adam Collard (adam-collard) wrote :

All postgresql logs (for kiko)

On 29 October 2014 12:38, Christian Reis <email address hidden> wrote:
> 2014-10-29 04:45:03 CDT STATEMENT: INSERT INTO "maasserver_componenterror" ("created", "updated", "component", "error") VALUES ('2014-10-29 04:45:03.433921', '2014-10-29 04:45:03.433921', 'maas-import-pxe-files script', 'Boot image import process not started. Nodes will not be able to provision without boot images. Visit the boot images page to start the import.') RETURNING "maasserver_componenterror"."id"

This almost looks like a race might be occurring in
maasserver.component.register_persistent_error(), which is what is
used to register the error when boot images are missing. That function
removes any existing error message for a component (in this case
"maas-import-pxe-files script") before adding a new one; you'd only
see this — I think — if there were two requests trying to do that at
the same time.

Note that the code that registers / removes the error is in the
middleware, so it gets run with every request ATM. I'm not sure, then,
whether this is a symptom of the Houdini, or a side-effect.

Gavin Panella (allenap) wrote :

> "checkpoints are occurring too frequently": this, I think, is only
> indicative of a very heavy write-intensive write operation being in
> progress. The WAL fills up quickly and thus checkpoints are frequent.

I've filed this as bug 1387266. It's not a defect, but it is a
distraction to end users. It /may/ also be causing some performance
degradation.

Christian Reis (kiko) on 2014-10-29
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Julian Edwards (julian-edwards) wrote :

On Wednesday 29 Oct 2014 14:46:50 you wrote:
> This almost looks like a race might be occurring in
> maasserver.component.register_persistent_error(), which is what is
> used to register the error when boot images are missing. That function
> removes any existing error message for a component (in this case
> "maas-import-pxe-files script") before adding a new one; you'd only
> see this — I think — if there were two requests trying to do that at
> the same time.

This is pretty worrying if that is the case, it might mean that two appserver
threads are trying to download images at the same time. Though I thought we
had locks to prevent that. Anyway, just a thought.

Changed in maas:
status: Confirmed → Triaged
Changed in maas:
status: Triaged → In Progress
milestone: 1.7.0 → none
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released

Hello Adam, or anyone else affected,

Accepted maas into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/maas/1.7.5+bzr3369-0ubuntu1~14.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Andres Rodriguez (andreserl) wrote :

This issue has been verified to work both on upgrade and fresh install, and has been QA'd. Marking verification-done.

tags: added: verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers