[2.3, b1] pod commission failed with "Ephemeral operating system ubuntu xenial is unavailable"

Bug #1750891 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
Undecided
Unassigned

Bug Description

Our CI hit a failure a couple of times last night where commissioning pods failed with the error message "Ephemeral operating system ubuntu xenial is unavailable":

Command failed: pod compose 1 hostname=juju-1 cores=8 memory=32768 storage=100 zone=1
Ephemeral operating system ubuntu xenial is unavailable.

This occurred at 06:07:35.

However, we know from 'rack-controller list-boot-images' output run shortly before (at 06:06:35) the pod composition that the rack controllers all have the images synced:

http://paste.ubuntu.com/p/qwJdbMX5Ch/

We didn't make any changes to image selection between those two times.

I've attached full logs.

This was with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1).

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jason,

So the rack download the images, then update the region and the "cache" on the region side is updated to tell overall that the images are available.

While the rack controllers may already have the images imported, do you have output from the region side that shows whether the images are really fully imported?

For example, does boot-resources read shows "synced" on all before this happens?

Also, are all rack controllers connected? There's the case that rack controllers are not fully connected and these messages are being surfaced?

Changed in maas:
status: New → Incomplete
tags: added: pod
summary: - pod commission failed with "Ephemeral operating system ubuntu xenial is
- unavailable"
+ [2.3] pod commission failed with "Ephemeral operating system ubuntu
+ xenial is unavailable"
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [2.3] pod commission failed with "Ephemeral operating system ubuntu xenial is unavailable"

rack-controller list-boot-images is what we're using - that's what Blake told us to use to make sure that rack controllers are all synced and maas is ready to deploy machines.

It asks the region controller what it thinks the status of the rack-controller's images is. That output says 'synced', which means the region controller think the rack controller's images are synced.

boot-resources read doesn't say anything about the status of syncing images to rack controllers, it just tells us what the resources are. The field in it that says 'Synced' is a type field, not a status field.

description: updated
Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Ok, so I think i may know how to reproduce this issue on a single region/rack.

 - I have a MAAS that has a very slow disk and i deleted everything under /usr/lib/maas/boot-resources/*.
 - I try to force the update of images on the images page and the rack controller
 - I watched /var/lib/maas/boot-resources/* create the cache and the snapshot folder and start the download of images
 - The 'Images' page on the top level would show "Step 2/2: Rack Controller(s) importing", *but*, the imagines inside the the list of images, all reported "Synced". NOTE, this is all while the rack is redowloading the images (I'm watching /var/lib/maas/boot-resources/*).

boot-resources read shows everything as 'Synced'
list-boot-images shows synced

So, this definitely seems like a bug where the rack controller is correctly re-updating images, but it is not reporting the correct information.

tags: added: performance
Changed in maas:
importance: Undecided → High
assignee: nobody → Blake Rouse (blake-rouse)
milestone: none → 2.4.0alpha2
status: New → Triaged
milestone: 2.4.0alpha2 → none
Changed in maas:
milestone: none → 2.4.0beta1
summary: - [2.3] pod commission failed with "Ephemeral operating system ubuntu
+ [2.3, b1] pod commission failed with "Ephemeral operating system ubuntu
xenial is unavailable"
Changed in maas:
assignee: Blake Rouse (blake-rouse) → nobody
Changed in maas:
milestone: 2.4.0beta1 → 2.4.0beta2
Changed in maas:
milestone: 2.4.0beta2 → 2.4.0rc1
Changed in maas:
assignee: nobody → Lee Trager (ltrager)
Revision history for this message
Andres Rodriguez (andreserl) wrote :

i think this issue is related or the same issue as https://bugs.launchpad.net/maas/+bug/1766370

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Guys,

Are you still seeing this issue with the MAAS 2.4 Beta3 ?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

(marking it as incomplete until we can reproduce with beta3 as could have been related to https://bugs.launchpad.net/maas/+bug/1766370)

Changed in maas:
status: Incomplete → Triaged
Changed in maas:
milestone: 2.4.0rc1 → 2.4.0rc2
Revision history for this message
Lee Trager (ltrager) wrote :

@andreserl - I looked into your reproduction and the behavior you see is actually expected. list-boot-images on both the UI and API call list_boot_images() in src/provisioningserver/rpc/boot_images.py. This is cached to reduce IO. The cache is updated after image sync. Deleting the images doesn't invalidate the cache so it returns the same set of images before you deleted them.

Before starting to commission MAAS checks that all racks have the commissioning operating system. This is where the error is occurring.

@jason-hobbs - When this occurred did you have multiple rack controllers setup? Where they all in sync? Did you delete images on any rack controller with rm -rf /var/lib/maas/boot-resources/*?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1750891] Re: [2.3, b1] pod commission failed with "Ephemeral operating system ubuntu xenial is unavailable"

Hi Lee,

Yes - we had multiple rack controllers setup, all in sync, and we did
not delete any image.

Thanks,
Jason

On Tue, May 15, 2018 at 8:17 PM, Lee Trager <email address hidden> wrote:
> @andreserl - I looked into your reproduction and the behavior you see is
> actually expected. list-boot-images on both the UI and API call
> list_boot_images() in src/provisioningserver/rpc/boot_images.py. This is
> cached to reduce IO. The cache is updated after image sync. Deleting the
> images doesn't invalidate the cache so it returns the same set of images
> before you deleted them.
>
> Before starting to commission MAAS checks that all racks have the
> commissioning operating system. This is where the error is occurring.
>
> @jason-hobbs - When this occurred did you have multiple rack controllers
> setup? Where they all in sync? Did you delete images on any rack
> controller with rm -rf /var/lib/maas/boot-resources/*?
>
> ** Changed in: maas
> Status: Triaged => Incomplete
>
> ** Changed in: maas/2.3
> Status: Triaged => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1750891
>
> Title:
> [2.3, b1] pod commission failed with "Ephemeral operating system
> ubuntu xenial is unavailable"
>
> Status in MAAS:
> Incomplete
> Status in MAAS 2.3 series:
> Incomplete
>
> Bug description:
> Our CI hit a failure a couple of times last night where commissioning
> pods failed with the error message "Ephemeral operating system ubuntu
> xenial is unavailable":
>
> Command failed: pod compose 1 hostname=juju-1 cores=8 memory=32768 storage=100 zone=1
> Ephemeral operating system ubuntu xenial is unavailable.
>
> This occurred at 06:07:35.
>
> However, we know from 'rack-controller list-boot-images' output run
> shortly before (at 06:06:35) the pod composition that the rack
> controllers all have the images synced:
>
> http://paste.ubuntu.com/p/qwJdbMX5Ch/
>
> We didn't make any changes to image selection between those two times.
>
> I've attached full logs.
>
> This was with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1).
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1750891/+subscriptions

Revision history for this message
Lee Trager (ltrager) wrote :

In Node._start MAAS double checks that the commissioning OS is available on all rack controllers before starting. While list-boot-images told you the image is available the check in Node._start didn't. As part of LP:1762461 I had to rewrite this check. Now both list-boot-images and the check in ._start use the same RPC call(ListBootImagesV2) to determine if images are synced.

Keeping as incomplete as I wasn't able to reproduce but now that both calls use ListBootImagesV2 the error shouldn't occur.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Are you still seeing this failure in your CI?

Changed in maas:
assignee: Lee Trager (ltrager) → nobody
milestone: 2.4.0rc2 → none
no longer affects: maas/2.3
Changed in maas:
status: Incomplete → New
importance: High → Undecided
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.