1.9.0 Error grabbing cluster configuration lock causes importing to fail.

Bug #1494439 reported by Jeff Lane on 2015-09-10
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Gavin Panella

Bug Description

Started off as a fully up to date 1.8 MAAS server.

I upgraded it to 1.9.0 from the experimental ppa and did not reboot, it was done by doing a dist-upgrade after adding the experimental PPA.

Today, I switched sources from the releases stream to the daily stream so I could get wily images. After setting it to the daily stream, I clicked on the images tab and saw that it was updated and now showed 15.10 as an available image.

I selected 15.10 and applied the changes. The region then successfully downloaded the image and boot stuff, and it is visible here:

bladernr@critical-maas:/var/lib/maas/boot-resources$ ll snapshot-20150910-171442/ubuntu/amd64/generic/wily/daily/
total 1858372
drwxr-xr-x 2 maas maas 4096 Sep 10 13:14 ./
drwxr-xr-x 3 maas maas 4096 Sep 10 13:14 ../
-rw-r--r-- 10 maas maas 29305609 Sep 10 12:08 boot-initrd
-rw-r--r-- 10 maas maas 6748784 Sep 10 12:07 boot-kernel
-rw-r--r-- 10 maas maas 23560748 Sep 10 12:08 di-initrd
-rw-r--r-- 10 maas maas 6750712 Sep 10 12:07 di-kernel
-rw-r--r-- 10 maas maas 1468006400 Sep 10 12:07 root-image
-rw-r--r-- 10 maas root 368578444 Sep 10 12:08 root-tgz
bladernr@critical-maas:/var/lib/maas/boot-resources$ ll current
lrwxrwxrwx 1 maas maas 53 Sep 10 13:14 current -> /var/lib/maas/boot-resources/snapshot-20150910-171442/

Current also points to the correct snapshot directory.

However, while Images shows wily as available:

15.10 amd64 421.1 MB 0 Thu, 10 Sep. 2015 13:14:41
15.04 amd64 408.2 MB 0 Thu, 10 Sep. 2015 13:14:42
14.04 LTS amd64 491.8 MB 3 Thu, 10 Sep. 2015 13:14:41

the cluster is out of sync and 15.10 is NOT a valid install image for deploying a system.

Looking at the maas log, I did notice these error messages (these are pulled from maas.log):

## This has been occurring since I updated to 1.9.0
Sep 10 08:14:34 critical-maas maas.bootresources: [INFO] Importing images from source: http://maas.ubuntu
.com/images/ephemeral-v2/releases/
Sep 10 08:14:35 critical-maas maas.bootresources: [INFO] Finished importing of boot images from 1 source(
s).
Sep 10 08:14:35 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 08:14:35 critical-maas maas.import-images: [WARNING] I/O error while syncing boot images. If this
problem persists, verify network connectivity and disk usage.
Sep 10 08:14:35 critical-maas maas.import-images: [WARNING] Finished importing boot images, the region do
es not have any boot images available.
Sep 10 08:19:33 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 08:19:33 critical-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Sep 10 08:19:38 critical-maas maas.boot_image_download_service: [ERROR] Failed to download images: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock

#This is after I switched to the Daily stream this morning.
Sep 10 11:56:51 critical-maas maas.bootsources: [INFO] Updated boot sources cache.
Sep 10 11:56:51 critical-maas maas.bootresources: [INFO] Started importing of boot images from 1 source(s).
Sep 10 11:56:53 critical-maas maas.bootresources: [INFO] Importing images from source: http://maas.ubuntu.com/images/ephemeral-v2/daily/
Sep 10 11:59:33 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 11:59:34 critical-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Sep 10 11:59:39 critical-maas maas.boot_image_download_service: [ERROR] Failed to download images: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock
Sep 10 12:05:10 critical-maas maas.bootresources: [INFO] Finished importing of boot images from 1 source(s).
Sep 10 12:05:10 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 12:08:25 critical-maas maas.import-images: [INFO] Writing boot image metadata and iSCSI targets.
Sep 10 12:08:25 critical-maas maas.import-images: [INFO] Installing boot images snapshot /var/lib/maas/boot-resources/snapshot-20150910-160510
Sep 10 12:08:28 critical-maas maas.import-images: [INFO] Updating boot image iSCSI targets.
Sep 10 12:08:28 critical-maas maas.import-images: [INFO] Cleaning up old snapshots and cache.
Sep 10 12:08:28 critical-maas maas.import-images: [INFO] Finished importing boot images.
Sep 10 12:14:36 critical-maas maas.bootsources: [INFO] Updated boot sources cache.

Related branches

Jeff Lane (bladernr) wrote :
Download full text (9.1 KiB)

After writing this up, I rebooted the maas server and noticed a couple things I didnt notice before... first, in clusterd.log there is a traceback when the cluster restarted:

2015-09-10 14:21:47-0400 [-] Unhandled error in Deferred:
2015-09-10 14:21:47-0400 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 158, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 92, in _run_import
            reload_boot_images()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 57, in reload_boot_images
            CACHED_BOOT_IMAGES = tftppath.list_boot_images(config.tftp_root)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in list_boot_images
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in <genexpr>
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 180, in extract_image_params
            arch, subarch, release, label)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/osystem/custom.py", line 58, in get_xinstall_parameters
            with ClusterConfiguration.open() as config:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 674, in open
            with cls.backend.open(filepath) as store:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 549, in open
            with RunLock(path).wait(timeout=5.0):
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/fs.py", line 440, in wait
            raise self.NotAvailable(self.fslock.name)
        provisio...

Read more...

Jeff Lane (bladernr) wrote :

Here're the full log files (maas.log, clusterd.log, and regiond.log)

Blake Rouse (blake-rouse) wrote :

src/maasserver/drivers/osystems/custom.py:CustomOS.get_xinstall_parameters is failing to grab the cluster configuration lock to get the tftp_root path. This error is then causes another error in the websocket handler because the response is not the expected output.

summary: - 1.9.0 Cluster failed to sync after switch to daily stream and attempted
- wily import
+ 1.9.0 Error grabbing cluster configuration lock causes importing to
+ fail.
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.9.0
Changed in maas:
assignee: nobody → Gavin Panella (allenap)
Gavin Panella (allenap) wrote :
Changed in maas:
status: Triaged → Fix Committed
Rod Smith (rodsmith) wrote :

I'm seeing what looks like the same bug in MAAS 1.9.0~beta2+bzr4456-0ubuntu1~trusty1. I'm attaching my log files; detailed version information below....

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================================-===================================================-============-===============================================================================
ii maas 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cert-server 0.2.8-0~36~ubuntu14.04.1 all Ubuntu certification support files for MAAS server
ii maas-cli 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server provisioning libraries

Rod Smith (rodsmith) wrote :

Update: Rebooting caused the images to import and become available.

Jeff Lane (bladernr) wrote :

What Rod said, still seeing this in 1.9 B2... any idea when this will land in a 1.9 update?

Jeff Lane (bladernr) wrote :

I'm currently installing from lp:maas/next

Gavin Panella (allenap) wrote :

I cannot reproduce this problem so I need your help.

When you next see a NotAvailable exception in your clusterd.log, please can you do `readlink $lockpath` (where $lockpath comes from the exception). This should print a number which should correspond to the PID of your clusterd process. Can you check that's the correct PID?

I want to be sure there isn't a clusterd process still running in the background. Or maybe clusterd has died and not cleared up after itself.

Might clusterd have been killed -9 by Upstart or systemd? Can you check your Upstart/systemd logs to see if that might have happened?

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers