1.9.0 Error grabbing cluster configuration lock causes importing to fail.

Bug #1494439 reported by Jeff Lane 
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Gavin Panella

Bug Description

Started off as a fully up to date 1.8 MAAS server.

I upgraded it to 1.9.0 from the experimental ppa and did not reboot, it was done by doing a dist-upgrade after adding the experimental PPA.

Today, I switched sources from the releases stream to the daily stream so I could get wily images. After setting it to the daily stream, I clicked on the images tab and saw that it was updated and now showed 15.10 as an available image.

I selected 15.10 and applied the changes. The region then successfully downloaded the image and boot stuff, and it is visible here:

bladernr@critical-maas:/var/lib/maas/boot-resources$ ll snapshot-20150910-171442/ubuntu/amd64/generic/wily/daily/
total 1858372
drwxr-xr-x 2 maas maas 4096 Sep 10 13:14 ./
drwxr-xr-x 3 maas maas 4096 Sep 10 13:14 ../
-rw-r--r-- 10 maas maas 29305609 Sep 10 12:08 boot-initrd
-rw-r--r-- 10 maas maas 6748784 Sep 10 12:07 boot-kernel
-rw-r--r-- 10 maas maas 23560748 Sep 10 12:08 di-initrd
-rw-r--r-- 10 maas maas 6750712 Sep 10 12:07 di-kernel
-rw-r--r-- 10 maas maas 1468006400 Sep 10 12:07 root-image
-rw-r--r-- 10 maas root 368578444 Sep 10 12:08 root-tgz
bladernr@critical-maas:/var/lib/maas/boot-resources$ ll current
lrwxrwxrwx 1 maas maas 53 Sep 10 13:14 current -> /var/lib/maas/boot-resources/snapshot-20150910-171442/

Current also points to the correct snapshot directory.

However, while Images shows wily as available:

15.10 amd64 421.1 MB 0 Thu, 10 Sep. 2015 13:14:41
15.04 amd64 408.2 MB 0 Thu, 10 Sep. 2015 13:14:42
14.04 LTS amd64 491.8 MB 3 Thu, 10 Sep. 2015 13:14:41

the cluster is out of sync and 15.10 is NOT a valid install image for deploying a system.

Looking at the maas log, I did notice these error messages (these are pulled from maas.log):

## This has been occurring since I updated to 1.9.0
Sep 10 08:14:34 critical-maas maas.bootresources: [INFO] Importing images from source: http://maas.ubuntu
.com/images/ephemeral-v2/releases/
Sep 10 08:14:35 critical-maas maas.bootresources: [INFO] Finished importing of boot images from 1 source(
s).
Sep 10 08:14:35 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 08:14:35 critical-maas maas.import-images: [WARNING] I/O error while syncing boot images. If this
problem persists, verify network connectivity and disk usage.
Sep 10 08:14:35 critical-maas maas.import-images: [WARNING] Finished importing boot images, the region do
es not have any boot images available.
Sep 10 08:19:33 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 08:19:33 critical-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Sep 10 08:19:38 critical-maas maas.boot_image_download_service: [ERROR] Failed to download images: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock

#This is after I switched to the Daily stream this morning.
Sep 10 11:56:51 critical-maas maas.bootsources: [INFO] Updated boot sources cache.
Sep 10 11:56:51 critical-maas maas.bootresources: [INFO] Started importing of boot images from 1 source(s).
Sep 10 11:56:53 critical-maas maas.bootresources: [INFO] Importing images from source: http://maas.ubuntu.com/images/ephemeral-v2/daily/
Sep 10 11:59:33 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 11:59:34 critical-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Sep 10 11:59:39 critical-maas maas.boot_image_download_service: [ERROR] Failed to download images: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock
Sep 10 12:05:10 critical-maas maas.bootresources: [INFO] Finished importing of boot images from 1 source(s).
Sep 10 12:05:10 critical-maas maas.import-images: [INFO] Started importing boot images.
Sep 10 12:08:25 critical-maas maas.import-images: [INFO] Writing boot image metadata and iSCSI targets.
Sep 10 12:08:25 critical-maas maas.import-images: [INFO] Installing boot images snapshot /var/lib/maas/boot-resources/snapshot-20150910-160510
Sep 10 12:08:28 critical-maas maas.import-images: [INFO] Updating boot image iSCSI targets.
Sep 10 12:08:28 critical-maas maas.import-images: [INFO] Cleaning up old snapshots and cache.
Sep 10 12:08:28 critical-maas maas.import-images: [INFO] Finished importing boot images.
Sep 10 12:14:36 critical-maas maas.bootsources: [INFO] Updated boot sources cache.

Related branches

Revision history for this message
Jeff Lane  (bladernr) wrote :
Download full text (9.1 KiB)

After writing this up, I rebooted the maas server and noticed a couple things I didnt notice before... first, in clusterd.log there is a traceback when the cluster restarted:

2015-09-10 14:21:47-0400 [-] Unhandled error in Deferred:
2015-09-10 14:21:47-0400 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 158, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 92, in _run_import
            reload_boot_images()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 57, in reload_boot_images
            CACHED_BOOT_IMAGES = tftppath.list_boot_images(config.tftp_root)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in list_boot_images
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in <genexpr>
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 180, in extract_image_params
            arch, subarch, release, label)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/osystem/custom.py", line 58, in get_xinstall_parameters
            with ClusterConfiguration.open() as config:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 674, in open
            with cls.backend.open(filepath) as store:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 549, in open
            with RunLock(path).wait(timeout=5.0):
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/fs.py", line 440, in wait
            raise self.NotAvailable(self.fslock.name)
        provisio...

Read more...

Revision history for this message
Jeff Lane  (bladernr) wrote :

Here're the full log files (maas.log, clusterd.log, and regiond.log)

Revision history for this message
Blake Rouse (blake-rouse) wrote :

src/maasserver/drivers/osystems/custom.py:CustomOS.get_xinstall_parameters is failing to grab the cluster configuration lock to get the tftp_root path. This error is then causes another error in the websocket handler because the response is not the expected output.

summary: - 1.9.0 Cluster failed to sync after switch to daily stream and attempted
- wily import
+ 1.9.0 Error grabbing cluster configuration lock causes importing to
+ fail.
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.9.0
Changed in maas:
assignee: nobody → Gavin Panella (allenap)
Revision history for this message
Gavin Panella (allenap) wrote :
Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Rod Smith (rodsmith) wrote :

I'm seeing what looks like the same bug in MAAS 1.9.0~beta2+bzr4456-0ubuntu1~trusty1. I'm attaching my log files; detailed version information below....

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================================-===================================================-============-===============================================================================
ii maas 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cert-server 0.2.8-0~36~ubuntu14.04.1 all Ubuntu certification support files for MAAS server
ii maas-cli 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.0~beta2+bzr4456-0ubuntu1~trusty1 all MAAS server provisioning libraries

Revision history for this message
Rod Smith (rodsmith) wrote :

Update: Rebooting caused the images to import and become available.

Revision history for this message
Jeff Lane  (bladernr) wrote :

What Rod said, still seeing this in 1.9 B2... any idea when this will land in a 1.9 update?

Revision history for this message
Jeff Lane  (bladernr) wrote :

I'm currently installing from lp:maas/next

Revision history for this message
Gavin Panella (allenap) wrote :

I cannot reproduce this problem so I need your help.

When you next see a NotAvailable exception in your clusterd.log, please can you do `readlink $lockpath` (where $lockpath comes from the exception). This should print a number which should correspond to the PID of your clusterd process. Can you check that's the correct PID?

I want to be sure there isn't a clusterd process still running in the background. Or maybe clusterd has died and not cleared up after itself.

Might clusterd have been killed -9 by Upstart or systemd? Can you check your Upstart/systemd logs to see if that might have happened?

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.