Downloading images failed. provisioningserver.utils.fs.NotAvailable: /run/lock/maas.XXXX=.lock

Bug #1524007 reported by Ryan Beisner on 2015-12-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Gavin Panella
1.9
High
Gavin Panella

Bug Description

Downloading images failed. provisioningserver.utils.fs.NotAvailable: /run/lock/maas.XXXX=.lock

maas:
  Installed: 1.9.0~rc3+bzr4525-0ubuntu1~trusty1

From cluster.log:

http://paste.ubuntu.com/13828916/

2015-12-08 17:11:51+0000 [-] Downloading images failed.
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 423, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1155, in gotResult
            _inlineCallbacks(r, g, deferred)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/image_download_service.py", line 136, in maybe_start_download
            yield self._start_download()
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/image_download_service.py", line 123, in _start_download
            get_proxy_url("https"))
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 200, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 118, in _run_import
            reload_boot_images()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 57, in reload_boot_images
            CACHED_BOOT_IMAGES = tftppath.list_boot_images(config.tftp_root)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in list_boot_images
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in <genexpr>
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 170, in extract_image_params
            arch, subarch, release, label)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/osystem/windows.py", line 54, in get_boot_image_purposes
            with ClusterConfiguration.open() as config:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 674, in open
            with cls.backend.open(filepath) as store:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 549, in open
            with RunLock(path).wait(timeout=5.0):
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/fs.py", line 441, in wait
            raise self.NotAvailable(self.fslock.name)
        provisioningserver.utils.fs.NotAvailable: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock

Related branches

Mike Pontillo (mpontillo) wrote :

Is it possible that maas-clusterd was restarted while syncing images?

Did you try deleting the lock file and restarting maas-clusterd?

Mike Pontillo (mpontillo) wrote :

$ printf '%s\n' $(echo 'L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=' | base64 -d)
/etc/maas/clusterd.conf

Looks like the lock is for the config file, by the way.

Ryan Beisner (1chb1n) wrote :

It is possible, as we just did MAAS upgrades. I suppose an image sync could have been underway.

Mike Pontillo (mpontillo) wrote :

Looks like we have a hard-coded timeout for that lock of 5 seconds.

Can you edit /usr/lib/python2.7/dist-pacakges/provisioningserver/config.py - on line ~549 there should be a line:

            with RunLock(path).wait(timeout=5.0):

Change that to something like 60 seconds to see if that fixes it:

            with RunLock(path).wait(timeout=60.0):

Ryan Beisner (1chb1n) wrote :

FWIW, the net effect of this, is that I seem to have a half-ready environment. Trusty deployments ultimately fail with:
"Could not find kernel image: ubuntu/amd64/hwe-t/trusty/daily/boot-kernel"

Ryan Beisner (1chb1n) wrote :

Note, the file doesn't exist: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock

Ryan Beisner (1chb1n) wrote :

I've increased that timeout to 60.0 and restarted svcs, still get the lock error.

Ryan Beisner (1chb1n) wrote :

Manually confirmed that the maas user is able to touch a file in that dir:
http://paste.ubuntu.com/13833792/

Ryan Beisner (1chb1n) wrote :
Download full text (3.2 KiB)

After removing the > 5GB Windows image per request, and setting the timeout to 60, a different trace is appearing (still with that lock file not existing):

http://paste.ubuntu.com/13833338/

2015-12-08 20:10:34+0000 [-] Unhandled error in Deferred:
2015-12-08 20:10:34+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 200, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 118, in _run_import
            reload_boot_images()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 57, in reload_boot_images
            CACHED_BOOT_IMAGES = tftppath.list_boot_images(config.tftp_root)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in list_boot_images
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in <genexpr>
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 180, in extract_image_params
            arch, subarch, release, label)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/osystem/custom.py", line 58, in get_xinstall_parameters
            with ClusterConfiguration.open() as config:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 674, in open
            with cls.backend.open(filepath) as store:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 549, in open
            with RunLock(path).wait(timeout=60.0):
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/fs.py", line 441, in wait
            raise self.NotAvailable(self.fslock.name...

Read more...

Ryan Beisner (1chb1n) wrote :

Although I am still getting the lock file trace periodically, my Trusty boot image appears to have made it in! I can now deploy Trusty. Please let me know if there is something we can do to help reveal the cause of the trace. Thank you.

Mike Pontillo (mpontillo) wrote :

It occurs to me that we should be able to open the configuration for reading without taking the lock.

Changed in maas:
status: New → Triaged
importance: Undecided → High
Changed in maas:
status: Triaged → Fix Committed
assignee: nobody → Gavin Panella (allenap)
milestone: none → next
Gavin Panella (allenap) wrote :

> It occurs to me that we should be able to open the configuration for
> reading without taking the lock.

In thinking about this I have stumbled on what might be the underlying
cause: there is no ordering when taking locks.

A thread will try to take the lock and then sleep for a short time if it
can't. It'll keep doing this for a few seconds after which it will give
up.

Each time the thread wakes up and finds another thread has the lock,
it's not necessarily the same thread or even the same process that's
holding the lock. It could be a different thread every time, even a
thread that has queue-barged for the lock.

Reads vastly outnumber writes, so adding an open-for-read-only option
that does not require a lock (which is safe because the configuration
file is updated atomically) would work around the problem I describe.
Concurrent writers would still be exposed to queue-barging, but so
infrequent are writes that the problem might never manifest.

Changed in maas:
status: Fix Committed → In Progress
Gavin Panella (allenap) on 2015-12-11
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
milestone: next → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers