Downloading images failed. provisioningserver.utils.fs.NotAvailable: /run/lock/maas.XXXX=.lock

Bug #1524007 reported by Ryan Beisner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Gavin Panella
1.9
High
Gavin Panella

Bug Description

Downloading images failed. provisioningserver.utils.fs.NotAvailable: /run/lock/maas.XXXX=.lock

maas:
  Installed: 1.9.0~rc3+bzr4525-0ubuntu1~trusty1

From cluster.log:

http://paste.ubuntu.com/13828916/

2015-12-08 17:11:51+0000 [-] Downloading images failed.
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 423, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1155, in gotResult
            _inlineCallbacks(r, g, deferred)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/image_download_service.py", line 136, in maybe_start_download
            yield self._start_download()
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/image_download_service.py", line 123, in _start_download
            get_proxy_url("https"))
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 200, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 118, in _run_import
            reload_boot_images()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 57, in reload_boot_images
            CACHED_BOOT_IMAGES = tftppath.list_boot_images(config.tftp_root)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in list_boot_images
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in <genexpr>
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 170, in extract_image_params
            arch, subarch, release, label)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/osystem/windows.py", line 54, in get_boot_image_purposes
            with ClusterConfiguration.open() as config:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 674, in open
            with cls.backend.open(filepath) as store:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 549, in open
            with RunLock(path).wait(timeout=5.0):
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/fs.py", line 441, in wait
            raise self.NotAvailable(self.fslock.name)
        provisioningserver.utils.fs.NotAvailable: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock

Related branches

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Is it possible that maas-clusterd was restarted while syncing images?

Did you try deleting the lock file and restarting maas-clusterd?

Revision history for this message
Mike Pontillo (mpontillo) wrote :

$ printf '%s\n' $(echo 'L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=' | base64 -d)
/etc/maas/clusterd.conf

Looks like the lock is for the config file, by the way.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

It is possible, as we just did MAAS upgrades. I suppose an image sync could have been underway.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Looks like we have a hard-coded timeout for that lock of 5 seconds.

Can you edit /usr/lib/python2.7/dist-pacakges/provisioningserver/config.py - on line ~549 there should be a line:

            with RunLock(path).wait(timeout=5.0):

Change that to something like 60 seconds to see if that fixes it:

            with RunLock(path).wait(timeout=60.0):

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FWIW, the net effect of this, is that I seem to have a half-ready environment. Trusty deployments ultimately fail with:
"Could not find kernel image: ubuntu/amd64/hwe-t/trusty/daily/boot-kernel"

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Note, the file doesn't exist: /run/lock/maas.L2V0Yy9tYWFzL2NsdXN0ZXJkLmNvbmY=.lock

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I've increased that timeout to 60.0 and restarted svcs, still get the lock error.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Manually confirmed that the maas user is able to touch a file in that dir:
http://paste.ubuntu.com/13833792/

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Download full text (3.2 KiB)

After removing the > 5GB Windows image per request, and setting the timeout to 60, a different trace is appearing (still with that lock file not existing):

http://paste.ubuntu.com/13833338/

2015-12-08 20:10:34+0000 [-] Unhandled error in Deferred:
2015-12-08 20:10:34+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 200, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 118, in _run_import
            reload_boot_images()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 57, in reload_boot_images
            CACHED_BOOT_IMAGES = tftppath.list_boot_images(config.tftp_root)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in list_boot_images
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 266, in <genexpr>
            extract_image_params(path, metadata) for path in paths))
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/tftppath.py", line 180, in extract_image_params
            arch, subarch, release, label)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/osystem/custom.py", line 58, in get_xinstall_parameters
            with ClusterConfiguration.open() as config:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 674, in open
            with cls.backend.open(filepath) as store:
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/config.py", line 549, in open
            with RunLock(path).wait(timeout=60.0):
          File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
            return self.gen.next()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/fs.py", line 441, in wait
            raise self.NotAvailable(self.fslock.name...

Read more...

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Although I am still getting the lock file trace periodically, my Trusty boot image appears to have made it in! I can now deploy Trusty. Please let me know if there is something we can do to help reveal the cause of the trace. Thank you.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

It occurs to me that we should be able to open the configuration for reading without taking the lock.

Changed in maas:
status: New → Triaged
importance: Undecided → High
Changed in maas:
status: Triaged → Fix Committed
assignee: nobody → Gavin Panella (allenap)
milestone: none → next
Revision history for this message
Gavin Panella (allenap) wrote :

> It occurs to me that we should be able to open the configuration for
> reading without taking the lock.

In thinking about this I have stumbled on what might be the underlying
cause: there is no ordering when taking locks.

A thread will try to take the lock and then sleep for a short time if it
can't. It'll keep doing this for a few seconds after which it will give
up.

Each time the thread wakes up and finds another thread has the lock,
it's not necessarily the same thread or even the same process that's
holding the lock. It could be a different thread every time, even a
thread that has queue-barged for the lock.

Reads vastly outnumber writes, so adding an open-for-read-only option
that does not require a lock (which is safe because the configuration
file is updated atomically) would work around the problem I describe.
Concurrent writers would still be exposed to queue-barging, but so
infrequent are writes that the problem might never manifest.

Changed in maas:
status: Fix Committed → In Progress
Gavin Panella (allenap)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
milestone: next → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers