Bug #1532732 “Caching for arbitrary images on the web/remote hos...” : Bugs : thumbnailer package : Ubuntu

Revision history for this message

Michi Henning (michihenning) wrote on 2016-01-11:

#1

Doing this would not be hard. The underlying persistent-cache-cpp thingy already handles TTL eviction. The time to live can be set on a per-thumbnail basis. So, you can just ask for a thumbnail and provide an expiry time, and the thumbnail will automatically be re-fetched once it expires. Or you can leave the timeout off, and then the thumbnail will hang around the cache using LRU until it isn't accessed often enough to drop out of the cache completely (at which point it would be automatically re-fetched as needed).

The key you would provide for this would be the URL to the image file.

If you want to thumbnail images that you have extracted yourself (where the images do not have URL that directly points at them), you can do that today. Doing so requires you to write them into the file system and to then ask for a thumbnail with a URL that points at the local file. (That's how photos are being thumbnail, for example.) The cost of this approach is increased disk space because the thumbnailer absolutely will not provide a thumbnail for a local image unless you can prove that you are entitled to actually read that image. In practice, this means that, as soon as the original image from which a thumbnail was generated is deleted, the thumbnailer will no longer hand out a thumbnail for that image.

All this would work for any file type that is currently supported. (Basically all image file formats in the known universe, plus most audio and video formats, provided the necessary gstreamer codecs are installed.)

Revision history for this message

Michi Henning (michihenning) wrote on 2016-01-12:

#2

After discussing this with James, here are some thoughts.

Adding direct download of arbitrary remote images opens an attack/bug vector. For example, a caller could ask for http://<whatever> and it turns out that the remote server (maliciously or otherwise), is very slow, or doesn't respond at all until the request times out. This means that one application can disable access to remote images for other applications because the thumbnailer allows only a limited number of outstanding HTTP requests at a time. Or the server could potentially return garbage images each and every time, each of which would result in an entry in our failure cache. That can have negative performance impact on other apps.

But, I don't think all is lost. The pragmatic answer to the problem would be for the application to download the image (or audio file, or whatever) and simply drop it into the file system somewhere. Then pass a URL to that file to the thumbnailer, and it will do the rest (extraction, scaling, rotating, caching, etc.)

This isn't quite as convenient as having the thumbnailer do all of the work, but I think it might be workable?

I'm open to suggestions though. I agree that having this feature would be really nice. But we'd have to sort out the potential denial-of-service/reliability issues. If the actions of one application can disable access to thumbnails for other applications (or significantly reduce throughput), we have a problem.

Revision history for this message

Michael Zanetti (mzanetti) wrote on 2016-01-12: Re: [Bug 1532732] Re: Caching for arbitrary images on the web/remote hosts

#3

On 12.01.2016 10:28, Michi Henning wrote:
> After discussing this with James, here are some thoughts.
>
> Adding direct download of arbitrary remote images opens an attack/bug
> vector. For example, a caller could ask for http://<whatever> and it
> turns out that the remote server (maliciously or otherwise), is very
> slow, or doesn't respond at all until the request times out. This means
> that one application can disable access to remote images for other
> applications because the thumbnailer allows only a limited number of
> outstanding HTTP requests at a time. Or the server could potentially
> return garbage images each and every time, each of which would result in
> an entry in our failure cache. That can have negative performance impact
> on other apps.
>
> But, I don't think all is lost. The pragmatic answer to the problem
> would be for the application to download the image (or audio file, or
> whatever) and simply drop it into the file system somewhere. Then pass a
> URL to that file to the thumbnailer, and it will do the rest
> (extraction, scaling, rotating, caching, etc.)
>
> This isn't quite as convenient as having the thumbnailer do all of the
> work, but I think it might be workable?

The problem with that is that in QML the actual fetching of the image is
hidden behind the scenes. Right now one uses

Image {
source: someurl
}

and qml does the rest. I think the only way to out a cache in there is
to place it into the imageprovider. Putting that back on the app
developers shoulder is probably causing noone to use it I think.

>
> I'm open to suggestions though. I agree that having this feature would
> be really nice. But we'd have to sort out the potential denial-of-
> service/reliability issues. If the actions of one application can
> disable access to thumbnails for other applications (or significantly
> reduce throughput), we have a problem.

Hmm, so far I was assuming that the cache is per app, instead of a
common pool for every app. Same as I was thinking the maximum number of
concurrent downloads would be on a per-process basis. Wouldn't that make
sense from a security point of view and get away with the above
problems? Now, I have no idea how realistic it would be to change the
thumbnailer to that.

I realize that in the music/video artwork case it totally makes sense to
keep a shared pool of data. In this case tho, I think overlappings
between apps would be quite rare (as each app typically connects to one
remote service) and the gain of splitting caches/logic seems bigger than
the loss if giving up the shared data pool. Would it be possible to use
2 different mechanisms, shared for media artwork, split cache for
arbitrary images?

On 12.01.2016 10:28, Michi Henning wrote:
> After discussing this with James, here are some thoughts.
> 
> Adding direct download of arbitrary remote images opens an attack/bug
> vector. For example, a caller could ask for http://<whatever> and it
> turns out that the remote server (maliciously or otherwise), is very
> slow, or doesn't respond at all until the request times out. This means
> that one application can disable access to remote images for other
> applications because the thumbnailer allows only a limited number of
> outstanding HTTP requests at a time. Or the server could potentially
> return garbage images each and every time, each of which would result in
> an entry in our failure cache. That can have negative performance impact
> on other apps.
> 
> But, I don't think all is lost. The pragmatic answer to the problem
> would be for the application to download the image (or audio file, or
> whatever) and simply drop it into the file system somewhere. Then pass a
> URL to that file to the thumbnailer, and it will do the rest
> (extraction, scaling, rotating, caching, etc.)
> 
> This isn't quite as convenient as having the thumbnailer do all of the
> work, but I think it might be workable?

The problem with that is that in QML the actual fetching of the image is
hidden behind the scenes. Right now one uses

Image {
  source: someurl
}

and qml does the rest. I think the only way to out a cache in there is
to  place it into the imageprovider. Putting that back on the app
developers shoulder is probably causing noone to use it I think.

> 
> I'm open to suggestions though. I agree that having this feature would
> be really nice. But we'd have to sort out the potential denial-of-
> service/reliability issues. If the actions of one application can
> disable access to thumbnails for other applications (or significantly
> reduce throughput), we have a problem.

Hmm, so far I was assuming that the cache is per app, instead of a
common pool for every app. Same as I was thinking the maximum number of
concurrent downloads would be on a per-process basis. Wouldn't that make
sense from a security point of view and get away with the above
problems? Now, I have no idea how realistic it would be to change the
thumbnailer to that.

I realize that in the music/video artwork case it totally makes sense to
keep a shared pool of data. In this case tho, I think overlappings
between apps would be quite rare (as each app typically connects to one
remote service) and the gain of splitting caches/logic seems bigger than
the loss if giving up the shared data pool. Would it be possible to use
2 different mechanisms, shared for media artwork, split cache for
arbitrary images?

Revision history for this message

Michi Henning (michihenning) wrote on 2016-01-13:

#4

All good points, thanks!

Currently, there is one thumbnailer service instance for all apps/scopes that are run by the user. So, it's one per user. (We could have one for the entire system though equally well, but that's currently moot, seeing that phone isn't multi-user, at least not concurrently.)

Downloads happen on the server side because dash.ubuntu.com is the only remote image source, and the results from there can be shared among all apps. So, it made sense to do it on the server side.

For your use case, downloading on the client-side would probably be the way to go. We could implement this on behalf of the caller inside the client API, so there would be no extra work for customers.

But we still need to secure this somehow. If the client gives us a URL, and we download (on the client-side) on behalf of the caller, we now have to put the thumbnail somewhere in the file system. The caller then would have to provide the path name to that file as usual and the thumbnailer would make sure that the client is allowed to read that file before handing out a thumbnail (as it does now).

But we don't want to accumulate garbage. When the thumbnail falls out of the cache, the file that was created by the client-side API needs to be removed too. Otherwise, we burden the caller with the need to explicitly remove stale files. That's a sure-fire guarantee for accumulating dead files over time, so I don't want to go there.

Or we could fetch the thumbnail on the client side, shoot it over the wire for thumbnailing and caching, and return some token to the client that is going to identify that thumbnail until it expires. But that makes for an awkward API too, because now the client has to garbage collect the tokens, so we haven't really solved the problem.

Yet another option would be to make the thumbnailer embeddable. If the client wants to thumbnail arbitrary things, they run their own thumbnailer inside their own address space. That gets rid of all the security concerns immediately. The thumbnailer hasn't been designed to be embeddable, but I don't think it would be all that difficult to allow this. The client-side API is pimpl'd already, so we can probably bolt all the other stuff (bypassing the DBus level underneath) without too much difficulty.

Opinions?

All good points, thanks!

Currently, there is one thumbnailer service instance for all apps/scopes that are run by the user. So, it's one per user. (We could have one for the entire system though equally well, but that's currently moot, seeing that phone isn't multi-user, at least not concurrently.)

Downloads happen on the server side because dash.ubuntu.com is the only remote image source, and the results from there can be shared among all apps. So, it made sense to do it on the server side.

For your use case, downloading on the client-side would probably be the way to go. We could implement this on behalf of the caller inside the client API, so there would be no extra work for customers.

But we still need to secure this somehow. If the client gives us a URL, and we download (on the client-side) on behalf of the caller, we now have to put the thumbnail somewhere in the file system. The caller then would have to provide the path name to that file as usual and the thumbnailer would make sure that the client is allowed to read that file before handing out a thumbnail (as it does now).

But we don't want to accumulate garbage. When the thumbnail falls out of the cache, the file that was created by the client-side API needs to be removed too. Otherwise, we burden the caller with the need to explicitly remove stale files. That's a sure-fire guarantee for accumulating dead files over time, so I don't want to go there.

Or we could fetch the thumbnail on the client side, shoot it over the wire for thumbnailing and caching, and return some token to the client that is going to identify that thumbnail until it expires. But that makes for an awkward API too, because now the client has to garbage collect the tokens, so we haven't really solved the problem.

Yet another option would be to make the thumbnailer embeddable. If the client wants to thumbnail arbitrary things, they run their own thumbnailer inside their own address space. That gets rid of all the security concerns immediately. The thumbnailer hasn't been designed to be embeddable, but I don't think it would be all that difficult to allow this. The client-side API is pimpl'd already, so we can probably bolt all the other stuff (bypassing the DBus level underneath) without too much difficulty.

Opinions?

Ubuntu
thumbnailer package

Caching for arbitrary images on the web/remote hosts

Bug Description

Other bug subscribers

Remote bug watches

Ubuntuthumbnailer package

Caching for arbitrary images on the web/remote hosts

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
thumbnailer package