misidentifies .html file as Perl script when it contains JavaScript "use strict"

Bug #1890716 reported by Alex A. D.
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
shared-mime-info
Fix Released
Unknown
qtbase-opensource-src (Ubuntu)
Invalid
Undecided
Unassigned
shared-mime-info (Debian)
Confirmed
Unknown
shared-mime-info (Ubuntu)
Fix Released
Low
Unassigned

Bug Description

For .html files `xdg-mime` reports wrong type. The culprit is the `"use strict"` phrase which is used in JavaScript. It should not mistake .html files for anything else except of text/html !

STEPS TO REPRODUCE:
Run the following step by step in any folder:
1. $ echo "\"use strict\"" > index.html
2. $ xdg-mime query filetype index.html # -> application/x-perl - this should be text/html!

Platform:
Ubuntu 20.04.1 LTS (Focal Fossa)"
Linux version 5.4.0-42-generic (buildd@lgw01-amd64-038) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2))

xdg-utils: 1.1.3-2ubuntu1

Alex A. D. (hinell)
description: updated
Alex A. D. (hinell)
tags: added: html xdg-mime xdg-open
tags: added: 20.04 kde kubuntu perl
Revision history for this message
Sebastien Bacher (seb128) wrote :

Thank you for your bug report, the file types are determined from the content and not only the name. An html is supposed to start with <html>

affects: xdg-utils (Ubuntu) → shared-mime-info (Ubuntu)
Changed in shared-mime-info (Ubuntu):
importance: Undecided → Low
status: New → Invalid
Revision history for this message
Alex A. D. (hinell) wrote :

Hi Sebastien. Thaks your for your reply.

I suggest you to do the following to see that even if file is starting with proper tags it is recognized as `application/x-perl`:

$ tee "index.html" <<eol
<!DOCTYPE html>
<html><body>use strict</body></html>
eol

$ xdg-mime query filetype index.html # -> application/x-perl - wrong type

Why did you tagged report as invalid?
I just got a file with minified JS code which has `use strict` burrowed deep somewhere inside correct html file but it's still recognized as perl script.

I can't use some tools to open the file in browser by using xdg-open and forced to use browser-specific launcher.

Revision history for this message
Alex A. D. (hinell) wrote :

why did you tag the report*

Bump.

Revision history for this message
Sebastien Bacher (seb128) wrote :

the new testcase is different and is a valid html now (which the initial report wasn't), using those commands here it returns 'text/html' as the type though...

could you try with another user to see if that's due to some local configuration?

Changed in shared-mime-info (Ubuntu):
status: Invalid → New
status: New → Incomplete
Revision history for this message
Alex A. D. (hinell) wrote :

I just have tried a fresh new user. The output is identical. See attached screenshot.

Revision history for this message
Sebastien Bacher (seb128) wrote :

in fact the script doesn't seem to used shared-mime-info

$ XDG_UTILS_DEBUG_LEVEL=1 xdg-mime query filetype index.html
Running mimetype --brief --dereference "/tmp/index.html"
text/html

could you try to use the command with the same environement variable and share the output?

Revision history for this message
Sebastien Bacher (seb128) wrote :

The wrapper used depends of the environment, the previous one was under a command line clean instance, on a GNOME based desktop

$ XDG_UTILS_DEBUG_LEVEL=1 xdg-mime query filetype index.html
Running gio info "/tmp/index.html"
text/html

Revision history for this message
Alex A. D. (hinell) wrote :

After running that commands I got the following output:

$ ...
Running kmimetypefinder5 "/home/alex/Desktop/index.html"
application/x-perl

It seems like that kmimetypefinder5 is major culprit here. I've found another unrelated bugreport here which was reported about a year ago: https://bugs.launchpad.net/ubuntu/+source/kde-cli-tools/+bug/1857824

In that case the program reports wrong type for *.py files with correct content. Weird!
I think it's wortht to rename this bugreport.

summary: - xdg-mime query filetype index.html reports wrong type
+ kmimetypefinder5 *.html reports wrong type
Revision history for this message
Alex A. D. (hinell) wrote : Re: kmimetypefinder5 *.html reports wrong type

New bug reopened with complete and correct explanation:
https://bugs.launchpad.net/ubuntu/+source/kde-cli-tools/+bug/1896682

Changed in shared-mime-info (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Alex A. D. (hinell) wrote :

I've reported a new bug providing a complete and correct description:
https://bugs.launchpad.net/ubuntu/+source/kde-cli-tools/+bug/1896682

This one can be safely closed.

Changed in shared-mime-info (Ubuntu):
status: Invalid → New
summary: - kmimetypefinder5 *.html reports wrong type
+ misidentifies .html file as Perl script when it contains JavaScript "use
+ strict"
Changed in shared-mime-info (Ubuntu):
status: New → Invalid
Revision history for this message
Kai Kasurinen (kai-kasurinen) wrote :

> dpkg -S /usr/bin/mimetype
libfile-mimeinfo-perl: /usr/bin/mimetype
> /usr/bin/mimetype --magic-only foo.html
foo.html: application/x-perl

Changed in shared-mime-info (Ubuntu):
status: Invalid → New
Revision history for this message
Kai Kasurinen (kai-kasurinen) wrote :
Revision history for this message
Sebastien Bacher (seb128) wrote :

> > dpkg -S /usr/bin/mimetype
> libfile-mimeinfo-perl: /usr/bin/mimetype

how is that supposed to prove it's a shared-mime-info bug?

Changed in shared-mime-info (Ubuntu):
status: New → Fix Committed
Changed in shared-mime-info (Debian):
status: Unknown → Confirmed
Revision history for this message
Alex A. D. (hinell) wrote :

With the same setup I have the `mimetype` to output `text/html` for `index.html`. It seems it workds correctly.

Revision history for this message
Alex A. D. (hinell) wrote :

#INTRO
After digging up for a while I've found where the issue comes from for both `.html` and `.py` (bug #1857824) files.

#SHORT
The culprit responsible for misidentification resides in `.xml` database which specifies how to match mime-type against input data. It can be found here [2].

#LONG
The `kmimetypefinder.cpp` pulls up [0] `QMimeDatabase db` apis by `db.mimeTypeForFile(...)` which in turns bootstrup `QMimeDatabasePrivate ...` XML database from .xml file.[1]

If we look carefully at the content of the `"text/x-perl"` entry we would see the following:

```
    <alias type="text/x-perl"/>
    <magic priority="50">
      ...
      <match value="use strict" type="string" offset="0:256"/>
      ...
    </magic>
```

Did you notice the offset attribute `"0:256"`? Now if we run the following two cases we will see that files whose content contains keywords `use strict` in the range of 1..256 will be identified as `text/x-perl` script and as `text/html` if the `use trict` is located outside of such range otherwise, checkout:

💲 tee "index.html" <<eol ; echo -e "\n"; kmimetypefinder5 index.html
`printf "_"%.0s {1..256}`use strict
eol

application/x-perl # <- OUTPUT IS WRONG ⚠️

💲 tee "index.html" <<eol ; echo -e "\n"; kmimetypefinder5 index.html
`printf "_"%.0s {1..257}`use strict
eol

text/html # <- OUTPUT IS CORRECT!!! ✅ - Surprising, huh? 😏

#CONCLUSION
This proves that the bug comes from QTBase database which wrongly identifies `x-perl`'s keywords in JS scripts. The latter have `'use strict'` keyword that specifically should be placed at the top of the script. It seems like that they overlap for both languages. I think appropriate bug should be opened in the QTBase bug registry.

[0]: https://github.com/KDE/kde-cli-tools/blob/master/kmimetypefinder/kmimetypefinder.cpp
[1]: https://github.com/qt/qtbase/blob/03dfd4199deb4a0f5123fb1eead42f7e1f85e9e3/src/corelib/mimetypes/qmimedatabase.cpp#L102

[2]: https://github.com/qt/qtbase/tree/03dfd4199deb4a0f5123fb1eead42f7e1f85e9e3/src/corelib/mimetypes/mime/packages

Revision history for this message
Alex A. D. (hinell) wrote :

If you wrap string in the proper tags you will get the same result, but with different offset (28 chars):

tee "index.html" <<eol ; echo -e "\n"; kmimetypefinder5 index.html
<!DOCTYPE html>
<html><body>`printf "x"%.0s {1..228}`
use strict
</body></html>
eol # -> text/html

Revision history for this message
Alex A. D. (hinell) wrote :

@Kai Kasurinen
>probably fixed on shared-mime-info 2.0:
>https://gitlab.freedesktop.org/xdg/shared-mime-info/-/commit/18bb7cfc6c43d710ecf60339b5dd9bd19c297cdf
Yeah, well. It's if they only used the same database.

affects: kde-cli-tools (Ubuntu) → qtbase-opensource-src (Ubuntu)
Revision history for this message
Kai Kasurinen (kai-kasurinen) wrote :
Changed in qtbase-opensource-src (Ubuntu):
status: New → Invalid
Revision history for this message
Kai Kasurinen (kai-kasurinen) wrote :

shared-mime-info 2.0-1 Published in groovy-release on 2020-10-14

Changed in shared-mime-info (Ubuntu):
status: Fix Committed → Fix Released
Changed in shared-mime-info:
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.