UTF8 character encoding problem in HTML export

Bug #1756904 reported by Danny Ouellet
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mahara
Confirmed
Medium
Unassigned

Bug Description

- Tested in demo.mahara.org
- Client Os : Windows
- Chrome 64.0.3282.186, Firefox 58.0.2, Microsoft Edge 41.16299.248.0

Steps to reproduce

- Create a Portfolio page with UTF8 accents in Title.
- Export to HTML with "All my data"
- Url to the page in index.html contains accents
- Look in the archive /views/ and the folder's page name contains undefined characters (Windows)

Tags: export
Revision history for this message
Danny Ouellet (exlame) wrote :
Danny Ouellet (exlame)
description: updated
summary: - UTF8 accents not encoded in page url in HTML export
+ UTF8 character encoding problem in HTML export
Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

I don't think that this is an issue because the special characters are still displayed properly when looking at the HTML page in the browser, and the links go to the correct places.

Changed in mahara:
status: New → Invalid
Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :
Revision history for this message
Danny Ouellet (exlame) wrote :

Links are workining on Mac, but I tested on Windows 10 et Windows 7 and it is not working. HTML is well displayed but the link doesn't work because of folders name.

"é" Become a weird symbol and "Test 18:10" become "Test 18_20" because Windows doesn't accept any ":" in folder names.

Danny Ouellet (exlame)
Changed in mahara:
status: Invalid → New
Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

Links to content in blocks still worked for us on Windows. Will need to take a closer look at your links.

Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

Danny, can you please also upload the Leap2A export for this portfolio so we can import it into our instance of Mahara?

Revision history for this message
Steven (stevens-q) wrote :

Some more testing with Danny's export:

1. index.html page cannot be clicked and opened in neither Chrome on Ubuntu nore MS Edge on a Surface tablet.
2. Clicking on "Browse your file collection" and then the "Test18:10" folder opens the subfolder in Chrome on Ubuntu but not in Edge on Surface.
3. Our test html export does open the index.html page correctly in both Ubuntu and on Surface, but the 18:10 test folder is not opened in the file collection either. Test export is attached.

Danny, having the Leap2A export from your portfolio that you created on the site would help for our local testing to check what is in Mahara and what isn't. Thank you.

Changed in mahara:
status: New → Incomplete
Revision history for this message
Danny Ouellet (exlame) wrote :
Revision history for this message
Danny Ouellet (exlame) wrote :
Revision history for this message
Danny Ouellet (exlame) wrote :

Exported a fresh Leap and HTML from demo.mahara.org

Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

Hello Danny,

Thank you very much for the exported files. We did a bit more testing. Here are the results so we know what happened when we can look further into it:

The results from note #7 still hold.

We did import your Leap2A file into one of our local test instances and exported that as HTML. The result is that we can click the page title and get to the page (both in Chrome and IE Edge). When in Edge, we can get to the folder collection, but then not click the "Test 18:10" folder. It does work fine on Chrome though in Ubuntu.

We'll need to investigate further how we can resolve the problem as the environments behave differently.

Changed in mahara:
status: Incomplete → Confirmed
importance: Undecided → High
milestone: none → 18.10.0
Changed in mahara:
assignee: nobody → Cecilia Vela Gurovic (ceciliavg)
Revision history for this message
Robert Lyon (robertl-9) wrote :

One of the things we should cater for is the problem of invalid chars in filepath names

There are a bunch of chars that we should avoid using that Windows doesn't like:

< (less than)
> (greater than)
: (colon - sometimes works, but is actually NTFS Alternate Data Streams)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)

see: https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names

Even though linux can handle most of those we are doing an export so are not knowing what system the export will be imported into. So better to make things more compatible by dealing with those chars during export and replacing them with something more useful like _ or -

Revision history for this message
Mahara Bot (dev-mahara) wrote : A patch has been submitted for review

Patch for "master" branch: https://reviews.mahara.org/8798

Changed in mahara:
status: Confirmed → In Progress
Revision history for this message
Cecilia Vela Gurovic (ceciliavg) wrote :
Revision history for this message
Cecilia Vela Gurovic (ceciliavg) wrote :

for testing

Revision history for this message
Cecilia Vela Gurovic (ceciliavg) wrote :

As we discussed, there are 2 ways we to resolve this
1. Translate the filenames to be included in the export, removing all symbols that could be consider invalid chars in the os.
This is not easy as replacing the symbols, since we could have similar names that translate to the same name, so we would have to keep track of all the names taken and find a way to choose a new one in case is not available.

2. Do not allow invalid chars in windows and linux to be part of mahara filenames either.
But we have to consider the files that already exist in mahara.
If we want to rename them to remove the invalid chars, then we have the same problem as above, and some users might not like to see their files renamed or they could think there's something wrong with them.

The patch I had started here https://reviews.mahara.org/#/c/8798/ follows solution 1, but I'd like to know what's the best solution before I abandon or continue working on it.

Revision history for this message
Robert Lyon (robertl-9) wrote :

Hi Cecilia

I believe doing the option 1 is the preferred choice as because restricting the filenames in an instance of Mahara is not needed if no one exports anything. Also it is not needed if users export and display on same OS ifrastructure.

It only becomes a problem when exporting from system that can handle more things and then importing to system that can't handle some of those things (see comment #12)

So we should only deal with this problem where it makes most sense, and that is during export. We should make our exported zipped up content compatible with more systems.

So please continue with working on a fix for option 1

Changed in mahara:
milestone: 18.10.0 → 19.04.0
Changed in mahara:
milestone: 19.04.0 → 19.04.1
Changed in mahara:
milestone: 19.04.1 → none
Revision history for this message
Mahara Bot (dev-mahara) wrote :

Patch for "main" branch: https://reviews.mahara.org/12083

Changed in mahara:
milestone: none → 22.04.0
Changed in mahara:
assignee: Cecilia Vela Gurovic (ceciliavg) → nobody
Changed in mahara:
milestone: 22.04.0 → 22.10.0
Revision history for this message
Robert Lyon (robertl-9) wrote :

Maybe we should do folder path as the id's of the folders instead of name to avoid special chars breaking things

Changed in mahara:
milestone: 22.10.0 → 23.04.0
Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

We would need to rethink the approach, e.g. use an ID instead of the folder or file title. But it shouldn't be a number ID that can be enumerated but it would be better to use a random sequence of letters, numbers, and some special characters to make guessing the ID more difficult (see how social media sites do that). That would enhance the security as URLs can't be guessed. While that is not so important for a download because that lives locally, it is important for the online viewing.

Changed in mahara:
status: In Progress → Confirmed
Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :
Revision history for this message
Robert Lyon (robertl-9) wrote :

What we need is the patch to the files to be numerical in the export file structure, while still having the actual names in the HTML markup.

This will mean adjusting the HTML export so that we can correctly create the HTML markup links / file path.

Revision history for this message
Robert Lyon (robertl-9) wrote :

We will repurpose the WIP bug in gerrit to make the actual fix

Changed in mahara:
importance: High → Medium
milestone: 23.04.0 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.