Zim

RTL text turns to LTR when exported to html

Bug #303108 reported by Hezy Amiel
6
Affects Status Importance Assigned to Milestone
Zim
Fix Released
Medium
Unassigned

Bug Description

When I write Hebrew text in Zim, it is rendered correctly from right-to-left (RTL). However when I export this text to html, it seems that the new file is left-to-right (LTR).

I suspect the reason is that in regard to RTL, zim is relaying on the good infrastructure of gnome and linux. Html, on the other hand, doesn't automatically distinct RTL text, and a rtl tag must be inserted to the parts of the text that are RTL. I can insert a RTL mark from the right click menu in zim, but as far as I can see it has no effect on the exported html.

Zim can be much more useful for RTL languages (Hebrew, Arabic, Farsi) if it will automatically insert the rtl tags to blocks of RTL text when exporting to html. If it's difficult to implement, at least an option to manualy add these tags from within zim would improve things.

I'm using zim 0.26

Tags: html lingua
Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

I have no experience using RTL languages but a quick internet search indicates that for HTML we need to set an explicit attribute 'dir="RTL"' per page / paragraph or text span. So it seems easy to make a template that sets all text to RTL. However mixed text might be a problem. I need to check how other applications handle this.

Could you attach an example of a simple HTML page containing both RTL and LTR text ?

Changed in zim:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
dotancohen (dotancohen) wrote :

Here are two examples of almost identical pages, the only difference being the dir="rtl" declaration in one:
http://maayancohen.com/
http://meiravcohen.com/

The accepted way of choosing whether to declare RTL or LTR in mixed documents is to assume LTR for all elements, unless the first character is an RTL character. After that, it is up to the browser to render the text correctly (most browsers today do). I suppose that a simple regex would do (untested PHP, where $text is the text of the <p> element):

<?php
print "<p dir='";
if ( preg_match( "|[ا-يא-ת]|", substr($text, 0, 1) ) ) {
    print "rtl";
}else{
    print "ltr";
}
print"'>";
?>

Revision history for this message
Hezy Amiel (hezy) wrote :

Dotan, thanks for your help :)

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Unfortunately that code snippet only handles hebrew characters. Ideally I would need a more highlevel way to tell whether a character is rtl in python. So far can not find a good method for that.

Revision history for this message
dotancohen (dotancohen) wrote :

No, the code snippet handles Arabic and Farsi as well.

You may want to ask on the Python-il list, which is an English-language list:
http://groups.google.com/group/python-il/about
It wouldn't hurt to mention that you don't speak Hebrew or Arabic when posting, though.

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

I recognize the first two charactrs in the regex as Hebrew. I assume the second two map to a range including Arabic and Farsi then ? What are the unicode character numbers of these sequences ?

Revision history for this message
dotancohen (dotancohen) wrote :

According to Wikipedia, these are the ranges for Arabic Unicode:
U+0600 to U+06FF
U+0750 to U+077F
U+FB50 to U+FDFF
U+FE70 to U+FEFF

And these are the ranges for Hebrew Unicode:
U+0590 to U+05FF
U+FB1D to U+FB40

I do not know for certain if this is adequate for Persian (Farsi), Ajir, and Yiddish. I have asked on the Python-il list for a possibly better way:
http://groups.google.com/group/python-il/browse_thread/thread/4e6cbe5891240984

Revision history for this message
Beni Cherniavsky (cben) wrote :

Summary of findings on Python-il list:

[1] The correct way to determine direction is described in the Unicode Bidirectional Algorithm. It's not only correct in principle - it's exactly what Gnome does, which is what you want to follow (since the whole point of this bug is mismatch between GUI and exported HTML).

In short, the Unicode algorithm says you should skip punctuation and similar neutral characters, and decide according to the first "strongly directional" character. You cannot easily approximate it with a regexp.

[2] Since Zim uses PyGTK, you have access to Pango, which happens to contain a function that determines the base direction according to the Unicode Bidi algorithm (in fact, it's the very function that Gnome uses):

>>> import pango
>>> a = ' .. . שלום'
>>> b = 'Hello שלום'
>>> pango.find_base_dir(a, len(a))
<enum PANGO_DIRECTION_RTL of type PangoDirection>
>>> pango.find_base_dir(b, len(b))
<enum PANGO_DIRECTION_LTR of type PangoDirection>
>>> pango.find_base_dir(a, 4)
<enum PANGO_DIRECTION_NEUTRAL of type PangoDirection>

You should emit dir="rtl" if it returns pango.DIRECTION_RTL, dir="ltr" if it returns pango.DIRECTION_LTR.

If it returns pango.DIRECTION_NEUTRAL, Gnome has some extra heuristics beyond those suggested by the Unicode standard. Since our goal here is to match Gnome's behaviour, you should repeat them:

* A neutral paragraph takes the same direction as the last paragraph before it that had a definite direction.
* If the document begins with neutral paragraph(s), they take the direction of the first following paragraph that has a definite direction.

The original mails:

[1] http://hamakor.org.il/pipermail/python-il/2009-January/000243.html
[2] http://hamakor.org.il/pipermail/python-il/2009-January/000246.html

Revision history for this message
Beni Cherniavsky (cben) wrote :

Addition about handling of neutral paragraphs:

* If the whole document is neutral, it defaults to LTR.

About HTML conversion: HTML bidi is hierarchical - elements inherit from their parents.
So there are several ways to assign dir= attributes that are equivalent in the direction they give to paragraphs.
But they may still affect page layout. It's better to set the direction(s) as high as possible:

* Assign the direction of the first strong paragraph in the document to the whole document body: <body dir="rtl"> or <body dir="rtl">.
* Put a <div dir="rtl/ltr"> around each group of paragraphs that should have the same direction (including trailing neutral ones).

For example:

1. neutral
2. LTR
3. neutral
4. RTL
5. RTL
6. neutral
7. LTR

is best coded as:

<body dir="ltr">
 <p>1</p>
 <div dir="ltr">
  <p>2</p>
  <p>3</p>
 </div>
 <div dir="rtl">
  <p>4</p>
  <p>5</p>
  <p>6</p>
 </div>
 <div dir="ltr">
  <p>7</p>
 </div>
</body>

If this seems over-complex, it should be noted that just assigning dir= to strongly-directioned paragraphs would already solve 95% of this bug!
Neutral paragraphs are just small border-case improvements. And the body/div stuff is mostly invisible; it would only slightly affect images and header/footer when printing, things like that...

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Thanks for the pointers on how to fix this. Patch committed, will be mirrored by lp in about 4h.

I now use the Pango function to decide on the dir property per <h1> .. <h5>, <p> or <pre> element.
My test data shows ok - but it can use mroe extensive testing as I don't use any LTR scripts myself.

Changed in zim:
status: Incomplete → Fix Committed
Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Can someone using RTL scripts confirm this solution works correctly? Thanks!

You can get the development code by install bazaar and run "bzr co lp:zim". You can run the development version from the source directly without replacing your installation by using "./bin/zim".

Revision history for this message
dotancohen (dotancohen) wrote :

I can confirm that the current solution prints HTML that looks the same as the Zim interface looks. There are some minor issues, but they are consistent between the Zim interface and the printed HTML and are therefore not related to this bug, which can be closed as fixed.

Thanks, Jaap! I want to commend you on working hard to fix this issue which I know does not affect you directly. You have once again proven that Zim is intended for real people to use, not just to scratch the developer's itch.

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Fixed in 0.28

Changed in zim:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.