For years I have been trying to explain to anyone who would listen that the Unicode Bidi Algorithm has a fundamental flaw. The problem was that I did not have strong practical examples… and then along came Twitter.

My complaint is that Unicode bidi considers most characters that are not letters or numbers as neutrals and, in many cases, this is not correct. A neutral takes its direction from the surrounding strong character or the dominant direction. e.g.:

[Arabic Letter][neutral][Arabic letter] --> [right][right][right]
[Arabic Letter][neutral][English letter] --> [right][right][left] if the direction is Arabic
                                          or [right][left][left] if the direction is English.

This is all OK if the algorithm is placing commas and periods in sentences. And, Unicode bidi also takes care of lots of special cases. e.g. when to treat the period (U+002e) as a full stop or a decimal separator.

But the usage of punctuation varies over time. Arabic users may use slashes as a date separator today and switch to hyphens in the future. Then Twitter presents a whole new ball game. The ‘:’, ‘@’ and ‘#’ are redefined as textual letters which should take the direction from the word they start, e.g #bidi or @ironymark. The recommended way to correct this is by adding directional characters (e.g. left-to-right mark, U+200e). But you cannot do this without eating into the 140 character limit, causing problems with search or breaking the myriad of Twitter tools out there. On top of this Twitter does not define a right-to-left interface, so users must fend for themselves. This causes the following problem:

Screen shot 2009-10-29 at 00.00.40.png
from @GVinarabic.

The above entry has three main elements - a title of a blog post in Arabic, the word “translated by” followed by the @name of the translator and the URL. Simple? No. Since the direction is left-to-right the user has typed the text in the following order:

@nightS [the title in Arabic] - [translated by] - [the URL]

i.e. the last word in the tweet was written first to make the text look correct. Then in another case we see:

Screen shot 2009-10-29 at 00.09.56.png

Now the user typed: [title] - @MuhammadAdel : [translated by] - [URL]

Here the order of the text was modified to force bidi to give the right visual result when the tweet fills two lines. Any future attempt to translate these tweets or search for “translated by @nightS” will fail.

If one forces the direction to right-to-left there are other problems:

Screen shot 2009-10-29 at 08.04.14.png

The @ has been separated from @gr33ndata and put at the other end of the tweet. This is actually the thin end of the wedge. URLs can get reordered to become unreadable and things will get nastier in more complicated tweets that are re-tweeted or refer to popular # tags. e.g.

RT: @tweeter1 @tweeter2 #arabic [some short message in Arabic] [URL]

here the right-to left reader would want to see:

[URL] [some short message in Arabic] #arabic @tweeter2 @tweeter1 :RT

To get the right effect I need the following markup:

  <div dir="rtl">
    <span dir="ltr">RT</span>&rlm;:
    <span dir="ltr">@tweeter1</span>&rlm;
    <span dir="ltr">@tweeter2</span>&rlm;
    <span dir="ltr">#arabic</span>&rlm;
    [some short message in Arabic] &rlm;
    <span dir="ltr">[URL]</span>
  </div>

Rewiring the bidi algorithm just for Twitter is a non-starter. So one must solve this at the display end. It can only be done by injecting spans and left-to-right marks. But we now have a rather messy collection of markup and injected direction characters which can lead to all sorts of problems once you have to handle these in a text editor or convert between markup and plain text.

This is possible and to do this right one must implement an algorithm that can be ported to all web languages, and for all text editors that handle Twitter, then have this standardised across the industry, etc.

The real solution is, at some point, for a future HTML to define simple markup that gives better control of how spans are ordered without being forced to inject &rlm; characters. That will be the subject of a future post.