Mashing-up Bi-Di
Mash-ups is a relatively new fashion in web design - taking bits of other web sites to build up your own web page. It is not new or special - any search engine showing a small summary of a web site that it has found is a form of mash-up. Integrating an rss feed is another. And it seems that every company and their mother has its own mash-up API. But what happens when you have an Arabic web-site integrate content that may be Arabic or English or both? It can be hard to predict how to mark-up the integrated content for the right result. Here is a good example. Google makes a heroic effort to correctly align content in its Arabic web sites - but they still get it wrong.
For example - here is the result of a search for “Arab” in the Arabic google.ae:

As you see - the green web addresses look good, the text of the first result is right-to-left (note the “…” comes visually after the last Arabic word). Full marks for the second result - the English text is correctly ordered left-to-right (note the “…” is visually after the last English word).
But something has gone wrong with the third result. Lets look at it more closely with the html below…
As you can see from the mark-up, the “…” is last character in the snippet but appears in the middle of the line. The Arabic word “مصر” is drawing in a completely different location to the place it appears logically. “ArabChat.com” should be the first word in the title. What has happened is that Google has flagged this site as an Arabic site and allowed the Unicode BiDi algorithm to treat it as right-to-left text. Which gives confusing results.
Things can even more confusing if the text contains mixed Arabic and English as this result from the mobile edition of Google shows:

I defy anyone to work out what the second results actually says. It should be:
2 Arabic language - Wikipedia, the free encyclopedia - 5 Dec 2008 … Arabic (العربية al-ʿarabiyyah; less formally: عربي ʿarabi) … Standard Arabic ( - en.wikipedia.org/wiki/Arabic_language
So in a world where people are increasingly mashing-up their web pages what is the solution for BiDi languages? As one can see it is not always possible to know beforehand if the piece you are integrating is primarily English or Arabic. ArabChat.com is an Arabic web site yet the snippet found was in English. And when the Unicode BiDi algorithm is left to its own devices in the wrong primary direction the result can be unreadable.
I think there needs to be additional mark-up to handle these correctly. And this needs to be a standard approach - one that is supported by all the web companies. What I suggest is treating this in layers.
- At the lowest level there needs to be a parser to spot URL’s and wrap these correctly. A parser to spot brackets and make sure the open bracket matches the direction of the close bracket. etc.
- The next level up would be a standard way to guess if a stream of text or HTML is primarily right-to-left or left-to-right.
- And the last level is agreed standards for Mash-up API’s, XML feeds, XSL transforms that define the intention of the creator of the content.
I hope to cover this in my following posts. And please let me have your thoughts in the comments.


[...] 33rd Unicode Conference I gave a presentation around the problems I encountered in my previous post Mashing Up Bidi. Here are the summary and [...]