Mashing Up Bidi - Unicode Conference Slides

At the 33rd Unicode Conference I gave a presentation around the problems I encountered in my previous post Mashing Up Bidi. Here are the summary and slides:

Mashing-up Bi-Di

Mash-ups is a relatively new fashionable word on the Web - taking bits of other web sites to build up your own web page. It is not new or special - any search engine showing a snippet of a web site that it has found is a form of mash-up. Integrating a news or micro-blogging feed is another. And it seems that every company and their mother has its own mash-up API. But what happens when you have an Arabic web-site integrate content that may be Arabic or English or both? The Unicode Bi-Di Algorithm can render text and numbers unreadable. URL’s may become unusable or, in the worst case, direct to fraudulent sites. It can be hard to predict how to mark-up the integrated content for the right result. This presentation will cover real world issues and attempt to suggest practical solutions.

The Unicode Bi-Di Algorithm has been a great benefit for software in general. It provides a unified way for rendering mixed right-to-left (e.g. Arabic) and left-to-right (e.g. English) text across all kinds of software and devices. However, if the original direction of a piece of text is lost, applying the wrong direction may render the text unreadable. This is especially a problem on the web.

Text that appears on a web site may have passed through many stages before being rendered in its final place and can easily lose the markup specifying the direction when it was initially created. For example, a search engine providing a single sentence from the web sites its chooses from a query; a site integrating a list of the statuses of friends on a social networking site; a blog displaying ‘trackbacks’ to its posts.

Web browsers will use the Bi-Di Algorithm to order the text it displays and displaying English text in a right-to-left context or Arabic in left-to-right, can make text hard to read. Some web companies make a heroic effort to correctly align content in its Arabic web sites - but they still get it wrong. In a world where people are increasingly mashing-up their web pages what is the solution for BiDi languages?

I will try to answer this by suggesting the additional mark-up to handle such cases correctly and the kind of methods that can be employed to recognize the text’s correct direction.

  1. At the lowest level there needs to be a parser to spot URL’s and wrap these correctly. A parser to spot brackets and make sure the open bracket matches the direction of the close bracket.
  2. The next level up would be a standard way to guess if a stream of text or HTML is primarily right-to-left or left-to-right.
  3. And the last level is agreed standards for Mash-up API’s, XML feeds, XSL transforms that define the intention of the creator of the content.

The presentation will conclude with a proposal for a standard approach - hopefully one that can be supported by all the web sites.

Comments?

Evil Potato Head from Outer Space

Be afraid, be very afraid…

Evil potato head from outer space

Banging my head on Pipes

Preparing my Unicode presentation I thought it would be really nice to show a simple demonstration. The topic of my presentation is “Mashing Up Bidi” about just how messy it can be to mix Arabic and English content in a mashed up web page. The idea is to take an an English Twitter or RSS feed and put it an Arabic page then see what needs to be done to keep it readable.

So I threw together a simple portal demo, made it right-to-left and decided to use Yahoo! Pipes to handle the manipulation of the text in the feed. Pipes has a really nice feature that lets it call your own web service to do some of the text handling.

The web service is simplicity itself. Pipes sends JSON in an HTTP POST and your service returns the modified JSON data in reply. Nothing could be easier. I was wrong. First I wrote a simple PHP service that receives the HTTP-POST calls json_decode (which is now a standard feature of PHP 5), plays with the data and then calls json_encode to return the results:

<?php
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
    //do the text manipulation here
}
header('Content-Type: application/json');
print json_encode($json);
?>

It all looked OK until I saw that URL’s, HTML fields and, worse, all non-ASCII Unicode was being converted into Mojibake once it re-entered the pipe. Pipes has really nice testing features - and it lets you check the output in many ways. So I made the pipe output as JSON, and compared it to the output of the PHP service. Both looked the same. So maybe there is a bug in communication between the pipe and the PHP service. I resent the POST data right back in the PHP output, it worked. So the comm’s were OK. Maybe there was a problem decoding the JSON. So I explicitly set a field to my own string containing a URL. It failed. So the JSON going in works, the converted JSON looks OK but fails and all seems to be OK in the PHP. At this point I started to get a little frustrated.

Next step - check the data sent to the web service. One of the really nice features of Pipes is that it has excellent debugging features. So I returned the http POST data in my PHP service but this time inserted a open quote (”) to force Pipes to output an error with the actual JSON data it gets back.

Now I noticed a difference. When Pipes sends JSON to a webservice it does not encode it in any way - you just get raw utf8 characters. When PHP encodes JSON it nicely encodes all non-ascii characters with escapes (e.g. “/” get written as “\/” so that it can be parsed safely). However Pipes makes no attempt to read or convert these escapes so they just become part of the text; causing instant mojibake.

So time to look for a more suitable JSON encoder. FIrst, I saw that json_encode takes a parameter to control the output encoding - but not in my version of PHP - and the controls were limited to a few character. So I used this really useful page comparing the output of various PHP encoders. Zend seemed to do the job for me.

I installed Zend and made sure it did not call the internal encoder and called it like this:

<?php
require_once ('Zend/Json.php');
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
    //do the text manipulation here
}
header('Content-Type: application/json');
Zend_Json::$useBuiltinEncoderDecoder = true;
print Zend_Json::encode($json);
?>

Now the URL’s wre OK but Unicode was still messed up. Someone had ‘fixed’ Zend to insert escapes. No problem, it’s all PHP, so, I edited Zend/Json/Encoder.php and commented out the line that encodes Unicode and presto - I have a working web service.

What really annoys me is that it is 2009 and you still have to jump through hoops to get anything with text just slightly more complex than plain ASCII to work. So whose fault is it? Well Yahoo! has to take a big share of the blame. They are one of the most active supporters of JSON and do not parse it properly inside of their own services.

But really it is the laxness of the JSON standard (and most other standards) when it comes to text. There should be only one way to pass a string and that is either plain Unicode or encoded ASCII. Not both.

Maren, Microsoft’s Final Insult

Now don’t get me wrong. I like Microsoft and they can come up with some really good technology. But you just have to wonder about their marketing department sometimes.

Their latest Arabic effort, Maren, is arguably a nice tool (albeit not an original idea) that intelligently converts transliterated Arabic typed on an English keyboard into real Arabic on the fly.

Then MS marketing gets hold of it and adds an introductory video that is, frankly, insulting. It starts with an inexplicable sequence of a person who cannot read chatting online with some one who cannot type. The video goes on to recommend forgetting Arabic keyboards and using Maren English for everything - Email, Word, IM. And in a really patronising way - as if to say “You poor Arabs, stop worrying about your really difficult language and use English instead”. Given that Arabs are generally touchy about the subject of American abuses to a few of their countries, one of their religions and culture in general, Microsoft marketing are simply fanning flames and the negative feedback starts flooding in. Here is one blogger:

Promoting Maren was not in the right way. I’m blogging in English and I chat with some friends in Franco-Arab way but I really care about Arabic Content and promoting the proper content to users, But Maren Video Demonstration didn’t show that It’s helping users to use Arabic Letters or Franco-Arab but It just says Screw Arabic letters .. Write in Roman Characters and we will convert to Arabic .

And a Twitter thread:

@Lastoadri: @Zeinobia well.. we shld preserve our language & force ourselves to write in Arabic letters. we can use yamli, t3reeb.. etc for quick things (13:45)

@Lastoadri: Maren is like forcing Arabs to type Arabic in latin letters. In few yrs we will have the new Turkish.. the video is so disrespectful. (13:44)

@Zeinobia: @Lastoadri why ?? I think it is good for those who do not understand Arabic English typing thing (13:27)

@Lastoadri: I feel the new Microsoft Maren program is like an insult for Arabs http://www.microsoft.com/middleeast/egypt/cmic/maren/ (13:23)

@afahad: after seeing Maren I am convinced us arabs must be the laziest nation in the world! If you do most of your typing in arabic, learn how to! (13:06)

I can only guess that the MS marketing team in Cairo were the same people who thought up Bob. I just hope they learn from mistakes.

Mashing-up Bi-Di

Mash-ups is a relatively new fashion in web design - taking bits of other web sites to build up your own web page. It is not new or special - any search engine showing a small summary of a web site that it has found is a form of mash-up. Integrating an rss feed is another. And it seems that every company and their mother has its own mash-up API. But what happens when you have an Arabic web-site integrate content that may be Arabic or English or both? It can be hard to predict how to mark-up the integrated content for the right result. Here is a good example. Google makes a heroic effort to correctly align content in its Arabic web sites - but they still get it wrong.

For example - here is the result of a search for “Arab” in the Arabic google.ae:

As you see - the green web addresses look good, the text of the first result is right-to-left (note the “…” comes visually after the last Arabic word). Full marks for the second result - the English text is correctly ordered left-to-right (note the “…” is visually after the last English word).

But something has gone wrong with the third result. Lets look at it more closely with the html below…

As you can see from the mark-up, the “…” is last character in the snippet but appears in the middle of the line. The Arabic word “مصر” is drawing in a completely different location to the place it appears logically. “ArabChat.com” should be the first word in the title. What has happened is that Google has flagged this site as an Arabic site and allowed the Unicode BiDi algorithm to treat it as right-to-left text. Which gives confusing results.

Things can even more confusing if the text contains mixed Arabic and English as this result from the mobile edition of Google shows:

I defy anyone to work out what the second results actually says. It should be:

2 Arabic language - Wikipedia, the free encyclopedia - 5 Dec 2008 … Arabic (العربية al-ʿarabiyyah; less formally: عربي ʿarabi) … Standard Arabic ( - en.wikipedia.org/wiki/Arabic_language

So in a world where people are increasingly mashing-up their web pages what is the solution for BiDi languages? As one can see it is not always possible to know beforehand if the piece you are integrating is primarily English or Arabic. ArabChat.com is an Arabic web site yet the snippet found was in English. And when the Unicode BiDi algorithm is left to its own devices in the wrong primary direction the result can be unreadable.

I think there needs to be additional mark-up to handle these correctly. And this needs to be a standard approach  - one that is supported by all the web companies. What I suggest is treating this in layers.

  • At the lowest level there needs to be a parser to spot URL’s and wrap these correctly. A parser to spot brackets and make sure the open bracket matches the direction of the close bracket. etc.
  • The next level up would be a standard way to guess if a stream of text or HTML is primarily right-to-left or left-to-right.
  • And the last level is agreed standards for Mash-up API’s, XML feeds, XSL transforms that define the intention of the creator of the content.

I hope to cover this in my following posts. And please let me have your thoughts in the comments.

So what is the ‘line-height’?

Al-Jazeera is one of the more popular Arabic web pages, so any browser that claims Arabic support should be able to render it correctly. However, here is what happens on a Mac (click on the picture to see it full size):

Three problems each with a different reason.

1/ Jumping Content: this is the web site’s fault. They only format the site for one font and expect every browser to have a matching font of exactly the same size.

2/ Back-to-Front Brackets: This is the browser’s fault. I don’t know… every time Firefox comes up with an update they break something with Bi-Di.

3/ Lines too Close: now this is interesting. What is happening is that the web site css specifies:

line-height: 100%

Now this should not make the lines clash with each other and if I change the lien-height to “normal” everythign is OK. So why is this happening? Lets look at the w3c the specification or line-height:

‘line-height’

Value:      normal | <number> | <length> | <percentage> | inherit
Initial:      normal
Applies to:      all elements
Inherited:      yes
Percentages:      refer to the font size of the element itself
Media:      visual

If the property is set on a block-level element whose content is composed of inline-level elements, it specifies the minimal height of each generated inline box.

If the property is set on an inline-level element, it specifies the exact height of each box generated by the element. (Except for inline replaced elements, where the height of the box is given by the ‘height’ property.)

Values for this property have the following meanings:
normal

Tells user agents to set the computed value to a “reasonable” value based on the font size of the element. The value has the same meaning as <number>. We recommend a computed value for ‘normal’ between 1.0 to 1.2. …

<percentage>

The computed value of the property is this percentage multiplied by the element’s computed font size. Negative values are illegal.

Firstly, the Arabic font here is Geeza which is the standard Arabic fall-back font in Mac OS X. The line-height is based on the font size. As Geeza has lower descenders it is clashing with the lines below. So the browser is setting the line-height assuming an English font. I can think of two problems here

  • the CSS description of line-height is too vague meaning that browser makers will define what 100% means on the needs of their English-language customers.
  • The other is that CSS specifies that this should only be a minimum - but the browsers are treating this as an absolute and forcing a line height that the font will not match.

Apart from all this fonts do a very bad job of sticking to any standard convention of line-height so the fault goes all around.

Didacta, Hanover

Educational toy?

Educational toy?

I am at Didacta, the big German Education trade fare, this week. having been at shows from Arabia to the UK it is obvious that there are huge opposites. First, the British education market and the German Education market are poles apart. Here, in Didacta, there are stand after stand of educational engineering products on offer. Structures, mechanics, electronics, all have their own specialised tools and toys. In the UK one can only such things tucked away among the smaller stands as if Britain is embarrassed to admit to teaching such practical subjects. And then there is the contrast with the old and the new technology in education. The lions share of space is taken up by the huge book publishers that make the German curriculum texts. Each of the publisher stands are a small village in their own right, while the new technology like software and computers are just plain 20 square meter stands - obviously technology is still nowhere near a power in education here (or anywhere else for that matter). Whatever one may say about the rest of society - the computer revolution has still not even scratched the surface of education.

And finally the picture on the right is an example of a little German humour, placed prominently in one of the display cases of a manufacturer of medical models of human body.

Overheard at Gitex

You know this market is controlled by a mafia… If I take one to court he will just send the judge 10 prostitutes and case over.

One person complaining about how hard it was to run a business in an Arab country that will not be named.

This is typical of the attitude many Arabs have about the “freedom” for capitalism in this region. Generally, American and European companies can deal with piracy and plagiarism by getting their embassy to lean on the local government to make sure justice is done - but for medium to small local companies it is like a jungle out there. For companies that do not like to use cronyism or outright bribery it can simply be impossible. This is one of the factors that hampers innovation in the Arab world. With the economic crisis - the world needs new vibrant markets to restart growth. Arabia is critical here as the potential for development here is huge. But without good laws and their transparent application there is simply no way this will happen.

Gitex Day 3 - When France Sneezes…

All Europe catches a cold. So said Klemens Wenzel von Metternich in 1820. And the sneeze in America is being felt here in its own way. It has slowly dawned on me that several of the really big Gitex stands are simply gone. Sony used to fill a whole hall - gone. Microsoft had the largest of all the software stands - gone. i-Mate - largest of the device manufacturers stands - gone. Siemens - gone. Also some of the stands were odd such as one which was a company that is a chain of electronics shops in Dubai - as if they needed to fill the space. Then there is the odd embarrasing gap:

Gap like that have not been see at previous shows. If some of the companies pulled out - no one told the visitors because Gitex was still buzzing with people from as far away as Kenya and Egypt. It seems to me that the big companies are expecting a storm to come and are backing off early.

In other news Apple will be launching the iPhone in 3 Arab countries com January. Apparently Egypt has them already but they are stuck in a warehouse because the Egyptian state bans mobile GPS devices. I saw many people with a hacked iPhone and all had a very poor Arabic implemnetation patched onto it for $30 extra. Unless Apple get their act together with proper Arabic support all the phones they sell in the region will either be gre marketed or hacked and this will seriouly undermine their distribution channels and even even control of the market.

I leave you tonight with this nice photo of the tallest building in the world taken from the Gitex car park!

Gitex Day 2 - Iran strikes back

One thing you will not read about in any of the news reports on Gitex is Iran’s plans for the expo. Their approach has ben quite different this year. Last year was lots of small stands little Iranian tech companies selling half-baked solutions that probably only would work in Iran. This year was something else. A large open-plan space with lots of sofas and the slogan “Come to the Opportinities”. Basically saying stuff the sanctions, come to Iran, and lets do business. One exhibitor I spoke to said that he was asked to quote for 100,000 laptops. How much of this is a real for trade and how much is just to annoy the Americans I don’t know.

My favorite part of the whole show is to wander round the Chinese and East Asian stands and look at the wild and wacky gadgets they come up with. There is, always, the company that makes the most blatant 3rd-rate rip off of apple iPhone and iPods and the silliest web-cams that you can possibly imagine.  But 1st prize for wildest and wackiest gadget goes to the company that makers of the “Emotion Baby” USB flash drives:

These are tiny flash drives and come in Happy white (4Gb), Surprised blue (8Gb) and Sad pink (2Gb).

More interesting was this Korean company on a tiny 3 meter-wide stand that produces a 3D television:

It is a 42″ High definition LCD TV that works in the same way as the 3D movies you get at the cinema - it requires polarising glasses. But the results are truly, jaw droppingly, amazing. Ladies and Gentlemen I have seen the future and the future is 3D. Once the big film companies extract all the revenue they can get from BlueRay, you will see, the next big thing will be 3D. In 5 to 10 years everyone will want a 3D-capable LCD television in their living room. There are currently only 2 movies available for theseTV’s - a rather relaxing aquarium and one Korean medieval drama, But bear in mind Pixar will be producing all their future movies in 3D. Once people get used to these films in the cinema they will want them at home.