Additional Requirements for Bidi in HTML…

… is the subject of the W3C working draft for future improvements to bidi support in HTML5 and CSS3. Having formally moaned three years ago about the state of bidi in HTML, it looks like something will actually change. A few months ago a group was formed to formalise a proposal to the W3C about bidi, which generated the working draft. This week there was a face-to-face meeting of the most active participants in this proposal to try to bring about some consensus between the different parties and from this will come a formal proposal to the relevant standards bodies.

One problem I have with discussing a proposal is just that - its a proposal not something that you can try, poke and patch up. So I have started to work on a reference implementation in Gecko to bridge that gap (more on this later).

I am starting with the most controversial issue (for me) which is the support for auto direction in HTML.

Why auto direction is so controversial..

Direction in HTML is controlled, mainly, by the dir attribute and the corresponding CSS direction property. Right now the direction is always clear from the markup. At a simple level it is either left or right (unspecified defaults to left). From any view of the markup or CSS alone you can always know what the direction of the underlying text will be.

The only complication is to do with alignment and indents. These may be on the right or left depending on the direction. But as we know the direction while reading the the markup, setting up the indent and alignment is relatively trivial and can be set up while reading the data.

An “automatic” direction throws a small spanner in the works because now the direction cannot be known from the markup alone. One must first pass the content text through an algorithm. Along with this alignment and indents cannot be known as the HTML is parsed. Also CSS does not like this uncertainty so much so that dir=auto will not have a CSS equivalent so the layout can not be made on the CSS alone.

For the browser, this means a major change. Content will need to be parsed twice to set the value for direction-dependent layout. Also you cannot simply translate dir to CSS. Elements with combinations of rtl and ltr content embedded inside dir=auto would have an even harder time knowing their layout or to be drawn intelligibly.

For these reasons the original specification was to strictly limit dir=auto to a single element and single paragraphs.

But wait there’s more..

At the face to face meeting it was suggested that if the auto detection algorithm was the standard Unicode bidi direction detection then there is no problem having embedded elements that mix rtl and ltr as long as we follow some simple rules about what can embed and how. This was resolved to the following:

The values for dir will also include uba, auto, and normal, and the values for unicode-bidi, will also include uba.

1. the default dir for all elements is normal, with the exception of block elements whose parent is uba. These inherit uba.

2. elements with dir=normal have the same resolved direction (both the internal HTML “property” used for CSS purposes and the actual CSS property) as the parent element. It also sets the unicode-bidi CSS property to normal (unless ubi is explicitly on for that element). The primary purpose for explicitly stating dir=normal is to break dir=uba inheritance from the parent.

3. dir=uba sets the resolved direction (as defined above) of the element according to the UBA applied to its textual content. The textual content is the depth-first traversal of all text nodes (even if they have an explicit dir).

4. In the application of the UBA to textual content, if the text contains no characters of the bidi classes L, AL, or R, the resolved direction of the text is inherited.

5. dir=uba sets the unicode-bidi CSS property to uba.

6. The base directionality of a UBA paragraph (which is distinct from CSS direction, which it does not have) whose containing block element has unicode-bidi:uba is set according to the paragraph’s content using the UBA. A UBA paragraph’s lines’ alignment is determined by the paragraph’s base directionality when the text-align of the containing block element is start or end.

7. To clarify, when an inline element has dir=uba, its children do not inherit dir=uba, but do inherit the resolved direction of the inline element.

8. dir=uba implies ubi by default. If ubi is explicitly off on this element, the unicode-bidi value is “uba embed”. Otherwise, unicode-bidi is “uba isolate”.

9. TBD: what happens in textarea when the user sets an explicit direction via browser UI, for all dir values.

10. auto set the CSS direction to either ltr or rtl by a mechanism TBD.

So where do we go from here

If the rules look complex, that is because they are. And this will be a major hurdle to getting such a proposal adopted. I am fairly sure that if I throw these 10 rules at the designers of Internet Explorer with no justification they are going to tell me to take a long walk of a short pier.

So here is the point. For the web programmer, it can mean the freedom to build the web without having to decide on the direction to mark up the content. And, that is a big deal. This would be another step towards the ideal that all software is striving for. Getting to the point where a user of our products will come and say “I do this and it just works”.

The problem is will these 10 rules achieve this? I see problems, firstly the Unicode bidi algorithm, as detailed as it is, is far from perfect. The paragraph direction detection is very simple, in many cases too simple. Further, the HTML and CSS inheritance algorithm may have some fundamental problems. So the real question is how close to wu wei will it get us and is it really worth the effort?

That can only be answered by experience - and this takes me back to the beginning of this post. I will make a reference implementation so we can test it, poke it and patch it.

Another reason to disable Flash

Does anyone find this worrying:

http://www.vimeo.com/9194146

It is a very cute and cool advert for Salsa and it jumps out of your screen. For a while I have been blocking Flash in my browser just because of these take-over-your-screen adverts but that was only because they annoyed me. This advert goes a few steps past annoying to creepy. What looks like a cool effect seems to me to be a security nightmare.

For this advert to work it is effectively taking a screen capture of the current browser window and replacing it with its own interface. What is to stop a rogue Flash file taking over your browser window and spoofing your clicks to other parts of the web then extracting your passwords. Or what is to stop the script from sending a screenshot of your browser window or maybe even your whole display back to some unspecified server. The worst issue is that I have no control over whether I can let Flash to do this or not.

I prefer my web content to stay in its box. Flash is staying firmly blocked in the future.

I inadvertently became a member of the Mozilla project

mozirony.png

Last month I reported bug 547654 in Thunderbird where scrolling caused display problems. First a QA engineer passed it to the right group, then the programmer of the affected code asked me to repeat the issue. Now the bug has grown to a full scale blocker and actually assigned to me! I have written a patch and with any luck it will be accepted and my initiation into Mozilla will be complete.

This is what I love about true open source projects. Literally anyone can walk in off the street (or the net) and overnight become a part of a project equal to any programmer who have been part of the project for years - as long as they are competent.

If only politics could work in the same way…

Some random posts about Buzz

F13BD75E-133D-4155-8361-F2A5197B25F4.jpg

Following my friends comments once Google dropped Buzz on an unsuspecting public…

@S Feb 10th 5:28pm I don’t get this. What’s going on? WHAT THE HELL IS GOING ON?

@S Feb 10th 5:29pm WHOA ITS LIKE FACEBOOK AND TWITTER. FWITTERBOOK!

@S Feb 10th 7:21pm It’s funny that Buzz comes about right when I started thinking about the lack of privacy in my life as a result of the blurring of the line between public/private courtesy social media etc.

@J Feb 10th 7:33pm I don’t know what I think about this Google Buzz stuff. I just refreshed gmail and there it was.

@J Feb 10th 8:16pm If you link Twitter (or for that matter, anything else that you update externally and frequently), I will stop following you. I don’t need to see Twitter in my freaking e-mail.

@H Don’t know what to think about this Buzz. First thoughts: familiar faces, what the hell, get the hell out of my inbox, I’m deluged, how am I gonna deal with this? … some friends are here too, where are the others, hey, I can write more than the 140 bloody characters! oh shit… no… I need to go see some patient, stop buzzing and go to work! oh and I’m not gonna get addicted to this… no way… imbuzzible!

@Sa a new and nifty distraction added to the several others, but this is right inside Gmail, hmm Buzz

@Je Gmail morphed into Facebook so slowly we never noticed until it was too late.

@Jalal - what the heck is this buzzer??????????
this seems to be soooo public

@A D - no sh*t, sherlock? really? and Twitter isn’t?

Musing on Unicode Data

A while back Apple asked me to improve the Arabic system font in Mac OS X (which is now available in OS X 10.6). The requirements were daunting to say the least. The font had to support the full Unicode Arabic range, presentation forms and all. This meant adding some 1500 glyphs to my font together with the relevant AAT tables for Arabic shaping, ligatures,  justification and kerning. My approach was to automate this as much as possible. So, I wrote a tool that generates all the required Arabic glyphs (about 1700 in total) from a set of about 90 basic shapes and 20 kinds of dots. But that will be the subject of another post. My point for this post is about Unicode standards.

Very technical post warning (eyes should glaze over at this point) …

For the tool to work in a generic way it relies heavily on the naming convention of Unicode characters. The tool reads the Unicode data file “ArabicShaping.txt” and for each line e.g.

068F; DAL WITH 3 DOTS ABOVE DOWNWARD; R; DAL

the name, “DAL WITH 3 DOTS ABOVE DOWNWARD”, is processed as tokens e.g.:

<ARABIC LETTER DAL>, <ABOVE>, and <THREE DOTS DOWNWARDS>

The tool then interprets these tokens and outputs an XML file for input to another program that will generate the new glyphs. The above example is composed of: U+062F (Dal) combined with a glyph named “threeDotsDown” and the combining glyph is placed above the main glyph. In XML it looks like this:

    <newGlyph name="u068f.dalWith3DotsAboveDownward" Unicode="U+068f">
        <pieceGlyph glyphRefID="dal">
            <position X="0" Y="0"></position>
        </pieceGlyph>
        <pieceGlyph glyphRefID="threeDotsDown" linkToPrior="yes">
            <position X="3" Y="5" useZones="yes"></position>
        </pieceGlyph>
    </newGlyph>

ArabicShaping.txt also specifies the joining - in this case “R” (right joining). So the tool will also look for a glyph that makes the final form of the Dal and creates a similar combination.

Finally the tool exports three more files to define the Arabic shaping rules, kerning and justification.

To get to my point:

The data files produced by the Unicode Consortuim and the naming conventions used are almost good enough for this kind of machine processing but need some modification. I noticed a number of inconsistencies and some missing information that complicates this. Below are examples of the problems:

1/ Some of the names for parts of a glyph are inconsistent:

e.g characters in the Arabic (06xx) area use the word “VERTICAL” while the Supplemental Arabic (07xx) area uses “VERTICALLY” in the same context.

compare: 067A; TEH WITH 2 DOTS VERTICAL ABOVE; D; BEH
and 076B; REH WITH 2 DOTS VERTICALLY ABOVE; R; REH

the same for BAR and STROKE and DOWN and DOWNWARD, etc, … I have a long list.

2/ There were some unclear names:

e.g The difference between 06A2; FEH WITH DOT MOVED BELOW  and 06A3; FEH WITH DOT BELOW is difficult to process programatically.

3/ The joining group is inconsistent for some of the characters:

ArabicShaping.txt has 4 fields for each listed character: Unicode - Schematic Name -  Joining Type - and Joining Group.

e.g.: 06C3; TEH MARBUTA GOAL; R; HAMZA ON HEH GOAL

The joining group is: HAMZA ON HEH GOAL. But when I look at the HAMZA ON HEH GOAL character, its joining group is HEH GOAL. So why not just have: 06C3; TEH MARBUTA GOAL; R; HEH GOAL ?

4/ Working out the names of combiners that are transparent to shaping is tricky:

My tool also generates the font shaping information. But for this to work I need to know what characters are combining accents and therefore transparent to shaping. However the Unicode data files make my life more difficult than it needs to be. In order to do this I would need to parse a file called DerivedJoiningType.txt, extract the Unicode values for characters with “Joining_Type=Transparent” then search UnicodeData.txt for more information on these characters. It would be much easier if ArabicShaping.txt listed accents as well.

5/ Character names across Unicode files are inconsistent:

It is hard to go from the character name in UnicodeData.txt and the character names in ArabicShaping.txt

e.g. in UnicodeData you have:
0622; ARABIC LETTER ALEF WITH MADDA ABOVE; Lo; 0; AL; 0627 0653; ; ; ; N; ARABIC LETTER MADDAH ON ALEF; ; ; ;

in ArabicShaping you have:
0622; MADDA ON ALEF; R; ALEF

For a number of reasons I use the name in ArabicShaping.txt to name the glyphs in my font. But, I need some way to map the glyph name to the character name in UnicodeData.txt. I suggest that the best way is to add a field to Arabic Shaping.txt that give the full Unicode name.

In conclusion

I think it is possible to address these inconsistencies by modifying one file ArabicShaping.txt. This would make processing Arabic Unicode fonts that little bit easier and who knows - maybe more there will be more fonts that fully support Arabic Unicode. Also, new characters are in the pipeline for Arabic - I hope these notes help make sure they are defined consistently.

The end of the Google generation

I first heard of Google in the late 90’s from The Scout Report, an academic mailing list reviewing all that is new in the world of research. Google was their favorite web search engine. Since then I have become part of what I can only call the Google Generation. And it was a generational phenomenon. A whole generation of web surfers that learnt to go without thinking to Google to find, well, whatever.

Last week I was at an Apple iPhone event in London and chatted to a developer who made a sailing application for the iPhone that, as one of its functions, tells you the tides anywhere in the world. Then a thought struck me - here is an application I would use just for going on a day out to the beach. Open it, see when the tide is low or high, and time my trip for low tide. More than that, here is an application that gives me the information I want immediately that would normally take several minutes of fumbling through Google and poorly organised web sites. And this is only one example out of tens of thousands of possible applications that I can find to make my life that little bit easier. Bit, by bit, I am turning from a Google searcher to an iPhone App user. Already I have stopped searching Google for restaurants, films, directions, tides, weather, etc.

What I think we are witnessing now is a new generational change. There is an explosion of iPhone app creation and usage (100,000 apps, 1 billion downloads and growing), much like the explosion of users that Google experienced. And on top of that, it is coming with a rock-solid business model and without the irritation of adverts. Now a whole new generation of internet users, including me, are looking more to Apple’s App Store for information than to Google. I’ll call us the “There’s an App For That” generation.

Google cannot control the data they index and the web sites have little or no financial incentive to improve the presentation of that data. On the other hand App writers get paid directly by each user. The result? More and better applications presenting information that is already available on the web and thousands, maybe millions, of user willing and able to pay a dollar for that App. In short the Apps are getting better and the web sites remain just as poor. It is gradually getting easier for me to buy an application that gives me the information I want than searching Google.

For the future it means that web search and especially advertising sponsored web search will become irrelevant and Apple’s iPhone and App Store will become Google’s main competitor. Or maybe Apple will be the disruptive change that will push Google into the sidelines in the same way that Google pushed Yahoo! out of its way.

Most opinions I have read about why Google spent millions on its mobile OS, Android, only to give it away are around getting mobile advertising dollars to keep flowing to Google. I disagree. The mobile market is changing and it is moving away from web sites that get paid for by advertising. And the mobile App market is beginning to eat into the web search market. I believe this is a fight for survival. Browsers and publishers now have a way to pay and charge for information directly. Advertisers now have to change their relationship from the huge company that pushes the adverts onto the web to the small companies that publish the data. I do not see a place in there where Google would fit easily.

A Tale of Two Tweets
or how Twitter broke the bidi algorithm.

For years I have been trying to explain to anyone who would listen that the Unicode Bidi Algorithm has a fundamental flaw. The problem was that I did not have strong practical examples… and then along came Twitter.

My complaint is that Unicode bidi considers most characters that are not letters or numbers as neutrals and, in many cases, this is not correct. A neutral takes its direction from the surrounding strong character or the dominant direction. e.g.:

[Arabic Letter][neutral][Arabic letter] --> [right][right][right]
[Arabic Letter][neutral][English letter] --> [right][right][left] if the direction is Arabic
                                          or [right][left][left] if the direction is English.

This is all OK if the algorithm is placing commas and periods in sentences. And, Unicode bidi also takes care of lots of special cases. e.g. when to treat the period (U+002e) as a full stop or a decimal separator.

But the usage of punctuation varies over time. Arabic users may use slashes as a date separator today and switch to hyphens in the future. Then Twitter presents a whole new ball game. The ‘:’, ‘@’ and ‘#’ are redefined as textual letters which should take the direction from the word they start, e.g #bidi or @ironymark. The recommended way to correct this is by adding directional characters (e.g. left-to-right mark, U+200e). But you cannot do this without eating into the 140 character limit, causing problems with search or breaking the myriad of Twitter tools out there. On top of this Twitter does not define a right-to-left interface, so users must fend for themselves. This causes the following problem:

Screen shot 2009-10-29 at 00.00.40.png
from @GVinarabic.

The above entry has three main elements - a title of a blog post in Arabic, the word “translated by” followed by the @name of the translator and the URL. Simple? No. Since the direction is left-to-right the user has typed the text in the following order:

@nightS [the title in Arabic] - [translated by] - [the URL]

i.e. the last word in the tweet was written first to make the text look correct. Then in another case we see:

Screen shot 2009-10-29 at 00.09.56.png

Now the user typed: [title] - @MuhammadAdel : [translated by] - [URL]

Here the order of the text was modified to force bidi to give the right visual result when the tweet fills two lines. Any future attempt to translate these tweets or search for “translated by @nightS” will fail.

If one forces the direction to right-to-left there are other problems:

Screen shot 2009-10-29 at 08.04.14.png

The @ has been separated from @gr33ndata and put at the other end of the tweet. This is actually the thin end of the wedge. URLs can get reordered to become unreadable and things will get nastier in more complicated tweets that are re-tweeted or refer to popular # tags. e.g.

RT: @tweeter1 @tweeter2 #arabic [some short message in Arabic] [URL]

here the right-to left reader would want to see:

[URL] [some short message in Arabic] #arabic @tweeter2 @tweeter1 :RT

To get the right effect I need the following markup:

  <div dir="rtl">
    <span dir="ltr">RT</span>&rlm;:
    <span dir="ltr">@tweeter1</span>&rlm;
    <span dir="ltr">@tweeter2</span>&rlm;
    <span dir="ltr">#arabic</span>&rlm;
    [some short message in Arabic] &rlm;
    <span dir="ltr">[URL]</span>
  </div>

Rewiring the bidi algorithm just for Twitter is a non-starter. So one must solve this at the display end. It can only be done by injecting spans and left-to-right marks. But we now have a rather messy collection of markup and injected direction characters which can lead to all sorts of problems once you have to handle these in a text editor or convert between markup and plain text.

This is possible and to do this right one must implement an algorithm that can be ported to all web languages, and for all text editors that handle Twitter, then have this standardised across the industry, etc.

The real solution is, at some point, for a future HTML to define simple markup that gives better control of how spans are ordered without being forced to inject &rlm; characters. That will be the subject of a future post.

Mashing Up Bidi - Unicode Conference Slides

At the 33rd Unicode Conference I gave a presentation around the problems I encountered in my previous post Mashing Up Bidi. Here are the summary and slides:

Mashing-up Bi-Di

Mash-ups is a relatively new fashionable word on the Web - taking bits of other web sites to build up your own web page. It is not new or special - any search engine showing a snippet of a web site that it has found is a form of mash-up. Integrating a news or micro-blogging feed is another. And it seems that every company and their mother has its own mash-up API. But what happens when you have an Arabic web-site integrate content that may be Arabic or English or both? The Unicode Bi-Di Algorithm can render text and numbers unreadable. URL’s may become unusable or, in the worst case, direct to fraudulent sites. It can be hard to predict how to mark-up the integrated content for the right result. This presentation will cover real world issues and attempt to suggest practical solutions.

The Unicode Bi-Di Algorithm has been a great benefit for software in general. It provides a unified way for rendering mixed right-to-left (e.g. Arabic) and left-to-right (e.g. English) text across all kinds of software and devices. However, if the original direction of a piece of text is lost, applying the wrong direction may render the text unreadable. This is especially a problem on the web.

Text that appears on a web site may have passed through many stages before being rendered in its final place and can easily lose the markup specifying the direction when it was initially created. For example, a search engine providing a single sentence from the web sites its chooses from a query; a site integrating a list of the statuses of friends on a social networking site; a blog displaying ‘trackbacks’ to its posts.

Web browsers will use the Bi-Di Algorithm to order the text it displays and displaying English text in a right-to-left context or Arabic in left-to-right, can make text hard to read. Some web companies make a heroic effort to correctly align content in its Arabic web sites - but they still get it wrong. In a world where people are increasingly mashing-up their web pages what is the solution for BiDi languages?

I will try to answer this by suggesting the additional mark-up to handle such cases correctly and the kind of methods that can be employed to recognize the text’s correct direction.

  1. At the lowest level there needs to be a parser to spot URL’s and wrap these correctly. A parser to spot brackets and make sure the open bracket matches the direction of the close bracket.
  2. The next level up would be a standard way to guess if a stream of text or HTML is primarily right-to-left or left-to-right.
  3. And the last level is agreed standards for Mash-up API’s, XML feeds, XSL transforms that define the intention of the creator of the content.

The presentation will conclude with a proposal for a standard approach - hopefully one that can be supported by all the web sites.

Comments?

Evil Potato Head from Outer Space

Be afraid, be very afraid…

Evil potato head from outer space

Banging my head on Pipes

Preparing my Unicode presentation I thought it would be really nice to show a simple demonstration. The topic of my presentation is “Mashing Up Bidi” about just how messy it can be to mix Arabic and English content in a mashed up web page. The idea is to take an an English Twitter or RSS feed and put it an Arabic page then see what needs to be done to keep it readable.

So I threw together a simple portal demo, made it right-to-left and decided to use Yahoo! Pipes to handle the manipulation of the text in the feed. Pipes has a really nice feature that lets it call your own web service to do some of the text handling.

The web service is simplicity itself. Pipes sends JSON in an HTTP POST and your service returns the modified JSON data in reply. Nothing could be easier. I was wrong. First I wrote a simple PHP service that receives the HTTP-POST calls json_decode (which is now a standard feature of PHP 5), plays with the data and then calls json_encode to return the results:

<?php
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
    //do the text manipulation here
}
header('Content-Type: application/json');
print json_encode($json);
?>

It all looked OK until I saw that URL’s, HTML fields and, worse, all non-ASCII Unicode was being converted into Mojibake once it re-entered the pipe. Pipes has really nice testing features - and it lets you check the output in many ways. So I made the pipe output as JSON, and compared it to the output of the PHP service. Both looked the same. So maybe there is a bug in communication between the pipe and the PHP service. I resent the POST data right back in the PHP output, it worked. So the comm’s were OK. Maybe there was a problem decoding the JSON. So I explicitly set a field to my own string containing a URL. It failed. So the JSON going in works, the converted JSON looks OK but fails and all seems to be OK in the PHP. At this point I started to get a little frustrated.

Next step - check the data sent to the web service. One of the really nice features of Pipes is that it has excellent debugging features. So I returned the http POST data in my PHP service but this time inserted a open quote (”) to force Pipes to output an error with the actual JSON data it gets back.

Now I noticed a difference. When Pipes sends JSON to a webservice it does not encode it in any way - you just get raw utf8 characters. When PHP encodes JSON it nicely encodes all non-ascii characters with escapes (e.g. “/” get written as “\/” so that it can be parsed safely). However Pipes makes no attempt to read or convert these escapes so they just become part of the text; causing instant mojibake.

So time to look for a more suitable JSON encoder. FIrst, I saw that json_encode takes a parameter to control the output encoding - but not in my version of PHP - and the controls were limited to a few character. So I used this really useful page comparing the output of various PHP encoders. Zend seemed to do the job for me.

I installed Zend and made sure it did not call the internal encoder and called it like this:

<?php
require_once ('Zend/Json.php');
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
    //do the text manipulation here
}
header('Content-Type: application/json');
Zend_Json::$useBuiltinEncoderDecoder = true;
print Zend_Json::encode($json);
?>

Now the URL’s wre OK but Unicode was still messed up. Someone had ‘fixed’ Zend to insert escapes. No problem, it’s all PHP, so, I edited Zend/Json/Encoder.php and commented out the line that encodes Unicode and presto - I have a working web service.

What really annoys me is that it is 2009 and you still have to jump through hoops to get anything with text just slightly more complex than plain ASCII to work. So whose fault is it? Well Yahoo! has to take a big share of the blame. They are one of the most active supporters of JSON and do not parse it properly inside of their own services.

But really it is the laxness of the JSON standard (and most other standards) when it comes to text. There should be only one way to pass a string and that is either plain Unicode or encoded ASCII. Not both.