Some random posts about Buzz

F13BD75E-133D-4155-8361-F2A5197B25F4.jpg

Following my friends comments once Google dropped Buzz on an unsuspecting public…

@S Feb 10th 5:28pm I don’t get this. What’s going on? WHAT THE HELL IS GOING ON?

@S Feb 10th 5:29pm WHOA ITS LIKE FACEBOOK AND TWITTER. FWITTERBOOK!

@S Feb 10th 7:21pm It’s funny that Buzz comes about right when I started thinking about the lack of privacy in my life as a result of the blurring of the line between public/private courtesy social media etc.

@J Feb 10th 7:33pm I don’t know what I think about this Google Buzz stuff. I just refreshed gmail and there it was.

@J Feb 10th 8:16pm If you link Twitter (or for that matter, anything else that you update externally and frequently), I will stop following you. I don’t need to see Twitter in my freaking e-mail.

@H Don’t know what to think about this Buzz. First thoughts: familiar faces, what the hell, get the hell out of my inbox, I’m deluged, how am I gonna deal with this? … some friends are here too, where are the others, hey, I can write more than the 140 bloody characters! oh shit… no… I need to go see some patient, stop buzzing and go to work! oh and I’m not gonna get addicted to this… no way… imbuzzible!

@Sa a new and nifty distraction added to the several others, but this is right inside Gmail, hmm Buzz

@Je Gmail morphed into Facebook so slowly we never noticed until it was too late.

@Jalal - what the heck is this buzzer??????????
this seems to be soooo public

@A D - no sh*t, sherlock? really? and Twitter isn’t?

Musing on Unicode Data

A while back Apple asked me to improve the Arabic system font in Mac OS X (which is now available in OS X 10.6). The requirements were daunting to say the least. The font had to support the full Unicode Arabic range, presentation forms and all. This meant adding some 1500 glyphs to my font together with the relevant AAT tables for Arabic shaping, ligatures,  justification and kerning. My approach was to automate this as much as possible. So, I wrote a tool that generates all the required Arabic glyphs (about 1700 in total) from a set of about 90 basic shapes and 20 kinds of dots. But that will be the subject of another post. My point for this post is about Unicode standards.

Very technical post warning (eyes should glaze over at this point) …

For the tool to work in a generic way it relies heavily on the naming convention of Unicode characters. The tool reads the Unicode data file “ArabicShaping.txt” and for each line e.g.

068F; DAL WITH 3 DOTS ABOVE DOWNWARD; R; DAL

the name, “DAL WITH 3 DOTS ABOVE DOWNWARD”, is processed as tokens e.g.:

<ARABIC LETTER DAL>, <ABOVE>, and <THREE DOTS DOWNWARDS>

The tool then interprets these tokens and outputs an XML file for input to another program that will generate the new glyphs. The above example is composed of: U+062F (Dal) combined with a glyph named “threeDotsDown” and the combining glyph is placed above the main glyph. In XML it looks like this:

    <newGlyph name="u068f.dalWith3DotsAboveDownward" Unicode="U+068f">
        <pieceGlyph glyphRefID="dal">
            <position X="0" Y="0"></position>
        </pieceGlyph>
        <pieceGlyph glyphRefID="threeDotsDown" linkToPrior="yes">
            <position X="3" Y="5" useZones="yes"></position>
        </pieceGlyph>
    </newGlyph>

ArabicShaping.txt also specifies the joining - in this case “R” (right joining). So the tool will also look for a glyph that makes the final form of the Dal and creates a similar combination.

Finally the tool exports three more files to define the Arabic shaping rules, kerning and justification.

To get to my point:

The data files produced by the Unicode Consortuim and the naming conventions used are almost good enough for this kind of machine processing but need some modification. I noticed a number of inconsistencies and some missing information that complicates this. Below are examples of the problems:

1/ Some of the names for parts of a glyph are inconsistent:

e.g characters in the Arabic (06xx) area use the word “VERTICAL” while the Supplemental Arabic (07xx) area uses “VERTICALLY” in the same context.

compare: 067A; TEH WITH 2 DOTS VERTICAL ABOVE; D; BEH
and 076B; REH WITH 2 DOTS VERTICALLY ABOVE; R; REH

the same for BAR and STROKE and DOWN and DOWNWARD, etc, … I have a long list.

2/ There were some unclear names:

e.g The difference between 06A2; FEH WITH DOT MOVED BELOW  and 06A3; FEH WITH DOT BELOW is difficult to process programatically.

3/ The joining group is inconsistent for some of the characters:

ArabicShaping.txt has 4 fields for each listed character: Unicode - Schematic Name -  Joining Type - and Joining Group.

e.g.: 06C3; TEH MARBUTA GOAL; R; HAMZA ON HEH GOAL

The joining group is: HAMZA ON HEH GOAL. But when I look at the HAMZA ON HEH GOAL character, its joining group is HEH GOAL. So why not just have: 06C3; TEH MARBUTA GOAL; R; HEH GOAL ?

4/ Working out the names of combiners that are transparent to shaping is tricky:

My tool also generates the font shaping information. But for this to work I need to know what characters are combining accents and therefore transparent to shaping. However the Unicode data files make my life more difficult than it needs to be. In order to do this I would need to parse a file called DerivedJoiningType.txt, extract the Unicode values for characters with “Joining_Type=Transparent” then search UnicodeData.txt for more information on these characters. It would be much easier if ArabicShaping.txt listed accents as well.

5/ Character names across Unicode files are inconsistent:

It is hard to go from the character name in UnicodeData.txt and the character names in ArabicShaping.txt

e.g. in UnicodeData you have:
0622; ARABIC LETTER ALEF WITH MADDA ABOVE; Lo; 0; AL; 0627 0653; ; ; ; N; ARABIC LETTER MADDAH ON ALEF; ; ; ;

in ArabicShaping you have:
0622; MADDA ON ALEF; R; ALEF

For a number of reasons I use the name in ArabicShaping.txt to name the glyphs in my font. But, I need some way to map the glyph name to the character name in UnicodeData.txt. I suggest that the best way is to add a field to Arabic Shaping.txt that give the full Unicode name.

In conclusion

I think it is possible to address these inconsistencies by modifying one file ArabicShaping.txt. This would make processing Arabic Unicode fonts that little bit easier and who knows - maybe more there will be more fonts that fully support Arabic Unicode. Also, new characters are in the pipeline for Arabic - I hope these notes help make sure they are defined consistently.

The end of the Google generation

I first heard of Google in the late 90’s from The Scout Report, an academic mailing list reviewing all that is new in the world of research. Google was their favorite web search engine. Since then I have become part of what I can only call the Google Generation. And it was a generational phenomenon. A whole generation of web surfers that learnt to go without thinking to Google to find, well, whatever.

Last week I was at an Apple iPhone event in London and chatted to a developer who made a sailing application for the iPhone that, as one of its functions, tells you the tides anywhere in the world. Then a thought struck me - here is an application I would use just for going on a day out to the beach. Open it, see when the tide is low or high, and time my trip for low tide. More than that, here is an application that gives me the information I want immediately that would normally take several minutes of fumbling through Google and poorly organised web sites. And this is only one example out of tens of thousands of possible applications that I can find to make my life that little bit easier. Bit, by bit, I am turning from a Google searcher to an iPhone App user. Already I have stopped searching Google for restaurants, films, directions, tides, weather, etc.

What I think we are witnessing now is a new generational change. There is an explosion of iPhone app creation and usage (100,000 apps, 1 billion downloads and growing), much like the explosion of users that Google experienced. And on top of that, it is coming with a rock-solid business model and without the irritation of adverts. Now a whole new generation of internet users, including me, are looking more to Apple’s App Store for information than to Google. I’ll call us the “There’s an App For That” generation.

Google cannot control the data they index and the web sites have little or no financial incentive to improve the presentation of that data. On the other hand App writers get paid directly by each user. The result? More and better applications presenting information that is already available on the web and thousands, maybe millions, of user willing and able to pay a dollar for that App. In short the Apps are getting better and the web sites remain just as poor. It is gradually getting easier for me to buy an application that gives me the information I want than searching Google.

For the future it means that web search and especially advertising sponsored web search will become irrelevant and Apple’s iPhone and App Store will become Google’s main competitor. Or maybe Apple will be the disruptive change that will push Google into the sidelines in the same way that Google pushed Yahoo! out of its way.

Most opinions I have read about why Google spent millions on its mobile OS, Android, only to give it away are around getting mobile advertising dollars to keep flowing to Google. I disagree. The mobile market is changing and it is moving away from web sites that get paid for by advertising. And the mobile App market is beginning to eat into the web search market. I believe this is a fight for survival. Browsers and publishers now have a way to pay and charge for information directly. Advertisers now have to change their relationship from the huge company that pushes the adverts onto the web to the small companies that publish the data. I do not see a place in there where Google would fit easily.

A Tale of Two Tweets
or how Twitter broke the bidi algorithm.

For years I have been trying to explain to anyone who would listen that the Unicode Bidi Algorithm has a fundamental flaw. The problem was that I did not have strong practical examples… and then along came Twitter.

My complaint is that Unicode bidi considers most characters that are not letters or numbers as neutrals and, in many cases, this is not correct. A neutral takes its direction from the surrounding strong character or the dominant direction. e.g.:

[Arabic Letter][neutral][Arabic letter] --> [right][right][right]
[Arabic Letter][neutral][English letter] --> [right][right][left] if the direction is Arabic
                                          or [right][left][left] if the direction is English.

This is all OK if the algorithm is placing commas and periods in sentences. And, Unicode bidi also takes care of lots of special cases. e.g. when to treat the period (U+002e) as a full stop or a decimal separator.

But the usage of punctuation varies over time. Arabic users may use slashes as a date separator today and switch to hyphens in the future. Then Twitter presents a whole new ball game. The ‘:’, ‘@’ and ‘#’ are redefined as textual letters which should take the direction from the word they start, e.g #bidi or @ironymark. The recommended way to correct this is by adding directional characters (e.g. left-to-right mark, U+200e). But you cannot do this without eating into the 140 character limit, causing problems with search or breaking the myriad of Twitter tools out there. On top of this Twitter does not define a right-to-left interface, so users must fend for themselves. This causes the following problem:

Screen shot 2009-10-29 at 00.00.40.png
from @GVinarabic.

The above entry has three main elements - a title of a blog post in Arabic, the word “translated by” followed by the @name of the translator and the URL. Simple? No. Since the direction is left-to-right the user has typed the text in the following order:

@nightS [the title in Arabic] - [translated by] - [the URL]

i.e. the last word in the tweet was written first to make the text look correct. Then in another case we see:

Screen shot 2009-10-29 at 00.09.56.png

Now the user typed: [title] - @MuhammadAdel : [translated by] - [URL]

Here the order of the text was modified to force bidi to give the right visual result when the tweet fills two lines. Any future attempt to translate these tweets or search for “translated by @nightS” will fail.

If one forces the direction to right-to-left there are other problems:

Screen shot 2009-10-29 at 08.04.14.png

The @ has been separated from @gr33ndata and put at the other end of the tweet. This is actually the thin end of the wedge. URLs can get reordered to become unreadable and things will get nastier in more complicated tweets that are re-tweeted or refer to popular # tags. e.g.

RT: @tweeter1 @tweeter2 #arabic [some short message in Arabic] [URL]

here the right-to left reader would want to see:

[URL] [some short message in Arabic] #arabic @tweeter2 @tweeter1 :RT

To get the right effect I need the following markup:

  <div dir="rtl">
    <span dir="ltr">RT</span>&rlm;:
    <span dir="ltr">@tweeter1</span>&rlm;
    <span dir="ltr">@tweeter2</span>&rlm;
    <span dir="ltr">#arabic</span>&rlm;
    [some short message in Arabic] &rlm;
    <span dir="ltr">[URL]</span>
  </div>

Rewiring the bidi algorithm just for Twitter is a non-starter. So one must solve this at the display end. It can only be done by injecting spans and left-to-right marks. But we now have a rather messy collection of markup and injected direction characters which can lead to all sorts of problems once you have to handle these in a text editor or convert between markup and plain text.

This is possible and to do this right one must implement an algorithm that can be ported to all web languages, and for all text editors that handle Twitter, then have this standardised across the industry, etc.

The real solution is, at some point, for a future HTML to define simple markup that gives better control of how spans are ordered without being forced to inject &rlm; characters. That will be the subject of a future post.

Mashing Up Bidi - Unicode Conference Slides

At the 33rd Unicode Conference I gave a presentation around the problems I encountered in my previous post Mashing Up Bidi. Here are the summary and slides:

Mashing-up Bi-Di

Mash-ups is a relatively new fashionable word on the Web - taking bits of other web sites to build up your own web page. It is not new or special - any search engine showing a snippet of a web site that it has found is a form of mash-up. Integrating a news or micro-blogging feed is another. And it seems that every company and their mother has its own mash-up API. But what happens when you have an Arabic web-site integrate content that may be Arabic or English or both? The Unicode Bi-Di Algorithm can render text and numbers unreadable. URL’s may become unusable or, in the worst case, direct to fraudulent sites. It can be hard to predict how to mark-up the integrated content for the right result. This presentation will cover real world issues and attempt to suggest practical solutions.

The Unicode Bi-Di Algorithm has been a great benefit for software in general. It provides a unified way for rendering mixed right-to-left (e.g. Arabic) and left-to-right (e.g. English) text across all kinds of software and devices. However, if the original direction of a piece of text is lost, applying the wrong direction may render the text unreadable. This is especially a problem on the web.

Text that appears on a web site may have passed through many stages before being rendered in its final place and can easily lose the markup specifying the direction when it was initially created. For example, a search engine providing a single sentence from the web sites its chooses from a query; a site integrating a list of the statuses of friends on a social networking site; a blog displaying ‘trackbacks’ to its posts.

Web browsers will use the Bi-Di Algorithm to order the text it displays and displaying English text in a right-to-left context or Arabic in left-to-right, can make text hard to read. Some web companies make a heroic effort to correctly align content in its Arabic web sites - but they still get it wrong. In a world where people are increasingly mashing-up their web pages what is the solution for BiDi languages?

I will try to answer this by suggesting the additional mark-up to handle such cases correctly and the kind of methods that can be employed to recognize the text’s correct direction.

  1. At the lowest level there needs to be a parser to spot URL’s and wrap these correctly. A parser to spot brackets and make sure the open bracket matches the direction of the close bracket.
  2. The next level up would be a standard way to guess if a stream of text or HTML is primarily right-to-left or left-to-right.
  3. And the last level is agreed standards for Mash-up API’s, XML feeds, XSL transforms that define the intention of the creator of the content.

The presentation will conclude with a proposal for a standard approach - hopefully one that can be supported by all the web sites.

Comments?

Evil Potato Head from Outer Space

Be afraid, be very afraid…

Evil potato head from outer space

Banging my head on Pipes

Preparing my Unicode presentation I thought it would be really nice to show a simple demonstration. The topic of my presentation is “Mashing Up Bidi” about just how messy it can be to mix Arabic and English content in a mashed up web page. The idea is to take an an English Twitter or RSS feed and put it an Arabic page then see what needs to be done to keep it readable.

So I threw together a simple portal demo, made it right-to-left and decided to use Yahoo! Pipes to handle the manipulation of the text in the feed. Pipes has a really nice feature that lets it call your own web service to do some of the text handling.

The web service is simplicity itself. Pipes sends JSON in an HTTP POST and your service returns the modified JSON data in reply. Nothing could be easier. I was wrong. First I wrote a simple PHP service that receives the HTTP-POST calls json_decode (which is now a standard feature of PHP 5), plays with the data and then calls json_encode to return the results:

<?php
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
    //do the text manipulation here
}
header('Content-Type: application/json');
print json_encode($json);
?>

It all looked OK until I saw that URL’s, HTML fields and, worse, all non-ASCII Unicode was being converted into Mojibake once it re-entered the pipe. Pipes has really nice testing features - and it lets you check the output in many ways. So I made the pipe output as JSON, and compared it to the output of the PHP service. Both looked the same. So maybe there is a bug in communication between the pipe and the PHP service. I resent the POST data right back in the PHP output, it worked. So the comm’s were OK. Maybe there was a problem decoding the JSON. So I explicitly set a field to my own string containing a URL. It failed. So the JSON going in works, the converted JSON looks OK but fails and all seems to be OK in the PHP. At this point I started to get a little frustrated.

Next step - check the data sent to the web service. One of the really nice features of Pipes is that it has excellent debugging features. So I returned the http POST data in my PHP service but this time inserted a open quote (”) to force Pipes to output an error with the actual JSON data it gets back.

Now I noticed a difference. When Pipes sends JSON to a webservice it does not encode it in any way - you just get raw utf8 characters. When PHP encodes JSON it nicely encodes all non-ascii characters with escapes (e.g. “/” get written as “\/” so that it can be parsed safely). However Pipes makes no attempt to read or convert these escapes so they just become part of the text; causing instant mojibake.

So time to look for a more suitable JSON encoder. FIrst, I saw that json_encode takes a parameter to control the output encoding - but not in my version of PHP - and the controls were limited to a few character. So I used this really useful page comparing the output of various PHP encoders. Zend seemed to do the job for me.

I installed Zend and made sure it did not call the internal encoder and called it like this:

<?php
require_once ('Zend/Json.php');
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
    //do the text manipulation here
}
header('Content-Type: application/json');
Zend_Json::$useBuiltinEncoderDecoder = true;
print Zend_Json::encode($json);
?>

Now the URL’s wre OK but Unicode was still messed up. Someone had ‘fixed’ Zend to insert escapes. No problem, it’s all PHP, so, I edited Zend/Json/Encoder.php and commented out the line that encodes Unicode and presto - I have a working web service.

What really annoys me is that it is 2009 and you still have to jump through hoops to get anything with text just slightly more complex than plain ASCII to work. So whose fault is it? Well Yahoo! has to take a big share of the blame. They are one of the most active supporters of JSON and do not parse it properly inside of their own services.

But really it is the laxness of the JSON standard (and most other standards) when it comes to text. There should be only one way to pass a string and that is either plain Unicode or encoded ASCII. Not both.

Maren, Microsoft’s Final Insult

Now don’t get me wrong. I like Microsoft and they can come up with some really good technology. But you just have to wonder about their marketing department sometimes.

Their latest Arabic effort, Maren, is arguably a nice tool (albeit not an original idea) that intelligently converts transliterated Arabic typed on an English keyboard into real Arabic on the fly.

Then MS marketing gets hold of it and adds an introductory video that is, frankly, insulting. It starts with an inexplicable sequence of a person who cannot read chatting online with some one who cannot type. The video goes on to recommend forgetting Arabic keyboards and using Maren English for everything - Email, Word, IM. And in a really patronising way - as if to say “You poor Arabs, stop worrying about your really difficult language and use English instead”. Given that Arabs are generally touchy about the subject of American abuses to a few of their countries, one of their religions and culture in general, Microsoft marketing are simply fanning flames and the negative feedback starts flooding in. Here is one blogger:

Promoting Maren was not in the right way. I’m blogging in English and I chat with some friends in Franco-Arab way but I really care about Arabic Content and promoting the proper content to users, But Maren Video Demonstration didn’t show that It’s helping users to use Arabic Letters or Franco-Arab but It just says Screw Arabic letters .. Write in Roman Characters and we will convert to Arabic .

And a Twitter thread:

@Lastoadri: @Zeinobia well.. we shld preserve our language & force ourselves to write in Arabic letters. we can use yamli, t3reeb.. etc for quick things (13:45)

@Lastoadri: Maren is like forcing Arabs to type Arabic in latin letters. In few yrs we will have the new Turkish.. the video is so disrespectful. (13:44)

@Zeinobia: @Lastoadri why ?? I think it is good for those who do not understand Arabic English typing thing (13:27)

@Lastoadri: I feel the new Microsoft Maren program is like an insult for Arabs http://www.microsoft.com/middleeast/egypt/cmic/maren/ (13:23)

@afahad: after seeing Maren I am convinced us arabs must be the laziest nation in the world! If you do most of your typing in arabic, learn how to! (13:06)

I can only guess that the MS marketing team in Cairo were the same people who thought up Bob. I just hope they learn from mistakes.

Mashing-up Bi-Di

Mash-ups is a relatively new fashion in web design - taking bits of other web sites to build up your own web page. It is not new or special - any search engine showing a small summary of a web site that it has found is a form of mash-up. Integrating an rss feed is another. And it seems that every company and their mother has its own mash-up API. But what happens when you have an Arabic web-site integrate content that may be Arabic or English or both? It can be hard to predict how to mark-up the integrated content for the right result. Here is a good example. Google makes a heroic effort to correctly align content in its Arabic web sites - but they still get it wrong.

For example - here is the result of a search for “Arab” in the Arabic google.ae:

As you see - the green web addresses look good, the text of the first result is right-to-left (note the “…” comes visually after the last Arabic word). Full marks for the second result - the English text is correctly ordered left-to-right (note the “…” is visually after the last English word).

But something has gone wrong with the third result. Lets look at it more closely with the html below…

As you can see from the mark-up, the “…” is last character in the snippet but appears in the middle of the line. The Arabic word “مصر” is drawing in a completely different location to the place it appears logically. “ArabChat.com” should be the first word in the title. What has happened is that Google has flagged this site as an Arabic site and allowed the Unicode BiDi algorithm to treat it as right-to-left text. Which gives confusing results.

Things can even more confusing if the text contains mixed Arabic and English as this result from the mobile edition of Google shows:

I defy anyone to work out what the second results actually says. It should be:

2 Arabic language - Wikipedia, the free encyclopedia - 5 Dec 2008 … Arabic (العربية al-ʿarabiyyah; less formally: عربي ʿarabi) … Standard Arabic ( - en.wikipedia.org/wiki/Arabic_language

So in a world where people are increasingly mashing-up their web pages what is the solution for BiDi languages? As one can see it is not always possible to know beforehand if the piece you are integrating is primarily English or Arabic. ArabChat.com is an Arabic web site yet the snippet found was in English. And when the Unicode BiDi algorithm is left to its own devices in the wrong primary direction the result can be unreadable.

I think there needs to be additional mark-up to handle these correctly. And this needs to be a standard approach  - one that is supported by all the web companies. What I suggest is treating this in layers.

  • At the lowest level there needs to be a parser to spot URL’s and wrap these correctly. A parser to spot brackets and make sure the open bracket matches the direction of the close bracket. etc.
  • The next level up would be a standard way to guess if a stream of text or HTML is primarily right-to-left or left-to-right.
  • And the last level is agreed standards for Mash-up API’s, XML feeds, XSL transforms that define the intention of the creator of the content.

I hope to cover this in my following posts. And please let me have your thoughts in the comments.

So what is the ‘line-height’?

Al-Jazeera is one of the more popular Arabic web pages, so any browser that claims Arabic support should be able to render it correctly. However, here is what happens on a Mac (click on the picture to see it full size):

Three problems each with a different reason.

1/ Jumping Content: this is the web site’s fault. They only format the site for one font and expect every browser to have a matching font of exactly the same size.

2/ Back-to-Front Brackets: This is the browser’s fault. I don’t know… every time Firefox comes up with an update they break something with Bi-Di.

3/ Lines too Close: now this is interesting. What is happening is that the web site css specifies:

line-height: 100%

Now this should not make the lines clash with each other and if I change the lien-height to “normal” everythign is OK. So why is this happening? Lets look at the w3c the specification or line-height:

‘line-height’

Value:      normal | <number> | <length> | <percentage> | inherit
Initial:      normal
Applies to:      all elements
Inherited:      yes
Percentages:      refer to the font size of the element itself
Media:      visual

If the property is set on a block-level element whose content is composed of inline-level elements, it specifies the minimal height of each generated inline box.

If the property is set on an inline-level element, it specifies the exact height of each box generated by the element. (Except for inline replaced elements, where the height of the box is given by the ‘height’ property.)

Values for this property have the following meanings:
normal

Tells user agents to set the computed value to a “reasonable” value based on the font size of the element. The value has the same meaning as <number>. We recommend a computed value for ‘normal’ between 1.0 to 1.2. …

<percentage>

The computed value of the property is this percentage multiplied by the element’s computed font size. Negative values are illegal.

Firstly, the Arabic font here is Geeza which is the standard Arabic fall-back font in Mac OS X. The line-height is based on the font size. As Geeza has lower descenders it is clashing with the lines below. So the browser is setting the line-height assuming an English font. I can think of two problems here

  • the CSS description of line-height is too vague meaning that browser makers will define what 100% means on the needs of their English-language customers.
  • The other is that CSS specifies that this should only be a minimum - but the browsers are treating this as an absolute and forcing a line height that the font will not match.

Apart from all this fonts do a very bad job of sticking to any standard convention of line-height so the fault goes all around.