At the 33rd Unicode Conference I gave a presentation around the problems I encountered in my previous post Mashing Up Bidi. Here are the summary and slides:

Mashing-up Bi-Di

Mash-ups is a relatively new fashionable word on the Web - taking bits of other web sites to build up your own web page. It is not new or special - any search engine showing a snippet of a web site that it has found is a form of mash-up. Integrating a news or micro-blogging feed is another. And it seems that every company and their mother has its own mash-up API. But what happens when you have an Arabic web-site integrate content that may be Arabic or English or both? The Unicode Bi-Di Algorithm can render text and numbers unreadable. URL’s may become unusable or, in the worst case, direct to fraudulent sites. It can be hard to predict how to mark-up the integrated content for the right result. This presentation will cover real world issues and attempt to suggest practical solutions.

The Unicode Bi-Di Algorithm has been a great benefit for software in general. It provides a unified way for rendering mixed right-to-left (e.g. Arabic) and left-to-right (e.g. English) text across all kinds of software and devices. However, if the original direction of a piece of text is lost, applying the wrong direction may render the text unreadable. This is especially a problem on the web.

Text that appears on a web site may have passed through many stages before being rendered in its final place and can easily lose the markup specifying the direction when it was initially created. For example, a search engine providing a single sentence from the web sites its chooses from a query; a site integrating a list of the statuses of friends on a social networking site; a blog displaying ‘trackbacks’ to its posts.

Web browsers will use the Bi-Di Algorithm to order the text it displays and displaying English text in a right-to-left context or Arabic in left-to-right, can make text hard to read. Some web companies make a heroic effort to correctly align content in its Arabic web sites - but they still get it wrong. In a world where people are increasingly mashing-up their web pages what is the solution for BiDi languages?

I will try to answer this by suggesting the additional mark-up to handle such cases correctly and the kind of methods that can be employed to recognize the text’s correct direction.

  1. At the lowest level there needs to be a parser to spot URL’s and wrap these correctly. A parser to spot brackets and make sure the open bracket matches the direction of the close bracket.
  2. The next level up would be a standard way to guess if a stream of text or HTML is primarily right-to-left or left-to-right.
  3. And the last level is agreed standards for Mash-up API’s, XML feeds, XSL transforms that define the intention of the creator of the content.

The presentation will conclude with a proposal for a standard approach - hopefully one that can be supported by all the web sites.

Comments?