A while back Apple asked me to improve the Arabic system font in Mac OS X (which is now available in OS X 10.6). The requirements were daunting to say the least. The font had to support the full Unicode Arabic range, presentation forms and all. This meant adding some 1500 glyphs to my font together with the relevant AAT tables for Arabic shaping, ligatures, justification and kerning. My approach was to automate this as much as possible. So, I wrote a tool that generates all the required Arabic glyphs (about 1700 in total) from a set of about 90 basic shapes and 20 kinds of dots. But that will be the subject of another post. My point for this post is about Unicode standards.
Very technical post warning (eyes should glaze over at this point) …
For the tool to work in a generic way it relies heavily on the naming convention of Unicode characters. The tool reads the Unicode data file “ArabicShaping.txt” and for each line e.g.
068F; DAL WITH 3 DOTS ABOVE DOWNWARD; R; DAL
the name, “DAL WITH 3 DOTS ABOVE DOWNWARD”, is processed as tokens e.g.:
<ARABIC LETTER DAL>, <ABOVE>, and <THREE DOTS DOWNWARDS>
The tool then interprets these tokens and outputs an XML file for input to another program that will generate the new glyphs. The above example is composed of: U+062F (Dal) combined with a glyph named “threeDotsDown” and the combining glyph is placed above the main glyph. In XML it looks like this:
<newGlyph name="u068f.dalWith3DotsAboveDownward" Unicode="U+068f">
<pieceGlyph glyphRefID="dal">
<position X="0" Y="0"></position>
</pieceGlyph>
<pieceGlyph glyphRefID="threeDotsDown" linkToPrior="yes">
<position X="3" Y="5" useZones="yes"></position>
</pieceGlyph>
</newGlyph>
ArabicShaping.txt also specifies the joining - in this case “R” (right joining). So the tool will also look for a glyph that makes the final form of the Dal and creates a similar combination.
Finally the tool exports three more files to define the Arabic shaping rules, kerning and justification.
To get to my point:
The data files produced by the Unicode Consortuim and the naming conventions used are almost good enough for this kind of machine processing but need some modification. I noticed a number of inconsistencies and some missing information that complicates this. Below are examples of the problems:
1/ Some of the names for parts of a glyph are inconsistent:
e.g characters in the Arabic (06xx) area use the word “VERTICAL” while the Supplemental Arabic (07xx) area uses “VERTICALLY” in the same context.
compare: 067A; TEH WITH 2 DOTS VERTICAL ABOVE; D; BEH
and 076B; REH WITH 2 DOTS VERTICALLY ABOVE; R; REH
the same for BAR and STROKE and DOWN and DOWNWARD, etc, … I have a long list.
2/ There were some unclear names:
e.g The difference between 06A2; FEH WITH DOT MOVED BELOW and 06A3; FEH WITH DOT BELOW is difficult to process programatically.
3/ The joining group is inconsistent for some of the characters:
ArabicShaping.txt has 4 fields for each listed character: Unicode - Schematic Name - Joining Type - and Joining Group.
e.g.: 06C3; TEH MARBUTA GOAL; R; HAMZA ON HEH GOAL
The joining group is: HAMZA ON HEH GOAL. But when I look at the HAMZA ON HEH GOAL character, its joining group is HEH GOAL. So why not just have: 06C3; TEH MARBUTA GOAL; R; HEH GOAL ?
4/ Working out the names of combiners that are transparent to shaping is tricky:
My tool also generates the font shaping information. But for this to work I need to know what characters are combining accents and therefore transparent to shaping. However the Unicode data files make my life more difficult than it needs to be. In order to do this I would need to parse a file called DerivedJoiningType.txt, extract the Unicode values for characters with “Joining_Type=Transparent” then search UnicodeData.txt for more information on these characters. It would be much easier if ArabicShaping.txt listed accents as well.
5/ Character names across Unicode files are inconsistent:
It is hard to go from the character name in UnicodeData.txt and the character names in ArabicShaping.txt
e.g. in UnicodeData you have:
0622; ARABIC LETTER ALEF WITH MADDA ABOVE; Lo; 0; AL; 0627 0653; ; ; ; N; ARABIC LETTER MADDAH ON ALEF; ; ; ;
in ArabicShaping you have:
0622; MADDA ON ALEF; R; ALEF
For a number of reasons I use the name in ArabicShaping.txt to name the glyphs in my font. But, I need some way to map the glyph name to the character name in UnicodeData.txt. I suggest that the best way is to add a field to Arabic Shaping.txt that give the full Unicode name.
In conclusion
I think it is possible to address these inconsistencies by modifying one file ArabicShaping.txt. This would make processing Arabic Unicode fonts that little bit easier and who knows - maybe more there will be more fonts that fully support Arabic Unicode. Also, new characters are in the pipeline for Arabic - I hope these notes help make sure they are defined consistently.