Banging my head on Pipes
Preparing my Unicode presentation I thought it would be really nice to show a simple demonstration. The topic of my presentation is “Mashing Up Bidi” about just how messy it can be to mix Arabic and English content in a mashed up web page. The idea is to take an an English Twitter or RSS feed and put it an Arabic page then see what needs to be done to keep it readable.
So I threw together a simple portal demo, made it right-to-left and decided to use Yahoo! Pipes to handle the manipulation of the text in the feed. Pipes has a really nice feature that lets it call your own web service to do some of the text handling.
The web service is simplicity itself. Pipes sends JSON in an HTTP POST and your service returns the modified JSON data in reply. Nothing could be easier. I was wrong. First I wrote a simple PHP service that receives the HTTP-POST calls json_decode (which is now a standard feature of PHP 5), plays with the data and then calls json_encode to return the results:
<?php
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
//do the text manipulation here
}
header('Content-Type: application/json');
print json_encode($json);
?>
It all looked OK until I saw that URL’s, HTML fields and, worse, all non-ASCII Unicode was being converted into Mojibake once it re-entered the pipe. Pipes has really nice testing features - and it lets you check the output in many ways. So I made the pipe output as JSON, and compared it to the output of the PHP service. Both looked the same. So maybe there is a bug in communication between the pipe and the PHP service. I resent the POST data right back in the PHP output, it worked. So the comm’s were OK. Maybe there was a problem decoding the JSON. So I explicitly set a field to my own string containing a URL. It failed. So the JSON going in works, the converted JSON looks OK but fails and all seems to be OK in the PHP. At this point I started to get a little frustrated.
Next step - check the data sent to the web service. One of the really nice features of Pipes is that it has excellent debugging features. So I returned the http POST data in my PHP service but this time inserted a open quote (”) to force Pipes to output an error with the actual JSON data it gets back.
Now I noticed a difference. When Pipes sends JSON to a webservice it does not encode it in any way - you just get raw utf8 characters. When PHP encodes JSON it nicely encodes all non-ascii characters with escapes (e.g. “/” get written as “\/” so that it can be parsed safely). However Pipes makes no attempt to read or convert these escapes so they just become part of the text; causing instant mojibake.
So time to look for a more suitable JSON encoder. FIrst, I saw that json_encode takes a parameter to control the output encoding - but not in my version of PHP - and the controls were limited to a few character. So I used this really useful page comparing the output of various PHP encoders. Zend seemed to do the job for me.
I installed Zend and made sure it did not call the internal encoder and called it like this:
<?php
require_once ('Zend/Json.php');
$json = json_decode($_POST['data']);
for($i=0;$i<sizeof($json->items);$i++) {
//do the text manipulation here
}
header('Content-Type: application/json');
Zend_Json::$useBuiltinEncoderDecoder = true;
print Zend_Json::encode($json);
?>
Now the URL’s wre OK but Unicode was still messed up. Someone had ‘fixed’ Zend to insert escapes. No problem, it’s all PHP, so, I edited Zend/Json/Encoder.php and commented out the line that encodes Unicode and presto - I have a working web service.
What really annoys me is that it is 2009 and you still have to jump through hoops to get anything with text just slightly more complex than plain ASCII to work. So whose fault is it? Well Yahoo! has to take a big share of the blame. They are one of the most active supporters of JSON and do not parse it properly inside of their own services.
But really it is the laxness of the JSON standard (and most other standards) when it comes to text. There should be only one way to pass a string and that is either plain Unicode or encoded ASCII. Not both.

Thanks so much for the final clue to this puzzle! I’ve been banging my head against this for a few days now and found your post describing the same problems I was having with the Zend JSON encoder.
For anyone that is curious of the line that needs to be commented out:
$string = self::encodeUnicodeString($string);