PHP & JavaScript: UTF-16 to UTF-8

标签: , , , , , ,


Recently I’ve been doing some work on a PHP script that has to process a bunch of XML files (in this case they’re imsmanifest files) however a few of them weren’t being parsed successfully.


The problem was soon quite clear, some of the files had been encoded using UTF-16 which wasn’t playing nicely with PHP. To solve this I’ve written a function that attempts to detect if a string is encoded using UTF-16 (little endian or big endian) and then converts it to a slightly more PHP friendly UTF-8. All the complicated stuff is copied from these JavaScript functions for converting between UTF-8 and UTF-16.

原因很明显,一些文件是用对PHP不友好的UTF-16编码的。为了解决这个问题我写了一个尝试判断一个字符是否用UTF-16编码并将其转换成对PHP比较友好的UTF-8编码的函数。所有这些复杂的材料都是从JavaScript functions for converting between UTF-8 and UTF-16复制的。

function utf16_to_utf8($str) {
    $c0 = ord($str[0]);
    $c1 = ord($str[1]);

    if ($c0 == 0xFE && $c1 == 0xFF) {
        $be = true;
    } else if ($c0 == 0xFF && $c1 == 0xFE) {
        $be = false;
    } else {
        return $str;

    $str = substr($str, 2);
    $len = strlen($str);
    $dec = '';
    for ($i = 0; $i < $len; $i += 2) {
        $c = ($be) ? ord($str[$i]) << 8 | ord($str[$i + 1]) : 
                ord($str[$i + 1]) << 8 | ord($str[$i]);
        if ($c >= 0x0001 && $c <= 0x007F) {
            $dec .= chr($c);
        } else if ($c > 0x07FF) {
            $dec .= chr(0xE0 | (($c >> 12) & 0x0F));
            $dec .= chr(0x80 | (($c >>  6) & 0x3F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
        } else {
            $dec .= chr(0xC0 | (($c >>  6) & 0x1F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
    return $dec;

Note this only does something if the string has a BOM, otherwise it is assumed that the string isn’t UTF-16 and it is returned unmodified.


I don’t know, but hopefully someone might find this useful. If anyone can see any problems with it please point them out, however at the moment it seems to be working for me.





  1. VBS练习题——计算1到100的和
  2. PT流量作弊工具之PTLiar2
  3. 利用WMI打造完美“三无”后门-消灭一切假网卡
  4. 使用Image Generator (Image Builder)生成OpenWrt固件
  5. Chrome用 –proxy-server 设置代理服务器

一条评论 发表在“PHP & JavaScript: UTF-16 to UTF-8”上

  1. yexingzhe说道:

