PHP & JavaScript: UTF-16 to UTF-8

标签: , , , , , ,

外国的一个博客上摘录的,原文如下(附上我的烂翻译):

Recently I’ve been doing some work on a PHP script that has to process a bunch of XML files (in this case they’re imsmanifest files) however a few of them weren’t being parsed successfully.

最近我在做一些用PHP处理一堆XML文件的工作(它们是imsmanifest文件),然而它们中的一些不能被成功的解析。

The problem was soon quite clear, some of the files had been encoded using UTF-16 which wasn’t playing nicely with PHP. To solve this I’ve written a function that attempts to detect if a string is encoded using UTF-16 (little endian or big endian) and then converts it to a slightly more PHP friendly UTF-8. All the complicated stuff is copied from these JavaScript functions for converting between UTF-8 and UTF-16.

原因很明显,一些文件是用对PHP不友好的UTF-16编码的。为了解决这个问题我写了一个尝试判断一个字符是否用UTF-16编码并将其转换成对PHP比较友好的UTF-8编码的函数。所有这些复杂的材料都是从JavaScript functions for converting between UTF-8 and UTF-16复制的。

function utf16_to_utf8($str) {
    $c0 = ord($str[0]);
    $c1 = ord($str[1]);

    if ($c0 == 0xFE && $c1 == 0xFF) {
        $be = true;
    } else if ($c0 == 0xFF && $c1 == 0xFE) {
        $be = false;
    } else {
        return $str;
    }

    $str = substr($str, 2);
    $len = strlen($str);
    $dec = '';
    for ($i = 0; $i < $len; $i += 2) {
        $c = ($be) ? ord($str[$i]) << 8 | ord($str[$i + 1]) : 
                ord($str[$i + 1]) << 8 | ord($str[$i]);
        if ($c >= 0x0001 && $c <= 0x007F) {
            $dec .= chr($c);
        } else if ($c > 0x07FF) {
            $dec .= chr(0xE0 | (($c >> 12) & 0x0F));
            $dec .= chr(0x80 | (($c >>  6) & 0x3F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
        } else {
            $dec .= chr(0xC0 | (($c >>  6) & 0x1F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
        }
    }
    return $dec;
}

Note this only does something if the string has a BOM, otherwise it is assumed that the string isn’t UTF-16 and it is returned unmodified.

注意这个函数只在字符串拥有BOM时有效,否则它推测字符串不是UTF-16编码的而返回没有经过修改的原始值。

I don’t know, but hopefully someone might find this useful. If anyone can see any problems with it please point them out, however at the moment it seems to be working for me.

我不知道,但是希望有人发现这个函数有用。如果谁发现这个函数有什么问题请指出来,至少目前为止它对我都是正确的。

赞赏

微信赞赏支付宝赞赏

随机文章:

  1. VBS获取系统本次及上次开关机时间
  2. 利用 WindowsInstaller.Installer 对象计算文件 MD5 hash 值
  3. PT作弊分析
  4. C/C++ void main()
  5. 用EnumSystemCodePages函数枚举系统代码页

一条评论 发表在“PHP & JavaScript: UTF-16 to UTF-8”上

  1. yexingzhe说道:

    php居然对utf-16不友好-,-

留下回复