关于utf-8和utf-16

hbghlyj · 2022-5-15 04:33

这个网页的头部的第一行是

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Copy the Code

所以这个网页的字符集是utf-8.

' '(Character Tabulation)可以写成十进制(dec)或十六进制(hex)编码:

Dec         &#9;
Hex         &#x9;

'd'可以写成dec或hex编码:

Bin         1100100
Dec         &#100;
Hex         &#x64;

'ⅆ'可以写成dec或hex编码:

Dec         &#8518;
Hex         &#x2146;

'𝕕'可以写成dec或hex编码:

Dec         &#120149;
Hex         &#x1d555;

'❤️'可以写成dec或hex编码:

Dec         &#10084;&#65039;
Hex         &#x2764;&#xfe0f;

最后两种,属于“双字节字元”和“多字节字元”,在这个论坛,MySQL字符集为utf8,所以无法直接存储到数据库,须打开html选项,然后书写对应的编码.实际上,如果直接写在帖子中,会导致帖子内容从这个字符开始被截断,后面的内容都会丢失.

如果您將其存儲在 MySQL 數據庫中，則需要將字符集指定為 utf8mb4，而不僅僅是 utf8，否則會導致諸如此類的損壞。
If you are storing this in a MySQL database you need to specify the charset as utf8mb4, and not just utf8, or it will cause corruptions such as this.
参考:
stackoverflow.com/questions/47187165/convert- … ji-in-html-using-php
gist.github.com/joni/3760795

🐶(&#x1F436)写成代理对(surrogate pair)
��(显示为空码)
一般地,从U+10000到U+10FFFF的码位会写成“代理对”

lead \ trail	DC00	DC01	…	DFFF
D800	10000	10001	…	103FF
D801	10400	10401	…	107FF
⋮	⋮	⋮	⋱	⋮
DBFF	10FC00	10FC01	…	10FFFF

Javascript支持utf-16“代理对”:

下面这个例子,2764和FE0F都不在范围D800-DFFF内,但是居然可以解码出来

按照Wikipedia所说,加上我的理解,𝕕(U+1d555)(是上面举过的第4个例子)可以通过如下计算进行解码:
先将它的hex编码1d555减去100000变成d555,二进制表示为1101010101010101(开头可以补零),然后分割成两段:
上值(upper): 0000110101(上10位值)
下值(lower): 0101010101(下10位值)
最后,把上值加上D800以形成高位,把下值加上DC00以形成低位

upper = parseInt(parseInt('d555',16).toString(2).slice(0,-10),2);//分割它的上10位值
(upper + parseInt('D800',16)).toString(16)
"d835"
lower = parseInt(parseInt('d555',16).toString(2).slice(-10),2);//分割它的下10位值
(lower + parseInt('DC00',16)).toString(16)
"dd55" 

于是得到代理对"\uD835\uDD55"

hbghlyj · 2022-5-15 04:59

stackoverflow.com/questions/44565859/how-does … uble-byte-characters

UTF-8 is designed to be able to unambiguously identify the type of each byte in a text stream:

1-byte codes (all and only the ASCII characters) start with a 0
Leading bytes of 2-byte codes start with two 1s followed by a 0 (i.e. 110)
Leading bytes of 3-byte codes start with three 1s followed by a 0 (i.e. 1110)
Leading bytes of 4-byte codes start with four 1s followed by a 0 (i.e. 11110)
Continuation bytes (of all multi-byte codes) start with a single 1 followed by a 0 (i.e. 10)

hbghlyj · 2022-5-15 05:57

é有两种等效的表示:

&#xE9;
&#x65;&#x0301;

其中,第二种表示是字母e加上尖音符(acute accent)◌́
参见en.wikipedia.org/wiki/Unicode_equivalence#Com … ecomposed_characters
Javascript中String的方法normailize()可以将Unicode表示正则化
参见developer.mozilla.org/en-US/docs/Web/JavaScri … cts/String/normalize

hbghlyj · 2022-5-31 00:53

下括号Unicode:
在元素查看器中不显示. 但是其实那个<mo></mo>的innerHTML就是这个下括号:
一个变形的括号:

Account		Remember me	Forgot password
Password			Register account

关于utf-8和utf-16

Quick Reply