PDF的CMAP table（字符映射表）

hbghlyj · 2023-6-28 21:43

为了测试，用lualatex生成单页PDF： $type

document_4CBF_8811.pdf (2.98 KB, Downloads: 66)
字体为FandolSong，只有一个字“龚”（texlive.net）
查看PDF的第1页的内容：

import fitz
doc = fitz.open("document_4CBF_8811.pdf")
page = doc[0]
for xref in page.get_contents():
stream = doc.xref_stream(xref).decode()
print(stream)

Copy the Code

输出：

BT
/F23 9.96264 Tf
1 0 0 1 148.712 707.125 Tm [<0701>]TJ
1 0 0 1 303.133 139.255 Tm [<0012>]TJ
ET

Copy the Code

这段 PDF 代码是渲染文本的指令。以下是每行的解释：
BT：开始(begin)文本对象，表明接下来的指令将指定文本渲染操作。
Tf：设置字体(font)和字号。/F23引用标识符为“F23”的字体资源，9.96264 指定字体大小。
Tm：设置矩阵(matrix)。此处$\pmatrix{1&0\\0&1\\148.712&707.125}$是平移到(148.712,707.125)
TJ：显示文本。在本例中，它包含单个字符0701。尖括号<...>指示该字符串采用十六进制编码。
Tm：与前面类似，设置一个新的矩阵，为下一个文本渲染操作指定新的平移位置。
TJ：与前面类似，显示字符串0012，十六进制编码。
ET：结束(end)文本对象，表明文本渲染指令已经完成。
总之，这段代码设置字体和字号，设置平移，显示文本“龚”(编码0701)，再设置新的平移，（字体和字号不变）显示文本“1”(编码0012)。

问：1是哪来的？明明只打了一个字？
答：1是页码。若不想要页码，在TeX中可用\thispagestyle{empty}去除。

hbghlyj · 2023-6-28 22:00

Last edited by hbghlyj 2023-6-28 23:57问：为什么“龚”的编码是0701？为什么“1”的编码是0012？
答：用PyMupdf查看FandolSong的CMAP table：

import fitz
doc = fitz.open("document_4CBF_8811.pdf")
page = doc[0]
font='FandolSong'
def get_key(xref,key):
return int(doc.xref_get_key(xref,key)[1].split()[0].replace("[", ""))
for font_tuple in page.get_fonts():
if font_tuple[3].split("+", maxsplit=1)[1].startswith(font):
for line in doc.xref_stream(get_key(font_tuple[0],'ToUnicode')).decode().splitlines():
print(line)

Copy the Code

输出
%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (TeX-NSHQXR-FandolSong-Regular-0)
%%Title: (TeX-NSHQXR-FandolSong-Regular-0 TeX NSHQXR-FandolSong-Regular 0)
%%Version: 1.000
%%EndComments
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TeX)
/Ordering (NSHQXR-FandolSong-Regular)
/Supplement 0
>> def
/CMapName /TeX-Identity-NSHQXR-FandolSong-Regular def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
0 beginbfrange
endbfrange
2 beginbfchar
<0012> <0031>
<0701> <9F9A>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
%%EndResource
%%EOF

其中beginbfchar...endbfchar是CID到Unicode编码的映射
右图可见这两个Unicode编码对应的字

参见：
Editing text in PDF
Editing CMap / ToUnicode to achieve correct character mapping when extracting text
How are Embedded CMAP tables defined in a PDF File?
hand-modify-pdf.md
Inside the PDF File Format
PDF reference
Developing with PDF by Leonard Rosenthol

hbghlyj · 2023-6-29 05:56

这个PDF的字母A-Za-z的字体是LMRoman10
将2#脚本第4行改为font='LMRoman10'同样地提取出CMAP table
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapName /BYPSOO+LMRoman10-Regular-UTF16 def
/CMapType 2 def
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
59 beginbfchar
<001B> <0041>
<001C> <0061>
<0022> <0042>
<0023> <0062>
<0028> <005B>
<0029> <005D>
<002B> <0063>
<002C> <003A>
<002E> <0044>
<002F> <0064>
<0032> <0065>
<0033> <0038>
<0036> <0046>
<0037> <0066>
<0038> <0035>
<0039> <0034>
<003A> <0047>
<003B> <0067>
<003E> <0048>
<003F> <0068>
<0040> <002D>
<0041> <0049>
<0042> <0069>
<0046> <006B>
<0074> <0078>
<0076> <0079>
<0077> <005A>
<0078> <007A>
<0079> <0030>
<00C7> <2022>
<0189> <00AF>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
下面的脚本可以查询CID：

def repl(search):
map={'\u0041':'001B',
'\u0061':'001C',
'\u0042':'0022',
'\u0062':'0023',
'\u005B':'0028',
'\u005D':'0029',
'\u0063':'002B',
'\u003A':'002C',
'\u0044':'002E',
'\u0064':'002F',
'\u0065':'0032',
'\u0038':'0033',
'\u0046':'0036',
'\u0066':'0037',
'\u0035':'0038',
'\u0034':'0039',
'\u0047':'003A',
'\u0067':'003B',
'\u0048':'003E',
'\u0068':'003F',
'\u002D':'0040',
'\u0049':'0041',
'\u0069':'0042',
'\u006B':'0046',
'\u0078':'0074',
'\u0079':'0076',
'\u005A':'0077',
'\u007A':'0078',
'\u0030':'0079',
'\u2022':'00C7',
'\u00AF':'0189'}
return ''.join([map.get(x,x) for x in search])
print(repl('hy'))

Copy the Code

输出
003F0076

hbghlyj · 2023-6-29 06:56

3#是查询单个字体的CMAP table
下面的脚本将第7页的所有字体的CMAP table合并成map，并查询CID：

import fitz
doc = fitz.open("waiweifen.pdf")
page = doc[7]
map = {}
def get_key(xref,key):
return int(doc.xref_get_key(xref,key)[1].split()[0].replace("[", ""))
for font_tuple in page.get_fonts():
try:
found = False
for line in doc.xref_stream(get_key(font_tuple[0],'ToUnicode')).decode().splitlines():
if 'beginbfchar' in line:
found = True
continue
if 'endbfchar' in line:
found = False
continue
if found:
cid = line.split()[0][1:-1]
unicode = chr(int("0x" +line.split()[1][1:-1], 16))
map[unicode] = cid
except:
continue
def replace(search):
return ''.join([map.get(x,x) for x in search])
print(replace('hy'))

Copy the Code

输出
003F0076

hbghlyj · 2023-7-1 01:18

反过来可以由CID查询Unicode，在PDF十六进制字符串的解码要用到：

import fitz
doc = fitz.open("waiweifen.pdf")
page = doc[7]
map = {}
def get_key(xref,key):
return int(doc.xref_get_key(xref,key)[1].split()[0].replace("[", ""))
for font_tuple in page.get_fonts():
try:
found = False
for line in doc.xref_stream(get_key(font_tuple[0],'ToUnicode')).decode().splitlines():
if 'beginbfchar' in line:
found = True
continue
if 'endbfchar' in line:
found = False
continue
if found:
cid = line.split()[0][1:-1]
unicode = chr(int("0x" +line.split()[1][1:-1], 16))
map[cid] = unicode
except:
continue
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
def replace(search):
return ''.join([map.get(x,x) for x in chunks(search.upper(),4)])
print(replace('0048007400620054003f0076'))

Copy the Code

输出
lxsphy

Account		Remember me	Forgot password
Password			Register account

PDF的CMAP table（字符映射表）

Related threads

Quick Reply