PDF补上缺字

hbghlyj · 2023-6-29 04:12

Last edited by hbghlyj 2023-6-30 19:44以waiweifen.pdf的第8页参考文献为例，发现2个技术性错误：
[3]的链接中~被TeX解析为空格，没有写成\~{}，是作者疏忽。
[4]中的字“昇”在FandolSong字体中缺失，相关GitHub issue：有个别汉字不能显示的问题，比如“琦，旻”

缺字会显示成“龚$\bbox[1pt, border: 1px solid]{\text{F}}$著”。
从PDF复制出文字是“龚�著”。

Screenshot 2023-06-28 at 21-28-48 waiweifen.pdf.png

若有源码，可以按issue中的方法修改。若没有源码，可以用PyMupdf修改：
先插入/china-ss(宋体)到该页，因为宋体是cjk基础字体，不会嵌入字体文件，从而不会增加PDF文件尺寸。对第8页的stream做替换：
把字符串“龚�著”断成两段“龚”和“著”，保持原字体字号，再用宋体插入“昇”到中间。
具体代码是：
先0701>]TJ结束字符串“龚”，Tf切换为宋体，Ts设置Text Rise为$-0.6$保持基线相齐，TJ插入“昇”，Tf切换回FandolSong，Ts设置Text Rise为0，最后[<保持后面的TJ语法正确。

import fitz
doc = fitz.open("waiweifen.pdf")
page = doc[7]
xref = page.get_contents()[0]
page.insert_font('china-ss')
doc.update_stream(xref,doc.xref_stream(xref).replace(b"07010000",b"0701>]TJ -0.6 Ts/china-ss 10.5 Tf[<6607>]TJ /F3 10.5 Tf 0 Ts[<"))
doc.save("waiweifen1.pdf", garbage=4, deflate=True, clean=True)

Copy the Code

效果：

Screenshot 2023-06-28 at 21-00-27 output.pdf.png

参考：PDF standards 9.3 Text State Parameters and Operators
写代码的过程中，用CMAP借助查找前一字来定位缺字。
在print(doc.xref_stream(page.get_contents()[0]).decode())输出中查0701发现0000是缺字。
因为缺字复制出是�，可以推测到，在CMAP中0000对应FFFD。
获得一个小经验，下次就不用查找前一字，直接查找0000就是缺字。

hbghlyj · 2023-7-1 02:46

Last edited by hbghlyj 2023-7-1 03:06

hbghlyj 发表于 2023-6-29 04:12
[3]的链接中~被TeX解析为空格，没有写成\~{}，是作者疏忽。

修改~(CID: 001f)

import fitz
doc = fitz.open("waiweifen.pdf")
page = doc[7]
xref = page.get_contents()[0]
def get_key(xref,key):
return int(doc.xref_get_key(xref,key)[1].split()[0].replace("[", ""))
doc.update_stream(xref,doc.xref_stream(xref).replace(b"0066>-332<0048",b"0066001f0048"))
Resources = get_key(page.xref,"Resources")
doc.update_object(Resources, doc.xref_object(Resources).replace(" /F9 12 0 R\n",""))
page.insert_font("F9",fontfile="D:/MiKTeX/fonts/opentype/public/lm/lmroman10-regular.otf")
### For debugging:
# print(doc.xref_stream(page.get_contents()[0]).decode())
# print(doc.xref_object(Resources))
doc.subset_fonts()
doc.save("output.pdf", garbage=4, deflate=True, clean=True)

Copy the Code

效果：

链接正常了。

原文件1.4MB，没有加doc.subset_fonts()时输出文件尺寸为1.5MB（因为嵌入了整个LMRoman10字体），加上doc.subset_fonts()就变成1.4MB（文件尺寸不会增加很多，因为在字体中只加了1个字符~）

第9行在Resource中把F9的引用删了，然后第10行插入lmroman10-regular.otf并命名为F9。
若去掉第9行，第10行的insert_font就无效，因为insert_font时新字体名不能与已有的字体名相同。

Account		Remember me	Forgot password
Password			Register account

PDF补上缺字

Related threads

Quick Reply