Forgot password?
 Create new account
View 162|Reply 2

提取斜体文字

[Copy link]

3146

Threads

8493

Posts

610K

Credits

Credits
66158
QQ

Show all posts

hbghlyj Posted at 2023-4-24 05:35:02 |Read mode
制作一个例子
  1. \documentclass[border={0pt 2pt 0pt 2pt}]{standalone}
  2. \begin{document}
  3. aaa
  4. {\it bbb}
  5. ccc
  6. \end{document}
Copy the Code

PyPDF2试着做一下page.extract_text()
这样只能提取到纯文本aaa bbb ccc无法保留格式信息

3146

Threads

8493

Posts

610K

Credits

Credits
66158
QQ

Show all posts

 Author| hbghlyj Posted at 2023-4-24 05:47:57
用SumutraPDF的Document properties看到LaTeX生成的文档有2个字体
CMR10(Type1;embedded)
CMTI10(Type1;embedded)
CMTI10应该是斜体. 能否根据字体来提取文本? 只提取CMTI10就能提取斜体文字了.
  1. from PyPDF2 import PdfReader
  2. reader = PdfReader('example.pdf')
  3. page = reader.pages[0]
  4. def visitor_body(text, cm, tm, fontDict, fontSize):
  5.     if text:
  6.         print(fontDict['/BaseFont'])
  7. page.extract_text(visitor_text=visitor_body)
Copy the Code

输出是
/DCZPBI+CMR10
/GUKUYZ+CMTI10
'aaa bbb ccc'
这里的DCZPBI、GUKUYZ都是啥意思?
加上“字符串包含CMTI”的判断,代码变成
  1. from PyPDF2 import PdfReader
  2. reader = PdfReader('example.pdf')
  3. page = reader.pages[0]
  4. def visitor_body(text, cm, tm, fontDict, fontSize):
  5.     if text and 'CMTI' in fontDict['/BaseFont']:
  6.         print(text)
  7. page.extract_text(visitor_text=visitor_body)
Copy the Code

输出是bbb
成功了

3146

Threads

8493

Posts

610K

Credits

Credits
66158
QQ

Show all posts

 Author| hbghlyj Posted at 2023-4-24 06:22:03
  1. from PyPDF2 import PdfReader
  2. reader = PdfReader("example.pdf")
  3. page = reader.pages[0]
  4. def visitor_body(text, cm, tm, fontDict, fontSize):
  5.     if text:
  6.         print(fontDict['/BaseFont'],text)
  7. page.extract_text(visitor_text=visitor_body)
Copy the Code
可以输出每个text object的字体
例如
  1. \documentclass{standalone}
  2. \begin{document}
  3. $\gamma^2+\theta^2=\omega^2$
  4. \end{document}
Copy the Code

输出
/HWJIZZ+CMMI10 γ
/RKDUWA+CMR7 2
/RFLZJB+CMR10 +
/HWJIZZ+CMMI10 θ
/RKDUWA+CMR7 2
/RFLZJB+CMR10 =
/HWJIZZ+CMMI10 ω
/RKDUWA+CMR7 2
不知前面的HWJIZZ代表什么一个eps文件中找到字体的定义有类似的行:
  1. %%BeginResource: font RKDUWA+CMR7
  2. %!PS-AdobeFont-1.1: CMR7 1.0
  3. %%CreationDate: 1991 Aug 20 16:39:21
Copy the Code

  1. /FontName /RKDUWA+CMR7 def
Copy the Code

可能是为每个/FontName随机生成的识别码?

手机版Mobile version|Leisure Math Forum

2025-4-20 21:34 GMT+8

Powered by Discuz!

× Quick Reply To Top Return to the list