Forgot password?
 Register account
View 201|Reply 0

提取pdf中的图片

[Copy link]

3156

Threads

7923

Posts

610K

Credits

Credits
64196
QQ

Show all posts

hbghlyj Posted 2022-8-31 05:01 |Read mode
如何以原始分辨率和格式从 pdf 文档中提取所有图像? (意味着将 tiff 提取为 tiff,将 jpeg 提取为 jpeg 等,不要重新采样)。不管源图像在页面上的位置。

Solution in Python
You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.
from PIL import Image

from PyPDF2 import PdfReader


def extract_image(pdf_file_path):
    reader = PdfReader(pdf_file_path)
    page = reader.pages[0]
    x_object = page["/Resources"]["/XObject"].getObject()

    for obj in x_object:
        if x_object[obj]["/Subtype"] == "/Image":
            size = (x_object[obj]["/Width"], x_object[obj]["/Height"])
            data = x_object[obj].getData()
            if x_object[obj]["/ColorSpace"] == "/DeviceRGB":
                mode = "RGB"
            else:
                mode = "P"

            if x_object[obj]["/Filter"] == "/FlateDecode":
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif x_object[obj]["/Filter"] == "/DCTDecode":
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif x_object[obj]["/Filter"] == "/JPXDecode":
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

Mobile version|Discuz Math Forum

2025-6-5 19:10 GMT+8

Powered by Discuz!

× Quick Reply To Top Edit