提取pdf中的图片

hbghlyj · 2022-8-31 05:01

如何以原始分辨率和格式从 pdf 文档中提取所有图像？（意味着将 tiff 提取为 tiff，将 jpeg 提取为 jpeg 等，不要重新采样）。不管源图像在页面上的位置。

Solution in Python
You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.

from PIL import Image

from PyPDF2 import PdfReader


def extract_image(pdf_file_path):
    reader = PdfReader(pdf_file_path)
    page = reader.pages[0]
    x_object = page["/Resources"]["/XObject"].getObject()

    for obj in x_object:
        if x_object[obj]["/Subtype"] == "/Image":
            size = (x_object[obj]["/Width"], x_object[obj]["/Height"])
            data = x_object[obj].getData()
            if x_object[obj]["/ColorSpace"] == "/DeviceRGB":
                mode = "RGB"
            else:
                mode = "P"

            if x_object[obj]["/Filter"] == "/FlateDecode":
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif x_object[obj]["/Filter"] == "/DCTDecode":
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif x_object[obj]["/Filter"] == "/JPXDecode":
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

Account		Remember me	Forgot password
Password			Register account

提取pdf中的图片

Related threads

Quick Reply