How to extract text from a SCANNED PDF in Python

0

I have the following code that extracts or separates the pdf in images, the idea is to extract a specific page and from that image obtain the text.

#extrae imagenes desde pdf
import sys

#definimos el pdf ->path
pdf = file(path, "rb").read()

startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find("stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream+20)
    if istart < 0:
        i = istream+20
        continue
    iend = pdf.find("endstream", istart)
    if iend < 0:
        raise Exception("no se encontro el fin de stream!")
    iend = pdf.find(endmark, iend-20)
    if iend < 0:
        raise Exception("no se encontro el fin de  JPG!")

    istart += startfix
    iend += endfix


    jpg = pdf[istart:iend]
    jpgfile = file("jpg%d.jpg" % njpg, "wb")
    jpgfile.write(jpg)
    jpgfile.close()

    njpg += 1
    i = iend    
print ("finalizado")

the images that I get through that function are something like this:

Thanks for any suggestions or comments !!

    
asked by Diego Avila 04.06.2018 в 19:52
source

1 answer

0

I add the answer if someone serves: As we know the scanned PDF is a set of images, that is to be able to extract the text we would have to pass to images and these to text.

Passing pdf to images:

#convertir pdf a imagen
    from wand.image import Image
    #estableciendo resolucion a imagen
    with Image(filename=path_absoluta, resolution=400) as img:
        #estableciendo ancho y alto
        img.resize(1850,1850)
        img.save(filename="media/normas/temp.jpg")

As you can see, use wand, where path_absoluta would be the desired pdf path and the page number to be converted to image.

Extracting the text from the image:

import pytesseract
    from PIL import ImageEnhance, ImageFilter
    from PIL import Image as Img
    im = Img.open("media/normas/temp.jpg")
    im = im.filter(ImageFilter.MedianFilter())

    enhancer = ImageEnhance.Contrast(im)
    #aplicando filtro para mejorar la convercion de imagen->txt
    im = enhancer.enhance(15)
    im = im.convert('1')
    im.save('sample.jpeg')

    text = pytesseract.image_to_string(Img.open('sample.jpeg'), lang='spa')

In this case I used the previous image (temp.jpg), I used Pillow and its filters to create a new improvement to the previous image and then extract the text from that image. and we would have everything in text.

Links:

Wand

Pillow

    
answered by 21.06.2018 / 22:20
source