Search within a pdf with Python

Question

Search within a pdf with Python

Navigation

#1 by (1 votes)

2

I read an encrypted Pdf and when looking for the list below it tells me that it does not find anything, it seems to me that my regular expression is wrong

import PyPDF2
import re

file = open('imagen.pdf', 'rb')
pdfreader = PyPDF2.PdfFileReader(file)
if pdfreader.isEncrypted:
    pdfreader.decrypt('')
    pageobj = pdfreader.getPage(0)
    pdftext = pageobj.extractText()
    tipo1 = re.match(r'(([a-zA-Z]{1,4})[0-9]{1,5}())', pdftext)

the regular expression has to find the following:

VGE07011_004.IFD  
VGH50052_007.IFD  
VIE01039.012  
VTGE0037   
Vie01025_001.IFD

python python-3.x pdf

asked by Memo 30.05.2017 в 16:51

source

1 answer

How to put a footer of a HTML CSS box Error in a for loop in jQuery

score 1 · Accepted Answer

To start re.match as mentioned in the documentation

Note that even in MULTILINE mode, re.match () will only match at the beginning of the string and not at the beginning of each line.

So you will only find the pattern at the beginning of the chain, so what you can do is a "split" of the text and go line by line but what you can do is use re.findall , too I made some modifications to the pattern because I understand that it would not be working according to what you want:

import re

text = """
VGE07011_004.IFD
Otra cosa que no quiero matchear
VGH50052_007.IFD
VIE01039.012
VTGE0037
Vie01025_001.IFD
"""
# letras de longitud 1 a 4 +
# números de longitud 1 a 5 +
# letras y números y un punto hasta el próximo caracter distinto
regex = r'[a-zA-Z]{1,4}[0-9]{1,5}[a-zA-Z_0-9.]+'

for m in re.findall(regex,text):
  print(m)

The output:

VGE07011_004.IFD
VGH50052_007.IFD
VIE01039.012
VTGE0037
Vie01025_001.IFD