Search within a pdf with Python

2

I read an encrypted Pdf and when looking for the list below it tells me that it does not find anything, it seems to me that my regular expression is wrong

import PyPDF2
import re

file = open('imagen.pdf', 'rb')
pdfreader = PyPDF2.PdfFileReader(file)
if pdfreader.isEncrypted:
    pdfreader.decrypt('')
    pageobj = pdfreader.getPage(0)
    pdftext = pageobj.extractText()
    tipo1 = re.match(r'(([a-zA-Z]{1,4})[0-9]{1,5}())', pdftext)

the regular expression has to find the following:

VGE07011_004.IFD  
VGH50052_007.IFD  
VIE01039.012  
VTGE0037   
Vie01025_001.IFD
    
asked by Memo 30.05.2017 в 18:51
source

1 answer

1

To start re.match as mentioned in the documentation

  

Note that even in MULTILINE mode, re.match () will only match at the   beginning of the string and not at the beginning of each line.

So you will only find the pattern at the beginning of the chain, so what you can do is a "split" of the text and go line by line but what you can do is use re.findall , too I made some modifications to the pattern because I understand that it would not be working according to what you want:

import re

text = """
VGE07011_004.IFD
Otra cosa que no quiero matchear
VGH50052_007.IFD
VIE01039.012
VTGE0037
Vie01025_001.IFD
"""
# letras de longitud 1 a 4 +
# números de longitud 1 a 5 +
# letras y números y un punto hasta el próximo caracter distinto
regex = r'[a-zA-Z]{1,4}[0-9]{1,5}[a-zA-Z_0-9.]+'

for m in re.findall(regex,text):
  print(m)

The output:

VGE07011_004.IFD
VGH50052_007.IFD
VIE01039.012
VTGE0037
Vie01025_001.IFD
    
answered by 30.05.2017 / 20:27
source