I need to extract text from a PDF for which I use pdftotext or pdftohtml depending on the case. However, I have found a PDF in which my script fails me. In the PDF the text is not as an image, but when I did a little research I realized that if I copy the PDF text with the mouse and paste it anywhere, instead of getting the characters from the text, I get garbage.
Does anyone know why this is? I thought that in PDFs text characters were that, text characters. I have tried to convert the PDF to image and scan it with OCR later and it works, but the result is not satisfactory enough, because the errors of the OCR make it unfeasible that its result will treat it with the script.
The PDF that gives me problems is this . The text of the order is copied well, but the text of all the attachments becomes garbage.