Convert pdf to a txt in C language [closed]

-2

My idea is to be able to read a pdf and then create a txt inserting line by line the pdf text, without images or anything like that, but I want to do it purely in C (not c ++) and I can not find any library that is moderately far easier since I do not consider myself an expert on the subject, much less.

I hope you can help me.

    
asked by WhySoBizarreCode 21.07.2018 в 23:07
source

1 answer

0

This is the format that a pdf actually has if you inspect the content with a text editor (I only put a part)

%PDF-1.4
%¿÷¢þ
1 0 obj
<< /Pages 3 0 R /Type /Catalog >>
endobj
2 0 obj
<< /CreationDate (D:20180722014716+02'00') /Creator () /Producer <feff0051007400200035002e00310031002e0031> /Title () >>
endobj
3 0 obj
<< /Count 1 /Kids [ 4 0 R ] /ProcSet [ /PDF /Text /ImageB /ImageC ] /Type /Pages >>
endobj
4 0 obj
<< /Annots 5 0 R /Contents 6 0 R /MediaBox [ 0 0 595 842 ] /Parent 3 0 R /Resources 7 0 R /Type /Page >>
endobj
5 0 obj
[ ]
endobj
6 0 obj
<< /Length 58692 >>
stream
/GSa gs /CSp cs /CSp CS
0.060000000 0 0 -0.060000000 10.0199999 831.980000 cm
q q
Q
Q q
q
Q
q
12.5000000 0 0 12.5000000 0 0 cm
/CSp cs 0 0 0 scn
/GSa gs
0 0 0 SCN
1 w 0 J 2 M 0 j []0  d
Q
Q q
0 0 m
9574.99984 0 l
9574.99984 13687.1471 l
0 13687.1471 l
0 0 l
h
W* n
q
10.0052244 0 0 10.0052244 0 0 cm
/CSp cs 0 0 0 scn
/GSa gs
0 0 0 SCN
1 w 0 J 2 M 0 j []0  d
Q
q

Basically what you have to do is create a code that understands this structure and will be able to extract the text.

For this his own would be read the standard . In order to understand how these files are structured (and therefore how they are created and how they are read).

But if you want to avoid it, I've done a search and it seems that you can find examples already made to extract the text. Here you have one of them, and you can download the code so you can inspect it.

    
answered by 22.07.2018 / 02:07
source