This is the format that a pdf actually has if you inspect the content with a text editor (I only put a part)
%PDF-1.4
%¿÷¢þ
1 0 obj
<< /Pages 3 0 R /Type /Catalog >>
endobj
2 0 obj
<< /CreationDate (D:20180722014716+02'00') /Creator () /Producer <feff0051007400200035002e00310031002e0031> /Title () >>
endobj
3 0 obj
<< /Count 1 /Kids [ 4 0 R ] /ProcSet [ /PDF /Text /ImageB /ImageC ] /Type /Pages >>
endobj
4 0 obj
<< /Annots 5 0 R /Contents 6 0 R /MediaBox [ 0 0 595 842 ] /Parent 3 0 R /Resources 7 0 R /Type /Page >>
endobj
5 0 obj
[ ]
endobj
6 0 obj
<< /Length 58692 >>
stream
/GSa gs /CSp cs /CSp CS
0.060000000 0 0 -0.060000000 10.0199999 831.980000 cm
q q
Q
Q q
q
Q
q
12.5000000 0 0 12.5000000 0 0 cm
/CSp cs 0 0 0 scn
/GSa gs
0 0 0 SCN
1 w 0 J 2 M 0 j []0 d
Q
Q q
0 0 m
9574.99984 0 l
9574.99984 13687.1471 l
0 13687.1471 l
0 0 l
h
W* n
q
10.0052244 0 0 10.0052244 0 0 cm
/CSp cs 0 0 0 scn
/GSa gs
0 0 0 SCN
1 w 0 J 2 M 0 j []0 d
Q
q
Basically what you have to do is create a code that understands this structure and will be able to extract the text.
For this his own would be read the standard . In order to understand how these files are structured (and therefore how they are created and how they are read).
But if you want to avoid it, I've done a search and it seems that you can find examples already made to extract the text. Here you have one of them, and you can download the code so you can inspect it.