I need a PHP search engine for PDF files

0

The idea is that through a simple search form, you can return the requested search by searching several PDFs of a directory, that search engine has to search within the pdf that are text, generally I do everything in PHP, but if I have to use Javascript, there's no problem. I did something similar with PHP strpos looking at a plain txt, but if I do that with PDF it does not work because it is binary.

Well, I almost got what I want, only now the search engine has some words that can not be found, the code looks like this:

 <meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no">


<form action="" method="post">
    <input type="text" name="buscar">
    <input type="submit" name="Submit" value="Buscar">
</form>


<?php




if(isset($_POST['Submit'])) {



include_once('class.pdf2text.php');

$directorio = opendir("./pdf"); //ruta actual
while ($archivo = readdir($directorio)) //obtenemos un archivo y luego otro sucesivamente
{



        $url = 'pdf/'.$archivo;

$a = new PDF2Text();
$a->setFilename($url);
$a->decodePDF();
$pdf = utf8_encode($a->output());


$larCharsNoAble = array("Ñ","á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","À","Ã","Ì","Ò","Ù","Ù","à ","è","ì","ò","ù","ç","Ç","â","ê","î","ô","û","Â","Ê","ÃŽ","Ô","Û","ü","ö","Ö","ï","ä","«","Ò","Ã","Ä","Ë");
$larCharsAble = array("N","a","e","i","o","u","A","E","I","O","U","n","N","A","E","I","O","U","a","e","i","o","u","c","C","a","e","i","o","u","A","E","I","O","U","u","o","O","i","a","e","U","I","A","E");
$texto = str_replace($larCharsNoAble, $larCharsAble, $pdf);






$cadena_solicitada   = $_POST['buscar'];
$larCharsNoAble = array("Ñ","á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","À","Ã","Ì","Ò","Ù","Ù","à ","è","ì","ò","ù","ç","Ç","â","ê","î","ô","û","Â","Ê","ÃŽ","Ô","Û","ü","ö","Ö","ï","ä","«","Ò","Ã","Ä","Ë");
$larCharsAble = array("N","a","e","i","o","u","A","E","I","O","U","n","N","A","E","I","O","U","a","e","i","o","u","c","C","a","e","i","o","u","A","E","I","O","U","u","o","O","i","a","e","U","I","A","E");
$post = str_replace($larCharsNoAble, $larCharsAble, $cadena_solicitada);


$posicion_coincidencia = stripos($texto, $post);




if ($posicion_coincidencia == true) {
    echo 'Se ha encontrado "'.$post.'"" en el archivo <a href="'.$url.'">'.$archivo.'</a><br>';
} 


 }

}


?>
    
asked by Pablo 20.01.2017 в 10:14
source

1 answer

2

If you need to access the text content of a PDF document, you have several ways to do it.

One of them is using this class: link

Usage:

include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('test.pdf');
$a->decodePDF();
echo $a->output();

And another option would be to use a CLI tool ( pdftotext for example) to be called, for example , using system() or, better yet, the Inverted quote operator :

$texto = 'pdftotext -raw document.pdf';

Editing: As I tried the library that I recommended and it does not work very well ( pdftotext yes it does perfectly), I recommend this other call pdfparser to be installed through composer using, for example, php composer.phar require smalot/pdfparser .

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no">
    <title>Buscador de PDFs</title>
  </head>

  <body>
    <form action="<?= $_SERVER['PHP_SELF'] ?>" method="post">
      <input type="text" name="buscar" placeholder="Palabras a buscar"
             value="<?= isset($_POST['buscar'])?htmlspecialchars($_POST['buscar']):'' ?>" />
      <input type="submit" value="Buscar" />
    </form>
<?php
if (isset($_POST['buscar'])) {
  include 'vendor/autoload.php';
  /* Configurar 'es_ES.UTF8' o cualquier local UTF-8 disponible para quitar
    el que viene por defecto, POSIX, que no funciona correctamente. Ojo con la
    local 'C.UTF-8': no convierte signos de puntuación como exclamaciones. */
  setlocale(LC_CTYPE, 'es_ES.UTF-8', 'es.UTF-8', 'C.UTF-8');
  $post = strtolower(iconv(mb_detect_encoding($_POST['buscar'], 'utf-8,iso-8859-15'), 'ASCII//TRANSLIT', $_POST['buscar']));
  echo "<p>Buscando la/s palabra/s '", htmlspecialchars($post), "'</p>\n";
  /* Subdirectorio "pdf" dentro de la ruta actual */
  $directorio = opendir(__DIR__ . '/pdf');
  /* Vamos obteniendo los archivos uno a otro */
  while ($archivo = readdir($directorio)) {
    /* Si el archivo no tiene extensión '.pdf' pasamos al siguiente */
    if (!preg_match('/\.pdf$/i', $archivo)) {
      continue;
    }
    /* Obtenemos el contenido del archivo PDF */
    $parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile(__DIR__ . '/pdf/' . $archivo);
    $texto = $pdf->getText();
    /* Normalizamos el contenido del documento PDF igual que hicimos con la consulta */
    $texto = strtolower(iconv(mb_detect_encoding($texto, 'utf-8,iso-8859-15'), 'ASCII//TRANSLIT', $texto));
    echo "<p>Probando en el archivo '", htmlspecialchars($archivo), "': '" . htmlspecialchars($texto) . "'</p>\n";
    $posicion_coincidencia = strpos($texto, $post);
    if ($posicion_coincidencia !== false) {
      echo 'Se ha encontrado "', htmlspecialchars($post), '" en el archivo <a href="pdf/', urlencode($archivo), '">',
        htmlspecialchars($archivo), '</a><br>';
    } 
  }
}
?>
  </body>
</html>

Please note that it is very important for the operation of the function //TRANSLIT of iconv to correctly configure the location of the system.

    
answered by 20.01.2017 в 10:29