Extract text between two words

4

As the title says I want to extract the text between two words. For that, it occurred to me to create a system with the command sed that would allow me to substitute what I was looking for then with greep and cut extract it. But the result has been in all cases an absolute failure, I put you a little in situation.

I have this text with a lot of code above and below:

<div class="item">
    <div class="imagens">
        <a href="http://sitio.php">
            <img src="https://image.jpg" alt="texto" width="100%" height="100%"/></a>
        <span class="imdb"><b><b class="icon-star"></b></b> 7.2</span>
    </div>
    <span class="text">texto</span>
    <span class="fecha">2016</span>
</div>
<div class="item">
    <div class="imagens">
        <a href="http://sitio.php">
            <img src="https://image.jpg" alt="texto" width="100%" height="100%"/></a>
        <span class="imdb"><b><b class="icon-star"></b></b> 7.2</span>
    </div>
    <span class="text">texto</span>
    <span class="fecha">2015</span>
</div>

EJ: I parameterize these two texts " <div class="item"><div class="imagens"> " and this " </div> " in such a way that there is a text like this:

<a href="http://sitio.php">
        <img src="https://image.jpg" alt="texto" width="100%" height="100%"/></a>
    <span class="imdb"><b><b class="icon-star"></b></b> 7.2</span>
</div>
<span class="text">texto</span>
<span class="fecha">2016</span>

EDITO

The answer of @Ivan Botero I think is the closest to solving the problem I have, but I still have the problem that I can not select the second </div> you have.

And the labels that should not come out are:

<div class="item">
    <div class="imagens">

and the second :

</div>

Because the first one is in the middle of the code, just behind the close span tag. Any help please?

    
asked by juan 25.01.2017 в 04:41
source

3 answers

1

In the end I managed to get a code in bash to do the work I needed, thanks for helping everyone!

As I said in a good beginning with sed it would be great to get it, but it is impossible because there are twice the text of </div> and thirst I think it can not work with it, my solution It is as follows:

#!/bin/bash
# -*- coding: utf-8 -*-
encontrar=0
primera_fuera=0
while read texto ; do
    if [ $encontrar -eq 0 -a "$texto" == '<div class="item">' ]; then
        encontrar=1
    fi

    if [ $encontrar -eq 1 ] && [ "$texto" = '<div class="imagens">' ]; then
        encontrar=3
        primera_fuera=1
    fi

    if [ $encontrar -gt 2 ] && [ $primera_fuera -eq 0 ];then

        if [ "$texto" = '</div>' ];then
            echo -n ""
        else
            echo $texto
            #SALIDA del texto que buscamos
        fi

    else
        primera_fuera=$(($primera_fuera-1))
    fi

    if [  $encontrar -gt 1 ] && [ "$texto" = '</div>' ];then
        echo ""
        encontrar=$(($encontrar -1))
        if [ $encontrar -eq 1 ];then
            encontrar=0
        fi
    fi
done < index.html
    
answered by 03.02.2017 / 23:22
source
2

Greetings, I have been observing what you require, based on what you say (that BASH can be used), I have made a script that I hope can help you with your problem.

script.sh

#!/bin/bash

# Parametros
INICIO=$1
FINAL=$2
ARCHIVO=$3

# Escapamos INICIO y FINAL
E_INICIO="${INICIO/\//\/}"
E_FINAL="${FINAL/\//\/}"

# Expresion a buscar
EXPRESION="/^$E_INICIO/,/^$E_FINAL/{p;/$E_FINAL/q}"

sed -n "$EXPRESION" $ARCHIVO

This takes three parameters, the string start , end and the file , based on them performs a escape em> of characters (That is, change the / by /) so that they can be used in the regular expression that is subsequently passed to the sed command.

I hope it serves you.

Here is an example of how it works:

file.html

<div class="item">
<div class="imagens">
    <a href="http://sitio.php">
        <img src="https://image.jpg" alt="texto" width="100%" height="100%"/></a>
    <span class="imdb"><b><b class="icon-star"></b></b> 7.2</span>
</div>
<span class="text">texto</span>
<span class="fecha">2016</span>
</div>
<div class="item">
<div class="imagens">
    <a href="http://sitio.php">
        <img src="https://image.jpg" alt="texto" width="100%" height="100%"/></a>
    <span class="imdb"><b><b class="icon-star"></b></b> 7.2</span>
</div>
<span class="text">texto</span>
<span class="fecha">2015</span>
</div>

Console

bash script.sh '<div class="item">' '</div>' archivo.html

Result

<div class="item">
<div class="imagens">
    <a href="http://sitio.php">
        <img src="https://image.jpg" alt="texto" width="100%" height="100%"/></a>
    <span class="imdb"><b><b class="icon-star"></b></b> 7.2</span>
</div>
    
answered by 26.01.2017 в 16:33
1

The error that shows you

  

SyntaxError: Non-ASCII character '\ xc3' in file test.py on line 7,   but no encoding declared; see link   for detail

Because you have Non-ASCII characters in your script (the one you grabbed from the other question has accents) You must define the character encoding that python will use to handle the file: this is for version 2.7 and below. since python 3 is not necessary.

# coding: utf-8
    
answered by 25.01.2017 в 13:03