Modify links that contain a [closed] image

-1

I'm trying to find in HTML all the images ( <img> ) that are contained by a link ( <a> ).

The expression that I have achieved and works more or less well is:

'/<a.*?href=\"(.*?)\".*?>(<img.*?src=\"(.*?)\".*?>)<\/a>/'

The problem is that you also get matches of type <a .. /> <a..><img ../></a> . That is to say that if there are links ahead it includes them and what I want to exactly look for is a link with an image inside.

I comment that I am working in PHP, I am using preg_match_all , and I do it because not only do I need to look for it, but also make some modifications. I need to get the URL of the link and the image to, in case of coinciding, remove the link.

I edit again to see if this time I make it clear. The HTML is in a database (WordPress) and I can not use a parser because it does not always follow the structure of a valid HTML. On the other hand it's not something I want to keep, I'm going to use it to make some changes and it will not be used again.

    
asked by Quidi90 24.07.2017 в 09:52
source

4 answers

2

You should not use regular expressions to process HTML. At the level that you are raising your expression, just a small change in the HTML would make your regex fail. A space of more, a change in the attributes of the tag, a comment, or more complex structures, would make even a gigantic regex not follow the rules. Even with a very advanced expression, an almost fail-safe case could be generated, but you could almost always find a weird case that causes it to fail. Also, it would require an expert each time you want to modify it.

It's very easy to process HTML with DOM , they are the tools that They are designed for that.


If we have an HTML like the following:

//Un HTML de ejemplo
$html = '
        <a href="https://i.stack.imgur.com/mOJ0a.png">
            <span>Enlace a la misma URL de la imagen</span>
            <img src="https://i.stack.imgur.com/mOJ0a.png" />
        </a>

        <span>Imagen independiente precedida por un </span>
        <a href="https://i.stack.imgur.com/mOJ0a.png">enlace</a>
        <img src="https://i.stack.imgur.com/mOJ0a.png" />

        <a href="./">
            <span>Enlace a una URL diferente que la imagen</span>
            <img src="https://i.stack.imgur.com/mOJ0a.png" />
        </a>
';

The DOM is simply generated as follows:

//Englobamos en body
$html = "<body>$html</body>";

//Generar el DOM
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_COMPACT | LIBXML_HTML_NOIMPLIED | LIBXML_NONET | LIBXML_HTML_NODEFDTD);

And we can get all the links within the DOM with:

//Obtener todos los enlaces
$a_nodelist = $dom->getElementsByTagName('a');

To then go through each one, checking if they have an image:

//Recorrer cada uno
foreach ($a_nodelist as $enlace) {
    //Obtener la primera imagen dentro del enlace
    $img = $enlace->getElementsByTagName('img')->item(0);
    if ($img) { //si tiene imagen
        //Comparar el enlace con la imagen
        $urlEnlace = $enlace->getAttribute('href');
        $urlImagen = $img->getAttribute('src');
        if ($urlEnlace == $urlImagen) {
            //Si son el mismo, reemplazar
            $enlace->parentNode->replaceChild($img, $enlace);
        }
    }
}

Where $enlace->parentNode->replaceChild($img, $enlace); is the way we replace the link that has an image with the same URL, just for the image.

And, finally, we print the result:

//imprimir el resultado
echo $dom->saveHTML();


Result:

<body>
        <img src="https://i.stack.imgur.com/mOJ0a.png">

        <span>Imagen independiente precedida por un </span>
        <a href="https://i.stack.imgur.com/mOJ0a.png">enlace</a>
        <img src="https://i.stack.imgur.com/mOJ0a.png">

        <a href="./">
            <span>Enlace a una URL diferente que la imagen</span>
            <img src="https://i.stack.imgur.com/mOJ0a.png">
        </a>
</body>


Demo:

Watch the demo at 3v4l.org

    
answered by 24.07.2017 в 11:30
1

I know you ask

answered by 24.07.2017 в 10:32
1

You can not parse [X] HTML with regular expressions because HTML can not be parsed with regex. Regex is not a tool that can be used to correctly parse HTML. As I have already answered many HTML and regex questions, the use of regex will not allow you to process HTML. Regular expressions are a tool that is not sophisticated enough to understand the constructions used by HTML. HTML is not a regular language and, therefore, can not be parsed through regular expressions. Regular expressions are not equipped to dissect the HTML in their representative parts. Many times but it is not working. Even irregularly enhanced regular expressions such as those used by Perl fail to be able to process HTML correctly. You will not be able to do it. HTML is a language of sufficient complexity that it can not be parsed with regular expressions. Not even Julio Iglesias can parse regular expressions. Every time you try to parse an HTML with regular expressions, a Russian hacker hacks the webapp and an impious child cries for the blood of the virgins. Parsear HTML with regex invokes dirty souls to the realm of the living. HTML and regex go hand in hand as much as love, marriage and the sacrifice of children. The < center > can not be held responsible, it's too late. The strength of regex and HTML together in the same conceptual space will destroy your mind. If you parse HTML with regex you are giving yourself to Them and their blasphemous forms that condemn us to all inhumane work for Him whose Name can not be expressed in the Basic Multilingual Plane, he comes. HTML-y-regexp will liquefy the nerves of those who feel as you watch, your psyche withers in the onslaught of horror. The parsers of HTML based on rege̿̔̉x- are the cancer that is killing Stack Overflow it is too late it is too late we can not save ourselves the transgression of a child ensures that the regex will consume all living tissue (except for HTML that can not be consumed, as prophesied) dear sir, help us survive this scourge using regex to parse HTML has condemned mankind to an eternity of torture and security holes using rege x as an HTML processing tool creates a gap between this world and the fearsome realm of unstable entities (like SGML entities, but more corrupt) a simple view zo to the world of parsers reg < b> ex for HTML would transpose immediatly to the consciousness of the p rogrammers has a scream without pause, appears , the pestilent inf ection of regex dev will pray your parser of HT ML, your application and your existence forever as a mere Visual Basic or worse it comes not luc hes it it comes v̡im̡ie̶ne, ̕h̵u radiaci͞n destr going҉ all luminescence, the HTML tags < b> filtering out of ̡tu ͟s eyes̸ ̛ like líq gone d oloroso, the song of parsear expression re onesgulares will exti nguir the voices of the man mor < b> such of the sp / b> era I can see it can ͚̖͔̙ver e s beautiful or the end extinguishing the lies of men TO DO EŚ͖̩͇̗̪̏̈TÁ ALL LOST EST A PE RDI DO e l pon̷ and he vie ne he v he has vien e the íco r pe rme a to do M I FACE M I FACE ᵒh two n o o NO NOO̼ OR ON Θ for l os án * ̶͑̾̾ ̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆ul͖͉̗̩̳̟̍ͫͥͨo s ͎ n o are rè̑ͧ̌aͨl̘̝̙ͤ̾̆es ZA̡͊͠͝LGΌ ESͮ҉̯͈͕̹̘ ALȳ̳ Ë͖̉L ͠P̯͍̭O̚ N̐Y̡ Ȩ̬̩̾͛ͪ̈͘L ͧ̾ͬ ̶̧̨̹̭̯ViENȆ̴̟̟͙̞ͩ͌͝ "

Have you tried using a parser ( parser ) of XML?

    
answered by 24.07.2017 в 10:56
1

If you want to pair:

What you could use is an external library " Simple HTML DOM Parser ". Although the actual use is for scraping (at least the one I have given).

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

I imagine (I have not tested it) that if you do the following, it would work:

$html = "<div>...<a><img></img></a>..</div>";

in the official documentation gives example:

// Create a DOM object from a string
$html = str_get_html('<html><body>Hello!</body></html>');
    
answered by 24.07.2017 в 10:54