Get the values of each cell of an HTML table with String.Split in vb.Net

3

Previous information: I'm making an application in which I copy part of the code of a web page (from a table of this to be exact) in order to put the information of this in a database. For that, he had thought about dividing the chain that the user hits in textbox in a array to then go putting the data with a loop. The fact is that I am not able to find the regular expression that "eliminates" everything that is between <> to divide it.

I give you an example of the string that the user would enter:

<tbody><tr id="timetableBlocks-23-01-18" style="display: block;"><td style="width: 135px;">23/01/2018</td><td style="width: 135px;">759.655</td><td style="width: 135px;">46.705</td><td style="width: 135px;">42.724</td><td style="width: 135px;">224.863</td><td style="width: 135px;">76.364</td><td style="width: 135px;">171.784</td><td style="width: 135px;">0</td><td style="width: 135px;">197.215</td></tr><tr id="timetableBlocks-22-01-18" style="display: block;"><td style="width: 135px;">22/01/2018</td><td style="width: 135px;">553.995</td><td style="width: 135px;">42.573</td><td style="width: 135px;">194.736</td><td style="width: 135px;">26.671</td><td style="width: 135px;">221.950</td><td style="width: 135px;">13.780</td><td style="width: 135px;">0</td><td style="width: 135px;">54.285</td></tr></tbody>

Then the objective would be, so to speak, to eliminate the html tags and to remain alone with what they are enclosed. The code that I used is the following:

Dim DatosBruto As String = TextBox1.Text

    If DatosBruto Like "<tbody>*</tbody>" Then
        Label7.Text = "ha introducido bien los datos"

        Dim pattern As String = "(<*>)" //<-- aquí es donde no funciona
        Dim DatosPartido() As String = Regex.Split(DatosBruto, pattern, RegexOptions.IgnoreCase)

        Label1.Text = DatosPartido(1)
        Label2.Text = DatosPartido(2)
        Label3.Text = DatosPartido(3)
        Label4.Text = DatosPartido(4)
        Label5.Text = DatosPartido(5)
        Label6.Text = DatosPartido(6)

The regular expression that is set is because it was the last one I've tried, but obviously it does not work, can you think of how it could be?

Currently the data that I get does not make sense (go, which separates where you want), for now I'm putting them in label until I get what I want. The goal would be to get something like

label1 = 23/01/2018, 
label2 = 759.655, 
label3 = 46.705, 
label4 = 42.724, 
label5 = 224.863, 
label6 = 76.364
...
    
asked by CharlieMR 26.04.2018 в 11:15
source

1 answer

4

Solution with regex (not recommended)

A simple way is to look for it to match:

  • any number of blanks: \s*
  • the character that opens the tag: <
  • any number of characters other than a > : [^>]*
  • the character that closes the tag: >

and this repeated 1 or more times, so it goes within a group: (?: ... )+

That is, the regex that matches one or more consecutive tags would be:

(?:\s*<[^>]*>)+

This would obviously generate empty elements at the beginning and end of the array that you should filter later.

However , regex is not the tool to analyze HTML. There are millions of cases where it would fail. Even with a gigantic regex, you can always find an exception in the HTML syntax that would cause it to fail. To fix it correctly, you should use DOM ( Document Object Model ).

  • what would happen if there is a ">" within an attribute of a tag?
  • what happens if there is <!-- comentarios --> in HTML?
  • What if you have to validate a CDATA?
  • Are you sure that your HTML can not be a little more complex than you expected?

There are several options. For example, a widely used option is with HTML Agility Pack.


With DOM (using HTML Agility Pack )

It's much simpler when you use a tool designed for that. The idea is that in this way we are analyzing the HTML syntax correctly: selecting each node of the HTML as an object.

First, we declare the basics:

Imports System
Imports System.Xml
Imports HtmlAgilityPack
Dim DatosBruto As String = "<tbody><tr id=""timetableBlocks-23-01-18"" style=""display: block;""><td style=""width: 135px;"">23/01/2018</td><td style=""width: 135px;"">759.655</td><td style=""width: 135px;"">46.705</td><td style=""width: 135px;"">42.724</td><td style=""width: 135px;"">224.863</td><td style=""width: 135px;"">76.364</td><td style=""width: 135px;"">171.784</td><td style=""width: 135px;"">0</td><td style=""width: 135px;"">197.215</td></tr><tr id=""timetableBlocks-22-01-18"" style=""display: block;""><td style=""width: 135px;"">22/01/2018</td><td style=""width: 135px;"">553.995</td><td style=""width: 135px;"">42.573</td><td style=""width: 135px;"">194.736</td><td style=""width: 135px;"">26.671</td><td style=""width: 135px;"">221.950</td><td style=""width: 135px;"">13.780</td><td style=""width: 135px;"">0</td><td style=""width: 135px;"">54.285</td></tr></tbody>"

And now, the idea is to generate the DOM (an object HtmlDocument ), to which we load the analyzed structure of the string with LoadHtml () .

Next, we use the XPath selector //tbody/tr/td ( tag tbody anywhere in the document, in a direct son tr , and in a direct son td ) by passing it to < a href="http://html-agility-pack.net/select-nodes"> SelectNodes () , that returns a collection that we can go through and get the value of each cell.

Code

'generamos el DOM
Dim html = New HtmlDocument()
html.LoadHtml(DatosBruto) 'pasando tu string con el HTML

For Each cell In html.DocumentNode.SelectNodes("//tbody/tr/td")  'XPath para seleccionar cada celda
    'imprimir en consola el texto de cada celda
    'esto lo adaptarías a donde quieras mostrar el resultado
    Console.WriteLine(cell.InnerText)
Next

Result

23/01/2018
759.655
46.705
42.724
224.863
76.364
171.784
0
197.215
22/01/2018
553.995
42.573
194.736
26.671
221.950
13.780
0
54.285

Demo

link

    
answered by 26.04.2018 / 12:22
source