Apply MultiThreading (Parallelism) to my Scraping code in C #

5

I need help to speed up the scrape process I've done. Currently I do everything I wanted without applying threads, I need it so that while I go through a page getting information, another process go ahead with other pages and so on.

Here is my code and the attempt to use Thread.

//Primer Proceso (Obtiene el Url de una pagina de productos) 
string FirstUrl = InputTextBox.Text;

//Segundo (Obtiene el documento HTML parsed) 
var doc = GetHtmlDoc(FirstUrl); 

//Obtengo el numero total de paginas que recorrere (ultimo numero de paginacion)
num = GetNumberofPages(doc); 

//Obtengo los urls de cada uno de los productos y Url para ir a la siguiente pagina
Tuple<List<string>, string> myVal = GetAllhrefs(doc); 
string NPage = myVal.Item2; 
LinkProduct = myVal.Item1;

//Comienzo a recorrer
for (int i = 0; i < num; i++)
{
  //En vista de ya tener informacion de la primer pagina valido para hacer el scrape
  if (i == 0)
  {
    //Aqui intento hace el thread pero no me arroja resultados (sin el Thread si funciona)
    Thread hiloNuevo = new Thread(() => EnterAllUrl(LinkProduct));
    hiloNuevo.Start();
  }
  else //Segunda vez, segunda pagina, voy y obtengo toda la inforamcion necesaria y aplico lo mismo
  {
    i++;
string newPage = "https://www.testscrape.com" + i;
     var docw = GetHtmlDoc(newPage);
    Tuple<List<string>, string> test = GetAllhrefs(docw);
    string NPage2 = test.Item2;
    LinkProduct = test.Item1;
    //Nuevamente el thread
    Thread hiloNuevo2 = new Thread(() => EnterAllUrl(LinkProduct));
    hiloNuevo2.Start();

    NPage = NPage2;
  }
}


//eSTE METHODO ES EL QUE USO PARA PROCESAR CADA PRODUCTO
private void EnterAllUrl(List<string> LinkProduct)
{
  for (int j = 0; j < LinkProduct.Count; j++)
  {
    string Linking = LinkProduct[j];
    string proxlin = "Linking";
    var DataInfo = GetHtmlDoc(proxlin);
    showData(DataInfo);
  }
}

Could you support me with this? Where do you have to make the changes to streamline the process?

    
asked by Hans 01.05.2018 в 17:06
source

1 answer

1

I think your problem can be solved faster by using Parallel.foreach ( link ). It's like a for standard but it divides the collection you're going to go through in several processes.

The only change you have to make is that it has to be a collection. You could create a collection with each page to go through and, in your case, the code would look something like this (assuming you have a collection of string called paginas with the pages):

Parallel.ForEach(pagina, (pagina) => 
{
...
...
...
tu código va aquí
...
...
...
});

I hope it serves you.

Salu2

    
answered by 10.07.2018 в 08:46