Find values from a dictionary in a JSON and add them

4

I have 2 txt files. 1 is a dictionary with term, value and I have another file with a json that simulates a list of tweets.

I have to look up the terms of the dictionary in the json and if it finds the term, add the value to a variable that I assigned earlier as 0.

The first part I already have with the code:

Sentimiento = open("Sentimientos.txt")
Valores = {}
for linea in Sentimiento:
    termino, valor = linea.split("\t")
    Valores[termino] = int(valor)

and I also get him to list the terms he finds in the json with the code:

Tweets = open("salida_tweets.txt",'r')
for i,linea in enumerate(Tweets):
    for Sentimiento in Valores.keys():
        if Sentimiento in linea:
            print("Se ha encontrado {} en el tweet de la linea {}".format(Sentimiento,i))

What I really need is to go through Tweets looking for some term of the first txt ( Sentimientos ). If it finds a term, it returns the value of the term and if not, it shows 0 for that tweet.

And at the end of the whole that returns the sum of the terms that it finds.

Any idea of where to put a hand? I suppose I must modify the code I have and do a if: else: and then use numpy to add the values. It is right?

    
asked by Evm 23.02.2018 в 13:48
source

1 answer

1

If I understood correctly what you asked for, the following code would be the answer. I have renamed some of your variables to follow the typical Python conventions, according to which the uppercase initial is reserved for class names (this agreement and others are specified in the PEP8 , of which there is an translation unofficial to Spanish) .

sentimiento = open("Sentimientos.txt")
valores = {}
for linea in sentimiento:
    termino, valor = linea.split("\t")
    valores[termino] = int(valor)

tweets = open("salida_tweets.txt",'r')
for i, linea in enumerate(tweets):
    total = 0
    for sentimiento, valor in valores.items():
        if sentimiento in linea:
            print("Se ha encontrado {} en el tweet de la linea {} (valor={})"
                  .format(sentimiento, i, valor))
            total += valor
    print("El tweet de la línea {} tiene un valor de {}".format(i, total))

This code calculates the sum of values of all the feelings found in each tweet, which I think is what you asked for.

Update

Once the OP has provided an example of the contents of the salida_tweets.txt file, it is seen that the content consists of a tweet per line, but each tweet is a JSON structure, not a simple text string.

I copy here part of the content provided by the OP:

{"delete":{"status":{"id":294512601600258048,"id_str":"294512601600258048","user_id":90681582,"user_id_str":"90681582"},"timestamp_ms":"1410368494083"}}
{"created_at":"Wed Sep 10 17:01:33 +0000 2014","id":509748524897292288,"id_str":"509748524897292288","text":"@Brenamae_ I WHALE SLAP YOUR FIN AND TELL YOU ONE LAST TIME: GO AWHALE","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":509748106015948800,"in_reply_to_status_id_str":"509748106015948800","in_reply_to_user_id":242563886,"in_reply_to_user_id_str":"242563886","in_reply_to_screen_name":"Brenamae_","user":{"id":175160659,"id_str":"175160659","name":"Butterfly","screen_name":"VanessaLilyWan","location":"Canada, Montreal","url":"http:\/\/instagram.com\/vanessalilywan","description":"British youtubers. 'Nuff said.","protected":false,"verified":false,"followers_count":118,"friends_count":180,"listed_count":2,"favourites_count":319,"statuses_count":10221,"created_at":"Thu Aug 05 20:03:16 +0000 2010","utc_offset":-36000,"time_zone":"Hawaii","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"B2DFDA","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme13\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme13\/bg.gif","profile_background_tile":false,"profile_link_color":"93A644","profile_sidebar_border_color":"EEEEEE","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/470701406245376000\/2aXDrauR_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/470701406245376000\/2aXDrauR_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/175160659\/1404361640","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"trends":[],"urls":[],"user_mentions":[{"screen_name":"Brenamae_","name":"I-G-G-Bye","id":242563886,"id_str":"242563886","indices":[0,10]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"medium","lang":"en","timestamp_ms":"1410368493668"}
{"delete":{"status":{"id":204951917716189185,"id_str":"204951917716189185","user_id":496152394,"user_id_str":"496152394"},"timestamp_ms":"1410368494071"}}

Many of the lines in this example do not look like "true" tweets, since they do not contain the "text" field. In fact, the only line that looks like a true tweet is the one that starts with {"created_at"... The others seem more like deletion actions.

With this new information, I do not think that the initial approach of looking for certain words (feelings) in each line is the most indicated. Let's think for example that one of the keywords to look for is "time" . This word appears in all the lines because all the tweets contain in their JSON the time they were issued in a field called "time" . But I understand that what is sought is only tweets that use the word "time" as part of the tweet message, and not as part of the complete JSON.

On the other hand, just as the code was, it was not taking into account that a feeling must be found even if it has been written in capital letters in the Tweet. For example, the only Tweet that contains text (the second line of the example) has the following text:

@Brenamae_ I WHALE SLAP YOUR FIN AND TELL YOU ONE LAST TIME: GO AWHALE

that everything is in capital letters (and also, look what a coincidence, use the word TIME that I mentioned before).

Therefore, a correct way to approach the problem in my opinion would be:

  • Read each line of the tweeets file
  • Parse the json contained in that line to obtain a python dictionary
  • See if that dictionary contains the text field. If not, ignore the line as it is not a "true" tweet.
  • Stay with the field 'text, pass it to lowercase and use it to look for feelings in it and compute the corresponding scores.
  • All this is done by the following code, in which I have supplied the contents of some sample files as strings, so that anyone can try it and see that it still works without having the files. It only remains to change the io.IOString() for open() of the corresponding files so that it works on files instead of strings.

    contenido_tweets = r'''
    {"delete":{"status":{"id":294512601600258048,"id_str":"294512601600258048","user_id":90681582,"user_id_str":"90681582"},"timestamp_ms":"1410368494083"}}
    {"created_at":"Wed Sep 10 17:01:33 +0000 2014","id":509748524897292288,"id_str":"509748524897292288","text":"@Brenamae_ I WHALE SLAP YOUR FIN AND TELL YOU ONE LAST TIME: GO AWHALE","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":509748106015948800,"in_reply_to_status_id_str":"509748106015948800","in_reply_to_user_id":242563886,"in_reply_to_user_id_str":"242563886","in_reply_to_screen_name":"Brenamae_","user":{"id":175160659,"id_str":"175160659","name":"Butterfly","screen_name":"VanessaLilyWan","location":"Canada, Montreal","url":"http:\/\/instagram.com\/vanessalilywan","description":"British youtubers. 'Nuff said.","protected":false,"verified":false,"followers_count":118,"friends_count":180,"listed_count":2,"favourites_count":319,"statuses_count":10221,"created_at":"Thu Aug 05 20:03:16 +0000 2010","utc_offset":-36000,"time_zone":"Hawaii","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"B2DFDA","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme13\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme13\/bg.gif","profile_background_tile":false,"profile_link_color":"93A644","profile_sidebar_border_color":"EEEEEE","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/470701406245376000\/2aXDrauR_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/470701406245376000\/2aXDrauR_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/175160659\/1404361640","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"trends":[],"urls":[],"user_mentions":[{"screen_name":"Brenamae_","name":"I-G-G-Bye","id":242563886,"id_str":"242563886","indices":[0,10]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"medium","lang":"en","timestamp_ms":"1410368493668"}
    {"delete":{"status":{"id":204951917716189185,"id_str":"204951917716189185","user_id":496152394,"user_id_str":"496152394"},"timestamp_ms":"1410368494071"}}
    {"delete":{"status":{"id":509733211497193473,"id_str":"509733211497193473","user_id":2328935617,"user_id_str":"2328935617"},"timestamp_ms":"1410368494165"}}
    '''
    
    contenido_sentimientos = '''
    time\t5
    slap\t2
    whale\t3
    '''
    
    # ------------------------
    import io
    import json
    
    sentimiento = io.StringIO(contenido_sentimientos)
    valores = {}
    for linea in sentimiento:
        linea = linea.strip()
        if not linea:
          continue      # Saltarse lineas en blanco
        termino, valor = linea.split("\t")
        valores[termino.lower()] = int(valor)
    
    tweets = io.StringIO(contenido_tweets)
    for i, linea in enumerate(tweets):
        total = 0
        linea = linea.strip()
        if not linea:
          continue     # Saltarse lineas vacias
    
        # Convertir el JSON de la línea a un diccionario python
        data = json.loads(linea)
        if "text" not in linea:
          continue     # Saltarse líneas que no tengan un tweet
        for sentimiento, valor in valores.items():
            if sentimiento in data["text"].lower():
                print("Se ha encontrado {} en el tweet de la linea {} (valor={})"
                      .format(sentimiento, i, valor))
                total += valor
        print("El tweet de la línea {} tiene un valor de {}".format(i, total))
    

    The result that appears on the screen is:

      

    Time has been found in the tweet of line 2 (value = 5)
      Slap found in the tweet of line 2 (value = 2)
      Whale found in the tweet of line 2 (value = 3)
      The tweet of line 2 has a value of 10

        
    answered by 23.02.2018 в 16:30