Check for repeated tweets in MongoDB


I have a method that, thanks to Twython, saves the tweets in MongoDB as it is in my question Maintaining a mongodb with tweets that match a given tag

def getSearchTagTwitter(hashtag):
    db = connexMongoDB()
    t = loginTwython()
    search =, count=100)
    data = search['statuses']
        for tweet in data:
            try :
            except :
                db.twittersearch.update_one({"id_str": tweet['id_str']}, tweet) 
    except Exception:
        print "Error al buscar hashtag"
        time.sleep(60*15) #15 minutos

I think it does not work correctly and I want to check if the value of id_str is not repeated through the MongoDB shell and / or from Python. I tried the following but it does not work for me:


Edit: I simplify the question: From Python, how can I check if I do not have duplicates in an already created mongodb? I currently connect with pymongo, and I can see that I created the collection.

2 answers


To record in your MongoDB collection you are using id_str :

# ... 
db.twittersearch.update_one({"id_str": tweet['id_str']}, tweet) 

And when doing the query you are using the wrong field str_id (unlike id_str ):


The correct thing would be:


Unless, of course, it's just an error of typing or copying / pasting

Update after the edition

I have created a simple script to replicate your case using the hashtag python and getting only 10 tweets:

# -*- coding: utf-8 -*-
from pymongo import MongoClient
from twython import Twython

client = MongoClient('localhost', 27017)
db = client.test

CONSUMER_KEY = 'xxxxxxxxxxxxxxxxxx'
CONSUMER_SECRET = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'

def get_tweets(hashtag='wtf'):
    twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET)
    search =, count=10)
    data = search['statuses']
    for tweet in data:
        except Exception, e:
            print "Error al insertar: %s" % e
            db.twittersearch.update_one({'id_str': tweet['id_str']}, tweet)

if __name__ == '__main__':

And it does not give me problems, if I do some test queries in the MongoDB console:

> db.twittersearch.find({}, {"id_str": 1, "_id": 0})
{ "id_str" : "700315462568120320" }
{ "id_str" : "700315461850804224" }
{ "id_str" : "700315438169747457" }
{ "id_str" : "700315421900148736" }
{ "id_str" : "700315421887619076" }
{ "id_str" : "700315350299049988" }
{ "id_str" : "700315332838301698" }
{ "id_str" : "700315321689833473" }
{ "id_str" : "700315301594796032" }
{ "id_str" : "700315293177008128" }

> db.twittersearch.find({"id_str": {$in: ["700315461850804224"]}})
I think the problem is on the other hand, maybe there is something else in your code that is happening to us.

I add another solution that I found, which is to use the update with the value upsert to True. This will be overwritten if there is a duplicate and a new record will be created if it does not exist.

db.twittersearch.update({'id_str': tweet['id_str']}, tweet, upsert=True)
