Find values from a list in a csv by column Python

Question

Find values from a list in a csv by column Python

Navigation

#1 by (1 votes)

1

What I want to do is identify which site is missing a date from the list called dates. To be more clear I leave an example with data:

fechas = ['1/1/2000','2/1/2000','3/1/2000','4/1/2000', '5/1/2000']

file:

Sitio,Fecha,Importe
a.com,1/1/2000,2
a.com,2/1/2000,3
a.com,3/1/2000,1
a.com,4/1/2000,5

In this case I would miss the 5/1/2000 date so I would like you to print the site name ('a.com ej') and the missing date

import pandas as pd

df = pd.read_csv('file.csv')
df.sort_values(by=['Sitio','Fecha'])

fechas = df.drop_duplicates(['Fecha'])
fechas = list(fechas['Fecha'])

python

asked by Martin Bouhier 19.02.2018 в 17:46

source

1 answer

Date__range inclusive in Django the sql statement in postgres, is the prefix "public" necessary?

score 1 · Accepted Answer

Let's see if I understood it correctly. Let's suppose that this is the content of the CSV:

contenido = """
Sitio,Fecha,Importe
a.com,1/1/2000,2
a.com,2/1/2000,3
a.com,3/1/2000,1
a.com,4/1/2000,5
b.com,1/1/2000,2
b.com,2/1/2000,3
b.com,3/1/2000,1
b.com,5/1/2000,5
"""

In this case the set of possible dates would be ['1/1/2000', '2/1/2000', '3/1/2000', '4/1/2000', '5/1/2000'] , and the site a.com lacks 5/1/2000 , while b.com lacks 4/1/2000 .

I start with a code equal to the one you provide in the question (but instead of reading from a file I read it from the previous string):

import pandas as pd
import io

df = pd.read_csv(io.StringIO(contenido))
df.sort_values(by=['Sitio','Fecha'])

fechas = df.drop_duplicates(['Fecha'])
fechas = list(fechas['Fecha'])

The following code detects which date is missing in each site. It's a bit tricky, but I can not think of a simpler way:

grupo = df.groupby("Sitio")
for site, dates in grupo.Fecha.unique().to_dict().items():
    diff = set(fechas) - set(dates)
    if diff:
        print("Al sitio {} le faltan las fechas {}".format(
               site, diff))

Exit:

Al sitio a.com le faltan las fechas {'5/1/2000'}
Al sitio b.com le faltan las fechas {'4/1/2000'}

How it works

pd.groupby("Sitio") groups all the entries together with those corresponding to the same site. On the resulting group I do a grupo.Fecha.unique() that returns a dataframe in which each row is a different site, and in the column you have the list of (unique) dates that appear for that site.

I convert that dataframe to a dictionary to iterate more easily about it, and I'm simply comparing the set of dates of each site with the set of "global" dates, making the difference of both sets and showing those that give a difference not empty.

Update

Before the additional request to generate a CSV with the result, one option may be the following.

Instead of printing sites with missing dates, we keep them in a list. Each item in the list will be a dictionary with two keys: "Site" and "Missing_date", which will contain respectively the name of the site and a string that concatenates the missing dates.

That structure is easy to convert into DataFrame pandas, and hence it's trivial to dump it as CSV. That is, to accumulate the results in a list of dictionaries:

grupo = df.groupby("Sitio")
resultado = []
for site, dates in grupo.Fecha.unique().to_dict().items():
    diff = set(fechas) - set(dates)
    if diff:
        resultado.append(dict(Sitio=site, 
                              Faltan_fechas=",".join(diff)))

To convert it to Dataframe:

df_result = pd.DataFrame(resultado, 
                         columns=["Sitio", "Faltan_fechas"]
                        ).sort_values(by="Sitio")

To dump it to CSV:

df_result.to_csv("resultado.csv", index=False)

In this case the file would contain:

Sitio,Faltan_fechas
a.com,5/1/2000
b.com,4/1/2000

In case there was a site missing several dates, for example c.com , this would be overturned:

Sitio,Faltan_fechas
a.com,5/1/2000
b.com,4/1/2000
c.com,"5/1/2000,4/1/2000,2/1/2000"

This CSV has only two columns, as you can see, but in the second there may be a string separated by commas with several dates (which are only one cell because it is enclosed in quotes)

Update update

To generate an exit in which the sites appear repeated as many times as dates are missing, this is an option:

grupo = df.groupby("Sitio")

resultado = []
for site, dates in grupo.Fecha.unique().to_dict().items():
    diff = set(fechas) - set(dates)
    if diff:
        for fecha in sorted(diff):
            resultado.append(dict(Sitio=site, Faltan_fechas=fecha))

The generation of the dataframe and the csv is as before. Example of result:

Sitio,Faltan_Fechas
a.com,5/1/2000
b.com,4/1/2000
c.com,2/1/2000
c.com,4/1/2000
c.com,5/1/2000