Let's see if I understood it correctly. Let's suppose that this is the content of the CSV:
contenido = """
Sitio,Fecha,Importe
a.com,1/1/2000,2
a.com,2/1/2000,3
a.com,3/1/2000,1
a.com,4/1/2000,5
b.com,1/1/2000,2
b.com,2/1/2000,3
b.com,3/1/2000,1
b.com,5/1/2000,5
"""
In this case the set of possible dates would be ['1/1/2000', '2/1/2000', '3/1/2000', '4/1/2000', '5/1/2000']
, and the site a.com
lacks 5/1/2000
, while b.com
lacks 4/1/2000
.
I start with a code equal to the one you provide in the question (but instead of reading from a file I read it from the previous string):
import pandas as pd
import io
df = pd.read_csv(io.StringIO(contenido))
df.sort_values(by=['Sitio','Fecha'])
fechas = df.drop_duplicates(['Fecha'])
fechas = list(fechas['Fecha'])
The following code detects which date is missing in each site. It's a bit tricky, but I can not think of a simpler way:
grupo = df.groupby("Sitio")
for site, dates in grupo.Fecha.unique().to_dict().items():
diff = set(fechas) - set(dates)
if diff:
print("Al sitio {} le faltan las fechas {}".format(
site, diff))
Exit:
Al sitio a.com le faltan las fechas {'5/1/2000'}
Al sitio b.com le faltan las fechas {'4/1/2000'}
How it works
pd.groupby("Sitio")
groups all the entries together with those corresponding to the same site. On the resulting group I do a grupo.Fecha.unique()
that returns a dataframe in which each row is a different site, and in the column you have the list of (unique) dates that appear for that site.
I convert that dataframe to a dictionary to iterate more easily about it, and I'm simply comparing the set of dates of each site with the set of "global" dates, making the difference of both sets and showing those that give a difference not empty.
Update
Before the additional request to generate a CSV with the result, one option may be the following.
Instead of printing sites with missing dates, we keep them in a list. Each item in the list will be a dictionary with two keys: "Site" and "Missing_date", which will contain respectively the name of the site and a string that concatenates the missing dates.
That structure is easy to convert into DataFrame pandas, and hence it's trivial to dump it as CSV. That is, to accumulate the results in a list of dictionaries:
grupo = df.groupby("Sitio")
resultado = []
for site, dates in grupo.Fecha.unique().to_dict().items():
diff = set(fechas) - set(dates)
if diff:
resultado.append(dict(Sitio=site,
Faltan_fechas=",".join(diff)))
To convert it to Dataframe:
df_result = pd.DataFrame(resultado,
columns=["Sitio", "Faltan_fechas"]
).sort_values(by="Sitio")
To dump it to CSV:
df_result.to_csv("resultado.csv", index=False)
In this case the file would contain:
Sitio,Faltan_fechas
a.com,5/1/2000
b.com,4/1/2000
In case there was a site missing several dates, for example c.com
, this would be overturned:
Sitio,Faltan_fechas
a.com,5/1/2000
b.com,4/1/2000
c.com,"5/1/2000,4/1/2000,2/1/2000"
This CSV has only two columns, as you can see, but in the second there may be a string separated by commas with several dates (which are only one cell because it is enclosed in quotes)
Update update
To generate an exit in which the sites appear repeated as many times as dates are missing, this is an option:
grupo = df.groupby("Sitio")
resultado = []
for site, dates in grupo.Fecha.unique().to_dict().items():
diff = set(fechas) - set(dates)
if diff:
for fecha in sorted(diff):
resultado.append(dict(Sitio=site, Faltan_fechas=fecha))
The generation of the dataframe and the csv is as before. Example of result:
Sitio,Faltan_Fechas
a.com,5/1/2000
b.com,4/1/2000
c.com,2/1/2000
c.com,4/1/2000
c.com,5/1/2000