how to calculate time difference between two clients

2

How can I calculate the time difference between a client's actions?

I want to calculate all the elapsed time of the users until it passes to the next user. example .... from user 1 to two it was 2 days 4 hours, 5 seconds.

This is the dataframe:

df = pd.DataFrame({'usuario': [1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
                   'Fecha': ['1990-01-02','1990-01-03','1990-01-04','1990-01-05','1990-01-08',\
                                  '1990-01-09','1990-01-10','1990-01-11','1990-01-12' ,'1990-01-15',\
                                  '1990-01-16', '1990-01-17', '1990-01-18','1990-01-19','1990-01-22',\
                                  '1990-01-23 ', '1990-01-24', '1990-01-25','1990-01-26','1990-01-29'],
                   'Hora': ['10:10:00','10:11:00','10:12:00','10:13:00','10:10:00',\
                                  '10:10:00','10:17:00','10:14:00','11:14:00','12:14:00',\
                                  '10:10:00', '10:20:00', '14:22:00','15:22:00','16:22:00',\
                                  '10:10:00', '00:00:00', '00:00:00','00:00:00','00:00:00']})

Each user has an event, purchase, visit product, etc. How can I add these events per user? For example, the events are to enter the page, see the product, is in the main page, is in the purchase page, product purchase. Obvious that each of these events in a different row and I want to tell if the user made several times an action ... for example user 1, entered 2 times to the main page, entered 2 times to product, bought 1 time. (do not have it separated)

    
asked by jlemosmusi 22.11.2018 в 20:25
source

2 answers

3

The first thing would be to add a new column to the dataframe (which I'll call timestamp ) that contains the concatenation of your columns Fecha and Hora , converted to a datetime python, because once we have them it is thus possible to do arithmetic with dates, such as subtracting them to obtain the time elapsed between them.

df["timestamp"] = pd.to_datetime(df.Fecha + " " + df.Hora)

Now we can use df.groupby() to group all the rows of the table according to the criteria that we choose, and we will use the user as a criterion. The result of that grouping is an object, at the moment not very useful because we can not even show it, but on which we can do other operations:

g = df.groupby("usuario")

the g grouping will have two groups (one for each user). It is possible to make calculations within each group (for example add numbers, find averages). In this case we want to take the field timestamp of that group and subtract the maximum and minimum values within the group, because those correspond to the last event and the first, and subtract them we will have the temporary distance between them.

Pandas does not give a specific function for this, but we can prepare one very easily. The s parameter is understood to be a dataframe with a group. The function returns a data that "sums up" that group.

def delta(s):
  return s.timestamp.max()-s.timestamp.min()

Just apply this function to the g grouping to get what you wanted:

>>> g.apply(delta)
usuario
1   13 days 02:04:00
2   12 days 13:50:00
dtype: timedelta64[ns]

Plus

(hahaha)

Since you do not give data for this part, I will add a column with random events as demo:

import random 
random.seed(1).  # Por reproducibilidad del ejemplo
df["evento"] = [random.choice(["entra", "sale", "ve", "compra"]) 
                for i in range(len(df))]

This part is even easier than before! Again we use groupby() , this time grouping by user and by event. What we want to do with each group is simply to count how many elements it has. Pandas already gives .count() for that. Therefore:

>>> df.groupby(("usuario", "evento")).count()
                Fecha  Hora
usuario evento             
1       compra      4     4
        entra       3     3
        sale        2     2
        ve          1     1
2       compra      4     4
        entra       3     3
        sale        1     1
        ve          2     2

Where you see that user 1 enters 3 times, buys 4 (that's consumerism!), etc. The titles of the columns "Date" and "Time" are irrelevant, since they do not contain dates or hours anymore , but the event counter and obviously both have the same number for that counter, so we can keep only one:

>>> df.groupby(("usuario", "evento")).count()["Fecha"]
usuario  evento
1        compra    4
         entra     3
         sale      2
         ve        1
2        compra    4
         entra     3
         sale      1
         ve        2
Name: Fecha, dtype: int64
    
answered by 22.11.2018 / 20:57
source
0

Hello, how are you? You are right, it is clearer with examples. My idea is to move on from this df2 = pd.DataFrame ({'user': [1,1,1,2,2,2,3,3,3,3,3],                     'event': ["I buy", "I buy", "viopag", "viopag", "entropag", "I buy", "entropag", "I buy", "went out", "viopag", "I buy"]} )

To this df3 = pd.DataFrame ({'user': [1,2,3],                 'I buy': [2,1,2],                 'viopag': [1,1,1],                 'entropag': [0,1,1],                 'came out': [0,0,1]})

    
answered by 24.11.2018 в 04:10