Program that calculates the average and varianca gives different results with the calculator?

2

I have two NumPy arrays called follow_dismiss_i and follow_dismiss_display_i that contain a first column that contains counters and a second column that are indexes.

I have created a program that helps me calculate:

  • The result of SUM_follow_dismiss and SUM_follow_dismiss_display which are the sums of the second columns of follow_dismiss_i and follow_dismiss_display_i respectively.

  • I get an array called m_i which is the result of dividing the first column of follow_dismiss_i between its counterpart of follow_dismiss_display_i by using the indexes (second column). If an index exists in follow_dismiss_i but not in follow_dismiss_display_i , in m_i that index is associated with a value of 0.0.

  • The variance of the array m_i .

I also calculate the average, m but it turns out that I get 0.517134831461, as you can see in the output of my code, while on the calculator I have 0.63567076 .

I try to understand why there are these differences and if there is a much simpler method to do it.

This is my code:

#!/usr/bin/python
#
# Small script for some stats
#

import traceback
import psycopg2
import numpy as np
import pandas as pd

# pueden necesitar los arrays siguients que estan en el output :

print "follow_dismiss_i"
print follow_dismiss_i
print "SUM_follow_dismiss"
print SUM_follow_dismiss
print "follow_dismiss_display_i"
print follow_dismiss_display_i
print "SUM_follow_dismiss_display"
print SUM_follow_dismiss_display

m = float(SUM_follow_dismiss)/ SUM_follow_dismiss_display
print ("\nmean m")
print m

m_i=[]

print "\nvariance"
for j in range(len(follow_dismiss_display_i)):
    new = []
    found = 0
    for i in range(len(follow_dismiss_i)):
        if follow_dismiss_display_i[j,1]==follow_dismiss_i[i,1]:
            new.append(follow_dismiss_display_i[j,1])
            new.append(float(follow_dismiss_i[i,0])/follow_dismiss_display_i[j,0])
            m_i.append(new)
            found=1         
            break
    if found == 0:
        new.append(follow_dismiss_display_i[j,1])
        new.append(0.0)
        m_i.append(new)
test = np.array(m_i)
print test[:,1]
variance_eclipse = np.var(test[:,1])

print variance_eclipse

Here is the output in case you need it to reproduce the program with the same data:

follow_dismiss_i
[[505  13]
 [ 14  54]
 [ 70  68]
 [ 21 150]
 [ 36 152]
 [ 62 156]
 [ 59 158]
 [120 160]
 [ 53 161]
 [150 162]
 [  3 169]
 [  1 171]
 [ 60 172]
 [  1 177]
 [126 179]
 [ 41 185]
 [239 189]
 [163 190]
 [ 26 216]
 [ 42 223]
 [  1 272]
 [  2 286]
 [  5 289]
 [  1 292]
 [  2 294]
 [  6 296]
 [ 25 306]
 [  7 312]]
SUM_follow_dismiss
1841
follow_dismiss_display_i
[[986  13]
 [ 20  54]
 [484  68]
 [ 57 150]
 [ 44 152]
 [ 95 156]
 [ 89 158]
 [144 160]
 [ 58 161]
 [383 162]
 [  3 169]
 [  2 171]
 [125 172]
 [  1 177]
 [147 179]
 [ 61 185]
 [325 189]
 [334 190]
 [ 46 216]
 [ 71 223]
 [  1 272]
 [  2 276]
 [  9 286]
 [  5 289]
 [  1 292]
 [  2 294]
 [ 10 296]
 [ 27 306]
 [ 16 312]
 [ 12 315]]
SUM_follow_dismiss_display
3560

mean
0.517134831461

variance
[ 0.51217039  0.7         0.1446281   0.36842105  0.81818182  0.65263158
  0.66292135  0.83333333  0.9137931   0.39164491  1.          0.5         0.48
  1.          0.85714286  0.67213115  0.73538462  0.48802395  0.56521739
  0.5915493   1.          0.          0.22222222  1.          1.          1.
  0.6         0.92592593  0.4375      0.        ]
0.0858073520518
    
asked by ThePassenger 30.05.2017 в 16:39
source

1 answer

4

The result of the average should not vary so much between what Python calculates and a calculator, no further than the precision with which that calculator works.

The m that you calculate is the result of adding all the counts (column 0) of follow_dismiss on one side and of follow_dismiss_display on the other and dividing both 1841/3560 = 0.517134831461. That is the result we get both in Python and in a calculator.

I do not finish understanding what meaning this 'media' has as you get it, For an explanation, I think it should be the average of test[:, 1] that gives 0.635 and not what you are calculating, although this is only a supposition.

Apart from the above, for this type of calculations you should use Pandas (since you have it imported and everything). It simplifies things a lot. For example, to do what you do in your nested for cycles, just use pandas.DataFrame.merge . For this we specify that it be applied to the columns of the indexes (parameter on ) and that it only takes into account the indexes of follow_dismiss_display_i (parameter how = 'left' indicating that the first array is taken into account). The indices that do not correspond in follow_dismiss_i will be with NaN value that we can pass to 0.0 without problems when we make the division:

#!/usr/bin/python

import numpy as np
import pandas as pd


a = pd.DataFrame(follow_dismiss_display_i, columns = ('counts', 'indx'))
b = pd.DataFrame(follow_dismiss_i, columns = ('counts', 'indx'))

SUM_follow_dismiss_display = a['counts'].sum()
SUM_follow_dismiss = b['counts'].sum()

c = pd.merge(a, b, how = 'left', on= 'indx')
c['div'] = c['counts_y'].div(c['counts_x'], fill_value=0.0)
test = c[['indx', 'div']].values
m = np.mean(test[:,1])
variance_eclipse = np.var(test[:,1])

print 'SUM_follow_dismiss_display: ', SUM_follow_dismiss_display
print 'SUM_follow_dismiss: ', SUM_follow_dismiss
print 'test[:,1]: \n', test[:,1]
print 'mean: ', m
print 'variance: ', variance_eclipse

Using the input data you give:

follow_dismiss_i= np.array([[505,  13],
                            [ 14,  54],
                            [ 70,  68],
                            [ 21, 150],
                            [ 36, 152],
                            [ 62, 156],
                            [ 59, 158],
                            [120, 160],
                            [ 53, 161],
                            [150, 162],
                            [  3, 169],
                            [  1, 171],
                            [ 60, 172],
                            [  1, 177],
                            [126, 179],
                            [ 41, 185],
                            [239, 189],
                            [163, 190],
                            [ 26, 216],
                            [ 42, 223],
                            [  1, 272],
                            [  2, 286],
                            [  5, 289],
                            [  1, 292],
                            [  2, 294],
                            [  6, 296],
                            [ 25, 306],
                            [  7, 312]])


follow_dismiss_display_i = np.array([ [986, 13],
                                      [ 20, 54],
                                      [484, 68],
                                      [ 57, 150],
                                      [ 44, 152],
                                      [ 95, 156],
                                      [ 89, 158],
                                      [144, 160],
                                      [ 58, 161],
                                      [383, 162],
                                      [  3, 169],
                                      [  2, 171],
                                      [125, 172],
                                      [  1, 177],
                                      [147, 179],
                                      [ 61, 185],
                                      [325, 189],
                                      [334, 190],
                                      [ 46, 216],
                                      [ 71, 223],
                                      [  1, 272],
                                      [  2, 276],
                                      [  9, 286],
                                      [  5, 289],
                                      [  1, 292],
                                      [  2, 294],
                                      [ 10, 296],
                                      [ 27, 306],
                                      [ 16, 312],
                                      [ 12, 315]])

We get:

SUM_follow_dismiss_display:  3560
SUM_follow_dismiss:  1841
test[:,1]: 
[ 0.51217039  0.7         0.1446281   0.36842105  0.81818182  0.65263158
  0.66292135  0.83333333  0.9137931   0.39164491  1.          0.5         0.48
  1.          0.85714286  0.67213115  0.73538462  0.48802395  0.56521739
  0.5915493   1.          0.          0.22222222  1.          1.          1.
  0.6         0.92592593  0.4375      0.        ]
mean:  0.635760767848
variance:  0.0858073520518
    
answered by 30.05.2017 / 21:27
source