Create an array of groups and ids related to groups

1

I want to create an array, a dictionary or a DataFrame (whatever the form) that contains the id grouped by group of subscribers that are in the same group.

The ids are in a DataFrame side_subscriber.index , the output of this array is:

Int64Index([160, 161, 296, 306, 365, 386, 471], dtype='int64', name=u'subscriber_id')

Groups are in numpy.ndarray called indexResultat :

[1 1 0 0 1 1 1]

I try to do the following without knowing how to initiate the array grouping by group:

kernelGroup = []
i = 0
for idx in indexResultat:
    print "idx : ",idx
    i = i+1
    print kernelGroup
    for kernel in kernelGroup:
        print "kernel : ",kernel
        if idx == kernel:
            print "we have the group ",kernel 
            print kernel
            # anadimos el id
            kernelGroup = kernelGroup[kernel].append(side_subscriber.index[idx])
            break
    # no habemos el grupo
    print "we don't have the group", idx
    #kernelGroup = kernelGroup.append(kernelGroup,[idx,side_subscriber.index[idx]])
    kernelGroup = kernelGroup.append([idx,side_subscriber.index[i]])

print kernelGroup      

And I get:

idx :  1
[]
we don't have the group 1
idx :  1
None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-64-a0add6c15d78> in <module>()
      5     i = i+1
      6     print kernelGroup
----> 7     for kernel in kernelGroup:
      8         print "kernel : ",kernel
      9         if idx == kernel:

TypeError: 'NoneType' object is not iterable

The output I expect this

{0:[296, 306], 1:[160, 161, 365, 386, 471]}:

I know that this function does more or less what I want to do:

def cluster_points(X, mu):
    clusters  = {}
    for x in X:
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
                    for i in enumerate(mu)], key=lambda t:t[1])[0]
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters
    
asked by ThePassenger 30.06.2017 в 17:12
source

1 answer

2

Depending on whether you need more or less efficiency you can do it in many ways (with Pandas, NumPy or with standard Python only). A very simple one is through Pandas and DataFrame.groupby :

import pandas as pd
import numpy as np

# Simulamos tus datos de orígen
df = pd.DataFrame(index=[160, 161, 296, 306, 365, 386, 471])
grupos = np.array([1, 1, 0, 0, 1, 1, 1])

res = pd.DataFrame({'ids': df.index, 'grupos': grupos})
res = res.groupby('grupos')['ids'].apply(np.array).to_frame('ids')

With what we get:

>>> res

                             ids
grupos                           
0                      [296, 306]
1       [160, 161, 365, 386, 471]

The ids column contains NumPy arrays.

If you need more efficiency you have to go down one level and use NumPy, sorting the array using grupos as key and slicing.

    
answered by 30.06.2017 / 18:16
source