How to fill the columns of a matrix A that are in a matrix B and missing in A?

1

I have two matrices P and Q and I want to do a matrix calculation P*Q.T . Then I need the same number of columns. P have m_p subscriber and n_p hashtags and m_q with m_p advertisements (which we call "eclipse") per Q . Both have id to differentiate subscriber, eclipse and hashtag. However, when a user has no relations with a hashtag, the hastag does not appear in the P columns.

So how do you fill in the columns of P with those of Q missing on hashtag_id (but with 0)?

Here is how P and Q were constructed:

    try:
        conn = psycopg2.connect(**params)
        cur = conn.cursor()
            cur = conn.cursor()
            cur.execute("""
    SELECT COUNT(swipe.eclipse_id), subscriber_hashtag.hashtag_id,subscriber_hashtag.subscriber_id  FROM subscriber_hashtag
      -- join para que las publicidades/eclipses que gusta un usuarios están vinculadas con las de la tabla de correspondencia con los hashtag
      INNER JOIN eclipse_hashtag ON eclipse_hashtag.hashtag_id = subscriber_hashtag.hashtag_id
      -- join para que los usuarios  están vinculadas con los de la tabla de correspondencia con los hashtag
      LEFT OUTER JOIN swipe ON subscriber_hashtag.subscriber_id = swipe.subscriber_id
      -- recobremos los "me gusta"
      WHERE swipe.state= 3 OR swipe.state = 6 or swipe.state=9
        GROUP BY subscriber_hashtag.hashtag_id,subscriber_hashtag.subscriber_id
          ORDER BY subscriber_hashtag.hashtag_id,subscriber_hashtag.subscriber_id DESC;
          """)
            p = cur.fetchall()
            listado_p = [{"count": elem[0], "hashtag_id": elem[1], "subscriber_id": elem[2]} for elem in p]
            print(listado_p[0]['count'])

            cur.execute("""
    SELECT COUNT(eclipse_hashtag.eclipse_id), eclipse_hashtag.hashtag_id,eclipse_hashtag.eclipse_id FROM eclipse_hashtag
        GROUP BY eclipse_hashtag.hashtag_id, eclipse_hashtag.eclipse_id
          ORDER BY eclipse_hashtag.hashtag_id,eclipse_hashtag.eclipse_id ASC;
        """)
            q = cur.fetchall()
            listado_q = [{"count": elem[0], "hashtag_id": elem[1], "eclipse_id": elem[2]} for elem in q]
            print(listado_q[0]['count'])

# recuperamos los valores brutos
df_p = pd.DataFrame(listado_p)
df_q = pd.DataFrame(listado_q)

# sustituimos NaN con 0
df_p['hashtag_id'] = df_p['hashtag_id'].fillna(0).astype(int)
df_q['hashtag_id'] = df_q['hashtag_id'].fillna(0).astype(int)

# creamos matrices :
df_p2 = df_p.pivot(index='subscriber_id', columns='hashtag_id', values='count')
df_q2 = df_q.pivot(index='eclipse_id', columns='hashtag_id', values='count')

P = np.array(P)
Q = np.array(Q)

And the dimensions of the matrices:

print "(subscriber,hastag) : ",df_p2.shape
print "(eclipse,hashtag) : ",df_q2.shape

(subscriber,hastag) :  (21, 157)
(eclipse,hashtag) :  (618, 364)

For the minimum and verifiable example, we can use:

--- df_p ---
    count  hashtag_id  subscriber_id
0      22         321            172
1      44         321            161
2      25         322            172
3       6         323            172
4      32         325            172
5      26         328            161
6      48         329            172
7      96         329            161
8      14         335            172

and

    ---  df_q ---
      count  eclipse_id  hashtag_id
0         1        6521         321
1         1        6606         321
2         1        6609         321
3         1        6617         321
4         1        6649         321
5         1        6911         321
6         1        6914         321
7         1        7001         321
8         1        7004         321
9         1        7120         321
10        1        9534         321
11        1       10442         321
12        1       10452         321
13        1       10458         321
14        1       10462         321
15        1       10479         321
16        1       10490         321
17        1       10506         321
18        1       10523         321
19        1       10573         321
20        1       10593         321
21        1       10711         321
22        1        6521         322
23        1        6606         322
24        1        6609         322
...     ...         ...         ...
2182      1       11175        2272
2183      1       11176        2272
2184      1       11177        2272
2185      1       11178        2272
2186      1       11179        2272
2187      1       11180        2272
2188      1       11181        2272
2189      1       11229        2470
2190      1       11230        2470
2191      1       11231        2470
2192      1       11232        2470
2193      1       11233        2470
2194      1       11234        2470
2195      1       11235        2470
2196      1       11236        2470
2197      1       11237        2470
2198      1       11238        2470
2199      1       11239        2470
2200      1       11240        2470
2201      1       11241        2470
2202      1       11242        2470
2203      1       11243        2470
2204      1       11244        2470
2205      1       11245        2470
2206      1       11246        2470
2207      1       11247        2470
2208      1       11248        2470
2209      1       11249        2533
2210      1       11250        2533
2211      1       11251        2533

We can see that missing% in df_p

Here it is: an underlying question. How to do two reindication? One on the columns and one on the lines? Here is missing in df_p subcriber_id of df_n .

--- df_n ---
    count  eclipse_id  subscriber_id
0       1     11265.0            150
1      12     10453.0            150
2       1     11267.0            150
3       1     11266.0            150
4       1     11270.0            156
5       1     11267.0            156
6      17     10453.0            156
7       1     11266.0            156
8       1     11269.0            161
9       1         NaN            161

I have done:

df_p2_r = df_p2.reindex(index = df_n2.index)

but this cancels the previous reindication: df_p2_r = df_p2.reindex(columns = df_q2.columns)

    
asked by ThePassenger 07.06.2017 в 15:59
source

1 answer

1

If I understand you well, your problem is that df_q2 has columns ( 'hashtag_id' ) that do not exist in df_p2 . You need to add these columns and fill them with zeros so that both dataframes have the same number of columns.

If we assume that all columns of df_p2 exist in df_q2 , but that there are columns of df_q2 that are not in df_p2 , then all you need is a reindexing of the df_p2 columns using the of df_q2 to solve it:

df_p2 = df_p2.reindex(columns = df_q2.columns)

Creating an example we can see it:

import numpy as np
import pandas as pd

listado_q = [{"count": 21, "hashtag_id":321 , "eclipse_id":306},
             {"count": 23, "hashtag_id":400 , "eclipse_id":306},
             {"count": 35, "hashtag_id":321 , "eclipse_id":330},
             {"count": 95, "hashtag_id":421 , "eclipse_id":403},
             {"count": 21, "hashtag_id":841 , "eclipse_id":515},
             {"count": 14, "hashtag_id":351 , "eclipse_id":684},
             {"count": 13, "hashtag_id":985 , "eclipse_id":801}]


listado_p = [{"count": 17, "hashtag_id":321 , "subscriber_id":142},
             {"count": 13, "hashtag_id":321 , "subscriber_id":345},
             {"count": 19, "hashtag_id":421 , "subscriber_id":467},
             {"count": 23, "hashtag_id":841 , "subscriber_id":175}]


# recuperamos los valores brutos
df_p = pd.DataFrame(listado_p)
df_q = pd.DataFrame(listado_q)

# creamos matrices :
df_p2 = df_p.pivot(index='subscriber_id', columns='hashtag_id', values='count')
df_q2 = df_q.pivot(index='eclipse_id', columns='hashtag_id', values='count')


df_p2_r = df_p2.reindex(columns = df_q2.columns) # <<< Reindexamos

df_p2.fillna(0, inplace=True)
df_p2_r.fillna(0, inplace=True)
df_q2.fillna(0, inplace=True)


print '-'*50, '\ndf_p2 original\n', '-'*50
print df_p2, '\n'
print '-'*50, '\ndf_p2 reindexado\n', '-'*50
print df_p2_r, '\n'
print '-'*50, '\ndf_p2\n', '-'*50
print(df_q2), '\n'

Exit:

-------------------------------------------------- 
df_p2 original
--------------------------------------------------
hashtag_id      321   421   841
subscriber_id                  
142            17.0   0.0   0.0
175             0.0   0.0  23.0
345            13.0   0.0   0.0
467             0.0  19.0   0.0 

-------------------------------------------------- 
df_p2 reindexado
--------------------------------------------------
hashtag_id      321  351  400   421   841  985
subscriber_id                                 
142            17.0  0.0  0.0   0.0   0.0  0.0
175             0.0  0.0  0.0   0.0  23.0  0.0
345            13.0  0.0  0.0   0.0   0.0  0.0
467             0.0  0.0  0.0  19.0   0.0  0.0 

-------------------------------------------------- 
df_p2
--------------------------------------------------
hashtag_id   321   351   400   421   841   985
eclipse_id                                    
306         21.0   0.0  23.0   0.0   0.0   0.0
330         35.0   0.0   0.0   0.0   0.0   0.0
403          0.0   0.0   0.0  95.0   0.0   0.0
515          0.0   0.0   0.0   0.0  21.0   0.0
684          0.0  14.0   0.0   0.0   0.0   0.0
801          0.0   0.0   0.0   0.0   0.0  13.0

I do not know if this solves your problem or not. If this is not your problem try to express it better and, above all, it would be very useful for you to add a minimum, complete and verifiable example , since we can not reproduce the problem when the query data comes to the database. Just create a reproducible example that will fill your problem and the output you expect as I did in this answer.

    
answered by 07.06.2017 / 21:08
source