I have two matrices P
and Q
and I want to do a matrix calculation P*Q.T
. Then I need the same number of columns. P
have m_p subscriber and n_p hashtags and m_q with m_p advertisements (which we call "eclipse") per Q
. Both have id to differentiate subscriber, eclipse and hashtag. However, when a user has no relations with a hashtag, the hastag does not appear in the P
columns.
So how do you fill in the columns of P
with those of Q
missing on hashtag_id
(but with 0)?
Here is how P
and Q
were constructed:
try:
conn = psycopg2.connect(**params)
cur = conn.cursor()
cur = conn.cursor()
cur.execute("""
SELECT COUNT(swipe.eclipse_id), subscriber_hashtag.hashtag_id,subscriber_hashtag.subscriber_id FROM subscriber_hashtag
-- join para que las publicidades/eclipses que gusta un usuarios están vinculadas con las de la tabla de correspondencia con los hashtag
INNER JOIN eclipse_hashtag ON eclipse_hashtag.hashtag_id = subscriber_hashtag.hashtag_id
-- join para que los usuarios están vinculadas con los de la tabla de correspondencia con los hashtag
LEFT OUTER JOIN swipe ON subscriber_hashtag.subscriber_id = swipe.subscriber_id
-- recobremos los "me gusta"
WHERE swipe.state= 3 OR swipe.state = 6 or swipe.state=9
GROUP BY subscriber_hashtag.hashtag_id,subscriber_hashtag.subscriber_id
ORDER BY subscriber_hashtag.hashtag_id,subscriber_hashtag.subscriber_id DESC;
""")
p = cur.fetchall()
listado_p = [{"count": elem[0], "hashtag_id": elem[1], "subscriber_id": elem[2]} for elem in p]
print(listado_p[0]['count'])
cur.execute("""
SELECT COUNT(eclipse_hashtag.eclipse_id), eclipse_hashtag.hashtag_id,eclipse_hashtag.eclipse_id FROM eclipse_hashtag
GROUP BY eclipse_hashtag.hashtag_id, eclipse_hashtag.eclipse_id
ORDER BY eclipse_hashtag.hashtag_id,eclipse_hashtag.eclipse_id ASC;
""")
q = cur.fetchall()
listado_q = [{"count": elem[0], "hashtag_id": elem[1], "eclipse_id": elem[2]} for elem in q]
print(listado_q[0]['count'])
# recuperamos los valores brutos
df_p = pd.DataFrame(listado_p)
df_q = pd.DataFrame(listado_q)
# sustituimos NaN con 0
df_p['hashtag_id'] = df_p['hashtag_id'].fillna(0).astype(int)
df_q['hashtag_id'] = df_q['hashtag_id'].fillna(0).astype(int)
# creamos matrices :
df_p2 = df_p.pivot(index='subscriber_id', columns='hashtag_id', values='count')
df_q2 = df_q.pivot(index='eclipse_id', columns='hashtag_id', values='count')
P = np.array(P)
Q = np.array(Q)
And the dimensions of the matrices:
print "(subscriber,hastag) : ",df_p2.shape
print "(eclipse,hashtag) : ",df_q2.shape
(subscriber,hastag) : (21, 157)
(eclipse,hashtag) : (618, 364)
For the minimum and verifiable example, we can use:
--- df_p ---
count hashtag_id subscriber_id
0 22 321 172
1 44 321 161
2 25 322 172
3 6 323 172
4 32 325 172
5 26 328 161
6 48 329 172
7 96 329 161
8 14 335 172
and
--- df_q ---
count eclipse_id hashtag_id
0 1 6521 321
1 1 6606 321
2 1 6609 321
3 1 6617 321
4 1 6649 321
5 1 6911 321
6 1 6914 321
7 1 7001 321
8 1 7004 321
9 1 7120 321
10 1 9534 321
11 1 10442 321
12 1 10452 321
13 1 10458 321
14 1 10462 321
15 1 10479 321
16 1 10490 321
17 1 10506 321
18 1 10523 321
19 1 10573 321
20 1 10593 321
21 1 10711 321
22 1 6521 322
23 1 6606 322
24 1 6609 322
... ... ... ...
2182 1 11175 2272
2183 1 11176 2272
2184 1 11177 2272
2185 1 11178 2272
2186 1 11179 2272
2187 1 11180 2272
2188 1 11181 2272
2189 1 11229 2470
2190 1 11230 2470
2191 1 11231 2470
2192 1 11232 2470
2193 1 11233 2470
2194 1 11234 2470
2195 1 11235 2470
2196 1 11236 2470
2197 1 11237 2470
2198 1 11238 2470
2199 1 11239 2470
2200 1 11240 2470
2201 1 11241 2470
2202 1 11242 2470
2203 1 11243 2470
2204 1 11244 2470
2205 1 11245 2470
2206 1 11246 2470
2207 1 11247 2470
2208 1 11248 2470
2209 1 11249 2533
2210 1 11250 2533
2211 1 11251 2533
We can see that missing% in df_p
Here it is: an underlying question. How to do two reindication? One on the columns and one on the lines? Here is missing in df_p
subcriber_id
of df_n
.
--- df_n ---
count eclipse_id subscriber_id
0 1 11265.0 150
1 12 10453.0 150
2 1 11267.0 150
3 1 11266.0 150
4 1 11270.0 156
5 1 11267.0 156
6 17 10453.0 156
7 1 11266.0 156
8 1 11269.0 161
9 1 NaN 161
I have done:
df_p2_r = df_p2.reindex(index = df_n2.index)
but this cancels the previous reindication: df_p2_r = df_p2.reindex(columns = df_q2.columns)