Compare data shared in two columns in Python

Question

Compare data shared in two columns in Python

Navigation

#1 by (0 votes)

0

I compared columns in python, I wanted to know what data in column 1 (col1) was in column 2 (col2) .

I used the following:

set(df['col1']).intersection(set(df['col2']))

I throw the result like:

set([u'TSC22D4', u'HSPB1', u'RARRES2', u'GNG11', u'PTPRZ1', u'MCM7', u'MEST', u'PON2', u'PTN', u'RHEB', u'PEG10', u'GRM3'])

Is the u by intersection? They are very large data that I can not know but using these tools.

python

asked by Jan 01.11.2018 в 20:11

source

1 answer

Problem loading Applet in servlet Does not save the information in database, error "Object reference not established as instance of an object"

score 0 · Answer 1

The u indicates that it is a unicode string, which is a way to encode strings that supports letters of any alphabet, as opposed to the ascii that only supports those of the English alphabet.

What is happening to you is that you are directly printing a variable of type "set", which is the result of the operation. Python is very flexible when it comes to allowing you to show any variable on the screen, since it always tries to convert these variables to a format that can be seen and understood in the output. In this case what it does is show the same syntax with which you could re-create the result set from a list. That is, it shows set() and within the parentheses and between brackets, all the elements of the set separated by commas. Since each element in this case is a Unicode string, it shows it to you with the syntax in which you would have to write it in a python program, that is, u'contenido' .

If you simply want the content, without quotes, without u and without commas, you should not try to show the whole set directly, but iterate over it and show yourself each item one by one. For example:

resultado = set(df['col1']).intersection(set(df['col2']))
for elemento in resultado:
    print(elemento)

In this case, every elemento that you pass to print() is a Unicode string, but print() knows how to display these strings without resorting to its "python representation" as it did in the case of the set. You will see them "normal", without quotes or u .

You can also use join() to convert the result set to a string that is separated by commas (or by the separator you prefer) the elements. So:

cadena = ", ".join(resultado)

In your case the cadena would contain at the end:

'MCM7, GNG11, PON2, RARRES2, HSPB1, GRM3, PTPRZ1, RHEB, PTN, MEST, PEG10, TSC22D4'