The .encode()
method of Python Unicode strings allows you to specify which encoding you want to convert to. In your case it is enough to specify utf8
. But there remains the question of how to put any unicode character (in your case the U+1F601
) into the string.
The way to do it depends on the character code.
-
If the code fits in 8 bits, you put \xHH
, with HH
the hexadecimal representation of those 8 bits. Notice that we are talking about the Unicode code, not its transformation to UTF-8. So, for example, the code of the eñe is U+00F1
, but since the high part is 00, we only need to specify the F1
, which fits in eight bits, so it would be \xf1
.
Another thing is its representation utf8, which would be two bytes and that we can obtain with:
>>> u'\xf1'.encode("utf8")
b'\xc3\xb1'
-
If it does not fit in 8 bits but fits in 16, like for example the euro code (€) that is U+20AC
, you can use the form \uXXXX
, where XXXX
is the hexadecimal representation of those 16 bits . Its transformation to UTF8 is obtained as before:
>>> u'\u20ac'.encode("utf8")
b'\xe2\x82\xac'
-
Finally, if it also does not fit in 16 bits, as is the case with the emojis and your example, then you have to represent it with 32 bits using the form \UXXXXXXXX
, with XXXXXXXX
being the hexadecimal representation of those 32 bits. In your example, U+1F601
would be represented as \U0001F601
. To get the bytes of your utf8 encoding, it is done the same as before:
>>> u'\U0001F601'.encode("utf8")
b'\xf0\x9f\x98\x81'
Note that the last option is the most general of all, since what fits in 8 bits also fits in 32. Therefore it would be possible to represent the eñe as \xf1
and also as \U000000f1
.
Update . If what you have is a string of "U+XXXXX"
and you want to get the utf8 version of the character represented there, you do not need anything of the above. Just extract what goes after U+
, decode it as an integer in hexadecimal, and use chr()
to get the character (unicode) that corresponds to that code. Once you have the character, you use .encode("utf8")
to get its encoding. So:
def unicode_to_utf8(unicode_point):
code = int(unicode_point[2:], 16)
return chr(code).encode("utf8")
Examples:
>>> unicode_to_utf8("U+F1")
b'\xc3\xb1'
>>> unicode_to_utf8("U+20AC")
b'\xe2\x82\xac'
>>> unicode_to_utf8("U+1F601")
b'\xf0\x9f\x98\x81'