convert U + XXXX to hexadecimal utf8

1

I would like to know how to convert a string like this: U + 1F601 to this format: \ xF0 \ x9F \ x98 \ x81

We can see an example on this page: link

There you specify your UNICODE code and its value in bytes.

I use python 2.7

This website does what I want, but I do not know how it works internally: link

    
asked by XBoss 25.07.2018 в 12:19
source

1 answer

1

The .encode() method of Python Unicode strings allows you to specify which encoding you want to convert to. In your case it is enough to specify utf8 . But there remains the question of how to put any unicode character (in your case the U+1F601 ) into the string.

The way to do it depends on the character code.

  • If the code fits in 8 bits, you put \xHH , with HH the hexadecimal representation of those 8 bits. Notice that we are talking about the Unicode code, not its transformation to UTF-8. So, for example, the code of the eñe is U+00F1 , but since the high part is 00, we only need to specify the F1 , which fits in eight bits, so it would be \xf1 .

    Another thing is its representation utf8, which would be two bytes and that we can obtain with:

    >>> u'\xf1'.encode("utf8")
    b'\xc3\xb1'
    
  • If it does not fit in 8 bits but fits in 16, like for example the euro code (€) that is U+20AC , you can use the form \uXXXX , where XXXX is the hexadecimal representation of those 16 bits . Its transformation to UTF8 is obtained as before:

    >>> u'\u20ac'.encode("utf8")
    b'\xe2\x82\xac'
    
  • Finally, if it also does not fit in 16 bits, as is the case with the emojis and your example, then you have to represent it with 32 bits using the form \UXXXXXXXX , with XXXXXXXX being the hexadecimal representation of those 32 bits. In your example, U+1F601 would be represented as \U0001F601 . To get the bytes of your utf8 encoding, it is done the same as before:

    >>> u'\U0001F601'.encode("utf8")
    b'\xf0\x9f\x98\x81'
    

Note that the last option is the most general of all, since what fits in 8 bits also fits in 32. Therefore it would be possible to represent the eñe as \xf1 and also as \U000000f1 .

Update . If what you have is a string of "U+XXXXX" and you want to get the utf8 version of the character represented there, you do not need anything of the above. Just extract what goes after U+ , decode it as an integer in hexadecimal, and use chr() to get the character (unicode) that corresponds to that code. Once you have the character, you use .encode("utf8") to get its encoding. So:

def unicode_to_utf8(unicode_point):
  code = int(unicode_point[2:], 16)
  return chr(code).encode("utf8")

Examples:

>>> unicode_to_utf8("U+F1")
b'\xc3\xb1'
>>> unicode_to_utf8("U+20AC")
b'\xe2\x82\xac'
>>> unicode_to_utf8("U+1F601")
b'\xf0\x9f\x98\x81'
    
answered by 25.07.2018 / 13:10
source