When you choose a collate as utf8_spanish_ci
, you are actually specifying 2 things:
- The charset:
utf-8
- The collate:
spanish_ci
The charset determines how to represent the data internally (the bytes), while the collate determines the rules that are followed to compare and sort the text.
Reference: Character Sets and Collations in MySQL .
Charset
So for the same string (for example 'abc'
), the values and the number of bytes used to represent that string internally will not be the same if you use utf8_general_ci
or utf16_general_ci
, because they use 2 charsets different.
There are 2 main reasons why you would choose one charset rather than another in different circumstances:
- In the case of some charsets, you may not have the ability to represent certain characters, so it is important to choose a charset that can handle all the characters you need. If you use a charset that starts with
utf
( utf8
, utf16
, utf32
, etc.), then you can be sure that you can handle any character that is part of Unicode.
- The number of bytes used per character varies between different charsets. So if you want to control the size of the database, it's something to think about.
Generally, the use of utf-8
is favored because it balances very well the need to handle all the characters in Unicode, but at the same time using a format that reduces the amount of bytes needed.
For languages such as Spanish, utf-8
can represent the characters with a single byte. But if you were to handle Chinese for example, then more bytes may be required, and then% co_of% may be more advantageous. (Note: contrary to what you put in the question, utf16
does not represent all the characters in Unicode with a single byte, some yes, others no.)
Collate
Now, within a charset , you have the option to choose different collates . For example, you can choose between utf8
and utf8_general_ci
, and even utf8_spanish_ci
. In all these cases, the internal representation of the text is identical (the bytes are the same).
Rather, the effect of choosing a different collate is that it adjusts how the text is compared and ordered in your queries.
For example, if the text contains letters such as utf8_spanish2_ci
or double L ñ
, using different collates you will notice that the text is sorted differently when you do ll
.
Here is a demonstration of how the collate affects the order of the text in MySQL.
As for the effect it has on the comparison of the data ( ORDER BY
), I am not aware of any difference between where col = 'abc'
and utf8_general_ci
. But there may be. I know there are differences in the case of other languages like German, for example.
Reference: Examples of the Effect of Collation .