Replace characters in a field in SQL Server

4

My question is this,

I have a field called titulo whose records are all university title names. Some contain ñ , other letters with accents. I imported it from a csv that already came with errors and my intention is to clean them and insert them clean in a new column of the same table called TituloDepurado .

Example:

% co_of% professors of Physical Education

Titulo Physical Education faculty

I did a Stored Procedure with the replace function. It does not mark me error when executing itself but nevertheless it inserts to me in the new column the names as they were, without any modification.

The updates statement is as follows:

UPDATE tabla set TituloDepurado = REPLACE(titulo, "á", "á") 
UPDATE tabla set TituloDepurado = REPLACE(titulo, "é", "é") 
UPDATE tabla set TituloDepurado = REPLACE(titulo, "í", "í") 
UPDATE tabla set TituloDepurado = REPLACE(titulo,"ó", "ó") 
UPDATE tabla set TituloDepurado = REPLACE(titulo, "ú", "ú") 
UPDATE tabla set TituloDepurado = REPLACE(titulo,"ñ", "ñ")

I would like to know why you insert the titles in the new column as they are without paying attention to the function TituloDepurado .

What would be my mistake?

    
asked by merisavino 20.07.2017 в 22:46
source

1 answer

5

Function extracted from Stack Overflow

The only thing you need is to decode the text to UTF-8 by means of a function:

CREATE FUNCTION dbo.DecodeUTF8String(@value varchar(max))
RETURNS nvarchar(max)
AS
BEGIN
    -- Transforms a UTF-8 encoded varchar string into Unicode
    -- By Anthony Faull 2014-07-31
    DECLARE @result nvarchar(max);

    -- If ASCII or null there's no work to do
    IF (@value IS NULL
        OR @value NOT LIKE '%[^ -~]%' COLLATE Latin1_General_BIN
    )
        RETURN @value;

    -- Generate all integers from 1 to the length of string
    WITH e0(n) AS (SELECT TOP(POWER(2,POWER(2,0))) NULL FROM (VALUES (NULL),(NULL)) e(n))
        , e1(n) AS (SELECT TOP(POWER(2,POWER(2,1))) NULL FROM e0 CROSS JOIN e0 e)
        , e2(n) AS (SELECT TOP(POWER(2,POWER(2,2))) NULL FROM e1 CROSS JOIN e1 e)
        , e3(n) AS (SELECT TOP(POWER(2,POWER(2,3))) NULL FROM e2 CROSS JOIN e2 e)
        , e4(n) AS (SELECT TOP(POWER(2,POWER(2,4))) NULL FROM e3 CROSS JOIN e3 e)
        , e5(n) AS (SELECT TOP(POWER(2.,POWER(2,5)-1)-1) NULL FROM e4 CROSS JOIN e4 e)
    , numbers(position) AS
    (
        SELECT TOP(DATALENGTH(@value)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
        FROM e5
    )
    -- UTF-8 Algorithm (http://en.wikipedia.org/wiki/UTF-8)
    -- For each octet, count the high-order one bits, and extract the data bits.
    , octets AS
    (
        SELECT position, highorderones, partialcodepoint
        FROM numbers a
        -- Split UTF8 string into rows of one octet each.
        CROSS APPLY (SELECT octet = ASCII(SUBSTRING(@value, position, 1))) b
        -- Count the number of leading one bits
        CROSS APPLY (SELECT highorderones = 8 - FLOOR(LOG( ~CONVERT(tinyint, octet) * 2 + 1)/LOG(2))) c
        CROSS APPLY (SELECT databits = 7 - highorderones) d
        CROSS APPLY (SELECT partialcodepoint = octet % POWER(2, databits)) e
    )
    -- Compute the Unicode codepoint for each sequence of 1 to 4 bytes
    , codepoints AS
    (
        SELECT position, codepoint
        FROM
        (
            -- Get the starting octect for each sequence (i.e. exclude the continuation bytes)
            SELECT position, highorderones, partialcodepoint
            FROM octets
            WHERE highorderones <> 1
        ) lead
        CROSS APPLY (SELECT sequencelength = CASE WHEN highorderones in (1,2,3,4) THEN highorderones ELSE 1 END) b
        CROSS APPLY (SELECT endposition = position + sequencelength - 1) c
        CROSS APPLY
        (
            -- Compute the codepoint of a single UTF-8 sequence
            SELECT codepoint = SUM(POWER(2, shiftleft) * partialcodepoint)
            FROM octets
            CROSS APPLY (SELECT shiftleft = 6 * (endposition - position)) b
            WHERE position BETWEEN lead.position AND endposition
        ) d
    )
    -- Concatenate the codepoints into a Unicode string
    SELECT @result = CONVERT(xml,
        (
            SELECT NCHAR(codepoint)
            FROM codepoints
            ORDER BY position
            FOR XML PATH('')
        )).value('.', 'nvarchar(max)');

    RETURN @result;
END
GO

Next, pass the text to be decoded in the function. For example, the text Educación Física :

SELECT dbo.DecodeUTF8String('Educación Física')

It will have as output:

  

Educación Física

Here you can see the demo and the result

In your case, it is simply doing the UPDATE in the following way:

UPDATE tabla set TituloDepurado = dbo.DecodeUTF8String(titulo)
    
answered by 20.07.2017 / 22:59
source