Error extracting string character UTF-8

8

NOTE

Although it does not affect the template code ( utf8iterator.hpp ), the main.cpp depends on the encoding used when saving it to work correctly. In my system, I work with UTF-8, and said main.cpp is saved in the same encoding.

Thanks to user @asdasdasd for letting me know with your comments.

END OF THE NOTE

After reading this question: Why does cout not display vowels with tilde or "ñ" with gcc 4.9.4? , I felt the uncontrollable need to iterate over individual characters within a ::std::string , or within a const char[] .

After documenting a bit on the Wikipedia page on UTF-8 , I encoded this simple template

// utf8iterator.hpp

#ifndef UTF8ITERATOR_HPP
#define UTF8ITERATOR_HPP

#include <cstddef>

template< typename T > struct utf8iterator {
  //static constexpr char ReplacementCharacter[4] { '\xEF', '\xBF', '\xBD', '\x00' };

  T ptr;
  ::size_t size; // Tamaño del caracter, en bytes. == 0 -> ptr ha cambiado.
                 // Su única misión es evitar escrituras no necesarias.
  char bytes[5]; // Máximo tamaño de un UTF-8 es 4. Dejamos sitio para el 0 al final.

  utf8iterator( const T &p ) :
    ptr( p ),
    size( 0 )
  {
    bytes[4] = 0; // Solo lo hacemos 1 vez. Nunca se sobreescribe.
  }
  utf8iterator &operator=( const T &iter ) {
    ptr = iter;
    size = 0;
    return *this;
    // Ya hicimos 'bytes[4] = 0' en el constructor.
  }

  bool operator==( const utf8iterator< T > &other ) const noexcept { return ptr == other.ptr; }
  bool operator!=( const utf8iterator< T > &other ) const noexcept { return ptr != other.ptr; }

  ::size_t calculateSize( ) const {
    if( ( *ptr & 248 ) == 240 ) { // 11110
      return 4;
    } else if( ( *ptr & 240 ) == 224 ) { // 1110
      return 3;
    } else if( ( *ptr & 224 ) == 192 ) // 110
      return 2;

    return 1;
  }
  utf8iterator &operator++( ) {
    if( size ) {
      ptr += size;
      size = 0; // Al cambiar 'ptr', se invalida 'size'.
    } else
      ptr += calculateSize( ); // 'size' ya es inválido.

    return *this;
  }
  utf8iterator operator++( int ) {
    utf8iterator tmp( *this );

    if( size ) {
      ptr += size;
      size = 0; // Al cambiar 'ptr', se invalida 'size'.
    } else
      ptr += calculateSize( ); // 'size' ya es inválido.

    return tmp;
  }

  operator const char *( ) {
    // Si 'size' es inválido, tenemos que calcular el tamaño del caracter, en bytes.
    if( !size ) {
      ::size_t c;
      T iter( ptr );

      size = calculateSize( );

      // Subsceptible de optimizar, especializando para < const char * >, y usando ::std::memcpy( ).
      // Copiamos los bytes indicados en 'size' al buffer 'bytes'.
      for( c = 0; c != size; ++c ) {
        bytes[c] = *iter;
        ++iter;
      }

      // En el constructor, hicimos 'bytes[4] = 0'. Las escrituras son costosas.
      // Solo ponemos el 0 si 'bytes != 4'.
      if( size != 4 )
        bytes[size] = 0;
    }

    return bytes;
  }
};

#endif

Accompanied by a small test code

// main.cpp

#include <iostream>

#include "utf8iterator.hpp"

int main( void ) {
  const char *test = "abcdeññ";

  utf8iterator< const char * > charIter( test );

  while( *charIter ) {
    std::cout << charIter.size( ) << ": ";
    std::cout << *charIter << "\n";
    ++charIter;
  }

  std::cout << std::endl;

  return 0;
}

All this compiles correctly with

g++ -I . -std=c++11 -Wall -pedantic -o test test.cpp

The expected result would be

  

1: a
1: b
1: c
1: d
1: e
2: ñ
2: ñ

However, the result obtained is this other:

  

1: a
1: b
1: c
1: d
1: e
2:
2:

I'm pretty sure that the bug is in const char *utf8iterator::operator*( ) , but I do not hit the key.

Any suggestions?

EDITO

Je, in the end the shots do not go there, but in as I print it in the test ; I have the C ++ something rusty . I leave it unanswered for a while.

    
asked by Trauma 20.03.2017 в 11:00
source

2 answers

2

This works for me:

int main( void ) {
  const char *test = "abcdeññ";

  utf8iterator< const char * > charIter( test );

  while( *charIter ) {

    std::cout << charIter.calculateSize( )  << ": ";  
    std::cout << charIter << "\n";

    ++charIter;
  }

  return 0;
}

std::cout << *charIter << "\n"; by std::cout << charIter << "\n"; I guess that will be the behavior I wanted

operator

    
answered by 20.03.2017 / 23:33
source
0

If you want to correctly interpret characters in UTF-8 you can use wstring instead of trying to reinvent the wheel:

// main.cpp

#include <iostream>
#include <locale>
#include <string>

using namespace std;

int main( void ) {
    ios_base::sync_with_stdio(false);
        wcout.imbue(locale("en_US.UTF-8"));

        for (auto const&t : wstring (L"áéíóúññ")){
            wcout << t;
        }

        wcout << endl;
        return 0;
}

To check its operation:

$ g++ -I . -std=c++11 -Wall -pedantic -o test main.cpp
$ ./test 

áéíóúññ
    
answered by 20.03.2017 в 11:23