Fix spaces and uppercase

1

I continue with my problem. I have to correct a text, first count the words and all that, that I have it well done (thank goodness) but now comes the part of: -more than two spaces is not allowed, it can be tabulated if desired, but two spaces, not -after . ,;;,: always a space - After point always space and capital letters

With my code I correct something, but not everything, and what's more, I add spaces. Let's see if you can throw me a cable.

def correcciones(texto):
    texto = texto.replace('  ', '')
    texto = texto.replace('. ', '.')
    texto = texto.replace('.', '. ')
    texto = texto.replace(':', ': ')
    texto = texto.replace(';', '; ')
    texto = texto.replace(';  ', '; ')
    texto = texto.replace(',', ', ')
    texto = texto.replace('  ', '')
    texto = texto.replace('.    ', '.\t')
    texto = texto.replace(' -', ':\t-')
    texto = texto.replace('      ', '\t  ')
    return texto
    
asked by Eva Rentero 05.01.2019 в 12:05
source

1 answer

0

Spaces and tabs

As you rightly say, by "the account of the old" and very carefully, can be completed in most cases based on replace() .

However, the order in which you apply replace() is crucial, because if for example you first make the one that converts two spaces into one, then a sequence of four spaces will no longer be detected as TAB, since it will have been compressed to one of only two.

I think the following sequence will work well in most cases. It would be interesting to provide the text on which you do the tests, to check that it goes as expected.

def correcciones(texto):
    texto = texto.replace('.', '. ')
    texto = texto.replace(':', ': ')
    texto = texto.replace(';', '; ')
    texto = texto.replace(',', ', ')
    texto = texto.replace('    ', '\t')
    texto = texto.replace('   ', ' ')
    texto = texto.replace('  ', ' ')
    texto = texto.replace(' -', ':\t-')
    return texto
  • We start by adding space after punctuation. Strictly speaking, it is not good, because we added it even if there was already one , but since we will replace the sequences of two spaces with a single one later, that secondary effect will disappear.
  • Then we changed the sequences of four spaces by TAB. It is not necessary the particular case of four spaces after a point, that you had, because it is included in this one.
  • Then I change sequences of three spaces for a single one.
  • Finally two spaces for a single one
  • The last line I do not understand, but it was in your code and I left it. I do not know what case he tries to cover.

The above code will work reasonably well, except in rare cases such as a sequence of 7 spaces, which would result in a TAB and a space (because the first 4 are replaced by TAB, and then the next 3 by one) , when perhaps, the result should be a TAB and three spaces. But it is not well specified anyway what should come out in these borderline cases.

Shift after stitch

This part can not be done with replace() .

It occurs to me that the simplest form would be the following. Assuming that after the previous transformations all the sentences have been "well formed" in the sense that after each point appears a space and only one, and after that space the next word (example: "Esto va bien. esto es otra frase" ) then we can:

  • Divide all the text by the places where the string appears ". " (point and space)
  • Each of the resulting pieces will be a prayer. Capitalize the first word of each of them, using string.capitalize() .
  • Reassemble all the pieces into a single text, "pasting" them with the sequence ". " between pieces.

This code does that:

oraciones = texto.split(". ")
oraciones_bien = []
for oracion in oraciones:
  oraciones_bien.append(oracion.capitalize())
texto = ". ".join(oraciones_bien)

It has the side effect of also capitalizing the first letter of the text, which seems right to me because it is the beginning of a sentence.

The code can be written much more compactly, if you have given the list comprehensions , like this:

texto = ". ".join(oracion.capitalize() for oracion in texto.split(". "))

Example:

def mayusc_tras_punto(texto):
  texto = ". ".join(oracion.capitalize() for oracion in texto.split(". "))
  return texto

ejemplo = "Esta es una frase. esta es otra. aqui va una tercera. Esta ya estaba bien."
print(mayusc_tras_punto(ejemplo))

Sale:

Esta es una frase. Esta es otra. Aqui va una tercera. Esta ya estaba bien.
    
answered by 05.01.2019 / 13:41
source