Chartr () removes documents within the corpus - R

0

I'm using R 3.3.1 in Windows 10 . I have a set of 3099 txt files that I am using for text mining with the tm() package.

The code was working perfectly, but suddenly it started to fail ...

After trying to remove the accents from my corpus, the documents disappear.

I pursued the problem until the next line of code, which I used to remove the accents:

setwd("C:/txt")
library(tm) 
cname <- file.path("C:", "txt")
docs <- Corpus(DirSource(cname))
docs <- tm_map(docs, tolower)
docs
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: **3099**
**docs <- chartr("áéíóú", "aeiou", docs)**   # remove accents
docs <- Corpus(VectorSource(docs))   # back to a corpus
docs
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: **3**   

As you can see, suddenly the 3099 documents are now only 3, and those 3 are blank.

No error was generated, nor alert. The strangest thing is that this code was working.

Can anyone guide me with this problem? When I do not give an error, I do not know how to solve it.

    
asked by Santiago Bel 13.11.2016 в 02:03
source

1 answer

1

You can not use chartr directly with a VCorpus object. Once the corpus is created with tm then use the package transformation API via tm_map .

library(tm)
docs <- c("esto sí es un documento.", "éste no lo es.")
corp <- Corpus(VectorSource(docs))
print(corp[[1]]$content)
[1] "esto sí es un documento."

removeAccents <- content_transformer(function(x) chartr("áéíóú", "aeiou", x))
corp <- tm_map(corp, removeAccents)
print(corp[[1]]$content)
[1] "esto si es un documento."

You can see the intermediate objects using the function str of R.

    
answered by 13.11.2016 / 11:08
source