remove non-alphanumeric characters in r

1

I have a function in r that analyzes them from different text strings and returns all the words that contain the different strings.

data <- searchTwitter(input$select, n=input$numtweets)
data_text <- sapply(data, function(x) x$getText())
data_text<- gsub('http+', '', data_text)
data_text_corpus <- Corpus(VectorSource(data_text))
data_text_corpus <- tm_map(data_text_corpus,
                                     content_transformer(function(x) iconv(x, to='UTF-8', sub='byte'))
)
data_text_corpus <- tm_map(data_text_corpus, removeNumbers)
data_text_corpus <- tm_map(data_text_corpus, content_transformer(tolower))
data_text_corpus <- tm_map(data_text_corpus, removePunctuation)
data_text_corpus <- tm_map(data_text_corpus, function(x)removeWords(x,stopwords(kind = "SMART")))  

The problem is that it returns strings of non-alphanumeric characters such as:

asthma  
â€Ã 
„à   
ÂÃ   
attack  

I just want you to return words like:

asthma  
attack
    
asked by francesc 18.05.2017 в 10:57
source

3 answers

2

Effectively the solution of @ lois6b is the right one, what remains is to "apply" it to the data.frame or vector, something like this:

> v <- c('„ÃA', 'AA')
> v[grepl('^[A-Za-z0-9]+$', v)]
[1] "AA"
    
answered by 18.05.2017 в 15:31
1

I do not know but from what I've seen looking for there, this function:

grepl('^[A-Za-z0-9]+$', str1)

use regular expressions (

answered by 18.05.2017 в 11:20
0

When importing, for example, tweets, there may be useful hidden information. A previous theme to the one mentioned by Patricio is "convert" the text:

Be the following example:

x <- c('asthma', 'â€Ã', '„Ã', 'ÂÃ', 'attack', 'gro\u00df', 'Ekstrøm', 'Jöreskog', 'bißchen', 'Zürcher')

You can try not to lose information in two ways:

x2 <- stringi::stri_trans_general(x, "latin-ascii")
x2

[1] "asthma" "a \ u0080A" "a \ u0080 \ u009eA" "AA"
[5] "attack" "gross" "Ekstrom" "Joreskog"
[9] "bisschen" "Zurcher"

That for this case it works for gross, which was written with a UNICODE character and the Swedish characters convert them. Or this one:

x3 <- iconv(x,'utf-8','ascii', sub = '')
x3

[1] "asthma" "" "" "" "attack" "gro" "Ekstrm"

[8] "Jreskog" "bichen" "Zrcher"

Where you delete characters from other languages or encodings, but leave Latin1.

Then zero-length strings can be removed.

The case is to try, not to lose information.

    
answered by 24.10.2017 в 20:10