remove non-alphanumeric characters in r

Question

remove non-alphanumeric characters in r

Navigation

#1 by (2 votes)
#2 by (1 votes)
#3 by (0 votes)

1

I have a function in r that analyzes them from different text strings and returns all the words that contain the different strings.

data <- searchTwitter(input$select, n=input$numtweets)
data_text <- sapply(data, function(x) x$getText())
data_text<- gsub('http+', '', data_text)
data_text_corpus <- Corpus(VectorSource(data_text))
data_text_corpus <- tm_map(data_text_corpus,
                                     content_transformer(function(x) iconv(x, to='UTF-8', sub='byte'))
)
data_text_corpus <- tm_map(data_text_corpus, removeNumbers)
data_text_corpus <- tm_map(data_text_corpus, content_transformer(tolower))
data_text_corpus <- tm_map(data_text_corpus, removePunctuation)
data_text_corpus <- tm_map(data_text_corpus, function(x)removeWords(x,stopwords(kind = "SMART")))

The problem is that it returns strings of non-alphanumeric characters such as:

asthma  
â€Ã 
â€žÃ    
ÂÃ   
attack

I just want you to return words like:

asthma  
attack

string r data

asked by francesc 18.05.2017 в 08:57

source

3 answers

Search credit card number in an input or textbox, after this remove it [closed] I do not work the android studio logcat

score 2 · Answer 1

2

Effectively the solution of @ lois6b is the right one, what remains is to "apply" it to the data.frame or vector, something like this:

> v <- c('â€žÃA', 'AA')
> v[grepl('^[A-Za-z0-9]+$', v)]
[1] "AA"

answered by 18.05.2017 в 13:31

score 1 · Answer 2

1

I do not know r but from what I've seen looking for there, this function:

grepl('^[A-Za-z0-9]+$', str1)

use regular expressions (

answered by 18.05.2017 в 09:20

0

When importing, for example, tweets, there may be useful hidden information. A previous theme to the one mentioned by Patricio is "convert" the text:

Be the following example:

x <- c('asthma', 'â€Ã', 'â€žÃ', 'ÂÃ', 'attack', 'gro\u00df', 'Ekstrøm', 'Jöreskog', 'bißchen', 'Zürcher')

You can try not to lose information in two ways:

x2 <- stringi::stri_trans_general(x, "latin-ascii")
x2

[1] "asthma" "a \ u0080A" "a \ u0080 \ u009eA" "AA"
[5] "attack" "gross" "Ekstrom" "Joreskog"
[9] "bisschen" "Zurcher"

That for this case it works for gross, which was written with a UNICODE character and the Swedish characters convert them. Or this one:

x3 <- iconv(x,'utf-8','ascii', sub = '')
x3

[1] "asthma" "" "" "" "attack" "gro" "Ekstrm"

[8] "Jreskog" "bichen" "Zrcher"

Where you delete characters from other languages or encodings, but leave Latin1.

Then zero-length strings can be removed.

The case is to try, not to lose information.

answered by 24.10.2017 в 18:10

score 0 · Answer 3