I have a function in r that analyzes them from different text strings and returns all the words that contain the different strings.
data <- searchTwitter(input$select, n=input$numtweets)
data_text <- sapply(data, function(x) x$getText())
data_text<- gsub('http+', '', data_text)
data_text_corpus <- Corpus(VectorSource(data_text))
data_text_corpus <- tm_map(data_text_corpus,
content_transformer(function(x) iconv(x, to='UTF-8', sub='byte'))
)
data_text_corpus <- tm_map(data_text_corpus, removeNumbers)
data_text_corpus <- tm_map(data_text_corpus, content_transformer(tolower))
data_text_corpus <- tm_map(data_text_corpus, removePunctuation)
data_text_corpus <- tm_map(data_text_corpus, function(x)removeWords(x,stopwords(kind = "SMART")))
The problem is that it returns strings of non-alphanumeric characters such as:
asthma
â€Ã
„Ã
ÂÃ
attack
I just want you to return words like:
asthma
attack