Final Report Ca3 PDF

DATA :
https://www.afaqs.com/news/story/53755_For-the-first-time-ever-Flipkart-has-let-YOU-set-their-offer-
prices-during-The-Big-Billion-Days
TOPIC : An article
Flipkart has let YOU set their offer prices during The Big Billion Days.
#Test mining
R Studio is an open source language and environment for statistical computing and graphics. It
includes packages like tm, SnowballC, ggplot2 and wordcloud, which are used to carry out the
earlier-mentioned steps in text processing. The first prerequisite is that R and R Studio need to
be installed on your machine.
Loading data into R
Start RStudio and open the Text Mining project was created earlier. The next step is to load
the tm package as this is not loaded by default. This is done using the library() function like so:
library(tm)
Loading required package: NLP
Dependent packages are loaded automatically – in this case the dependency is on the NLP
(natural language processing) package.
Next, we need to create a collection of documents (technically referred to as a Corpus) in the R

environment. This basically involves loading the files created in the text mining folder into a
Corpus object. The tm package provides the Corpus() function to do this. There are several
ways to create a Corpus (check out the online help using ? as explained earlier). In a
nutshell, the Corpus() function can read from various sources including a directory.
Commands:
#Data cleaning:
Data cleaning is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the
dirty or coarse data. Data cleansing may be performed interactively tools as through scripting
Uses:
The tm_map() function is used to remove unnecessary white space, to convert the text to lower
case, to remove common stopwords like ‘the’, “we”.
There are a few preliminary clean-up steps we need to do before we use these powerful
datacleaning. If anybody inspect some documents in the corpus (and we know how to do that
now), you will notice that I have some quirks in my writing. For example, I often use colors and
hyphens without spaces between the words separated by them.
By using this will cause remove Punctuation two words on either side of the symbols to be
combined. Clearly, we need to fix this prior to using the data cleaning
inspect the corpus again to ensure that the offenders have been eliminated. This is also a good
time to check for any other special symbols that may need to be removed manually.
If all is well, you can move to the next step which is to:
• Convert the corpus to lower case

• Remove all numbers
• Ignore warning messages
#Term documents Matrix:

The next step in the process is the creation of the document term matrix (DTM)– a matrix that
lists all occurrences of words in the corpus, by document. In the DTM, the documents are
represented by rows and the terms (or words) by columns.
Clearly there is nothing special about rows and columns – we could just as easily transpose
them. If we did so, we’d get a term document matrix (TDM) in which the terms are rows and
documents columns. One can work with either a DTM or TDM. I’ll use the DTM in what follows
By using below commands:
> #term document matrix

>
Thebigmillionday=TermDocumentMatrix(Thebigmillionday,control=list(minWordLeng
th=1))
> filpkart=as.matrix(Thebigmillionday)
> f=sort(rowSums(filpkart),decreasing = TRUE)
> library(wordcloud)
# Word cloud
Interpretation:
Here we first load the wordcloud package which is not loaded by default. The arguments of
the library(wordcloud) function are straightforward enough. Note that one can specify the
maximum number of words to be included instead of the minimum frequency (as I have done
above). See the word cloud documentation for more.
This word cloud also makes it clear that stop word removal has not done its job well, there are
a number of words it has missed (also and however, for example). These can be removed by
augmenting the built-in stop word list with a custom one.
While setting the frequency it should not be minimum frequency by 100 or not more than 200
maximum frequency (min.frequency<100 , max.frequency>300)
The above word cloud clearly shows that “October”, “idea”, “voice”, “set” ,“brand” and ”customer” are the six
most important words in the article “Flipkart has let YOU set their offer prices during The Big Billion Days”.
# Explore frequent terms and their associations:
findFreqTerms(Thebigmillionday,lowfreq=6
findFreqTerms(Thebigmillionday,lowfreq=6)
[1] "big" "billion" "days" "flipkart" "assistant"
"bargaining" "experience"
[8] "google" "price"
We can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs()
function. The R code below identifies which words are associated with “Big” in Flipkart has let YOU set
their offer prices during The Big Billion Days.
Frequency table of the words:
findFreqTerms(Thebigmillionday,highfreq=2)
a=data.frame(word=names(f),freq=f)
head(a,10)
word freq
big big 8
billion billion 8
days days 8
assistant assistant 8
bargaining bargaining 8
experience experience 8
flipkart flipkart 6
google google 6
price price 6
can can 5
The frequency will be first 10 would be plotted as below:
The bar plots shows most frequency words which held by seeing in find frequency terms which
is mostly repeated words are used. In this bar plot it mentioned multi color code like
brewer.pal(8,”Dark2”) or we add by simply making commands like any color names either
“black”,”blue” etc
Full scripts which I was used in R studio machince learning software :
library(tm)
Thebigmillionday=readLines(file.choose())
Thebigmillionday=Corpus(VectorSource(Thebigmillionday))
#####################################################################
#Data cleaning
Thebigmillionday=tm_map(Thebigmillionday,tolower)#all text into lower case
Thebigmillionday=tm_map(Thebigmillionday,removePunctuation)
Thebigmillionday=tm_map(Thebigmillionday,removeNumbers)
Thebigmillionday=tm_map(Thebigmillionday,removeWords,stopwords("english"))
#####################################################################
#Term document matrix
Thebigmillionday=TermDocumentMatrix(Thebigmillionday,control=list(minWordLength=1))
filpkart=as.matrix(Thebigmillionday)
f=sort(rowSums(filpkart),decreasing = TRUE)
######################################################################
#wordcloud
library(wordcloud)
wordcloud(names(f),f,min.freq=100,max.words=200,colors=palette())
######################################################################
#findfrequencyterms
findFreqTerms(Thebigmillionday,lowfreq=4)
a=data.frame(word=names(f),freq=f)
head(a,10)
######################################################################
#barplots of the most frequency words
barplot(a[1:10,]$freq,las=2,names.arg=a[1:10,]$word,
col=brewer.pal(8, "Dark2"),main="Most frequent words",
ylab="Word frequencies")
#######################################################################

Final Report Ca3 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final Report Ca3 PDF

Caricato da

Copyright:

Formati disponibili

DATA :

Next, we need to create a collection of documents (technically referred to as a Corpus) in the R

• Convert the corpus to lower case

#Term documents Matrix:

By using below commands:

> #term document matrix

# Explore frequent terms and their associations:

Frequency table of the words:

Potrebbero piacerti anche