NLP and ML Project

DETECTION OF DEPRESSION RELATED POSTS
IN REDDIT SOCIAL MEDIA FORUM
GUIDE:
Dr. M. Sree Latha
Prof. in CSE
BATCH NO:
7 TEAM MEMBERS:
P. Sai Yasaswini
Nitheesha. B
K. Harish Kumar
INDEX
01 STEPS AND DATASET
02 PREPROCESSING
03 FEATURE EXTRACTION
04 CLASSIFICATION
05 CONCLUSION
STEPS:
 PREPROCESSING
 FEAUTURE EXTRACTION
 CLASSIFICATION
DATA SET
DATASET: SENTIMENT 140
S E N T I M E N T 1 4 0
DESCRIPTION:
• This sentiment dataset consists of 16,00,000
tuples and 6 columns.
• Out of these 16,00,000 tuples, 8,00,000 are
COLUMNS depression related tweets. Other, non-
SENTIMENT (POLARITY) depression related tweets.
ID
DATE
• We mainly focus on two columns namely
FLAG (NO_QUERY) Sentiment and tweets.
USERNAME
TWEET
dataset.csv
PREPROCESSING
STEP-1 STEP-3 RESULT
REMOVING URLS STOP WORDS CLEAN TWEETS
. .
•REMOVING MENTIONS
•REMOVING
PUNCTUATIONS
STEP-2
STEMMING
HOW TO DO:
•URLS •STEMMING •STOPWORDS

•MENTIONS
RESULT
‘RE NLTK NLTK
BEAUTIFUL SOUP STEM CORPUS CLEAN TWEETS

TO DATA FRAMES
r'@[A-Za-z0-9_]+' WORDNETLEMM STOPWORD
ETIZER
 r'https?://[^ ]+'
EXAMPLE(1)
@caregiving couldn't bear to watch it. And I thought the UA loss was embarrassing . . . . .
Removal of URLs ,Mentions, negation words and uppercase
could not bear to watch it and thought the ua loss was embarrassing
Applying Lemmatization
could not bear to watch it and think the ua loss be embarrass
Removing Stop words
could not bear watch think ua loss embarrass

EXAMPLE(2)
@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.
You shoulda got David Carr of Third Day to do it. ;D
awww that bummer you shoulda got

david carr of third day to do it
awww that bummer you shoulda get

david carr of third day to do it
awww bummer shoulda get david carr third day

IMPLEMENTATION
RESULT
ML
CLASSIFICATION
NLP
FEATURE EXTRACTION
RE,NLP
PREPROCESSING
FEATURE EXTRACTION
N-GRAMS LDA
Latent Dirichlet Allocation is a
Used to calculate the probability of
probabilistic generative model
co-occurence of each input sentence
helpful in discovering underlying
as a unigram and bigram
topic structures
CLEAN
TWEETS
LIWC TFIDF
Linguistic Inquiry and Word Count Term Frequency – Inverse Document
dictionary can be used to obtain scores Frequency is a numeric statistic which
for standard linguistic dimensions, highlights the importance of a word w.r.t.
psychological processes and personal each document
concerns
N-GRAM MODELING
•Used to calulate the probability of co-occurence of each input

sequence as unigrams and bigrams.
•N-gram model is a type of probabilistic language

model for predicting the next item in a sequence in the
form of (n-1) order.
N-GRAM MODELING(Contd..)
S1: This movie is Bad.
S2: This movie is Good.
Unigrams: Bigrams:
[This , movie , is , [ [This , movie] , [movie , is] ,

Bad , Good] [is , Bad] , [is , Good]]
Vectorising:
S1: [1,1,1,1,0] S1: [1,1,1,0]

S2: [1,1,1,0,1] S2: [1,1,0,1]
•TF-IDF will be used as a numeric statistic which highlights
the importance of a word w.r.t. each document.
•TF:Term Frequency
Numbers of times a particular word has occured in a
given document.
•IDF: Inverse Document Frequency

In how many documents the word has appeared.
•Together called TF-IDF

•
•
LIWC
Linguistic Inquiry and Word Count
 The LIWC dictionary used in this demonstration is composed of
5,690 words and word stems. Each word or word stem defines one or more
word categories.
 For example, the word 'cried' is part of four word categories:

sadness, negative emotion, overall affect, and a past tense verb. Hence, if it
is found in the target text, each of these four category scale scores will be
incremented. As in this example, many of the LIWC categories are arranged
hierarchically. All anger words, by definition, will be categorized as negative
emotion and overall emotion words.
LIWC(Contd..)
LIWC(Contd..)
•Basically, it reads a given text and

counts the percentage of words that
reflect different emotions, thinking
styles, social concerns, and even
parts of speech.
•LIWC Dictionary
For each dictionary word,
there is a corresponding dictionary
entry that defines one or more word
categories
LDA
Latent Dirichlet Allocation
Used for topic modeling.
Generative probabilistic model of a collection of

composites made up of parts.
Typically used to detect underlying topics in text

documents.
Particularly useful for finding reasonably accurate mixtures

of topics within a given document.
LDA(Contd..)
Assumption
Documents with similar topics will use similar group of words.
How to do
•NLP
•Topic Modeling
LDA(Contd..)
LDA(Contd..)
CLASSIFICATION
LOGISTIC REGREESION
LOGISTIC
REGRESSION
SUPPORT ADAPTIVE BOOSTING
SUPPOR
VECTOR
MACHINE ADAPTIVE
BOOSTING
ALGORITHMS
RANDOM MULTILAYER
FOREST PERCEPTRON
LOGISTIC REGRESSION
•Simple Algorithm used for
binary/multivariate
classification tasks.
•Sigmoid function is used to

reduce the values to [0,1].
•The output is the class label

either 1(Yes) or 0(No).
LOGISTIC REGRESSION(Contd..)
MERITS
•Provides model logistic probability

•Easy to interpret
•Quickly update model to incorporate new data.
DEMERITS
•Suffers multicollinearity.
•Sensitive to extreme values of continuous variables.
SUPPORT VECTOR MACHINE
•Discriminative classifier formally defined by a separating

hyperplane.
•Given labeled data, SVM outputs an optimal hyperplane which

categorizes new examples.
•In two dimensional plane, hyperplane divides is a line dividing a

plane in two parts where in each class lay in either side.
SUPPORT VECTOR MACHINE(Contd..)
SUPPORT VECTOR MACHINE(Contd..)
MERITS
•Good prediction in a variety of situations.

•Low generalization errors.
•Easy to interpret results.
DEMERITS
•Computationally expensive.
•Complexity is high.
•Requires more memory and time for training the model.
RANDOM FOREST
Ensemble
method which
uses multiple
learning
models to gain
• Ensemble method which uses multiple learning models to gain better predictive results.
better
predictive
results.
• Creates a forest with a number of decision trees.
Creates a
forest with a
number of
• Decision trees are created from randomly selected subset of training set.
decision trees.
• Aggregates the votes from different decision trees to decide the final class of the test object.
Decision trees
are created
from randomly
selectedubset
of training set.
Aggregates the
votes from
different
decision trees
to decide the
final class of
the test object.
RANDOM FOREST(Contd..)
RANDOM FOREST(Contd..)
MERITS
•Efficient on large datasets.

•Deals with high dimensional data.
DEMERITS
•Observed to be overfit for some datasets with noisy classification tasks.

•Large number of trees makes the algorithm slow for real-time prediction.
ADAPTIVE BOOSTING
•First practical boosting algorithm

proposed by Freund and Schapire
in 1996.
•Focuses on classification
problems and aims to convert a
set of weak classifiers into a
strong one.
•Used to boost the performance

of a machine learning algorithm.
ADAPTIVE BOOSTING(Contd..)
MERITS
•Simple to implement
•Does features selection resulting in a simple classifier
•Fairly good generalization
DEMERITS
•Sensitive to noisy data and outliers

MULTI LAYER PERCEPTRON
•Class of feedforward artificial neural network.
•Consists of atleast three layers: an input layer, a

hidden layer and an output layer.
•Except for the input nodes, each node is a neuron

that uses a non linear activation function.
•Utilizes a supervised learning technique called

Backpropagation for training.
•Can distinguish data that is not linearly separable.

MULTI LAYER PERCEPTRON(Contd..)
MERITS
•Powerful, can model complex functions

•Adapt to unknown situations
DEMERITS
•Can get stuck in a local minima

•Number of hidden layers is to be set by the user
EXTENSION
•Existing methods are developed based on single features.
•Single features may be inefficient and result in lesser accuracy.
•To increase the accuracy combinations of these features are to

be taken.
CONCLUSION
•Perform Data Preprocessing, Feature Extraction and

Classification.
•Multiple models are to be built perform feature extraction and

classification.
•Determining the best model which gives more accuracy.

NLP and ML Project

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

NLP and ML Project

Caricato da

Copyright:

Formati disponibili

DETECTION OF DEPRESSION RELATED POSTS

IN REDDIT SOCIAL MEDIA FORUM

•URLS •STEMMING •STOPWORDS

‘RE NLTK NLTK

BEAUTIFUL SOUP STEM CORPUS CLEAN TWEETS

Removal of URLs ,Mentions, negation words and uppercase

could not bear to watch it and think the ua loss be embarrass

Removing Stop words

could not bear watch think ua loss embarrass

awww that bummer you shoulda got

awww that bummer you shoulda get

awww bummer shoulda get david carr third day

•Used to calulate the probability of co-occurence of each input

•N-gram model is a type of probabilistic language

[This , movie , is , [ [This , movie] , [movie , is] ,

S1: [1,1,1,1,0] S1: [1,1,1,0]

•IDF: Inverse Document Frequency

•Together called TF-IDF

 For example, the word 'cried' is part of four word categories:

•Basically, it reads a given text and

Used for topic modeling.

Generative probabilistic model of a collection of

Typically used to detect underlying topics in text

Particularly useful for finding reasonably accurate mixtures

Documents with similar topics will use similar group of words.

•Sigmoid function is used to

•The output is the class label

•Provides model logistic probability

•Discriminative classifier formally defined by a separating

•Given labeled data, SVM outputs an optimal hyperplane which

•In two dimensional plane, hyperplane divides is a line dividing a

•Good prediction in a variety of situations.

• Creates a forest with a number of decision trees.

•Efficient on large datasets.

•Observed to be overfit for some datasets with noisy classification tasks.

•First practical boosting algorithm

•Used to boost the performance

•Sensitive to noisy data and outliers

•Consists of atleast three layers: an input layer, a

•Except for the input nodes, each node is a neuron

•Utilizes a supervised learning technique called

•Can distinguish data that is not linearly separable.

•Powerful, can model complex functions

•Can get stuck in a local minima

•Existing methods are developed based on single features.

•Single features may be inefficient and result in lesser accuracy.

•To increase the accuracy combinations of these features are to

•Perform Data Preprocessing, Feature Extraction and

•Multiple models are to be built perform feature extraction and

•Determining the best model which gives more accuracy.

Potrebbero piacerti anche