Sei sulla pagina 1di 37

DETECTION OF DEPRESSION RELATED POSTS

IN REDDIT SOCIAL MEDIA FORUM

GUIDE:
Dr. M. Sree Latha
Prof. in CSE
BATCH NO:

7 TEAM MEMBERS:
P. Sai Yasaswini
Nitheesha. B
K. Harish Kumar
INDEX
01 STEPS AND DATASET

02 PREPROCESSING

03 FEATURE EXTRACTION

04 CLASSIFICATION

05 CONCLUSION
STEPS:
 PREPROCESSING
 FEAUTURE EXTRACTION
 CLASSIFICATION
DATA SET
DATASET: SENTIMENT 140

S E N T I M E N T 1 4 0

DESCRIPTION:
• This sentiment dataset consists of 16,00,000
tuples and 6 columns.
• Out of these 16,00,000 tuples, 8,00,000 are
COLUMNS depression related tweets. Other, non-
SENTIMENT (POLARITY) depression related tweets.
ID
DATE
• We mainly focus on two columns namely
FLAG (NO_QUERY) Sentiment and tweets.
USERNAME
TWEET

dataset.csv
PREPROCESSING
STEP-1 STEP-3 RESULT
REMOVING URLS STOP WORDS CLEAN TWEETS
. .
•REMOVING MENTIONS
•REMOVING
PUNCTUATIONS

STEP-2
STEMMING
HOW TO DO:

•URLS •STEMMING •STOPWORDS


•MENTIONS
RESULT

‘RE NLTK NLTK

BEAUTIFUL SOUP STEM CORPUS CLEAN TWEETS


TO DATA FRAMES
r'@[A-Za-z0-9_]+' WORDNETLEMM STOPWORD
ETIZER
 r'https?://[^ ]+'
EXAMPLE(1)
@caregiving couldn't bear to watch it. And I thought the UA loss was embarrassing . . . . .

Removal of URLs ,Mentions, negation words and uppercase

could not bear to watch it and thought the ua loss was embarrassing

Applying Lemmatization

could not bear to watch it and think the ua loss be embarrass

Removing Stop words

could not bear watch think ua loss embarrass


EXAMPLE(2)
@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.
You shoulda got David Carr of Third Day to do it. ;D

awww that bummer you shoulda got


david carr of third day to do it

awww that bummer you shoulda get


david carr of third day to do it

awww bummer shoulda get david carr third day


IMPLEMENTATION
RESULT

ML
CLASSIFICATION

NLP
FEATURE EXTRACTION

RE,NLP
PREPROCESSING
FEATURE EXTRACTION
N-GRAMS LDA
Latent Dirichlet Allocation is a
Used to calculate the probability of
probabilistic generative model
co-occurence of each input sentence
helpful in discovering underlying
as a unigram and bigram
topic structures

CLEAN
TWEETS
LIWC TFIDF
Linguistic Inquiry and Word Count Term Frequency – Inverse Document
dictionary can be used to obtain scores Frequency is a numeric statistic which
for standard linguistic dimensions, highlights the importance of a word w.r.t.
psychological processes and personal each document
concerns
N-GRAM MODELING

•Used to calulate the probability of co-occurence of each input


sequence as unigrams and bigrams.

•N-gram model is a type of probabilistic language


model for predicting the next item in a sequence in the
form of (n-1) order.
N-GRAM MODELING(Contd..)
S1: This movie is Bad.
S2: This movie is Good.

Unigrams: Bigrams:

[This , movie , is , [ [This , movie] , [movie , is] ,


Bad , Good] [is , Bad] , [is , Good]]

Vectorising:

S1: [1,1,1,1,0] S1: [1,1,1,0]


S2: [1,1,1,0,1] S2: [1,1,0,1]
N-GRAM MODELING(Contd..)
•TF-IDF will be used as a numeric statistic which highlights
the importance of a word w.r.t. each document.

•TF:Term Frequency
Numbers of times a particular word has occured in a
given document.

•IDF: Inverse Document Frequency


In how many documents the word has appeared.

•Together called TF-IDF


N-GRAM MODELING(Contd..)



LIWC
Linguistic Inquiry and Word Count
 The LIWC dictionary used in this demonstration is composed of
5,690 words and word stems. Each word or word stem defines one or more
word categories.

 For example, the word 'cried' is part of four word categories:


sadness, negative emotion, overall affect, and a past tense verb. Hence, if it
is found in the target text, each of these four category scale scores will be
incremented. As in this example, many of the LIWC categories are arranged
hierarchically. All anger words, by definition, will be categorized as negative
emotion and overall emotion words.
LIWC(Contd..)
LIWC(Contd..)

•Basically, it reads a given text and


counts the percentage of words that
reflect different emotions, thinking
styles, social concerns, and even
parts of speech.

•LIWC Dictionary
For each dictionary word,
there is a corresponding dictionary
entry that defines one or more word
categories
LDA
Latent Dirichlet Allocation

Used for topic modeling.

Generative probabilistic model of a collection of


composites made up of parts.

Typically used to detect underlying topics in text


documents.

Particularly useful for finding reasonably accurate mixtures


of topics within a given document.
LDA(Contd..)
Assumption

Documents with similar topics will use similar group of words.

How to do

•NLP
•Topic Modeling
LDA(Contd..)
LDA(Contd..)
CLASSIFICATION
LOGISTIC REGREESION

LOGISTIC
REGRESSION
SUPPORT ADAPTIVE BOOSTING
SUPPOR
VECTOR
MACHINE ADAPTIVE
BOOSTING

ALGORITHMS

RANDOM MULTILAYER
FOREST PERCEPTRON
LOGISTIC REGRESSION
•Simple Algorithm used for
binary/multivariate
classification tasks.

•Sigmoid function is used to


reduce the values to [0,1].

•The output is the class label


either 1(Yes) or 0(No).
LOGISTIC REGRESSION(Contd..)

MERITS

•Provides model logistic probability


•Easy to interpret
•Quickly update model to incorporate new data.

DEMERITS

•Suffers multicollinearity.
•Sensitive to extreme values of continuous variables.
SUPPORT VECTOR MACHINE

•Discriminative classifier formally defined by a separating


hyperplane.

•Given labeled data, SVM outputs an optimal hyperplane which


categorizes new examples.

•In two dimensional plane, hyperplane divides is a line dividing a


plane in two parts where in each class lay in either side.
SUPPORT VECTOR MACHINE(Contd..)
SUPPORT VECTOR MACHINE(Contd..)

MERITS

•Good prediction in a variety of situations.


•Low generalization errors.
•Easy to interpret results.

DEMERITS

•Computationally expensive.
•Complexity is high.
•Requires more memory and time for training the model.
RANDOM FOREST

Ensemble
method which
uses multiple
learning
models to gain
• Ensemble method which uses multiple learning models to gain better predictive results.
better
predictive
results.

• Creates a forest with a number of decision trees.

Creates a
forest with a
number of
• Decision trees are created from randomly selected subset of training set.
decision trees.

• Aggregates the votes from different decision trees to decide the final class of the test object.
Decision trees
are created
from randomly
selectedubset
of training set.

Aggregates the
votes from
different
decision trees
to decide the
final class of
the test object.
RANDOM FOREST(Contd..)
RANDOM FOREST(Contd..)

MERITS

•Efficient on large datasets.


•Deals with high dimensional data.

DEMERITS

•Observed to be overfit for some datasets with noisy classification tasks.


•Large number of trees makes the algorithm slow for real-time prediction.
ADAPTIVE BOOSTING

•First practical boosting algorithm


proposed by Freund and Schapire
in 1996.

•Focuses on classification
problems and aims to convert a
set of weak classifiers into a
strong one.

•Used to boost the performance


of a machine learning algorithm.
ADAPTIVE BOOSTING(Contd..)
MERITS

•Simple to implement
•Does features selection resulting in a simple classifier
•Fairly good generalization

DEMERITS

•Sensitive to noisy data and outliers


MULTI LAYER PERCEPTRON
•Class of feedforward artificial neural network.

•Consists of atleast three layers: an input layer, a


hidden layer and an output layer.

•Except for the input nodes, each node is a neuron


that uses a non linear activation function.

•Utilizes a supervised learning technique called


Backpropagation for training.

•Can distinguish data that is not linearly separable.


MULTI LAYER PERCEPTRON(Contd..)

MERITS

•Powerful, can model complex functions


•Adapt to unknown situations

DEMERITS

•Can get stuck in a local minima


•Number of hidden layers is to be set by the user
EXTENSION

•Existing methods are developed based on single features.

•Single features may be inefficient and result in lesser accuracy.

•To increase the accuracy combinations of these features are to


be taken.
CONCLUSION

•Perform Data Preprocessing, Feature Extraction and


Classification.

•Multiple models are to be built perform feature extraction and


classification.

•Determining the best model which gives more accuracy.

Potrebbero piacerti anche