Sei sulla pagina 1di 8

1.

)Introduction :

The main objective of this Project is to detect the fake news, which is a classic text
classification problem with a straight forward proposition. It is needed to build a model that
can differentiate between “Real” news and “Fake” news.

In this project, we have used various natural language processing techniques and machine
learning algorithms to classify fake news articles using sci-kit libraries from python.

This Project comes up with the applications of NLP (Natural Language Processing)
techniques for detecting the 'fake news', that is, misleading news stories that comes from the
non-reputable sources. Only by building a model based on a count vectorizer (using word
tallies) or a (Term Frequency Inverse Document Frequency) tfidf matrix, (word tallies
relative to how often they’re used in other articles in your dataset) can only get you so far.
But these models do not consider the important qualities like word ordering and context..
Combatting the fake news is a classic text classification project with a straight forward
proposition. It’s possible for us to build a model that can differentiate between “Real “news
and “Fake” news? So a proposed work on assembling a dataset of both fake and real news
and employ a Naive Bayes classifier in order to create a model to classify an article into
fake or real based on its words and phrases.

One big assumption underlying this project is that there is considerable overlap in the topics
covered by each class. If certain words or phrases show up more often in “real” news
instead of “fake” news it doesn’t necessarily mean that those terms are associated more with
“real” news, but instead could just mean that those words are used in topics more common
in the real news dataset. For example, the names of politicians showed up more often in
articles classified as “real” than in ones classified as “fake”. I would be committing a huge
error if I came to the conclusion that articles mentioning politicians are significantly more
likely to be to factual. Some of the most prominent and outlandish fake news articles
circulating throughout the internet are ludicrous conspiracy theories about politicians.

Fake news detection has unique characteristics and presents new challenges. First, fake
news is intentionally written to mislead readers to believe false information, which makes it
difficult to detect based on news content. Thus, we need to include auxiliary information,
such as user social engagements on social media, to help differentiate it from the true news.
Second, exploiting this auxiliary information is nontrivial in and of itself as users’ social
engagements with fake news produce data that is big, incomplete, unstructured, and noisy.
2.)Feasibility study:

Social media for news consumption is a double-edged sword. On the one hand, its low cost,
easy access, and rapid dissemination of information allow users to consume and share the
news. On the other hand, it can make viral “fake news”, i.e., low-quality news with
intentionally false information. The quick spread of fake news has the potential for
calamitous impacts on individuals and society. For example, the most popular fake news
was more widely spread on Facebook than the most popular authentic mainstream news
during the U.S. 2016 president election.
It is a sign of the times that in 2018, the Government established a new unit to tackle fake
news, and every day seems to reveal more about the dirty tricks played by companies like
Cambridge Analytica, including deliberately leading misinformation, to try and influence
electorates in favour of whoever happens to be paying them.This is a problem because it
hands more power to those with money — people and groups who already have plenty — and
away from ordinary people, who democracy is supposed to serve. Whether our political
preferences happen to be served by recent fake news or not, everybody should be concerned
about this.It is also becoming increasingly difficult to doubt that certain foreign states are
even interfering in elections and referendums of other countries in a significant way, by
spreading fake news to those who are most willing to receive it.

The recent advances in the availability of artificial intelligence presents us with an


opportunity to tackle this problem at scale.

Fake news detection on social media presents unique characteristics and challenges that
make existing detection algorithms from traditional news media ineffective or not
applicable. First, fake news is intentionally written to mislead readers to believe false
information, which makes it difficult and nontrivial to detect based on news content;
therefore, we need to include auxiliary information, such as user social engagements on
social media, to help make a determination. Second, exploiting this auxiliary information is
challenging in and of itself as users' social engagements with fake news produce data that is
big, incomplete, unstructured, and noisy. Because the issue of fake news detection on social
media is both challenging and relevant, we conducted this survey to further facilitate
research on the problem. In this survey, we present a comprehensive review of detecting
fake news on social media, including fake news characterizations on psychology and social
theories, existing algorithms from a data mining perspective, evaluation metrics and
representative datasets. We also discuss related research areas, open problems, and future
research directions for fake news detection on social media.
3.) METHODOLOGY :
There exists a large body of research on the topic of machine learning methods for
deception detection, most of it has been focusing on classifying online reviews and publicly
available social media posts. Particularly since late 2016 during the American Presidential
election, the question of determining 'fake news' has also been the subject of particular
attention within the literature.

News content based approaches focus on extracting various features in fake news content,
including knowledge-based and style-based. Since fake news attempts to spread false
claims, knowledge-based approaches aim to using external sources to fact-check the
truthfulness of the claims in news content. In addition, fake news publishers often have
malicious intents to spread distorted and misleading, requiring particular writing styles to
appeal to and persuade a wide scope of consumers that are not seen in true news articles.
Style-based approaches try to detect fake news by capturing the manipulators in the writing
style.

Social context based approaches aim to utilize user social engagements as auxiliary
information to help detect fake news. Stance-based approaches utilize users’ viewpoints
from relevant post contents to infer the veracity of original news articles. In addition,
propagation-based approaches reason about the relations of relevant social media posts to
guide the learning of credibility scores by propagating credibility values between users,
posts, and news. The veracity of a news piece is aggregated by the credibility values of
relevant social media posts.
In this paper a model is build based on the count vectorizer or a tfidf matrix ( i.e word
tallies relatives to how often they are used in other articles in your dataset ) can help . Since
this problem is a kind of text classification, Implementing a Naive Bayes classifier will be
best as this is standard for text-based processing. The actual goal is in developing a model
which was the text transformation (count vectorizer vs tfidf vectorizer) and choosing which
type of text to use (headlines vs full text). Now the next step is to extract the most optimal
features for count vectorizer or tfidf-vectorizer, this is done by using a n-number of the most
used words, and/or phrases, lower casing or not, mainly removing the stop words which are
common words such as “the”, “when”, and “there” and only using those words that appear
at least a given number of times in a given text dataset.
4.) Module and Team Member wise Work Distribution:

Dataset used: (Dataset Creation is handled by Kshitiz)

The data source used for this project is LIAR dataset which contains 3 files with
.tsv format for test, train and validation. Below is some description about the
data files used for this project.

LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION

William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for
Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (ACL 2017), short paper,
Vancouver, BC, Canada, July 30-August 4, ACL. The original dataset contained
13 variables/columns for train, test and validation sets as follows:

 Column 1: the ID of the statement ([ID].json).


 Column 2: the label. (Label class contains: True, Mostly-true, Half-true,
Barely-true, FALSE, Pants-fire)
 Column 3: the statement.
 Column 4: the subject(s).
 Column 5: the speaker.
 Column 6: the speaker's job title.
 Column 7: the state info.
 Column 8: the party affiliation.
 Column 9-13: the total credit history count, including the current
statement.
 9: barely true counts.
 10: false counts.
 11: half true counts.
 12: mostly true counts.
 13: pants on fire counts.
 Column 14: the context (venue / location of the speech or statement).

To make things simple we will be choosing only 2 variables from this original
dataset for this classification. The other variables will be added later to add some
more complexity and enhance the features.

Below are the columns used to create 3 datasets that have been in used in this
project:

 Column 1: Statement (News headline or text).


 Column 2: Label (Label class contains: True, False)

You will see that newly created dataset has only 2 classes as compared to 6 from
original classes. Below is method used for reducing the number of classes.

 Original -- New
 True -- True
 Mostly-true -- True
 Half-true -- True
 Barely-true -- False
 False -- False

The dataset used for this project is in csv format named train.csv, test.csv and
valid.csv.
File descriptions:

DataPrep.py:(Handling: Kshitiz and Ankit Singh)

This file will contain all the pre-processing functions needed to process all input
documents and texts. First we read the train, test and validation data files then
performed some pre processing like tokenizing, stemming etc.

FeatureSelection.py: (Handling: Raunak Jalan and Ankit Singh)

In this file we will performed feature extraction and selection methods from sci-
kit learn python libraries. For feature selection, we will be using methods like
simple bag-of-words and n-grams and etc.

classifier.py: (Handling: Raunak Jalan)

Here we will build all the classifiers for predicting the fake news detection. The
extracted features are fed into different classfiers. Each of the extracted features
would be used in all of the classifiers. Finally selected model will be used for
fake news detection with the probability of truth.

prediction.py: (Handling: Ankit Singh)

This model will be copied to user's machine and will be used by prediction.py
file to classify the fake news. It takes an news article as input from user then
model is used for final classification output that is shown to user along with
probability of truth.
5.) Software and Hardware Requirements:
I. Software Requirements:
a) Python: Python is an interpreted, high-level, general-purpose
programming language. Created by Guido van Rossum and first released
in 1991, Python has a design philosophy that emphasizes code
readability, notably using significant whitespace. It provides constructs
that enable clear programming on both small and large scales.
b) Numpy: NumPy is a library for the Python programming language,
adding support for large, multi-dimensional arrays and matrices, along
with a large collection of high-level mathematical functions to operate on
these arrays.
c) Pandas: Pandas is a software library written for the Python programming
language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time
series. It is free software released under the three-clause BSD license.
d) Matplotlib: Matplotlib is a plotting library for the Python programming
language and its numerical mathematics extension NumPy. It provides an
object-oriented API for embedding plots into applications using general-
purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+
e) Scikit-learn : Scikit-learn provides a range of supervised and
unsupervised learning algorithms via a consistent interface in Python.It is
licensed under a permissive simplified BSD license and is distributed
under many Linux distributions, encouraging academic and commercial
use.The library is built upon the SciPy (Scientific Python) that must be
installed before you can use scikit-learn.

f) Nltk : The Natural Language Toolkit, or more commonly NLTK, is a


suite of libraries and programs for symbolic and statistical natural
language processing for English written in the Python programming
language.

II. Hardware Requirements:


a) RAM: 8 Gigabytes
b) Operating System: Windows 10
c) Graphic Card: Not necessary but recommended.(4 Gigabytes NVIDIA)
d) Storage: 1 TB
e) Processor: Intel i5
6.) Bibliography:
I. N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic deception detection: Methods
for finding fake news,” Proceedings of the Association for Information Science and
Technology, vol. 52, no. 1, pp. 1–4, 2015.
II. S. Feng, R. Banerjee, and Y. Choi, “Syntactic stylometry for deception detection,” in
Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics: Short Papers-Volume 2, Association for Computational Linguistics,
2012, pp. 171–175.
III. Shlok Gilda, Department of Computer Engineering, Evaluating Machine Learning
Algorithms for Fake News Detection,2017 IEEE 15th Student Conference on
Research and Development (SCoReD)
IV. https://www.udemy.com/30-days-of-python/learn/v4/overview (For Python)
V. https://www.youtube.com/playlist?list=PLD63A284B7615313A (For Machine
Learning Maths)
VI. https://www.udemy.com/machinelearning/?ranMID=39197&ranEAID=vedj0cWlu2
Y&ranSiteID=vedj0cWlu2Y-
dBl7UJRL3a4csA6paUwrwA&LSNPUBID=vedj0cWlu2Y
VII. https://www.udemy.com/data-analysis-with-pandas/learn/v4/ (For Pandas Library)
VIII. https://www.udemy.com/machine-learning-masterclass/learn/v4/
(For Machine Learning)
IX. https://www.youtube.com/playlist?list=PLQiyVNMpDLKnZYBTUOlSI9mi9wAEr
FtFm (For Natural Language Processing)
X. https://www.kaggle.com/getting-started/
XI. William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for
Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (ACL 2017), short paper, Vancouver,
BC, Canada, July 30-August 4, ACL.

Potrebbero piacerti anche