Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
TOOL
Session 2014-2018
Project Advisors
Dr. Syed Asim Ali
Submitted By
degree
Batchelor of Science
in
Computer Science
by
University of Karachi
(Signed)
Date
In the past few years, the popularity of social media has grown dramatically, with more
and more users sharing all kinds of information through different platforms. Companies
use social media platforms to promote their brands, professionals maintain a public
profile online and use social media for networking, and regular users discuss about any
topic. More users also means more data waiting to be mined.
The social media has changed the whole world into a global village in which all the
humans on planet have been brought at a single platform in which they can share their
thoughts, their images as well as their feelings with the other humans which they know or
with whom they are in contact. There are multiple social media platforms on which a
person can do all such things.
Some of the famous social media platforms are Twitter, Facebook, Instagram etc. but the
most popular and powerful of all of them is Twitter because Twitter has a very unique
kind of structure of being friend with another user by which a user can get in touch or
receive the updates of others. Unlike Facebook Twitter follows the pattern of following
someone and being followed by someone which gives the user more reach over other
users.
The main purpose or we can say job of this project is to gather all the data and activities
of a Twitter user which he is doing or undergoing on his twitter timeline and analyze his
activities, tweets and the data by performing high level analysis and manipulation of that
data show the end-user a small report of that user within no time by which the end-user
can easily judge the personality, nature and way of thinking of that user and analyzing the
report he can easily took decisions about that user that if to follow him or not or any other
decision the end-user wish to make.
Table Of Contents
OVERVIEW
The main purpose of Social Reach is about applying data mining techniques to Twitter
using Python in order to get the interesting as well as very useful insights of a user. In
2013, Twitter had reported a volume of 500+ million tweets per day. These numbers are
just the tip of the iceberg when describing how the popularity of social media has grown
exponentially with more users sharing more and more information through Twitter. This
wealth of data provides unique opportunities for all the data mining practitioners to use
their data mining skills and bring some interesting facts and figures out of it.
OPPORTUNITIES
The key opportunity of developing data mining systems is to extract useful insights from
the data. The aim of the process is to answer interesting (and sometimes difficult)
questions using data mining techniques to enrich our knowledge about a particular
domain. This project brings you not all but many opportunities to avail by using this
product. For example with help of this product you can easily mine any Twitter’s user
which may be your friend, your family members your favorite personality, your
colleagues or any other person who has a legit twitter account and you will be provided
with the latest real time insights of that user activity on his account.
SOFTWARE PLATFORM
Social Reach is a web application which has been built using Django (version 1.9) which
is a web application development framework for Python. The core logic of the program
which usually software engineers refer as back-end of web application is written in
Python (version 3) maintaining all the rules and protocols of the language and the user
interface which is usually called front-end of web application has been done using
HTML, CSS, JavaScript and JQuery.
CHALLENGES
With the commendable opportunities there also are challenges which were faced by the
development team and will impact on the performance of this product. Social Reach is a
web application which means anyone can easily use the app from his or her mobile,
laptop, tablet and personal computer etc. but the internet connection would be must for
any user to connect to this application. Following are some of the major challenges:
1) Authentication
a. The user agrees with the consumer to grant access to the social media platform.
b. As the user doesn't give their social media password directly to the consumer, the
consumer has an initial exchange with the resource provider to generate a token
and a secret. These are used to sign each request and prevent forgery.
c. The user is then redirected with the token to the resource provider, which will ask
to confirm authorizing the consumer to access the user's data.
d. Depending on the nature of the social media platform, it will also ask to confirm
whether the consumer can perform any action on the user's behalf, for example,
post an update, share a link, and so on.
e. The resource provider issues a valid token for the consumer.
f. The token can then go back to the user confirming the access.
2) Fetching Data
The Twitter gives us the access some of its APIs (Application Programming Interface )
which we have to call from our backend in order to get data and these APIs return with
the asked data in the JSON (JavaScript Object Notation) format. When using a third-party
API, developers don't need to worry about the internals of the component, but only about
how they can use it. With the term Web API, we refer to a web service that exposes a
number of URIs to the public, possibly behind an authentication layer, to access the data.
A common architectural approach for designing this kind of APIs is called
Representational State Transfer (REST).
3) Data Volume:
When dealing with social data, we're often dealing with big data. To understand the
meaning of big data and the challenges it entails, we can go back to the traditional
definition (3D Data Management: Controlling Data Volume, Velocity and Variety, Doug
Laney, 2001) that is also known as the three Vs of big data: volume, variety, and
velocity. Over the years, this definition has also been expanded by adding more Vs, most
notably value, as providing value to an organization is one the main purposes of
exploiting big data. Regarding the original three Vs, volume means dealing with data that
spans over more than one machine. This, of course, requires a different infrastructure
from small data processing (for example, in-memory). Moreover, volume is also
associated with velocity in the sense that data is growing so fast that the concept of big
becomes a moving target. Finally, variety concerns how data is present in different
formats and structures, often incompatible between them and with different semantics.
Data from social media can check all the three Vs. The data which has been provided to
us by Twitter is in JSON format which means we can classify it as a semi- structured
data.
3) Rate Limits:
The Twitter API limits access to applications. These limits are set on a per-user basis, or
to be more precise, on a per-access-token basis. This means that when an application uses
the application-only authentication, the rate limits are considered globally for the entire
application; while with the per-user authentication approach, the application can enhance
the global number of requests to the API.
The implications of hitting the API limits is that Twitter will return an error message
rather than the data we're asking for. Moreover, if we continue performing more requests
to the API, the time required to obtain regular access again will increase as Twitter could
flag us as potential abusers. . When many API requests are needed by our application, we
need a way to avoid this. In Python, the time module, part of the standard library, allows
us to include arbitrary suspensions of the code execution, using the time.sleep() function.
For example, a pseudo-code is as follows:
# Assume first_request() and second_request() are defined.
# They are meant to perform an API request.
import time
first_request()
time.sleep(10)
second_request()
In this case, the second request will be executed 10 seconds (as specified by the sleep()
argument) after the first one.
In order to interact with the Twitter APIs, we are using a Python client that implements
the different calls to the API itself. There are several other options as well but none of
them are officially maintained by Twitter and they are backed by the open source
community. While there are several options to choose from, some of them almost
equivalent, so we will choose to use Tweepy here as it offers a wider support for different
features and is actively maintained. We have installed the package Tweepy version 3.3 in
our code in order to get authenticated and to start fetching data from twitter.
The Tweepy package gives us two predefined methods by which we can easily get our
application authenticated and to create a twitter client which creates the API object
needed to interface with Twitter.
All the code related to authentication and creating twitter client has been written in
authentication.py file which is our custom made module which we are using frequently
in order to get authenticated and initializing client. The code in authentication.py module
is as follows:
Here we are defining two methods, one is get_twitter_auth() which takes no arguments,
this method is responsible for the authentication and the other method is
get_twitter_client() which also does not take any argument as well as this function is used
to create an instance of tweepy.API, used for many different types of interaction with
Twitter.
As we know that the RESTful APIs usually provides the data in very famous JSON type,
and the tweet is a complex object having multiple properties which we have to handle.
The problem here is that we are getting hundreds or thousands of tweets from the user
timeline and we are persisting these tweets in a List which is a very famous data structure
and the Lists in python language provides us many built-in features which helps us in
manipulating data, but every tweet is of JSON type which means that every element of
List should be of JSON type and must be a complete JSON object. Here we are using the
(_json) property of the tweet which gives us dictionary with the JSON response of the
status and appending each status in the List. The complete structure of the tweet is as
follows:
//Structure of tweet
The preceding snippet shows how to use tweepy.Cursor to loop through the first XYZ
items of User’s timeline. Firstly, we need to import Cursor and the get_twitter_client
function that was previously defined. We have mapped the above code into a method so
that we can easily get the tweets of any user just by passing the username of the user to
the method.
From the previous chapter we already know how to get tweets from user’s timeline and
how to contain them In the list of tweets as well as we also got to know about the
structure of the tweet. This chapter focus on analyzing entities in tweets. We are going to
perform some frequency analysis using the data collected in the previous section. Slicing
and dicing this data will allow us to produce some interesting statistics that can be used to
get some insights on the data and answer some questions.
Analyzing entities such as hashtags is interesting as these annotations are an explicit way
for the author to label the topic of the tweet.
// code for hashtag Frequency
The above code needs a list of tweets which we have already got to know how to create it.
Then we run a for loop to iterate over all the tweets present in the list, then for each tweet
we are calling get_hashtags(tweet) method which gets a tweet as an argument and returns
the hashtags used in the tweet. After getting all the hashtags from tweets we perform
further analysis on the list of hashtags and determine which are the most used hashtags
used by user in his/her tweets and returns list containing each hashtag and its count.
The previous script gave an overview of the hashtags most frequently used by the user,
but we want to dig a little bit deeper. We can, in fact, produce more descriptive statistics
that give us an overview of how hashtags are used by the user:
From the above code we got to know interesting insights of using hashtags by user
because just by watching user’s timeline we don’t even estimate that how many tweets
contain hashtags and how many tweets contain how many number of hashtags. Above
script is just mind blowing and provides you with the result which is enough astonishing
for anyone.