Sei sulla pagina 1di 24

Media Engineering and Technology Faculty

German University in Cairo

Location Detection Over Social


Media

Bachelor Thesis

Author: Ahmed Soliman


Supervisors: Sarah Elkasrawy

Submission Date: XX July, 20XX


Media Engineering and Technology Faculty
German University in Cairo

Location Detection Over Social


Media

Bachelor Thesis

Author: Ahmed Soliman


Supervisors: Sarah Elkasrawy

Submission Date: XX July, 20XX


This is to certify that:

(i) the thesis comprises only my original work toward the Bachelor Degree

(ii) due acknowlegement has been made in the text to all other material used

Ahmed Soliman
XX July, 20XX
Acknowledgments

Text

V
VI
Abstract

Abstact

VII
VIII
Contents

Acknowledgments V

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related Work 3
2.1 Non-Content-Based Location Estimation . . . . . . . . . . . . . . . . . . 3
2.2 Content-Based Location Estimation . . . . . . . . . . . . . . . . . . . . . 4

3 Data 5
3.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.1 Geographical Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.2 Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Location Detection Approaches 7


4.1 Profile location identification . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Location detection by language . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Machine learning approaches . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3.1 Content-based Heuristic Classifier . . . . . . . . . . . . . . . . . . 7

5 Conclusion 9

6 Future Work 11

Appendix 12

A Lists 13
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

References 15

IX
Chapter 1

Introduction

1.1 Motivation
Micro-blogging services such as Twitter, Facebook and Tumblr have been growing and ris-
ing rapidly recently, As of March 2013, 400 million tweets were being posted everyday[7].
This has initiated enormous research efforts to mine this data and use them in various
applications, such as event detection [Sakaki et al. 2010; Agarwal et al. 2012] and news
recommendation [Phelan et al. 2009]. Many applications could make use of informa-
tion about users locations, but unfortunately the information is very sparse, a research
firm Sysomos studied Twitter usage between mid-October and mid-December 2009 and
found that only 0.23% of tweets in that time period were geo-tagged which is a good
indicator how much this information is sparse. Although blogging services allow users to
specify their location in their profiles, the profile location field is not reliable, Cheng et
al. found that only 26% out of a random sample of over 1 million Twitter users revealed
their city-level location in their profiles and only 0.42% of the tweets in this dataset were
geo-tagged [Cheng et al. 2010]. Moreover these profile locations are not always valid
as reported that only 42% of Twitter users in a random dataset have reported a valid
city-level location in their profiles [Hecht et al 2011].

1.2 Aim
In this paper users location prediction approaches are discussed to overcome location
sparseness problem mentioned above. These approaches are based purely on the tweets
content and tweeting behaviour in the absence of any other location information. The
goal is to develop approaches that will be able to predict the location of the tweet, the
key step towards achieving this goal is to predict the home location of the user as the
home location can give important clues to the possible actual location of the tweet. The
intuition here is that the content of a tweet may contain some words, entity names or
phrases more likely to be employed in particular places than others which could give

1
2 CHAPTER 1. INTRODUCTION

indicators for the actual location. Developing these approaches to be able to predict
possible locations of a tweet will be very beneficial in tracking applications such as news
verification in which we want to know which tweets are reported by users who are likely
to be in the actual location of an event or versus tweets reported by users who are likely
to be far away.

1.3 Outline
In the remainder of this paper related work, data set, formalization of the location pre-
diction problem, location classification approaches, and an evaluation of discussed algo-
rithms and approaches are discussed. Then the conclusion comes with a discussion of
future work.
Chapter 2

Related Work

This chapter shows a variety of prior work that is related to this study, the prior studies
can be categorized into the following areas:

2.1 Non-Content-Based Location Estimation


There are many studies that explored estimating users location based on information
provided in users profile, geo-tagged tweets and other social information.
A number of studies make use of location information provided in users Twitter profiles,
for example Kulshrestha et al. [2012][7] have used location information provided by users
in their profile and map APIs to estimate the country level location, they were able to
estimate country level location for about 23.5% of users with 94.7% accuracy. However
the location prediction techniques that rely on information provided by users are not
always reliable because map APIs do not always return correct results. In addition, users
do not enter correct information most of the time, for example Hecht et al. reported that
34% of users either enter incorrect non-geographic information in the location field or
leave the field empty.
Other studies make use of GPS coordinates provided by users mobile devices, however
the number of geotagged tweets is not large so we can rely on as reported by Cheng et al.
[2010] that the prportion of geotagged tweets is arround 1% and the location of majority
of users are not geotagged. Some Methods based on IP addresses are used by other studies
like (Buyukokkten, Cho, Garcia-Molina, Gravano, & Shivakumar, 1999). These methods
have been shown to achieve arround 90% accuracy at locating Internet hosts to their
locations as reported by Padmanabhan Subramanian [2001] [8]. However these methods
are not applicable to Twitter and other socal media services as geographical divisions
of IP adresses are not always valid. For example, some departments in an international
corporation might use the same IP addresses and their true locations are spreading across
the world. Another example, users who use VPNs could be assigned IP addresses from
different locations other than their true locations.

3
4 CHAPTER 2. RELATED WORK

Some studies use other social information to infere location of users, for example Popescu
and Grefenstette[9] tried to estimate the home country of Flicker users using place names
and coordinate provided with their photos, Backstrom et al. [2] presented an algorithm
to predict the physical location of a user using the social network structure of Facebook,
given the known locatiion of users friends they were able to locate 69.1% of the users
with 16 or more located friends to within 25 miles compared to only 57.2% using IP-based
methods.

2.2 Content-Based Location Estimation


Content of users tweets has been exploited to extract users location, for example if a
place is frequently mentioned in users tweet he is likely tweeting from that place. There
are some methods that are based on that intuition such as naive gazetteer matching
(Bilhaut, Charnois, Enjalbert, Mathet, 2003)[3] and named entity recognition as well as
vocabulary-based method to identify location name from tweets (Agarwal et al. [2012])[1].
A number of methods have been proposed to estimate the home location of users based
on content analysis of tweets. These methods build probabilistic models from tweet
content, for example Eisenstein et al. [2011][6] reported 58% accuracy for predicting
regions (4 regions) and 24% accuracy for predicting states (48 continental US states and
the District of Columbia) using geographic topic models for prediction.
For estimating city-level location it becomes more challenging than location estimation
for higher levels, because the number of cities in dataset is often larger that the number of
states, regions or countries, Cheng et al[2010][5] described a city-level location estimation
algorithm in which local words are identified in tweets (such as red sox is local to Boston)
and statistical models are build from them. However their method was not promising as
it needs manual selection of such local words to train a supervised classification model.
In addition they reported approximately 51% accuracy using their approach. Chang et
al. [2012] [4] recently described another content based location estimation using Gaussian
Mixture Model (GMM) and Maximum Likelihood Estimation (MLE) and reported 50%
accuracy in predicting city-level location within 100 miles of actual city-location which is
comparable to Cheng et al [2010].
Chapter 3

Data

3.1 Dataset Overview


In experiments conducted in this thesis, a dataset collected analyzed by Benjamin Bischke
have been used [?]. This dataset was collected from Decahouse-Streaming API represent-
ing about ten percent of random public tweets over all activities on Twitter during the
period 2016-01-15 till 2016-02-06.
The dataset consists of around 1 Billion activities divided as follows: 47.16% are new
posts, 30.23% are shared content of retweets and the remaining 22.6% are deletion of
tweets.

3.2 Data Analysis


Geolocation prediction models presented have primiraly been trained and evaluated using
geotagged tweets, geotagged tweets are filtered based on languages when conducting
different experiments. In this section geographic and language analysis for tweets used
will be presented.

3.2.1 Geographical Analysis

Twitter give users the option to embed their current GPS-location. The dataset included
only about 2.1% (16,874,517) of the activities with embedded precise GPS-location. On
the other hand geographical information can be inffered indirectly from profile location
field in users profile, but as noticed this geographical information is not reliable as only
43.1% of unique users profiles included non empty profile location field. In addition about
half of these non empty profile location fields were successfully mapped to a real locations
as the other half contained non valid existing locations and in some cases valid but non

5
6 CHAPTER 3. DATA

complete addresses which is hard to map to a unique locations like state or country
names.
By extracting GPS-coordinates from users profiles and mapping it to countries we can
see the distribution of geo-locations of tweets, Figure 3.1 shows the top 10 countries
extracted from the dataset.

3.2.2 Language Analysis

Researches previously conducted have been primarily focused on English data or have
been used datasets that consisted of primarily English tweets.However, Twitter is a multi-
lingual platform and including some languages may help in the task of location prediction
as it can be powerful indicator for locations, for example, if a user tweets mostly in chi-
neese, this could be an indicator that the user is from china.
for the analysis of languages used in tweets, a language detector was applied, Figure 3.2
shows the top 10 most frequently used languages in the dataset.
Chapter 4

Location Detection Approaches

In this chapter we introduce and describe several approaches for location detection over
social media.

4.1 Profile location identification

4.2 Location detection by language


Social media applications give users the free choice to publish their status updates and
tweets. Language can be a strong indicator of location: for example a user that writes
tweets in Chinese is most probably located in China, but the problem is there are lan-
guages that are spoken in many locations around the world such as English which is
spoken by 67 countries as an official language[7]. So the prediction based on language
is not accurate enough to get the country where the tweet was published but a list of
possible locations can be obtained which is sufficient here.In this section the approach to
obtain this list is described.
Languages are mapped to countries that are speaking this language as an official or
second language[9], Then by classifying the language of the tweet, a list of countries that
speak this language can be obtained. The problem here is that this list can be large and
contains irrelevant countries, In this case we can utilize time zone field in users profile if
existed to remove countries that are not located in the users time zone to obtain a list
of countries that could contain the location of the tweet.

4.3 Machine learning approaches


4.3.1 Content-based Heuristic Classifier
In this section we describe our statistical location classifier that is trained from different
terms extracted from all the users geotagged tweets.

7
8 CHAPTER 4. LOCATION DETECTION APPROACHES

We created this classifier for city level location for which we have ground truth. Each
user in our training dataset corresponds to a training example where the features are
extracted from the user tweet contents and the corresponding output is the geolocation
provided with that tweet. The number of classes in this trained model equal to the total
number of locations in our training dataset (total number of cities).

4.3.1.1 Feature Extraction

First, we tokenize all tweets in our training dataset to filter them, we filter tweets by
removing URLs, mentions and hashtags, then we remove any word that is identified as
stop word. Stop words are defined by a list of words provided by nltk stopwords corpus.
Once the stop words are removed, lemmatization in which we reduce the forms of a word
to a common base form is performed. Once the tokens have been extracted, we use simple
heuristic algorithm which is called CALGARI[2]. This algorithm is based on intuition
that a model will perform better if it is trained on terms that are more likely to be used
by some users from particular regions than users from the general population. In this
algorithm we define a score for each term, this score show us how likely this term happens
in our dataset. We will explain how this score is calculated below:
Let s(T ) be a function which takes a term and calculate the score for that term T , F(T )
be the frequency of a term T in our dataset, (T , c) be a function that count how many
times the term T is used with class c, is the total number of different terms in our
dataset and C be the set of classes (locations) in our dataset, we need to evaluate this
equation for each term:

max(P (T | c = C))
s(T ) = where c C
P(T )

F(T )
The term P(T ) = , so we need to know how to evaluate the numerator.

C (T , ci )
max(P (T | c = C)) = max P

i
(tj , ci )
j

Now after calculating a score for each term, the algorithm sorts the terms based on this
score in non decreasing order and choose the best 10,000 terms as features for our model.

4.3.1.2 Training and Classification

One the features (chosen terms from previous step) are extracted for the classifier, we
build probabilistic classifier based on Multinomial Naive Bayes algorithm from scikit-learn
library with assumption of conditional independence of the features.
Chapter 5

Conclusion

Conclusion

9
10 CHAPTER 5. CONCLUSION
Chapter 6

Future Work

Text

11
Appendix

12
Appendix A

Lists

13
List of Figures

14
Bibliography

[1] Puneet Agarwal, Rajgopal Vaithiyanathan, Saurabh Sharma, and Gautam Shroff.
Catching the long-tail: Extracting local news events from twitter. 2012.

[2] Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: improving
geographical prediction with social and spatial proximity. pages 6170, 2010.

[3] Frederik Bilhaut, Thierry Charnois, Patrice Enjalbert, and Yann Mathet. Geographic
reference analysis for geographic document querying. pages 5562, 2003.

[4] Hau-wen Chang, Dongwon Lee, Mohammed Eltaher, and Jeongkyu Lee. @ phillies
tweeting from philly? predicting twitter user locations with spatial word usage. pages
111118, 2012.

[5] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: a
content-based approach to geo-locating twitter users. pages 759768, 2010.

[6] Jacob Eisenstein, Brendan OConnor, Noah A Smith, and Eric P Xing. A latent
variable model for geographic lexical variation. pages 12771287, 2010.

[7] Juhi Kulshrestha, Farshad Kooti, Ashkan Nikravesh, and P Krishna Gummadi. Ge-
ographic dissection of the twitter network. 2012.

[8] Venkata N Padmanabhan and Lakshminarayanan Subramanian. An investigation of


geographic mapping techniques for internet hosts. 31(4):173185, 2001.

[9] Adrian Popescu, Gregory Grefenstette, et al. Mining user home location and gender
from flickr tags. 2010.

15

Potrebbero piacerti anche