Sei sulla pagina 1di 10

IIIT Hyderabad

Masters Research
Polar News

Detecting Polarity among


sources for a particular news
story
Author:
Devansh Shah

Supervisor:
Prof. Navjyoti Singh

July 20, 2016

Abstract
In the information age, user time has become more precious than
ever before. Average content per consumer seems to be increasing exponentially on the internet, and increasing emphasis is being laid on
user relevance. Major internet companies continue efforts in reducing barriers, both physical and temporal, for knowledge and information. As a consequence, print news agencies are declaring bankruptcy
worldwide or moving to the online platform. New media sources have
emerged, or have been reconstitute to suit the pace of information
induction over the Internet. News are no longer delivered the next
morning, on your doorstep, but are rather at your fingertips, almost
instantly.
With the change in the medium of news delivery, there has been
an inevitable change in the discipline once known as Journalism. In
the online media, where there is seldom reader loyalty or subscription, revenue directly translates from clicks or views. In order
to achieve this, News recommendations based on user interests and
history, have become omnipresent. Algorithms restrict the content
that a user could see over the internet, based on complex and indeterministic numerical computation of users estimated click-through
rate. This means that self-trained algorithms decide the authenticity, trustworthiness and relevance of an article for a given user. This
not only restricts the content a user can see on the web, but is also
increasingly polarizing the population.
Given this scenario, it is difficult to form a holistic view about a
particular story without analyzing all the sources pertaining to the
event. The aim of this thesis is two-fold. Firstly, we aim to the
detect affinity of each news source for a given story. We aim to use
advances in graph theory, and in particular hyper-graphs to study the
network of various news sources, and the angle with which they report
particular stories. These network features will help us determine the
authenticity and angle of a given article on a story. Our aim is
to detect communities among these sources, based on their affinity
towards a particular angle of the story.
The second aim of the project, is to objectively identify each of
these communities and their corresponding angles for a given story.
This involves using advances in Natural Language processing, to understand the context and identify the underlying similarity among the
community. Polar News is a web application that aims to aggregate all
possible stories on the online media related to a particular event and

detecting the polarity of the sources pertaining to the events critiques.


This would enable us to answer questions about how majority of the
world evaluates a particular event and on what evaluation metrics are
these evaluations based.

Introduction

Polar News is a web application that aggregates news articles from various
media sources like The New York Times, Instapundit, Washington
Times, Wired, and over 150 others. Each of these articles are aggregated
from a list of RSS feeds that are present on the corresponding media sources
website. RSS feeds provide normalization of text across sources, and are
available in real-time. These aggregated articles are then analyzed to extract
tags, or keywords that describe the context of the article. These keywords
are used to group articles pertaining to a particular story together.
Once we have the groups of articles which represent a single event, we
start the complex analysis, to detect the polarity or angle that emits from
the given source. We form a hyper-graph for each such story, where each
hyper-edge aims to represent the angle or similarity between those articles.
In order to achieve this, we will use known techniques in mining hyper edges.
However, the scope of this paper is limited to the design and implementation
of a web app, implementation of an aggregation engine that accumulates
news articles from over 150 sources and stores them to an NoSQL database,
real time and construction of an hyper graph among those sources. Several
experiments have been performed on the graph, that should be verified and
published.

2
2.1

Background
What is a news aggregator?

At its most basic, a news aggregator is a website that takes information from
multiple sources and displays it in a single place. While the concept is simple
in theory, in practice news aggregators take many forms. For the purpose of
explanation, two different types of aggregators are explained in this report.
They are as follows:

Feed Aggregators
A Feed Aggregator is closest to the traditional conception of a news aggregator, namely, a website that contains material from a number of websites
organized into various feeds, typically arranged by source, topic, or story.
Feed Aggregators often draw their material from a particular type of source,
such as news websites or blogs, although some Feed Aggregators will contain
content from more than one type of source. Some well known examples are
Yahoo! News (and its sister site, My Yahoo!) and Google News. Feed Aggregators generally display the headline of a story, and sometimes the first
few lines of the storys lede, with a link to where the rest of the story appears
on the original website. The name of the originating website is often listed,
as well.
Specialty Aggregators
A Specialty Aggregator is a website that collects information from a number
of sources on a particular topic or location. Examples of Specialty Aggregators are hyperlocal websites like Everyblock and Outside.In and websites
that aggregate information about a particular topic like Techmeme and Taegan Goddards Political Wire. Like Feed Aggregators, Specialty Aggregators
typically display the headline of a story, and occasionally the first few lines
of the lede with a link to the rest of the story, along with the name of the
website on which the story originally appeared. Unlike Feed Aggregators,
which cover many topics, Specialty Aggregators are more limited in focus
and typically cover just a few topics or sources. This project aims to be a
speciality aggregator, since it restricts articles to a particular topic.

2.2

Avoiding Copyright infringement

Since were aggregating a large amount of data, we need to be careful not


to infringe copyrights while scraping for news. Here are several guidelines
followed in order to do the same.
Reproduce only portions of the headline or article that are necessary
to identify the story.
Since were only aggregating articles for a particular tag, we do not end
up reproducing the entire source website.
3

We acknowledge the source, for every article.


We maintain the link to the original article
We mention the purpose of scraping any source, and follow the robotstxt for all sources.

Polar News

3.1

Design

Polar News is designed to be reliable and scalable. In order to be able to


produce valuable results, the web application needs to be able to process
millions of articles upon every iteration, and group them for every story.
Thus each component of polar web needs to be responsive to large amounts
of data. Thus, we selected MongoDB, a NoSQL database that uses columnar
store. This database is run on a Hadoop cluster that replicates the data, to
save it from crashing again. This is because any potential crash can lead to
a loss of hundreds of computing hours and valuable stories.
The web application itself, is built using Django, which is a python based
web-framework, largely used in the industry to build reliable applications.
It was also chosen because Python provides wide third-party offerings and
is easy to use. The application fetches articles from a syndicated feed or a
single web page, and displays the headline and the summary of the story,
along with the published date, publisher and tags.Its design is similar to a
Feed Aggregator.
It contains 2 sections as of now : List of articles, Categories (per story).
Media Cloud is a project at the Berkman center for internet and society that
also provides research students an API for extracting news articles from its
sources.

3.2

What questions do we answer?

What parts of the world is a story being covered in geographically?


What is the root of a particular story?
How are news stories spread, across sources?

How are contrasting opinions or facts presented in different articles for


the same story?
How can we identify influencers? Or what is the impact of an article?
How does a story die?
Does online user behavior ultimately shape the news?

4
4.1

Technology stack and implementation


News Aggregator

The news aggregator in Polar News fetches its news articles from RSS feeds
using Pythons feed parser. In order to get the RSS news feeds, it subscribes
to various websites that provide this feed and periodically polls them for new
updates. It also uses the Media Cloud API to aggregate stories from more
sources.
Feeds are fetched once every six hours, as a cronjob. When it sees any new
articles on the feed, it writes it to the queue, which is consumed by a writer
that writes to the NoSQL database. The database is made resilient to failure
by duplication. Hence the news articles are added/updated from time to
time. The pipeline is shown in Figure 1.
The Algorithm
The algorithm used for fetching the articles and grouping them under specific
topics is as follows :
Detection of story
Classification of articles into tags and grouping them under topics. Each RSS
feed provides a set of tags with a article. These tags are stored and similar
tags are grouped after stemming. A high threshold is set to exclude false
positives, which would be harmful for detecting communities. On the other
hand, the tradeoff of missing a story is relatively low. The screenshots of the
web application interface are shown as Figures 2 and 3.
Once the articles have been either tagged to a particular story, or discarded, we form a graph that detects the links and similar tags from each of
5

Algorithm 1 Algorithm for News Aggregator (Python)


1: procedure AggregateAndClassify
2:
db MongoDB instance
3:
mydb db[db name]
4:
start 0
5:
if mydb.collections.find(medialist) then
6:
start max(mydb.collections.find(medialist)[media id])
7:
end start+250
8:
mc MediaCloud instance
9:
media sources GetMediaList(mc, start, end)
10:
sternum 0
11: loop:
12:
start 0
13:
if mydb.collections.find(feedlist) then
14:
start
15:
max(mydb.collections.find(feedlist))
16:
media id:media sources[iterNum]
17:
end start+20
18:
feeds GetFeedList(mc, media sources[iterN um], start, end)
19:
parsed json json.loads(f eeds)
20:
urls [x[url] for x in parsed json]
21:
feed parsed [ feedparser.parse(x)]
22:
loop [ for x in urls(])
23:
// Populate the database, fetching the articles and tags
24:
CreateGroupsFromArticles(mydb)
. classify
25:
sternum sternum + 1
26:
if sternum < 250 then
27:
goto loop.

the articles. This graph is displayed to the user, along with the title of the
story and the name of the source. We then use the community detection algorithm, described in Honors report - I, in order to find the communities that
are influenced by one another. However, since were not using any NLP
techniques (besides simple tag extraction and matching), we do not expect
to form a more relevant and rich hyper graph among the set of sources.
Current progress and state
Currently, Im working a couple of things. Firstly, the current news aggregator attempts to aggregate more than 100,000 stories every run (every 6
hours). This data is worth several GBs and thus, we need to limit the stories
aggregated. Im working on a specialized aggregation engine, that would only
aggregate stories on a particular section (e.g. Sports, Travel, Politics etc.)
This will help us collect more precise and relevant stories. This code has
been released to the cronjob, and am currently testing the results to ensure
that it is correctly functioning.
Secondly, are most importantly, Im working on the implementation of hypergraphs. Currently, the nodes are connected via simple edges whose weight
is proportional to the tag similarity between two sources. However, this restricts the richness of the graph and thus, am working on data structures to
implement a hyper-graph. This was one of the reasons for storing a columnar
store like MongoDB, since it makes accessing elements with arbitrary number
of relationships possible. (Each row is an hyper edge, which stores multiple
key-value pairs for the weight and node included in that edge). The hyper
edges are being successfully stored in Mongo and the only remaining part is
to render them to HTML for presentation.
Future work
This suite of applications, including the web application, news aggregation
engine, clustering algorithms, and construction of hyper graph; aim to form
the underlying foundation for my Masters thesis. As discussed, the idea is to
detect communities and the underlying opinions of various news sources, for
a given story. I intend to submit a working product, that could be easily used
to verify and reproduce the results of my research. This would be helpful not
only to substantiate any theoretical claims, but also for further research on
the same subject.
7

Figure 1: News Aggregator pipeline

Figure 2: Polar News - Home page

Figure 3: Polar News - List of Articles

Potrebbero piacerti anche