Sei sulla pagina 1di 23

Combating Link Spam

M.Tech. Seminar Report

Submitted in partial fulfillment of the requirements for the degree of


Master of Technology

by
Jubin Chheda
Roll No 06305003

under the guidance of:


Prof. Soumen Chakrabarti
examined by:
Om P. Damani

Department of Computer Science and Engineering,


Indian Institute of Technology, Bombay.

-1-
Abstract:
As more and more people rely on search engines as starting points to fulfill their need for
information, it has become absolutely important to have one’s page rank up in the top few
results of popular search engines. Most search engines use, among other things, variants
of the classic PageRank algorithm, which relies on the link structure of the web to rank
pages. In order to have their pages rank higher than deserving, some web designers,
resort to all sorts of tricks to mislead search engines by manipulating linkage (link-spam)
and content(term-spam) on their pages and the web, in the process give form to what has
come to be called web-spam. There is a continuing clash between search engine
algorithm-designers and web-spammers leading to this battleground of the Adversarial
Web.

Our main focus in this report is link-spam. We take a look at the different methods of
combating link-spam. We also look at optimal link-spam structures and test them using
Java code. We implement popular algorithms for ranking algorithms and test the efficacy
of these on a web-graph made available by Webaroo.

-2-
Table of Contents:

I. Introduction:.................................................................................................... 4
II. Web Model: .................................................................................................... 6
III. Ranking Algorithms:....................................................................................... 7
A. PageRank: 7
B. TrustRank: 9
IV. Web Spam Taxonomy: ................................................................................. 11
A. Link Spamming Techniques: 12
B. Term Spamming: 13
C. Hiding Techniques: 14
V. Optimal Spamming Structures:..................................................................... 15
VI. Tweaking Ranking Algorithms: a solution to link-spam? ............................ 18
A. Antitrust and Distrust Ranking: 18
B. Combining Trust and Distrust: 18
C. Truncated PageRank: 18
D. Topical TrustRank: 19
VII. Statistics about pages: features to classify link-spam:.................................. 20
VIII. Scope for Future work: ................................................................................. 22
IX. Conclusion: ................................................................................................... 22
X. References:.................................................................................................... 23

-3-
I. Introduction:
Everyone has an information need, and how do they get satisfied? The web seems to have
some answers: 21st century’s answer to the library of Alexandria. However, it’s too
messy, too disorganized and too fast-changing; the web is Godzilla, Socrates and Jesse
Owens all packed into one: huge, smart and fast. We need a catalog, and we need trust.
There’s the need for the search engine.

“Users started at a search engine 88% of the time when we gave them a new task to
complete on the Web.”[6]. The key to the success of search engines is their simplicity
and comprehensiveness. The difficulty of the search problem is present the top 10
relevant sites. The need is fulfilled if the results of the search point to answers. In other
words, the relevance of the results is the key. Moreover, 85% of the time, people don’t
look beyond the top 10 results. [7]. People make medical, financial, cultural, security-
related decisions based on search engine results.

Traditionally, search engines have employed ranking algorithms which use the linkage
between websites to represent endorsement and have pushed up websites that are referred
to by other high ranking websites. PageRank and HITS are 2 such algorithms. Ranking
high on a search engine is thus something that fetches high premium. E-commerce,
propagandistic and marketing websites have business stake in featuring on top
.
In this scenario, some web designers want to do all they can to have their pages rank
high, artificially. Enter, Web Spam. “Web Spam refers to hyperlinked pages on the
WWW which are created with intention of misleading websites.”[2]

Literature about statistics about the amount of web-spam is limited. [8] report:

Table 1: Amount of Web-spam


Data Set Crawl Date Data set size Sample-size Spam
Fetterly et al. 11/02-02/03 150 million pages 751 pages 8.1%
Fetterly et. al Yahoo BFS 07/02-09/02 429 million pages 535 pages 6.9%
Gyöngyi et al.: Alta Vista set 08/03 31 million pages 748 pages 18%

The methods used to web-spam are broadly classified into 2 categories: Link spamming
and Term/Content/Text spamming. Link spamming refers to manipulating the in-link
&/or out-links of pages and in effect a link substructure of the web, to boost
rankings for one’s pages and mislead search engines. Link spamming exploits the
weaknesses in traditional ranking algorithms. To boost rankings of a page, spammers
induce high-ranking pages to point to them and orchestrate link structures within their
own pages to boost rankings of a few target pages. Some spammers resort to even
arranging whole collections of sub-domains pointing to each other: setting up spam
farms. Term spamming, on the other hand, refers to spamming the text fields on a page to
include spam terms to make and thus make them more relevant. Techniques include

-4-
dumping, which is the inclusion of a large number of unrelated terms on a page; even
whole dictionaries, just so that pages will show up relevance for some obscure terms. To
cover up, tell-tale manipulations on spammed pages, so that humans can’t figure it out,
spammers use hiding techniques. Popular ones include cloaking, which is serving one
version of a page to crawlers and another to human users.

This has created a war-zone of the Adversarial Web, where both sides: search-engines
and spammers are trying to outwit each other.

In this report we concentrate on ways to detect and prevent link-spam. This problem
involves ideas not only from areas of Information Retrieval and Machine Learning, but
domains as diverse as anthropology, linguistics, political science, and economics, among
others. [9]

The rest of this section outlines the report. The report starts off with a model for web as a
graph. In SectionIII, we take a look at PageRank, and then TrustRank. In Section IV, we
come up with a taxonomy for web-spam, which we hope will help bring order to means
to tackle each type. In Section IVV, we look at optimal structures for spam-farms and
stress on how structures can be created to maximize the rank for desired pages. In Section
VI, we look at some algorithms which tweak PageRank to come up with alternative
algorithms. In Section VII, we look at statistical features as a potential holy grail for
detecting spam.

We implement the ranking algorithms: PageRank, TrustRank, DistrustRank, etc. using a


stream model. We try to verify the claims in various papers, by testing them on with
algorithms.

-5-
II. Web Model:
2 3
We model the web as a graph as G= (V, E) consisting of V as
set of pages (vertices) and E as set of directed links (edges). We
1
remove multiple links and selflinks. Consider a simple web-
graph. This has 5 pages and 6 links.
4
5
The number of inlinks of page p is its indegree ι(p), and number
of outlinks is its outdegree ω(p). Pages with no inlinks are
Figure 1: Simple called unreferenced pages, and those with no outlinks are non-
Web-graph referencing pages.

Transition matrix, T is defined as: X


0 if q,p 2 6 E
` a
^
^
^
\
T p,q = ^ f1f
` a
f
ff
ff
ff
ff
f
ff
f
f`
Zω ` qa if q,p 2 E
a
^
^

The T for graph in Figure 1: SimplehWeb-graphis: i


0 0 0 0 0
l 1 0 1f 1fff
l m
ff
0
l m
m
2 2
l m
l m
l 0 1 0 0 0
l m
T =l
m
l m
l 0 0 1f
m
ff
0 0
l m
m
2
l m
l m
1fff
l m
l m
0 0 0 0
j k
2
Inverse Transition Matrix, U is defined as:
X
0 if p,q 2 6 E
` a
^
^
^
\
U p,q = ^ f1f
` a
ff
f f
f
ff
f
ff
f`
Zι ` qa if p,q 2 E
a
^
^

The U for graph in Figure 2: A web-graph: for PageRank calculation. Gray nodes are
spam pages. is: h i
l 0
1ff
f
0 0 0m
l
l 3 m
m
l 0 0 1 0 0 m
l m

1
l m
U =l ff
f
l m
0 0 1 0
l m
3
m
l m
l m
l 0 1f
l m
f
f
0 0 1
l m
m
l
j 3 m
k
0 0 0 0 0

-6-
III.Ranking Algorithms:
PageRank [1] and HITS [10] were the first attempts to provide an importance score to
pages and made extensive use of the concept prestige in social network analysis.
Based on PageRank, Haveliwala[11] came up with topic sensitive PageRank, which
paved way for Gyöngyi et al.[2] to come up with TrustRank.

We take a quick look at PageRank and TrustRank.

A. PageRank:

The idea behind PageRank is that a page is important if several pages point to it. Thus a
page is influenced by and influences other pages. If each page has to have a rank,
intuition would tell us that a page must rank in proportion to ranks of pages pointing in to
it, with an inlink signifying a vote. Another thing that would be built into such a scheme
is that if a page points to several pages, it should mean that endorsement should be
distributed amongst all outlinks in an equal amount. Thus for a page p, the rank r(p)
would be:
rf pf
` a
r p = Pq: ` q,pa 2A Ε f
f
fff
fff
ff
f
ff
f
` a
ω q
` a

This arrangement works fine except in cases where, there are 2 nodes pointing into each
other and not anywhere else and one has an inlink, they would end up as a rank-sink.[1]
Thus, the equation was modified as:
rf pf 1f
` a
f
ff
ff
f
ff
ff
f
ff
f` a f f
ff
f
f
r p = α A Pq: ` q,pa 2 E ` a + 1 @ α A
` a
ω q N

Here, α serves as a damping factor, and the second term serves as a random jump to p
from anywhere on the web. [1] also elucidates this idea using the random surfer model.

The matrix form is:


a 1f
r = α A TA r + 1 @ α A ff
f
ff
f
`
A1
N N

Biased PageRank:
Now, instead of using an equi-probable distribution of randomly jump to any page, one
1f
f
ff
f
ff
can define our own distribution, by replacing A1 N by vector d. Biased PageRank can
N
be used to assign a special non-zero scores (which add up to 1) to some pages.[11]:
r = α A TA r + 1 @ α A d
` a

where, d can be initialized to some scores which disseminate to other pages over
iterations. Thus, if d contains primarily sports pages, then the biased PageRank will have
a sports-based ranking.[11].

-7-
We use some scilab code to calculate the PageRank scores, taking α=0.85, r1 are
unbiased for the graph in the Figure 2, where as r2 is biased with d:
0 0.03 0
h i h i h i

l0.7m l 0.18 m l0.27m


l m l m l m
d =l
l0.3m ; r1 =l 0.18m; r2 =l0.28m
l m l m l m
m l m l m
j 0 k j 0.11 k j0.12k
l m l m l m

0 0.08 0.05

We also implement PageRank using Java. We plan to verify the topic-sensitive


PageRank, by using a corpus on Sports Pages.

Consider another web-graph in Figure 2:


For 20 iterations and α=0.85
0
1

3
2
Table 2: PageRank Scores

4 Page PageRank
7 0.26328982
5 6 0.24046313
2 0.12690023
4 0.10361437
7 8 3 0.0776826
8 0.06859472
6 1 0.0531074
5 0.04968177
Figure 2: A web-graph: for PageRank 0 0.01666667
calculation. Gray nodes are spam pages.

-8-
B. TrustRank:

[2] builds on biased PageRank. The initial d is made to consist of normalized non-zero
scores for known good (non-spam) pages. The idea is that goodness (trust) propagates in
the forward direction: from known good nodes toa nodes that they point to.
The main equation remains: r = α A T A r + 1 @ α A d. However, the heart of the algorithm
`

is how to select the d.

Seed Set, s: They do this by coming up with a Seed set, s, which is the set of pages that
are considered for goodness, initially. This seed set can be obtained by SelectSeed
function: applying inverse PageRank on the web-graph, i.e. PageRank on Transpose of
web-graph.
Oracle function, O(p): Human evaluation is used to decide whether a page p is spam.
0 if p is bad
V
This is formalized as: O p =
` a
1 if p is good
TrustRank algorithm:
Input:
• T: transition matrix
• N: number of pages
• L: limit to oracle invocations
• α: decay factor
• M: number of PageRank iterations

Output:
• rt: TrustRank scores.

Algorithm:
//evaluate seed-suitability of each page
1. s=SelectSeed(…) //Could be inverse PageRank
//generate corresponding ordering
2. σ=Rank({1,…,N}, s)
//select good seeds
3. d=0N
4. for i=1 to L:
a. if O(σ(i)) equals 1 then:
i. d(σ(i))=1
//normalize d
5. d=d/|d|
//compute TrustRank scores
6. rt=d
7. for i=1 to M:
a. r t = α A TA r + 1 @ α A d
` a

8. return rt

-9-
Consider the graph in Figure 3. Note that 5, 6 and 7 are spam nodes.
We use Java code and scan the graph as an edge list.
Edge List

0 0 1
1 0 3
1 2
3 1 8
2 2 3
2 4
3 4
4 3 5
4 8
5 8 2
4 1
5 7
7 8 6 7
7 6
6
4 2
Figure 3:Web graph: for TrustRank calculation. Gray nodes are spam pages
Using inverted PageRank for Select seed we get:
We obtain:
Inverted
Page PageRank
2 0.09
4 0.08
3 0.08
0 0.08
1 0.06
7 0.05
8 0.04
6 0.04
5 0.04

Page TrustRank s= {0.08, 0.06, 0.09, 0.08, 0.08, 0.04, 0.04, 0.05, 0.04}, which
2 0.24 gives:
4 0.22 σ= {2, 4, 3, 0, 1, 7, 8, 6, 5}. Taking L=2, we get {2, 4}, which
7 0.13 are both good so oracle returns: d= {0, 0, 1/2, 0, 1/2, 0, 0, 0, 0}
6 0.11 Assuming, α=.085 and M=20.
3 0.10 We obtain TrustRank, rt as shown in
8 0.09 We can compare this with PageRank shown in Table 2. We
1 0.06 observe that 2 & 4 retain trust, however, 7 and 6, go
5 0.04 undetected. Also page 0 is wrongly given low TrustRank. [2]
0 0.00
report significant effectiveness of TrustRank in detecting spam.
Table 3: TrustRank Scores 2] pay a lot of attention on the systematic way to select seed
[
pages and the rationale behind it.

- 10 -
IV. Web Spam Taxonomy:
A first step in gearing up for the counter-measures it would be prudent to understand the
spammers’ ‘arsenal’. This section elucidates the attempts to organize web-spamming
techniques into a taxonomy. It also briefly brushes over published statistics about web-
spam.

There have been discussions in literature and on the web, but we draw heavily from [4].
We use two terms: importance: the ranking of a page in general, and relevance: the
ranking of a page with respect to a specific query.

Boosting Techniques

Link Spamming Term Spamming

Inlink Outlink Dumping Weaving

Honey Pot Dir. clone

Directory

Wiki

Link Exchange

Expired Domains

Farm

Figure 4: Web-Spam taxonomy1

1
We modify the taxonomy proposed by [4] for our purpose.

- 11 -
A. Link Spamming Techniques:
To delve into link spamming let’s categorize pages according to the way they can be
manipulated by spammers to influence results:
a. Inaccessible pages: Spammers cannot modify these pages. However, they can
point to them.
b. Accessible pages: These pages don’t belong to the spammer, but they can modify
the content on these pages, in a limited manner. Typical examples are: wikis,
comments on blogs.
c. Own pages: The spammer wants to boost ranking of one or more of these pages:
target pages, t. These have a cap on budget (e.g. web hosting, etc.).

The target algorithms: HITS, PageRank, TrustRank, etc.


HITS:
HITS ranks hubs and authority pages.[11] For HITS, the spammer can easily obtain high
hub scores by adding outlinks to popular websites. Some spammers even pay users of
high ranked .edu authority pages to point to their spammy pages. The spammer can
obtain high authority scores by having his unscrupulous hub pages point into a page
which can now become a hub page.

Figure 5: Spamming for HITS

Page Rank:
For any A set of pages, the PageRank is given as[12]:
PR(A)=PRstatic(A)+ PRin(A)- PRout(A)- PRsink(A)
This is explained in the Section V.

Techniques:
Outgoing links: The spammer can manually add well-known links, but the smarter
option is Directory Cloning: copying entire directory sites like DMOZ Open Directory
into ones pages.
Incoming links:
• Creating a Honey-pot:

- 12 -
The idea is to provide some useful resources (e.g. Articles or documents), and have
good sites link into you. These goodies could themselves be stolen (e.g. having a
copy of wikipedia with all outlinks changed to ones page).
• Infiltrating a web directory:
Directories are usually highly ranked and a spammer can trick a webmaster to allow
links into their pages to do the trick.
• Wikis, blogs, guest-books, unmoderated message-boards:
A quick fix has been some tools and some bloggers maintaining white-lists of
commentors but all these make it difficult to obtain feedback and affect the way
people blog.
• Link Exchange:
Spammers sometimes resort to mutual promotion.
• Expired domains:
Spammers take advantage of high-ranks conveyed by old-links.
• Create own spam farm:
Spammers battle ever-new prevention techniques by having link structures to which
popular algorithms have vulnerabilities. They own large number of domains these
days.

B. Term Spamming:
There are several fields on a web page which can be relevant to a query, these include:
body, title, meta tag, anchor and url. Rigging up text-fields to make pages relevant is
term-spamming.

Target algorithm:
TFIDF metric[13] has been used in information retrieval.
The TFIDF score of a page p for a query q is computed as sum over every term common
to p and q:
TFIDF(p,q)=TF(t)*IDF(t)
TF->the number of times a given term appears in that document. It is normalized.
IDF->The inverse document frequency is a measure of the general importance of the term.

Thus, spammers can try to make a page relevant to many pages: my including large number of
distinct terms or relevant to the some specific query by repeating some particular terms.

Some prominent techniques:

Anchor-tag spam: This means spamming links that point to target spam page. Thus, it
affects the ranking of both source and target.
<a href=”target.html”>free, cheap, mortgage, free</a>

Dumping:
Spammers build pages containing large number of terms, even whole dictionaries. This
makes them relevant to at least some term.

Weaving:

- 13 -
Interleave spam into relevant content. e.g. host wikipedia-clone and randomly insert your
repeating term throughout the page.

C. Hiding Techniques:

It is important for spammers to conceal their intent from a human visitor. Two techniques
used here are:

Content Hiding:
This involves making the spam invisible from the page. This can be done by changing
background color or by having the 1x1 pixel.

Cloaking:
Spammers can provide one version to humans and different one to crawlers. This is done
by keeping track of IP addresses of crawlers and serving them different content.

- 14 -
V. Optimal Spamming Structures:
There has been sizable literature on how link structures should be organized to spam. It is
open to question, that whom this is to serve the researchers or the spammers. We present
here, a small summary of some notable work in this area. We use some mathematical
equations here, but omit the proofs.

Optimal Spam Link Farms:


First we look at some optimal spam farm structures, so as to boost rank of a set of target
pages, t, using some boosting pages, b, which are under the spammer’s control and
hijacked pages, h, which are not controlled by the spammer, but where he can place some
outlinks. The rank contributed by hijacked pages is leakage, λ.

Single-target spam farm model:


Consider a single-target spam farm model
[14]. The score of the single-target is p2
maximal if: λ
• 8bi 2 b,bi points to t, and t alone A
• 8bi ,b j 2 b, : 9link from bi to b j p3
t p1
• t points to some or all bi 2 b
• 8hi 2 h,hi points to t
In fact it has been shown that leakage has pk
same effect as boosting pages and need
not be treated otherwise [14]. Figure 6:Single-Target Spam farm

Alliances of 2 spam-farms:
Let us consider the case where a q1
p1
group of spammers already have
spam-farms and want to mutually
boost their ranking, by p2 q2
interconnecting. [14] elucidate that, tp tq
the optimal way to link two spam-
farms is to connect the two target q3
pages and remove all links to pk
boosting
pages. Figure 7: Alliance of 2 spam-farms

Web Rings and Complete Cores:


[14] present analysis of extending the idea of alliances to more than 2 spam-farms. The
set of target pages is called the core. They have found that ring and complete sub-graph
of target pages both yield PageRank for target pages higher than possible for optimal
unconnected page.

- 15 -
[14] also explore the budgetary and other considerations for entering and leaving a spam-
farm.

We took to verifying these claims using Java code.


Edge List Vertex Page Ranks
0 1 3 0.180987174
0 3 2 0.174505699
1 2 4 0.167751157
1 3 5 0.150629733
2 3 1 0.095044252
2 4 8 0.087960919
3 4 7 0.067110003
3 5 6 0.059345096
4 8 0 0.016666667
8 2
4 1
5 6
6 7
7 5
5 3
5 2

Edge list Vertex PageRank


0 1 5 0.21001205
0 3 6 0.19517694
1 2 7 0.18256711
1 3 4 0.09383356
2 3 2 0.09177315
2 4 3 0.08979602
3 4 1 0.06362926
3 5 8 0.05654593
4 8 0 0.01666667
8 2
4 1
5 6
6 7
7 5

Edge List Vertex PageRank


0 1 7 0.2790591
0 3 6 0.25386702
1 2 4 0.09383356
1 3 2 0.09177315
2 3 3 0.08979602
2 4 1 0.06362926
3 4 8 0.05654593
3 5 5 0.05482998
4 8 0 0.01666667
8 2
4 1

- 16 -
5 7
6 7
7 6

Edge List Vertices PageRank


0 1 7 0.20775228
0 3 9 0.19972227
1 2 10 0.19182597
1 3 8 0.18908964
2 3 4 0.03749541
2 4 3 0.03675077
3 4 1 0.03374805
3 5 11 0.02843555
4 1 5 0.02811908
5 7 2 0.02206195
6 7 6 0.0125
7 8 0 0.0125
8 7
1 10
4 11
10 9
11 9
9 10

Edge List Vertices PageRank


0 1 7 0.26554847
0 3 9 0.26345375
1 2 10 0.13402978
1 3 8 0.12535816
2 3 4 0.03749541
2 4 3 0.03675077
3 4 1 0.03374805
3 5 11 0.02843555
4 1 5 0.02811908
5 7 2 0.02206195
6 7 6 0.0125
7 8 0 0.0125
8 7
1 10
4 11
10 9
11 9
9 10
7 9
9 7

- 17 -
VI. Tweaking Ranking Algorithms: a solution to link-
spam?
A. Antitrust and Distrust Ranking:

[5, 3,15] have suggested a distrust back propagation method. Intuitively, just as trust
disseminates forward from a set of known good pages, distrust can also be imagined to
move out of a seed of known spam pages; however, distrust should propagate backward.
The idea is that pages pointing to spam pages are themselves very likely to be spam.

The algorithm is analogous to TrustRank. Table 4:Distrust Scores


Step1: Seed->to find seed pages, PageRank can be used. Page DistrustRank
Step2: G`: transpose of the web graph needs to be computed. 7 0.22
Step3: The biased PageRank algorithm is applied on G` 6 0.17
5 0.09
Let us apply the Distrust Rank on graph in Figure 3. 3 0.09
We take L=2. From Table 2 we know that seed set {7, 6}, will 2 0.05
be selected: little surprise that they pass on distrust to 5, 0 0.05
which the only page pointing in to them. 4 0.03
1 0.02
[3] report that Antitrust Rank algorithms have a higher chance
8 0.01
over TrustRank of finding high PageRank spam pages, as
they start with seed set of spam pages with high PageRank.

B. Combining Trust and Distrust:


How does one combine trust and distrust? One method, which we are exploring, is to use
some sort of weighted sum of trust and distrust. This we plan to do on the dataset of the
web scanned by Webaroo, a company based at IITB.

[15] gives a naïve discussion on combing Distrust Rank and TrustRank. Webaroo seem to
have interesting hack on combining Distrust and Trust.
[9] present an in-depth abstract framework for trust and distrust propagation. Exploring
further over here might lead to some answers.

C. Truncated PageRank:
Intuitively, if we ignore the direct contribution of the first few levels of links, then we
would get a true picture of the real rank of pages.[16] Spammers can afford to influence
only a few levels, and algorithms should be able to see through them easily.

They suggest generalization to the PageRank equation to:


damping v
1
` a
ff
f
ff
f
ff
f
ff
f
ff
f
ff
ff
f
ff
f
ff
f
ff
ff
f
ff
f
ff
f
ff v
r =X AT
v=0
N

- 18 -
where, damping decreases with v.
For PageRank,
damping v = 1 @ α α v
` a ` a

To demote the immediate supporters, the

damping(v)
damping function may be redefined as:
1 @ α α v for v > V
V` a
damping v =
` a
0 otherwise
Truncated PageRank is easily obtained by using v
snapshots of PageRank.
C
f
ff
f
ff
f ` a
1
r 0 = ; r v = αr v @ 1 ; rtrunc = X r v Figure 8: truncated damping function
` a ` a ` a

N v=V+1
v
where, r is PageRank snapshot at instant v.
` a

[16] themselves explain that Truncated PageRank is not intended to replace PageRank,
but in fact can be used as feature to classify spam pages, as discussed in SectionVII.

D. Topical TrustRank:
[17] propose to calculate TrustRank scores for different topics, instead of single
TrustRank score, with each score representing the trustworthiness of the site within that
particular topic. We believe a combination of these scores will present a better measure
of the trustworthiness of a site. The interesting part is how to combine different
TrustRank scores: simple summation and quality bias. Quality bias is to weight each
topic, possibly based on its average PageRank. They also emphasize on seed selection:
suggesting seed weighting instead of assigning equal weights to seeds in d.
Quality bias unfortunately is no answer to the problem of combining trust and distrust.

- 19 -
VII. Statistics about pages: features to classify link-
spam:

Figure 9: Distribution of in-degree and out-degree of web-pages[18]


[8] report this distribution of in-degree and outdegree of pages. They also suggest that in
distributions of statistics about pages the outliers tend to be spam.

Different features of pages and link-structures can be used to classify spam pages.
[18]2 use:
• Degree-based measures
• PageRank
• TrustRank
• Truncated PageRank
• Estimation of supporters
They compute these metrics for only the page with maximum PageRank. it would be an
interesting exercise to cross-check their findings, and come up with similar features that
can be tested independently or in combination with these.

2
δ is a parameter they use to measure significance for Kolmogorov-Smirnov tests.

- 20 -
Degree-based measures:
They find no significant difference between in-degree and out-degree of spam pages and
normal pages. They also report that edge-reciprocity shows a marked difference.

PageRank:

Figure 10: <Left>: Histogram of PageRank of normal and spam pages. <Right>: Histogram of
Standard deviation of PageRank of neighbors.[18]
[18] find that most spam homepages are in particular
narrow PageRank strip. Also, PageRank of neighbors
of homepage show little dispersion.

TrustRank:
TrustRank scores also have a marked difference.
Combinations of degree-correlations, PageRank and
TrustRank seem to yield good results.[18]

Truncated PageRank:
Truncated PageRank proves particularly useful (see
Figure 11: Histogram of TrustRank Figure 12: Histogram of Truncated
of Normal and spam pages.[18]
PageRank(V=4)/PageRank.[18])3. This means that
spam pages loose a large part of their score when
truncated at level of 4.

Figure 12: Histogram of Truncated


3 PageRank(V=4)/PageRank.[
The 18]
T in the figure is V, the threshold.

- 21 -
VIII. Scope for Future work:
We believe there is a lot of scope in exploring trust and distrust propagation and
algorithms that can combine the 2 might do well. The analysis of Optimal Structures to
spam ranking algorithms can provide direction to better them. Combinations of different
statistical features might throw up some nice surprises, regarding differences between
distributions for spam and normal pages.

Useful inputs from other fields like economics and game theory might open up new
vistas. Some literature on monetary constraints and analysis of sponsored search has
started to surface.

IX. Conclusion:
One may ponder that for how long will we be able to use links as endorsements and
whether someday search engines will stop using them altogether.
Combating web-spam seems to need a combination of term and link spam detection
techniques. Hopefully, these approaches are not just orthogonal, but complimentary.
The spammers’ and search engines’ goals will for sometime ataleast remain conflicting:
creating the adversarial web.

- 22 -
X. References:

1. L. Page, S. Brin, R. Motwani, T. Winograd, "The PageRank citation ranking:


Bringing order to the Web," 1998.
2. Zoltán Gyöngyi, Hector Garcia-Molina, Stanford University, and Jan Pedersen,
Yahoo. Proceedings of the 30th VLDBConference, 2004.
3. Krishnan, V. and Raj, R., “Web Spam Detection with Anti-Trust Rank”,
December 2005.
4. Zoltán Gyöngyi, Hector Garcia-Molina, “Web Spam Taxonomy”, In First
International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), 2005.
5. Panagiotis T. Metaxas, Joseph DeStefano, "Web Spam, Propaganda and Trust".
6. Jakob Nielsen, “When Search engines become answer engines”. August 2004.
http://www.useit.com/alertbox/20040816.html
7. Craig Silverstein, Monika Henzinger, Hannes Marais, Michael Moricz, “Analysis
of a Very Large Web Search Engine Query Log”, 1999.
8. Dennis Fetterly, Mark Manasse, Marc Najork. “Spam, Damn Spam, and
Statistics: Using statistical analysis to locate spam web pages”, WebDB 2004.
9. R. Guha, Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins, “Propagation of
Trust and Distrust”, 2004.
10. J.M. Kleinberg, “Autoritative sources in a hyperlinked environment”. 1999.
11. T. Haveliwala. “Topic-sensitive PageRank”, WWW, 2002.
12. Monica Bianchini, Marco Gori, and Franco Sacrselli”, Inside Page Rank”, 2005.
13. R. Baeza-Yates and Berhtier Ribeiro-Neto, “Modern Information Retrieval”,
Addison Wesley, 1999.
14. Zoltán Gyöngyi, Hector Garcia-Molina, “Link Spam Alliances”.
15. BadRank. http://pr.efactory.de/e-pr0.shtml
16. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using
rank propagation and probabilistic counting for link-based spam detection.
Technical report”, 2006.
17. B. Wu, V. Goel, and B. D. Davison, “Topical TrustRank: Using topicality to
combat web spam”, WWW, May 2006.
18. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Link-
based characterization and detection of web spam”, AIRWeb, 2006.

- 23 -

Potrebbero piacerti anche