Special Topics in Search Engines: Result Summaries Anti-Spamming Duplicate Elimination

Special Topics in Search Engines
Result Summaries Anti-spamming Duplicate elimination
Results summaries
Summaries
Having ranked the documents matching a query, we wish to present a results list Most commonly, the document title plus a short summary The title is typically automatically extracted from document metadata What about the summaries?
Summaries
Two basic kinds:

Static Dynamic
A static summary of a document is always the same, regardless of the query that hit the doc Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand
Static summaries
In typical systems, the static summary is a subset of the document Simplest heuristic: the first 50 (or so this can be varied) words of the document
Summary cached at indexing time
More sophisticated: extract from each document a set of key sentences

Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences.
Most sophisticated: NLP used to synthesize a summary
Seldom used in IR; cf. text summarization work
Dynamic summaries
Present one or more windows within the document that contain several of the query terms
KWIC snippets: Keyword in Context presentation
Generated in conjunction with scoring
If query found as a phrase, the/some occurrences of the phrase in the doc If not, windows within the doc that contain multiple query terms
The summary itself gives the entire content of the window all terms, not only the query terms how?
Generating dynamic summaries
If we have only a positional index, we cannot (easily) reconstruct context surrounding hits If we cache the documents at index time, can run the window through it, cueing to hits found in the positional index
E.g., positional index says the query is a phrase in position 4378 so we go to this position in the cached document and stream out the content Note: Cached copy can be outdated
Most often, cache a fixed-size prefix of the doc
Dynamic summaries
Producing good dynamic summaries is a tricky optimization problem
The real estate for the summary is normally small and fixed Want short item, so show as many KWIC matches as possible, and perhaps other things like title Want snippets to be long enough to be useful Want linguistically well-formed snippets: users prefer snippets that contain complete phrases Want snippets maximally informative about doc
But users really like snippets, even if they complicate IR system design
Anti-spamming
Adversarial IR (Spam)
Motives
Commercial, political, religious, lobbies Promotion funded by advertising budget

Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services Web master world ( www.webmasterworld.com )

Operators

Forum
Search engine specific tricks Discussions about academic papers
Search Engine Optimization I Adversarial IR (search engine wars)
Can you trust words on the page?

auctions.hitsoffice.com/
Pornographic Content
www.ebay.com/
Examples from July 2002
Simplest forms
Early engines relied on the density of terms
The top-ranked pages for the query maui resort were the ones containing the most mauis and resorts
SEOs responded with dense repetitions of chosen terms

e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page

Repeated terms got indexed by crawlers But not visible to humans on browsers
Cant trust the words on a web page, for ranking.
A few spam technologies
Cloaking

Serve fake content to search engine robot DNS cloaking: Switch IP address. Impersonate Pages optimized for a single keyword that re-direct to the real target page
Doorway pages
Keyword Spam

Misleading meta-keywords, excessive repetition of a term, fake anchor text Hidden text with colors, CSS tricks, etc.
Mutual admiration societies, hidden links, awards Domain flooding: numerous domains that point or re-direct to a target page Fake click stream Fake query stream Millions of submissions via Add-Url
Link spamming

Robots

More spam techniques
Cloaking
Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate
Y
Is this a Search Engine spider?
SPAM
Cloaking
Real Doc
Tutorial on Cloaking & Stealth Technology
Variants of keyword stuffing

Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc.
Meta-Tags = London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra,
More spam techniques
Doorway pages
Pages optimized for a single keyword that redirect to the real target page
Link spamming

Mutual admiration societies, hidden links, awards more on these later Domain flooding: numerous domains that point or re-direct to a target page
Fake query stream rank checking programs
Robots
Curve-fit ranking programs of search engines
Millions of submissions via Add-Url
The war against spam

Quality signals - Prefer authoritative pages based on:
Votes from authors (linkage signals) Votes from users (usage signals)
Policing of URL submissions

Anti robot test
Limits on meta-keywords Robust link analysis
Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association)
The war against spam
Spam recognition by machine learning
Training set based on known spam Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc.
Family friendly filters

Editorial intervention

Blacklists Top queries audited Complaints addressed
Acid test

Which SEOs rank highly on the query seo? Web search engines have policies on SEO practices they tolerate/block
See pointers in Resources
Adversarial IR: the unending (technical) battle between SEOs and web search engines See for instance http://airweb.cse.lehigh.edu/
Duplicate detection
Duplicate/Near-Duplicate Detection

Duplication: Exact match with fingerprints Near-Duplication: Approximate match
Overview
Compute syntactic similarity with an editdistance measure Use similarity threshold to detect nearduplicates
E.g., Similarity > 80% => Documents are near duplicates Not transitive though sometimes used transitively
Computing Similarity

Segments of a document (natural or artificial breakpoints) [Brin95] Shingles (Word k-Grams) [Brin95, Brod98]
a rose is a rose is a rose => a_rose_is_a rose_is_a_rose is_a_rose_is
Similarity Measure between two docs (= sets of shingles)
Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure
Shingles + Set Intersection

Computing exact set intersection of shingles between all pairs of documents is expensive
Approximate using a cleverly chosen subset of shingles from each (a sketch)
Estimate Jaccard from a short sketch
Create a sketch vector (e.g., of size 200) for each document
Documents which share more than t (say 80%) corresponding vector elements are similar For doc d, sketchd[i] is computed as follows:

Let f map all shingles in the universe to 0..2m Let pi be a specific random permutation on 0..2m Pick MIN pi (f(s)) over all shingles s in d
Shingling with sampling minima

Given two documents A1, A2. Let S1 and S2 be their shingle sets Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|. Let Alpha = min ( p (S1)) Let Beta = min (p(S2))
Probability (Alpha = Beta) = Resemblance
Computing Sketch[i] for Doc1

Document 1 264 264 264 264
Start with 64 bit shingles
Permute on the number line

with
pi
Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 264 264 264
A B
Document 2 264
264
264 264
264 Are these equal?
Test for 200 random permutations: p1, p2, p200
However
Document 1 Document 2
264 264
A
264 264
B
264 264
264 264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection)
This happens with probability:
Size_of_intersection / Size_of_union Why?
Set Similarity
Set Similarity (Jaccard measure)
sim J(C i , C j )
Ci C j Ci C j
View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j C1 C2 Example
0 1 1 0 1 0 1 0 1 0 1 1
simJ(C1,C2) = 2/5 = 0.4
Key Observation
For columns Ci, Cj, four types of rows

A B C D Ci 1 1 0 0 Cj 1 0 1 0
Overload notation: A = # of rows of type A Claim
A sim J(C i , C j ) A BC
Min Hashing

Randomly permute rows h(Ci) = index of first row with 1 in column Ci Surprising Property Why? P h(C i ) h(C j ) sim J Ci , C j

Both are A/(A+B+C) Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj) type A row
Mirror Detection
Mirroring is systematic replication of web pages across hosts. Single largest cause of duplication on the web Host1/a and Host2/b are mirrors iff For all (or most) paths p such that when http://Host1/ a / p exists http://Host2/ b / p exists as well with identical (or near identical) content, and vice versa.
Mirror Detection example
http://www.elsevier.com/ and http://www.elsevier.nl/
Structural Classification of Proteins

http://scop.mrc-lmb.cam.ac.uk/scop http://scop.berkeley.edu/ http://scop.wehi.edu.au/scop http://pdb.weizmann.ac.il/scop http://scop.protres.ru/
Repackaged Mirrors
Auctions.msn.com Auctions.lycos.com
Aug
Motivation
Why detect mirrors?
Smart crawling

Fetch from the fastest or freshest server Avoid duplication Combine inlinks Avoid double counting outlinks If that fails you can try: <mirror>/samepath
Better connectivity analysis

Redundancy in result listings
Proxy caching
Bottom Up Mirror Detection

[Cho00]

Maintain clusters of subgraphs Initialize clusters of trivial subgraphs
Group near-duplicate single documents into a cluster Merge clusters of the same cardinality and corresponding linkage
Subsequent passes
Avoid decreasing cluster cardinality Adequate path overlap Contents of corresponding pages within a small time range
To detect mirrors we need:

Can we use URLs to find mirrors?

www.synthesis.org synthesis.stanford.edu
a c
b d
a c
b d
www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech- www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced www.synthesis.org/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro- www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case- www.synthesis.org/Docs/myr5 synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new- www.synthesis.org/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/annual.report96.final.html www.synthesis.org/Docs/myr5/cs/cs-meta.html synthesis.stanford.edu/Docs/annual.report96.final_fn.html www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html synthesis.stanford.edu/Docs/myr5/assessment www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment- www.synthesis.org/Docs/myr5/synsys/experiential-learning.html synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk- www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html www.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.html www.synthesis.org/Docs/yr5ar/assess synthesis.stanford.edu/Docs/myr5/cicee www.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
Top Down Mirror Detection

[Bhar99, Bhar00c]
E.g.,
www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html
What features could indicate mirroring?
Hostname similarity:
word unigrams and bigrams: { www, www.synthesis, synthesis, }
Directory similarity:
Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, }

3 or 4 octet overlap Many hosts sharing an IP address => virtual hosting by an ISP
IP address similarity:

Host outlink overlap Path overlap
Potentially, path + sketch overlap
Implementation
Phase I - Candidate Pair Detection

Find features that pairs of hosts have in common Compute a list of host pairs which might be mirrors Test each host pair and determine extent of mirroring Check if 20 paths sampled from Host1 have near-duplicates on Host2 and vice versa Use transitive inferences:
IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B) IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)
Phase II - Host Pair Validation
Evaluation

140 million URLs on 230,000 hosts (1999) Best approach combined 5 sets of features Top 100,000 host pairs had precision = 0.57 and recall = 0.86
WebIR Infrastructure
Connectivity Server
Fast access to links to support for link analysis Fast access to document vectors to augment link analysis
Term Vector Database
Connectivity Server
[CS1: Bhar98b, CS2 & 3: Rand01]
Fast web graph access to support connectivity analysis Stores mappings in memory from
URL to outlinks, URL to inlinks HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc.
Applications

more on this later
Visualizations
Usage
Input Graph algorithm + URLs + Values Execution Graph algorithm runs in memory Output
IDs to URLs
URLs to FPs to IDs
URLs + Values
Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes
ID assignment
E.g., HIGH IDs:

Max(indegree , outdegree) > 254
Partition URLs into 3 sets, sorted lexicographically
ID
URL www.amazon.com/ www.amazon.com/jobs/
High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%)
9891 9912
IDs assigned in sequence (densely)
9821878
40930030 85903590
www.geocities.com/
www.google.com/
Adjacency lists
In memory tables for Outlinks, Inlinks List index maps from a Source ID to start of adjacency list
www.yahoo.com/
Adjacency List Compression - I

104 105 106
98 132 153 98 147 153 Sequence of Adjacency Lists 104 105 106 -6 34 21 -8 49 6 Delta Encoded Adjacency Lists
List Index
Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host) - Compress deltas with variable length encoding (e.g., Huffman) List Index pointers: 32b for high, Base+16b for med, Base+8b for low - Avg = 12b per pointer
List Index
Adjacency List Compression - II
Inter List Compression
Basis: Similar URLs may share links
Close in ID space => adjacency lists may overlap Define a representative adjacency list for a block of IDs

Approach
Adjacency list of a reference ID Union of adjacency lists in the block
Represent adjacency list in terms of deletions and additions when it is cheaper to do so Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM) Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
Measurements

Term Vector Database

[Stat00]
Fast access to 50 word term vectors for web pages
Term Selection:

Restricted to middle 1/3rd of lexicon by document frequency Top 50 words in document by TF.IDF. Deferred till run-time (can be based on term freq, doc freq, doc length)
Term Weighting:
Applications

Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification
Performance

Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)
Architecture
URLid * 64 /480
offset
Base (4 bytes)
URL Info
LC:TID
Terms
LC:TID LC:TID FRQ:RL

128 Byte TV Record
Bit vector For 480 URLids
Freq
FRQ:RL
FRQ:RL
URLid to Term Vector Lookup

Special Topics in Search Engines: Result Summaries Anti-Spamming Duplicate Elimination

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Special Topics in Search Engines: Result Summaries Anti-Spamming Duplicate Elimination

Caricato da

Copyright:

Formati disponibili

Special Topics in Search Engines

Result Summaries Anti-spamming Duplicate elimination

Two basic kinds:

Summary cached at indexing time

More sophisticated: extract from each document a set of key sentences

Most sophisticated: NLP used to synthesize a summary

Seldom used in IR; cf. text summarization work

KWIC snippets: Keyword in Context presentation

Generated in conjunction with scoring

Generating dynamic summaries

Most often, cache a fixed-size prefix of the doc

Producing good dynamic summaries is a tricky optimization problem

Commercial, political, religious, lobbies Promotion funded by advertising budget

Search engine specific tricks Discussions about academic papers

Search Engine Optimization I Adversarial IR (search engine wars)

Can you trust words on the page?

Examples from July 2002

Early engines relied on the density of terms

SEOs responded with dense repetitions of chosen terms

Cant trust the words on a web page, for ranking.

A few spam technologies

More spam techniques

Tutorial on Cloaking & Stealth Technology

Variants of keyword stuffing

More spam techniques

Curve-fit ranking programs of search engines

Millions of submissions via Add-Url

The war against spam

Policing of URL submissions

Limits on meta-keywords Robust link analysis

The war against spam

Spam recognition by machine learning

Family friendly filters

Blacklists Top queries audited Complaints addressed

See pointers in Resources

Duplication: Exact match with fingerprints Near-Duplication: Approximate match

Similarity Measure between two docs (= sets of shingles)

Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure

Shingles + Set Intersection

Approximate using a cleverly chosen subset of shingles from each (a sketch)

Estimate Jaccard from a short sketch

Create a sketch vector (e.g., of size 200) for each document

Shingling with sampling minima

Probability (Alpha = Beta) = Resemblance

Computing Sketch[i] for Doc1

Permute on the number line

Pick the min value

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

264 Are these equal?

Test for 200 random permutations: p1, p2, p200

Size_of_intersection / Size_of_union Why?

Set Similarity (Jaccard measure)

simJ(C1,C2) = 2/5 = 0.4

For columns Ci, Cj, four types of rows

Overload notation: A = # of rows of type A Claim

Mirror Detection example

http://www.elsevier.com/ and http://www.elsevier.nl/

Structural Classification of Proteins

http://scop.mrc-lmb.cam.ac.uk/scop http://scop.berkeley.edu/ http://scop.wehi.edu.au/scop http://pdb.weizmann.ac.il/scop http://scop.protres.ru/

Why detect mirrors?

Better connectivity analysis

Redundancy in result listings