Sei sulla pagina 1di 27

ANDHA COLLEGE OF TECHNOLOGY FOR WOMEN

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


CLASS : IV YEAR CSE
SUBJECT : CS6007/INFORMATION RETRIEVAL (Regulation 2013)

UNIT -3 QUESTION BANK

PART A

1. What is the purpose of web crawler?(nov/dec 16) (apr/may 17)


Or What are the politeness policies used in web crawling?(nov/dec 17)

Web crawlers are mainly used to create a copy of all the visited pages for later processing by
a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be
used for automating maintenance tasks on a Web site, such as checking links or validating HTML
code. Web crawlers use a combination of policies to determine their crawling behavior,
such policies include a selection policy, a revisit policy, a politeness policy and a parallelization
policy.

Rules :Web crawler must identify itself as such, and must not pretend to be a regular Web
user . A Web crawler must obey the robots exclusion protocol (robots.txt) A Web crawler must keep
a low bandwidth usage in a given Web site.

2. What are the requirements of XML information retrieval system?( nov/dec 16)

XML retrieval, or XML information retrieval, is the content-based retrieval of documents


structured with XML (eXtensible Markup Language). As such it is used for computing relevance
of XML documents.

The goal of an XML information retrieval system is refined to focused retrieval strategies, which
aim at returning document components, i.e. XML elements, instead of whole documents in response
to a user query . Query languages characterizing the information need with respect to both the
content and structure.

Requirements of XML information retrieval systems

 Query language that allows users to specify the nature of relevant components, in particular
with respect to their structure.

 Representation strategies providing a description not only of the content of XML documents,
but also their structure.

 Ranking strategies that determine the most relevant elements and rank these appropriately for
a given query

3. What is inversion in indexing process? (nov/dec 17)

For each term, we have a list that records in which documents the term occurs
• Each term in the list is conventionally called a posting.
1
• A posting is a tuple of the form (ti,di) , where ti is a term identifier and dj is a document identifier
• The list is called a posting list (or inverted list)
• All the postings list taken together are referred to as the postings
• Inverted indexes are independent from the adopted IR model (Boolean model, vector space model,
etc.)
• Each posting usually contains:
1. The identifier of the linked document
2. The frequency of appearance of the term in the document
3. The position of the term for each document (optional)
o Expressed as number of words from the begin of the document, number of bytes, etc. A.k.a.
positional posting
• For each term is also usually stored the frequency of appearance of the term in the dictionary

Building an inverted index


o Collect the documents to be indexed
o Tokenize the text, turning each document into a list of tokens
o Linguistic pre-processing. Each document is represented as a list of normalized tokens
o Index the documents that each token occurs in by creating an inverted index,
consisting of a dictionary and postings

4. How do spammers use cloaking to server spam to the web user?(apr/may 18)

Serve fake content to search engine spider


DNS cloaking: Switch IP address. Impersonate

5. Can a digest of the characters in a web page be used detect near duplicate web pages ?why ?
(apr/may 18)
Web contains multiple copies of the same content. Search engines try to avoid indexing multiple copies
of the same content, to keep down storage and processing overheads.The simplest approach to detecting
duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the
characters on that page.

2
Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are
equal and if so declare one of them to be a duplicate copy of the other.
This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near
duplication. In many cases, the contents of one web page are identical to those of another except for a few
characters - say, a notation showing the date and time at which the page was last modified.
Even in such cases, we want to be able to declare the two pages to be close enough that we only index
one copy.We now describe a solution to the problem of detecting near-duplicate web pages. The answer lies
in a technique known as shingling.

6. What are the applications of web crawlers? (Nov/Dec 18)

 create an index covering broad topics (general Web search )


 create an index covering specific topics (vertical Web search )
 archive content (Web archival )
 analyze Web sites for extracting aggregate statistics (Web characterization )
 keep copies or replicate Web sites (Web mirroring )
 Web site analysis

7. Define search engine optimization. (Nov/Dec 18)

Search engine optimization (SEO) is the process of increasing the quality and quantity of website
traffic by increasing visibility of a website or a web page to users of a web search engine.SEO refers to the
improvement of unpaid results (known as "natural" or "organic" results), and excludes the purchase of paid
placement.
SEO may target different kinds of search, including image search, video search, academic search,
news search, and industry-specific vertical search engines. Optimizing a website may involve editing its
content, adding content, modifying HTML, and associated coding to both increase its relevance to specific
keywords and to remove barriers to the indexing activities of search engines.
Promoting a site to increase the number of backlinks, or inbound links, is another SEO tactic.
SEO is all about optimizing a website for search engines.
SEO is a technique for:
 designing and developing a website to rank well in search engine results.
 improving the volume and quality of traffic to a website from search engines.
 marketing by understanding how search algorithms work, and what human visitors might search.
SEO is a subset of search engine marketing. SEO is also referred as SEO copyrighting, because most of the
techniques that are used to promote sites in search engines, deal with text.
8. What are the spams techniques.

Doorway pages
Pages optimized for a single keyword that re-direct to the real target page
Link spamming
Mutual admiration societies, hidden links, awards – more on these later
3
Domain flooding: numerous domains that point or re-direct to a target page
Robots Fake query stream – rank checking programs
“Curve-fit” ranking programs of search engines
Millions of submissions via Add-Url
9. State heap‘s law. (Apr/may 19)
Under mild assumptions, the Herdan–Heaps law is asymptotically equivalent to
Zipf's law concerning the frequencies of individual words within a text. Heaps'law also applies to
situations in which the "vocabulary" is just some set of distinct types which are attributes of some
collection of objects. If V is the size of the vocabulary and the n is the length of the corpus in words:

V=Knᵦ with constants K,0< β >1

Typical constants: – K ≈ 10 -100 – β ≈ 0.4-0.6 (approx. square-root)

10. Explain the difference between SEO and Pay-per-click

SEO Pay-per-click

 SEO results take 2 weeks to 4 months It results in 1-2 days

 It is very difficult to control flow of traffic It has ability to turn on and at any moment

 Requires ongoing learning and experience Easier for a novice


to reap results
 It is more difficult to target local markets Ability to target “local” markets

 Better for long-term and lower margin Better for short-term and high-margin
campaigns campaigns.

 Generally more cost-effective , does not Generally more costly per visitor and per
penalize for more traffic conversion

11. Explain index compression with an example. (A/M 17)

We need less disk space. As we will see, compression ratios of 1:4 are easy to achieve,
potentially cutting the cost of storing the index by 75%.
There are two more subtle benefits of compression.
The first is increased use of caching. Search systems use some parts of the dictionary and the index
much more than others. For example, if we cache the postings list of a frequently used query term t,
then the computations necessary for responding to the one-term query t can be entirely done in
memory.
With compression, we can fit a lot more information into main memory. Instead of having to
expend a disk seek when processing a query with t, we instead access its postings list in memory and
decompress it. As we will see below, there are simple and efficient decompression methods, so that
the penalty of having to decompress the postings list is small.
As a result, we are able to decrease the response time of the IR system substantially.

4
Because memory is a more expensive resource than disk space, increased speed owing to caching –
rather than decreased space requirements – is often the prime motivator for compression.

12. What are the Practical Issues in the Web?


Security Commercial transactions over the Internet are not yet a completely safe procedure
Privacy Frequently, people are willing to exchange information as long as it does not become
public Copyright and patent rights It is far from clear how the wide spread of data on the Web
affects copyright and patent laws in the various countries Scanning, optical character
recognition (OCR), and cross-language retrieval

13. Why compression need and its types.


 Use less disk space (saves money)
 Keep more stuff in memory (increases speed)
 Increase speed of transferring data from disk to memory (increases speed)
[read compressed data and decompress] is faster than [read uncompressed data]
 Premise: Decompression algorithms are fast
True of the decompression algorithms we use
 In most cases, retrieval system runs faster on compressed postings lists than on uncompressed
postings lists.
Types
• Lossless compression: All information is preserved.
• What we mostly do in IR.
• Lossy compression: Discard some information

14. What are the levels of link analysis?


 microscopic level: related to the statistical properties of links and individual nodes
 mesoscopic level: related to the properties of areas or regions of the Web
 macroscopic level: related to the structure of the Web at large

15. What are the Indexing Issues?


• Availability and speed – Most search engines will cache the page being referenced.

• Multiple search terms – OR: separate searches concatenated


– AND: intersection of searches computed.
5
– Regular expressions not typically handled.

• Parsing – Must be able to handle malformed HTML, partial documents

PART-B

1. Elaborate on the web search architectures.(N/D 16)

Search Engine refers to a huge database of internet resources such as web pages, newsgroups,
programs, images etc. It helps to locate information on World Wide Web.User can search for any
information by passing query in form of keywords or phrase. It then searches for relevant information in its
database and return to the user.
The main components of a search engine are the crawler, indexer, search index, query engine, and
search interface.
A web crawler is a software program that traverses web pages, downloads them for indexing, and
follows the hyperlinks that are referenced on the downloaded pages a web crawler is also known as a spider,
a wanderer or a software robot.
The second component is the indexer which is responsible for creating the search index from the web
pages it receives from the crawler.

The Search Index


The search index is a data repository containing all the information the search engine needs to match and
retrieve web pages. The type of data structure used to organize the index is known as an inverted file. It is
very much like an index at the back of a book. It contains all the words appearing in the web pages crawled,
listed in alphabetical order (this is called the index file), and for each word it has a list of references to the
web pages in which the word appears (this is called the posting list ).
Example:
Consider the entry for ―chessǁ in the search index. Attached to the entry is the posting list of all web pages
that contain the word ―chessǁ; for example, the entry for ―chessǁ could be chess → [www.chess.co.uk,
www.uschess.org, www.chessclub.com, . . .] Often, more information is stored for each entry in the index
such as the number of documents in the posting list for the entry.The search index will also store
information pertaining to hyperlinks in a separate link database, which allows the search engine to perform
hyperlink analysis, which is used as part of the ranking process of web pages. The link database can also be
organized as an inverted file in such a way that its index file is populated by URLs and the posting list for
each URL entry, called the source URL, contains all the destination URLs forming links between these
source and destination URLs. www.chessclub.com, . . .] Often, more information is stored for each entry in
the index such as the number of documents in the posting list for the entry, that is, the number of web pages
that contain the keyword, and for each individual entry in the posting file we may also store the number of
occurrences of the keyword in the web page and the position of each occurrence within the page. This type
of information is useful for determining content relevance.
The search index will also store information pertaining to hyperlinks in a separate link database,
which allows the search engine to perform hyperlink analysis, which is used as part of the ranking process of
6
web pages. The link database can also be organized as an inverted file in such a way that its index file is
populated by URLs and the posting list for each URL entry, called the source URL, contains all the
destination URLs forming links between these source and destination URLs.
Compression of the link database is thus an important issue for search engines, who need to perform
efficient hyperlink analysis.Randall have developed compression techniques for the link database, which
take advantage of the structure of the Web. Their techniques are based on the observations that most web
pages tend to link to other pages on the same website, and many web pages on the same web site tend to link
to a common set of pages.
The text which is attached to a hyperlink, called link (or anchor) text, that is clicked on by users following
the link, is considered to be part of the web page it references.
The Query Engine
commercial query engine is a well-guarded secret, since search engines are rightly paranoid, fearing web
sites who wish to increase their ranking by unscrupulously taking advantage of the algorithms the search
engine uses to rank result pages. Search engines view such manipulation as spam, since it has direct effects
on the quality of the results presented to the user.
Spam
Spam is normally associated with unsolicited e-mail also known as junk e-mail, although the word spam
originally derives from spiced ham and refers to a (canned meat product.) It is not straightforward to
distinguish between search engine spam and organic search engine optimization, where a good and healthy
design of web pages leads them to be visible on the top results of search engines for queries related to the
pages.
The text which is attached to a hyperlink, called link (or anchor) text, that is clicked on by users following
the link, is considered to be part of the web page it references. So when a word such as ―chessǁ appears in
some link text, then the posting list for that word will contain an entry for the destination URL of
the link.
The Search Interface
Once the query is processed, the query engine sends the results list to the search interface, which
displays the results on the user‘s screen. The user interface provides the look and feel of the search engine,
allowing the user to submit queries, browse the results list, and click on chosen web pages for further
browsing.

Search Engine Working


Web crawler, database and the search interface are the major component of a search engine that
actually makes search engine to work. Search engines make use of Boolean expression AND, OR, NOT to
restrict and widen the results of a search. Following are the steps that are performed by the search engine:
• The search engine looks for the keyword in the index for predefined database instead of going directly to
the web to search for the keyword.
• It then uses software to search for the information in the database. This software component is known as
web crawler.
• Once web crawler finds the pages, the search engine then shows the relevant web pages as a result. These
retrieved web pages generally include title of page, size of text portion, first several sentences etc.
• These search criteria may vary from one search engine to the other. The retrieved information is ranked
according to various factors such as frequency of keywords, relevancy of information, links etc.
• User can click on any of the search results to open it.

7
Architecture
The search engine architecture comprises of the three basic layers listed below:
 Content collection and refinement.
 Search core
 User and application interfaces

Search Engine Processing


Indexing Process
Indexing process comprises of the following three tasks:
 Text acquisition
 Text transformation
 Index creation
Text acquisition
It identifies and stores documents for indexing.
Text Transformation
It transforms document into index terms or features.
Index Creation
It takes index terms created by text transformations and create data structures to suport fast searching.
Query Process
Query process comprises of the following three tasks:
 User interaction
 Ranking
 Evaluation
User interaction
It supporst creation and refinement of user query and displays the results.
Ranking
It uses query and indexes to create ranked list of documents.
Evaluation

8
It monitors and measures the effectiveness and efficiency. It is done offline.

Examples
Following are the several search engines Description
available today: Search Engine
Google It was originally called BackRub. It is the
most popular search engine globally.
Bing It was launched in 2009 by Microsoft. It is
the latest web-based search engine that
also delivers Yahoo’s results.
Ask It was launched in 1996 and was originally
known as Ask Jeeves. It includes support for
match, dictionary, and conversation question.
AltaVista It was launched by Digital Equipment
Corporation in 1995. Since 2003, it is
powered by Yahoo technology.
AOL.Search It is powered by Google.
LYCOS It is top 5 internet portal and 13th largest
online property according to Media Matrix.
Alexa It is subsidiary of Amazon and used for
providing website traffic information.

2. Describe Meta and focused crawling. (N/D 16)


Write short notes on the following. (N/D 17)
Focused crawling , Deep web , Distributed crawling, site map

META-CRAWLERS
MetaCrawler, a specific search engine owned by InfoSpace, was one of the first meta search engines.
Instead of crawling the Web, metasearch engines search multiple search engines, Internet directory sites,
government sites, news sites and aggregate the results.
For general search purposes, initiatives such as the Google Knowledge Graph accomplish many of the
same goals as a metasearch engine. For example, a decision about which hotel room to book for a trip is
usually not based on price alone. In fact, the average consumer visits five or six websites before booking a
room.
Meta-search engines consolidate results from multiple sites and categorize results so consumers can
view results based on characteristics other than price. Eg.http://www.kartoo.com this meta search site shows
the results with sites being interconnected by keywords In specific industries such as the travel industry,
metasearch engines are gaining popularity with consumers because they aggregate results across many
websites. It provided users with the option to search news, videos, images, white pages, and yellow pages.

9
MetaCrawler blends data from many of the top search engines including Yahoo!, Google, Bing, Ask.com,
About.com and others.
• Examples: Metacrawler, Savvysearch, MetaSearch, Mamma.
• Advantages:
1. Combine the results of many sources.
2. Save users from the need to pose queries to multiple searchers.
3. Ability to sort the results by different attributes.
4. Pages retrieved by multiple searchers are more relevant.
S. Improve coverage: Individual searchers cover a small fraction of the Web.

• A metasearcher submits queries over multiple sources. But the interfaces and capabilities of these sources
may vary dramatically. Even the basic query model that the sources support may vary. Some search engines
only support the Boolean retrieval model.

• In this model, a query is a condition that documents either do or do not satisfy. The query result is then a
set of documents. For example, a query distributed and a system returns all documents that contain both the
words distributed and systems in them.

• Metasearch engines present the results of their searches in one


l. Single list : Most metasearcher display multiple-engine sear merged list, from which duplicate entries
have been remov
2. Multiple lists : Some Metasearchers do not collate multiple but display them instead in separate lists as
they are receive Duplicate entries may appear.

• Following is the list of metasearcher with their URL. Metasearcher URL


S.N Metasearche URL Source Used
o r

1 Cyber411 www.cyber411.com 14

2 Dogpile www.dogpile.com 25

3 Highway61 www.highway61.com 5

4 Inference www.infind.com 6
Find

5 Mamma www.mamma.com 7

6 MetaCrawler www.metacrawler.com 7

7 MetaFind www.metafind.com 7

8 MetaMiner www.miner.uol.com.br 13

9 MetaSearch www.metasearch.com -

10 SavvySearch www.savvy.cs.colostate.edu:20 >13

10
00

• Metacrawler is a search engine which cluster retrieved metasearch engine is a search tool that sends user
requests to engines and/or databases and aggregates the results into a sin

Step 1: When MetaCrawler receives a query, it posts the que engines in parallel.

Step 2: Performs sophisticated pruning on the responses returned

• The search results you receive are a combination of the top commercial (sponsored advertising) and non-
commercial (algorithmic) results from search engines on the Web.

• Meta Crawler's metasearch technology combines the top ranking search results from each of the separate
search engines based on your specific query. The blend of sponsored and non-sponsored results for a given
search term depends on the nature of the term, but sponsored results are always clearly identified with a
“Sponsored” or similar designation like this.

• Features:

1. Unifies the Search Interface and provides a consistent user interface,


2. Standardizes the query structure,
3. May make use of an independent ranking method for the results {rank-merging},
4. May have an independent ranking system for each search engine/database it searches,
5. Meta Search is not a search for Meta Data.

• User interface: Normally resemble search engine interfaces with options for types of search and Search
Engines to Use.

• Dispatcher: Generates actual queries to the search


engines by using the user query. It may involve
choosing/expanding search engines to use.

• Display: Generates Results page from the replies


received. It may involve ranking, parsing, clustering
of the search results or just plain stitching.

• Personalization/Knowledge: It may contain either


or both. Personalization may involve weighting of
search results/query/engine for each user.

Focused crawling
Focused crawler is used to collect those web pages that are relevant to a particular topic while filtering out
the irrelevant. Thus focused crawling can be used to generate data for an individual user.
There are three major challenges for focused crawling:
(i)It needs to determine the relevance of a retrieved web page.

11
(ii)Predict and identify potential URLs that can lead to relevant pages.
(iii)Rank and order the relevant URLs so the crawler knows exactly what to follow next.

Architecture of Focused crawling

URL queue contains a list of unvisited URLs maintained by the crawler and is initialized with seed URLs.
Web page downloader fetches URLs from URL queue and downloads corresponding pages from the
internet.

The parser an extractor extracts information such as the terms and the hyperlink URLs from a downloaded
page. Relevance calculator calculates relevance of a page w.r.t. topic, and assigns score to URLs extracted
from the page. Topic filter analyses whether the content of parsed pages is related to topic or not. If the page
is relevant, the URLs extracted from it will be added to the URL queue, otherwise added to the Irrelevant
table.
A focused crawling algorithm loads a page and extracts the links. By rating the links based on keywords
the crawler decides which page to retrieve next. The Web is traversed link by link and the existing work is
extended in the area of focused document crawling.

There are various categories in focused crawlers:


(a) Classic focused crawler
(b) Semantic crawler
(c) Learning crawler
12
(a)Classic focused crawlers
Guides the search towards interested pages by taking the user query which describes the topic as input.
They assign priorities to the links based on the topic of query and the pages with high priority are
downloaded first. These priorities are computed on the basis of similarity between the topic and the page
containing the links. Text similarity is computed using an information similarity model such as the Boolean
or the Vector Space Model

(b) Semantic crawlers


It is a variation of classic focused crawlers. To compute topic to page relevance downloaded priorities are
assigned to pages by applying semantic similarity criteria, the sharing of conceptually similar terms defines
the relevance of a page and the topic. Ontology is used to define the conceptual similarity between the terms.

(c)Learning crawlers
Uses a training process to guide the crawling process and to assign visit priorities to web pages. A learning
crawler supplies a training set which consist of relevant and not relevant Web pages in order to train the
learning crawler . Links are extracted from web pages by assigning the higher visit priorities to classify
relevant topic. Methods based on context graphs and Hidden Markov Models take into account not only the
page content but also the link structure of the Web and the probability that a given page will lead to a
relevant page .

Distributed crawling

crawling operation can be performed by several dedicated threads


• Parallel crawling can be distributed over nodes of a distributed system (geographical distribution, link-
based distribution, etc.)
• Distribution involves a host-splitter, which dispatches URLs to the corresponding crawling node

The URL frontier


2 types of queues:
• front queues for prioritization
• back queues for politeness
13
Each back queue only contains URLs for a given host (mapping between hosts and back queue identifiers)
When a back queue is empty, it is filled with URLs from priority queues A heap contains the earliest time te
to contact a given host again

Site map and index:

A site map is a finding aid organized in the same way as a table of contents, it follows the structure
of the site, section by section, instead of being alphabetical. While providing a helpful overview of a site,
therefore, site maps do not always enable users to quickly find a specific topic.
Also, a site map lists each Web page only once and by its correct name (for example, ―Work For
Us‖), without cross-references or variants to make it more easily found. Finally, site maps tend to include
only Web pages, and perhaps not even all pages, rather than the additional sections within pages.
This is usually dictated by spatial factors: the entire site map should fit on one screen to be most
usable for browsing. An index can be much larger and more detailed, and is not constrained by space, for
example, a common convention of indexes is the use of a separate Web page for each letter of the alphabet.

Issues covered by Web index


The Web Index assesses the Web‘s contribution to social, economic and political progress in countries
around the world.

The Index measures and ranks:


 Universal Access:

14
o These sub-index measures whether countries have invested in affordable access to high
quality internet infrastructure, as well as investing in the education and skills citizens need to
use the Web well.
 Freedom and Openness:
o This sub-index assesses the extent to which citizens enjoy rights to information, opinion,
expression, safety and privacy online.
 Relevant Content and use:
This sub-index maps both Web use by citizens and the content available to citizens in each country,
with an emphasis on the extent to which different stakeholders can access information that is relevant to
them, in the language that they are most comfortable using and via platforms and channels that are widely
available.
 Empowerment:
This sub-Index aims to assess the difference that the Web is making to people, and the extent to
which use of the Web by stakeholders is fostering positive change in four key areas: society, economy,
politics and environment.

Web Indexing Examples


 Parliament of Australia
A large back-of-the-book style index.
 UNIXhelp for Users
Both a back-of-the-book style index and a searchable index.
 Daily Herald story index - 1901 to 1964
An alphabetical list of subject headings for newspaper articles.
 Writer's Block
A "living" index that is updated quarterly.
 The World Bank Group
Subject list showing major topics and sub-topics. Each page for a major topic is structured
differently.
 US Census Bureau
A large multi-level subject list. Also provides a list of subjects in alphabetical order.

3. Explain the working of a web crawler with an example. (A/M 17)


Draw the basic crawler architecture and describe its components. (A/M 18)
Explain the features and architecture of web crawler. (N/D 18)
Describe in detail about Web Crawling with a neat architecture diagram. (A/M 19)

Web Crawling
Crawling is the process of gathering pages from the internet, in order to index them .The objective of
crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link
structure that interconnects them.
Basic rules for Web crawler operation
 A Web crawler must identify itself as such, and must not pretend to be a regular Web user
 A Web crawler must obey the robots exclusion protocol (robots.txt)
 A Web crawler must keep a low bandwidth usage in a given Web site.

15
Features of a crawler
 Robustness: ability to handle spider-traps (cycles, dynamic web pages)
 Politeness: policies about the frequency of robot visits
 Distribution: crawling should be distributed within several machines
 Scalability: crawling should be extensible by adding machines, extending bandwidth, etc.
 Efficiency: clever use of the processor, memory, bandwidth
 Quality: should detect the most useful pages, to be indexed first
 Freshness: should continuously crawl the web (visiting frequency of a page should be close to its
modification frequency)
 Extensibility: should support new data formats (e.g. XML-based formats), new protocols (e.g. ftp),
etc.
Crawling process
 The crawler begins with a seed set of URLs to fetch
 The crawler fetches and parses the corresponding web pages, and extracts both text and links
 The text is fed to a text indexer, the links (URL) are added to a URL frontier ( crawling agenda)
 (continuous crawling) Already fetched URLs are appended to the URL frontier for later re-
processing
Types of Web search based on crawling
 General Web search: done by large search engines
 Vertical Web search: the set of target pages is delimited by a topic, a country or a language
Applications of a Web Crawler
 create an index covering broad topics (general Web search )
 create an index covering specific topics (vertical Web search )
 archive content (Web archival )
 analyze Web sites for extracting aggregate statistics (Web characterization )
 keep copies or replicate Web sites (Web mirroring )
 Web site analysis

Architecture of a crawler
The simple scheme outlined above for crawling demands several modules that fit together

16
1) The URL frontier, containing URLs yet to be fetched in the current crawl (in the case of continuous
crawling, a URL may have been fetched previously but is back in the frontier for re-fetching).
2) A DNS resolution module that determines the web server from which to fetch the page specified by a
URL.
3) A fetch module that uses the http protocol to retrieve the web page at a URL.
4) A parsing module that extracts the text and set of links from a fetched web page.
5) duplicate elimination module that determines whether an extracted link is already in the URL frontier or
has recently been fetched.

Politeness policies:

• Within website www.cs.sfu.ca, many pages contain many links to other pages inthe same site.
Downloading many pages simultaneously from one website mayoverwhelm the server.

• A reasonable web crawler should use only a tiny portion of the bandwidth of awebsite server, not fetching
more than one page at a time.

• Implementation: Logically split the request queue into a single queue per webserver; a server queue is open
only if it has not been accessed within the specifiedpoliteness window.
• Suppose a web crawler can fetch 100 pages per second and the politeness policydictates that it cannot fetch
more than 1 page every 30 seconds from a server thenwe need URLs from at least 3,000 different servers to
make the crawler reach itspeak throughput.

17
• Detecting updates: If a web page is updated, the page should be crawled again.In HlTP, request HEAD
returns only header information about the page but notthe page itself. A crawler can compare the date it
received from the last GETrequest with the Last-Modified value from a HEAD request.

4. Explain with an example the framework of a XML retrieval system. (A/M 17)
Illustrate the various changes in XML retrieval with appropriate example. (A/M 18)
Explain about the vector space model in XML retrieval .Write in detail about the evaluation of the
data retrieved. (A/M 19) Explain in detail about Vector space model of XML retrieval. (N/D 18)
What is XML?
A meta-language (language to describe other languages).XML is able to represent a mix of
structured and text(unstructured) information, defined by WWW Consortium headed by James Clark. It is
the de facto standard markup language.
Example : Fig 1

An XML Document
An XML document is an ordered, labeled tree. Each node of the tree is an XML element and is
written with an opening and closing tag . An element can have
one or more XML attributes .
In the XML document in Figure above , the scene element is enclosed by the two tags <scene ...>
and </scene>. It has an attribute number with value vii and two child elements, title and verse.

18
Fig 2:The XML document in Figure 1 as a simplified DOM object.

Figure 1 shows Figure 2 as a tree. The leaf nodes of the tree consist of text, e.g., Shakespeare,
Macbeth, and Macbeth's castle. The tree's internal nodes encode either the structure of the document (title,
act, and scene) or metadata functions (author).
The standard for accessing and processing XML documents is the XML Document Object Model or
DOM . The DOM represents elements, attributes and text within elements as nodes in a tree. Figure 2 is a
simplified DOM representation of the XML document in Figure 1 . With a DOM API, we can process an
XML document by starting at the root element and then descending
down the tree from parents to children.

3.14.1 Compare XML vs HTML

XML is an important standard for both exchanging data between applications and encoding documents. To
support this more data-oriented view, the database community has defined languages for describing the
structure of XML data (XML Schema), and querying and manipulating that data (XQuery and XPath).

XQuery is a query language that is similar to the database language SQL, with the major difference that it
must handle the hierarchical structure of XML data instead of the simpler tabular structure of relational data.

XPath is a subset of XQuery that is used to specify the search constraints for a single type of XML data or
document. XPath or XPath is a standard for enumerating paths in an XML document collection. Example,
could be used in an XML movie database to find the movies that were directed by a particular person and
were released in a particular year.

The XPath expression node selects all nodes of that name. Successive elements of a path are separated by
slashes, so act/scene selects all scene elements whose parent is an act element. Double slashes indicate that
an arbitrary number of elements can intervene on a path: play//scene selects all scene elements occurring in
a play element.
In Figure 2 this set consists of a single scene element, which is accessible via the path play, act,
scene from the top. An initial slash starts the path at the root element. /play/title selects the play's title in
Figure 1 , /play//title selects a set with two members (the play's title and the scene's title), and /scene/title
selects no elements. For notational convenience, we allow the final element of a path to be a vocabulary
19
term and separate it from the element path by the symbol #, even though this does not conform to the XPath
standard. For example, title#"Macbeth" selects all titles containing the term Macbeth.

XQuery could be used to combine information about movies with information about actors, assuming that
the XML movie database contained both movie and actor ―documents.ǁ An example would be to find
movies starring actors who were born in Australia. A schema puts constraints on the structure of allowable
XML documents for a particular application. A schema for Shakespeare's plays may stipulate that scenes can
only occur as children of acts and that only acts and scenes have the number attribute. Two standards for
schemas for XML documents are XML DTD (document type definition) and XML Schema. Users can only
write structured queries for an XML retrieval system if they have some minimal knowledge about the
schema of the collection.

Fig 3: An XML query in NEXI format and its partial representation as a tree.

A common format for XML queries is NEXI (Narrowed Extended XPath I). We give an example in
Figure 3. We display the query on four lines for typographical convenience, but it is intended to be read as
one unit without line breaks. In particular, //section is embedded under //article.
The query in Figure 3 specifies a search for sections about the summer holidays that are part of
articles from 2001 or 2002. As in XPath double slashes indicate that an arbitrary number of elements can
intervene on a path.
The dot in a clause in square brackets refers to the element the clause modifies. The clause [.//yr =
2001 or .//yr = 2002] modifies //article. Thus, the dot refers to //article in this case. Similarly, the dot in
[about(., summer holidays)] refers to the section that the clause modifies.
The two yr conditions are relational attribute constraints. Only articles whose yr attribute is 2001 or
2002 (or that contain an element whose yr attribute is 2001 or 2002) are to be considered. The about clause
is a ranking constraint: Sections that occur in the right type of article are to be ranked according to how
relevant they are to the topic summer holidays.

20
User focus on the core information retrieval problem in XML retrieval, namely how to rank documents
according to the relevance criteria expressed in the about conditions of the NEXI query.
If user discard relational attributes, it can represent documents as trees with only one type of node:
element nodes. In other words, remove all attribute nodes from the XML document, such as the number
attribute in Figure 1 .Figure 4 shows a subtree of the document in Figure 1 as an element-node tree (labeled
d1).We can represent queries as trees in the same way. This is a queryby- example approach to query
language design because users pose queries by creating objects that satisfy the same formal description as
documents. In Figure 4 , q1 is a search for books whose titles score highly for the keywords Julius Caesar.
q2 is a search for books whose author elements score highly for Julius Caesar and whose title elements score
highly for Gallic war.

Vector space model for XML retrieval

To have each dimension of the vector space encode a word together with its position within the XML tree.

21
Take each text node (leaf) and break it into multiple nodes, one for each word. E.g. split Bill Gates into Bill
and Gates
Define the dimensions of the vector space to be lexicalized sub trees of documents – sub trees that
contain at least one vocabulary term.Here we represent queries and documents as vectors in this space of
lexicalized sub trees and compute matches between them, e.g. using the vector space formalism. The main
difference is that the dimensions of vector space in unstructured retrieval are vocabulary terms whereas they
are lexicalized Sub trees in XML retrieval.

5. Explain in detail about finger print algorithm for near duplicate detection. (N/D 17)
Explain in detail about fingerprint algorithm for near duplication detection. (N/D 18)

Near duplicate detection


The Web contains multiple copies of the same content. By some estimates, as many as 40% of the
pages on the Web are duplicates of other pages. Search engines try to avoid indexing multiple copies of the
same content, to keep down storage and processing overheads.
The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is
a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web
pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a
duplicate copy of the other.
This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near
duplication. In many cases, the contents of one web page are identical to those of another except for a few
characters - say, a notation showing the date and time at which the page was last modified.
Even in such cases, we want to be able to declare the two pages to be close enough that we only
index one copy. Short of exhaustively comparing all pairs of web pages, an infeasible task at the scale of
billions of pages, how can we detect and filter out such near duplicates?

22
6. Brief about search engine optimization. (N/D 17)

The Search Engine Optimization Techniques


1.Domain name Strategies
Domain naming is important to our overall foundation, use sub-directory root domains
(example.com/awesome)versus sub-domains (awesome.example.com). All inbound links pointing to a

23
website‘s main domain (www.example.com) as well as its internal pages contribute to the domain authority
.Getting a back link from a high domain authority site is always valuable.

2.Linking Strategies
-the text in the links should include keywords
- the more inbound links the higher the SE ranking
- if the site linking to us is already indexed, spiders will also receive our site

3.Keywords
- the most important in optimizing rankings
-keywords are words that appear the most in a page
- the spider chooses the appropriate keywords for each page, then sends them back to its SE
-our web site will then be indexed based on our keywords
-can be key phrases or a single keyword
- do not use common words eg ‗the‘ ‗and‘ ‗of‘: spiders ignore them
-writekeyword-richtext
- balance keyword-rich and readability

4.Title tags

24
7.Submit your website to SEs for indexing
-submit your site to search engine directories, directory sites and portal sites
- indexing takes time

SEO techniques are classified into two broad categories:

• White Hat SEO - Techniques that search engines recommend as part of a good design.
• Black Hat SEO - Techniques that search engines do not approve and attempt to minimize the effect of.
These techniques are also known as spamdexing.

White Hat SEO

An SEO tactic is considered as White Hat if it has the following features:


 It conforms to the search engine's guidelines.
 It does not involve in any deception.
 It ensures that the content a search engine indexes, and subsequently ranks,
 It ensures that a web page content should have been created for the users
 It ensures good quality of the web pages. It ensures availability of useful content on the web pages.

Black Hat or Spamdexing


An SEO tactic, is considered as Black Hat or Spamdexing if it has the following features:
 Attempting ranking improvements that are disapproved by the search engines and/or involve
deception.
 Redirecting users from a page that is built for search engines to one that is more human friendly.
 Redirecting users to a page that was different from the page the search engine ranked.
 Serving one version of a page to search engine spiders/bots and another version to human visitors.
This is called Cloaking SEO tactic.
 Using hidden or invisible text or with the page background color, using a tiny font size or hiding
them within the HTML code such as "no frame" sections.
 Repeating keywords in the meta-tags, and using keywords that are unrelated to the website content.
This is called meta-tag stuffing.
 Calculated placement of keywords within a page to raise the keyword count, variety, and density of
the page. This is called keyword stuffing.
 Creating low-quality web pages that contain very little content but are instead stuffed with very
similar keywords and phrases. These pages are called Doorway or Gateway Pages.
 Mirror websites by hosting multiple websites - all with conceptually similar content but using
different URLs.
 Creating a rogue copy of a popular website which shows contents similar to the original to a web
crawler, but redirects web surfers to unrelated or malicious websites. This is called page hijacking.

25
SEO Tools

 Keyword Discovery - Find popular keywords that your site should be targeting.
 Keyword Volume - Estimate how much search traffic a specific keyword receives each month.
 Keyword Density - Analyze how well a specific webpage is targeting a keyword.
 Back link Trackers - Keep track of how many sites are linking to you.
 Site Popularity - Determine how popular your site is on the internet.
 Keyword Rankings - Track of your site's rankings for a keyword in thesearch engines.
 Firefox Add-ons - Install these add-ons to turn your browser into an SEO
research powerhouse

What are the two major types of SEO


There are two major types of search engine optimization, white hat search engine optimization (the
'good' kind), and black hat (the 'not so good' kind).

SEO Services
There are a number of SEO services which can help contribute to the improvement of the organic
search engine rankings of a website. These services include, but are not limited to, on-page (or on-site)
optimization, link building, search engine friendly website design and development, and search engine
friendly content writing services.

Search Engine Spam

Search engines make billions of dollars each year selling ads. Most search engine traffic goes to the
free, organically listed sites. The ratio of traffic distribution is going to be keyword dependent and search
engine dependent, but we believe about 85% of Google‘s traffic clicks on the organic listings. Most other
search engines display ads a bit more aggressively than Google does. In many of those search engines,
organic listings get around 70% of the traffic.
Some sites rank well on merit, while others are there due exclusively to ranking manipulation. In
many situations, a proper SEO campaign can provide a much greater ROI(Return on Investments) than paid
ads do. This means that while search engine optimizers—known in the industry as SEOs—and search
engines have business models that may overlap, they may also compete with one another for ad dollars.
Sometimes SEOs and search engines are friends with each other, and, unfortunately, sometimes they are
enemies.
When search engines return relevant results, they get to deliver more ads. When their results are not
relevant, they lose market share. Beyond relevancy, some search engines also try to bias the search results to
informational sites such that commercial sites are forced into buying ads.
There is a huge sum of money in manipulating search results. There are ways to improve search
engine placement that go with the goals of the search engines, and there are also ways that go against them.
Quality SEOs aim to be relevant, whether or not they follow search guidelines. Many effective SEO
techniques may be considered somewhat spammy.

26
How Search Engines Works
Unlike humans, search engines are text-driven. Although technology advances rapidly, search
engines are far from intelligent creatures that can feel the beauty of a cool design or enjoy the sounds and
movement in movies. Instead, search engines crawl the Web, looking at particular site items (mainly text) to
get an idea what a site is about.

Search engines perform several activities in order to deliver search results – crawling, indexing,
processing, calculating relevancy, and retrieving.

First, search engines crawl the Web to see what is there. This task is performed by a piece of
software, called a crawler or a spider (or Googlebot, as is the case with Google). Spiders follow links from
one page to another and index everything they find on their way. Having in mind the number of pages on the
Web (over 20 billion), it is impossible for a spider to visit a site daily just to see if a new page has appeared
or if an existing page has been modified, sometimes crawlers may not end up visiting our site for a month or
two.
After a page is crawled, the next step is to index its content. The indexed page is stored in a giant
database, from where it can later be retrieved. Essentially, the process of indexing is identifying the words
and expressions that best describe the page and assigning the page to particular keywords. For a human it
will not be possible to process such amounts of information but generally search engines deal just fine with
this task. Sometimes they might not get the meaning of a page right but if we help them by optimizing it, it
will be easier for them to classify our pages correctly and for us – to get higher rankings.
When a search request comes, the search engine processes it – i.e. it compares the search string in
the search request with the indexed pages in the database. Since it is likely that more than one page
(practically it is millions of pages) contains the search string, the search engine starts calculating the
relevancy of each of the pages in its index with the search string.
There are various algorithms to calculate relevancy. Each of these algorithms has different relative
weights for common factors like keyword density, links, or meta-tags. That is why different search engines
give different search results pages for the same search string. All major search engines, like Yahoo!, Google,
Bing, etc. periodically change their algorithms and to keep at the top, we also
need to adapt our pages to the latest changes. This is one reason (the other is your competitors) to devote
permanent efforts to SEO.

The last step in search engines' activity is retrieving the results. Basically, it is nothing more than
simply displaying them in the browser – i.e. the endless pages of search results that are sorted from the most
relevant to the least relevant sites.

27

Potrebbero piacerti anche