The Deep Web

The Deep Web
Ameya Hanamsagar
Department of Computer Engineering,
Sinhgad Institute of Technology & Science,
Narhe, Pune - 41
ameyah@live.com
Ninad Jane
Department of Computer Engineering,
Sinhgad Institute of Technology & Science,
Narhe, Pune - 41
ninadjane@live.com

AbstractDeep Web is the name given to the technology of
surfacing the Hidden value of the World Wide Web that
cannot be easily indexed by standard search engines. Deep
Web contents are accessed by queries submitted to Web
databases and the returned data records are enwrapped in
dynamically generated Web pages. Recent estimates
indicate that the information available on the Deep Web is
currently 400 to 550 times larger than the Surface Web. The
Deep Web is the best hope for those who want to escape the
bonds of totalitarian state censorship, and share their ideas
or experiences with the outside world. Furthermore, there
is a lot of legitimate and valuable content in the Deep Web
like scientific data sets, documents and databases. Although
it is useful in many ways, it does have a dark side. Deep Web
is also a hub for terrorists, drug traders, hitmen and all
kinds of illegal activities.

Keywords censorship, darknet, freedom, knowledge,
levels

I. INTRODUCTION
Internet content is considerably more diverse and
certainly much larger than what is commonly understood.
Firstly, though sometimes used synonymously, the
World Wide Web is but a subset of Internet content.
Secondly, even within the strict context of the Web, most
users are only aware of the content presented to them via
search engines such as Excite, Google, AltaVista, Snap
or Northern Light, or search directories such as Yahoo!,
About.com or LookSmart.
The part of the internet which is indexed by search
engines is called Surface Web while the rest is called the
Deep Web. Deep Web or Invisible Web or Darknet or
Hidden Web is the Web that is dynamically generated
from data source such as databases or le system. Unlike
Surface Web where data are available through URLs,
data from a Deep Web are guarded by a search interface.
The amount of data in Deep Web exceeds by far that
of the Surface Web. Recent estimates suggest that the
size of the Deep Web greatly exceeds that of the Surface
Web with nearly 92,000 terabytes of data on the Deep
Web versus only 167 terabytes on the Surface Web.
With its myriad databases and Hidden content, the
Deep Web is an important yet largely-unexplored frontier
for information search.

II. HOW SEARCH ENGINES WORK
Search Engines obtain their listings in two ways.
i. Authors may submit their own Web pages
ii. Search Engines crawl or spider the
documents by following one hyperlink to
another.
Crawlers work by recording every hypertext link in
every page they index crawling. Like ripples propagating
across a pond, search-engine crawlers are able to extend
their indices further and further from their starting points.
Now, you must be wondering why some Web pages
are invisible to search engines. Most Websites are not
invisible to search engines, but for sites referred to as
Deep Web sites, the majority of pages within the site may
be invisible, i.e., not discoverable by search engine
searches.
Many documents or Web pages on the Web are
accessible to search engines with difficulty or not at all
due to several reasons:
Web pages that search engines could find and
include, but choose not to because the page is too
Deep in the sites and search engine producers don't
expend their resources to go that Deeply.
Pages that a search engine has been asked not to
index.
Pages where, for privacy reasons, the creator of the
sites asks search engine crawler programs
("robots") not to index the page. This is done by
using "robot exclusion protocols".
Some Websites produce pages dynamically by
processing users query using internal search
engine. These pages cant be index by search
engines as they have no static URL.
Many times pages are protected with a login and
password authentication which cant be indexed.
Pages that a search engine cannot index because it
can't interpret them because of the unrecognizable
document format.

Fig. 1. Poorly Indexed Webpages

III. THE GRAY ZONE
Surface Web content is persistent on static pages
discoverable by search engines through crawling, while
Deep Web content is only presented dynamically in
response to a direct request. However, once directly
requested, Deep Web content comes associated with a
URL, most often containing the database record number,
that can be re-used later to obtain the same document.
Consider an example:
http://www.flipkart.com/search/a/all?fk-search=all&query=nokia
In the above URL, the query= shows the query user has
entered with which communication to the database takes
place and results are served dynamically from the
flipkart.com database.
Now, if we were doing comprehensive research
on this company and posting these results on our own
Web page, other users could click on this URL and get
the same information. Importantly, if we had posted this
URL on a static Web page, search engine crawlers could
also discover it, use the same URL as shown above, and
then index the contents. It is through this manner that
Deep content can be brought to the Surface. Any Deep
content listed on a static Web page is discoverable by
crawlers and therefore indexable by search engines. It is
impossible to scrub large Deep Web sites for all content
in this manner. But, it does show why Deep Web content
occasionally offers on search engine results. This gray
zone also encompasses Surface Web sites that are
available through Deep Web sites. For instance, the Open
Directory Project, is an effort to organize the best of
Surface Web content using voluntary editors or guides.
The Open Directory looks something like Yahoo!; that is,
it is a tree structure with directory URL results at each
branch. The results pages are static, laid out like disk
directories, and are therefore easily indexable by the
major search engines.
The Open Directory claims a subject structure
of 248,000 categories, each of which is a static page. The
key point is that every one of these 248,000 pages is
indexable by major search engines.
Four major search engines with broad Surface
coverage allow searches to be specified based on URL.
The query "URL:dmoz.org" (the address for the Open
Directory site) was posed to these engines with these
results:
Engine OPD Pages Yield
Open Directory (OPD) 248,706
Bing 17,833 7.2%
Fast 12,199 4.9%
Northern Light 11,120 4.5%
Go (Infoseek) 1,970 0.8%
Table 1. Incomplete indexing of Surface Web Sites
Clearly, the engines themselves are imposing
decision rules with respect to either depth or breadth of
Surface pages indexed for a given site. There was also
broad variability in the timeliness of results from these
engines. Specialized Surface sources or engines should
therefore be considered when truly Deep searching is
desired. Again, the bright line between Deep and
Surface Web shows shades of gray.

IV. QUALITY OF SITES ON DEEP WEB
Quality is of course a purely subjective concept. This
study is therefore based on this simple precept: if a user
obtains from a site on the Deep Web exactly the results
he was looking for, the quality of that site can be
considered to be very good. Conversely, if you do not
obtain satisfactory results, the quality is considered as
very poor.
In terms of relevance, the quality of the Deep Web is
thought to be 3 times better than that of the Surface Web.
Deep Web content includes
a) Specialized Databases
b) Publications
c) Internal Databases
d) Online Libraries
e) Yellow Pages and Phone Directories
f) Research Databases

Fig. 2. Distribution of Deep Web sites by content sites

A table comparing Surface Web and Deep Web
Quality is given below:
Fields
Surface Web Deep Web
Total Quality Yield Total Quality Yield
Agriculture 400 20 5.0% 300 42 14.0%
Medicine 500 23 4.6% 400 50 12.5%
Finance 350 18 5.1% 600 75 12.5%
Science 700 30 4.3% 700 80 11.4%
Law 260 12 4.6% 320 38 11.9%
TOTAL 2,210 103 4.7% 2,320 285 12.3%
Table 2. Surface vs. Deep Web Quality

Apart from the above mentioned useful data, Deep Web
also includes illegal and disturbing content.

V. THE DARK SIDE OF INTERNET
The Dark Side of Internet, commonly called as Dark
Internet or Dark Address refers to any or all
unreachable network hosts on the Internet. It is also
called dark address space. The Dark Internet resembles
only a small part of the Deep Web which includes all the
illegal and secret content. The Dark Internet is any
portion of the Internet that can no longer be accessed
through conventional means.
Popular belief holds that the Internet represents a
completely connected system. It turns out that's just not
true. Researchers, over years of gathering and analyzing
data have discovered that for much of the Internet, the
shortest path between two points doesn't exist.
Failures within the allocation of Internet
resources due to the Internet's chaotic tendencies of
growth and decay are a leading cause of dark address
formation. One form of dark address is military sites on
the archaic MILNET. These government networks are as
old as the original Arpanet, and have simply not been
incorporated into the Internet's evolving architecture. It is
also thought that hackers utilize malicious techniques to
hijack private routers to either divert traffic or mask
illegal activity. By the use of these private routers,
hackers create a private network and hence spread illegal
activities.

VI. DARKNETS
A Darknet or anonymity network is a private,
distributed P2P file sharing network where connections
are made only between trusted peers. Darknets are
distinct from other distributed P2P networks as sharing is
anonymous (that is, IP addresses are not publicly shared),
and therefore users can communicate with little fear of
governmental or corporate interference. For this reason,
Darknets are often associated with dissident political
communications, as well as various illegal activities.
More generally, the term "Darknet" is used to refer to all
"underground" Web communications and technologies,
most commonly those associated with illegal activity or
dissent.
The 3 major anonymity networks on the Internet are
Tor/Onionland, I2P and Freenet. Each anonymity
network is designed for a different specific purpose. One
network alone cannot do what the three can do together.
Tor and I2P cannot persist information like Freenet can,
Tor and Freenet cant offer the generic transports that I2P
provides and Freenet doesnt handle data streaming as
well as Tor and I2P. There is also no better proxy system
than the Tor network.
A. Tor/Onionland
Tor (The Onion Router) is an anonymous Internet proxy.
You proxy through multiple Tor relays and eventually
pass through a Tor exit relay that allows traffic to exit out
of Tor and into the Internet. Tor has the most attention
and the most support. The user base on the Tor network
is on average 100,000 to 200,000 users in size which is
the largest of the three. Tor also provides an anonymous
intranet often referred to as onionland.
B. FreeNet
Freenet is an anonymous data publishing Network
and is very different from Tor and I2P. Freenet is much
higher latency and focuses more on friend to friend
interactions with often military grade security. Freenet
uses UDP and is the oldest of the 3 Networks. It is hard
to gauge the size of Freenet because of its ability to
connect exclusively to friends and not strangers. Its
estimated to have about 20,000 active machines but may
have more.
C. I2P (Invisible Internet Project)
I2P is a Distributed Peer to Peer Anonymous
Network Layer. It allows you to send data between
computers running I2P anonymously with multilayer end
to end encryption. I2P derived from IIP (Invisible IRC
Project) which was one of FreeNets sister projects. I2P
focuses on exclusively internal communication and not
proxying to the regular Internet. I2P currently has 9,000
to 14,000 active machines depending on the time of day.
Most of the nodes are either European or Russian.

VII. LEVELS OF DARK INTERNET
As discussed earlier, Dark Internet cannot be
accessed by normal conventional methods.
There are, supposedly, 5 levels of the Deep Web (not
counting Level 0). The levels are supposed to be more
and more difficult to access following their number. For
example, only a simple proxy is needed to access Level
2, but complex hardware is apparently needed to access
parts of Level 4, and all the following levels.
A brief description of these levels is given below:

A. Level 0 - Common Web
This level is the one you browse everyday:
YouTube, Facebook, Wikipedia and other famous or
easily accessible Websites can be found here.
B. Level 1 - Surface Web
This level is still accessible through normal means,
but contains "darker" Websites.
C. Level 2 - Bergie Web
This level is the last one normally accessible: all
levels that follow this one have to be accessed with a
proxy, Darknets or by modifying your hardware.
D. Level 3 - Deep Web
The first part of this level has to be accessed with
a proxy. Here begins the Deep Web. This part of the Web
can be accessed by connecting to Darknets like tor, i2p,
freenet, etc.
E. Level 4 - Charter Web
This level is also divided in two parts. The first can
be accessed through Private networks. Things such as
drug dealers, banned movies and books and black
markets exist there. The second part can be accessed
through a hardware modification: a Closed Shell
System. Also, experimental hardware information
(Gadolinium Gallium Garnet Quantum Electronic
Processors), and darker information, such as the Law
of 13, World War 2 experiments can be found.

F. Level 5 - Marianas Web
Secret government documentation exists here.
According to acclaimed hackers, three more
levels exist after the 5th one though this is yet to be
proven.

VIII. FUTURE OF DEEP WEB
The lines between search engine content and the
Deep Web have begun to blur, as search services start to
provide access to part or all of once-restricted content.
An increasing amount of Deep Web content is opening
up to free search as publishers and libraries make
agreements with large search engines. In the future, Deep
Web content may be defined less by opportunity for
search than by access fees or other types of
authentication.

IX. ADVANTAGES
Deep Web is a vast store of comprehensive
scientific knowledge.
Due to higher quality, research work can be done
more efficiently.
Due to high amount of anonymity and secure
connections at par with military grades, Deep Web
is a secure place for activists and intelligence
organizations.
Bitcoins, the currency used in the Deep Web is used
for anonymous transactions involving high amount
of money making the process much safer.
Deep Web provides a safe ground for government
agencies to discuss and share sensitive and
potentially dangerous data.

X. DISADVANTAGES
Deep Web is an unstructured entity based on
dynamic generation of data. So, the data may be
ephemeral.
The connection speed is quite low due to P2P
networks and multiple security layers.
Deep Web is a thriving place for criminals like
terrorists, pedophiles, hitmen, drug dealers, et al
making it an unsafe place for unsuspecting users.
Darknets also have big ramifications for digital
rights management, piracy, and other things that
keep the entertainment and software industries
awake at night.
Details of sensitive information such as credit card
account information is readily available.

XI. CONCLUSION
Deep Web is a ubiquitous source of reliable
information which can be harnessed for constructive
purposes. However, this would be possible only if the
darker part of the Deep Web can be kept under control.

XII. REFERENCES
[1] White Paper The Deep Web: Surfacing Hidden Value
by BrightPlanet
[2] White Paper Discover and exploit the Invisible Web for
competitive intelligence by DigiMind
[3] White Paper Using the Deep Web- A how-to guide for
IT professionals
[4] The Deep Web by Ran Hock, 2008
[5] The dark side of the internet
http://www.guardian.co.uk/technology/2009/nov/26/dark-side-
internet-freenet
[6] BBC News Expedition to the lost net
http://news.bbc.co.uk/2/hi/science/nature/1721006.stm
[7] Computer Research & Technology The Dark Internet
http://www.crt.net.au/About/ETopics/Archives/darkint.htm
[8] Wikia The Free Encyclopedia Levels of Deep Web
http://DeepWeb.wikia.com/wiki/Levels_of_the_Deep_Web
[9] Wikipedia Deep Web
http://en.wikipedia.org/wiki/Deep_Web
[10] Wikipedia Dark Internet
http://en.wikipedia.org/wiki/Dark_Internet
[11] Wikipedia Darknet(File Sharing)
http://en.wikipedia.org/wiki/Darknet_(file_sharing)

The Deep Web

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

The Deep Web

Caricato da

Copyright:

Formati disponibili

The Deep Web

Potrebbero piacerti anche