Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
HTTP connections Nonpersistent HTTP
(contains text,
Suppose user enters URL references to 10
Nonpersistent HTTP Persistent HTTP
www.someSchool.edu/someDepartment/home.index jpeg images)
At most one object is Multiple objects can 1a. HTTP client initiates TCP
sent over a TCP be sent over single connection to HTTP server
1b. HTTP server at host
(process) at
connection. TCP connection www.someSchool.edu on port 80
www.someSchool.edu waiting
between client and for TCP connection at port 80.
HTTP/1.0 uses “accepts” connection, notifying
nonpersistent HTTP server. client
2. HTTP client sends HTTP
HTTP/1.1 uses request message (containing
persistent connections URL) into TCP connection 3. HTTP server receives request
in default mode socket. Message indicates message, forms response
that client wants object message containing requested
someDepartment/home.index object, and sends message
into its socket
time
2: Application Layer 5 2: Application Layer 6
2
Persistent HTTP
HTTP request message
Nonpersistent HTTP issues: Persistent HTTP
requires 2 RTTs per object
two types of HTTP messages: request, response
server leaves connection
OS overhead for each TCP open after sending HTTP request message:
connection response ASCII (human-readable format)
browsers often open parallel subsequent HTTP messages
TCP connections to fetch between same request line
referenced objects client/server sent over (GET, POST, GET /somedir/page.html HTTP/1.1
open connection HEAD commands) Host: www.someschool.edu
client sends requests as User-agent: Mozilla/4.0
soon as it encounters a header Connection: close
referenced object lines Accept-language:fr
as little as one RTT for all
Carriage return,
the referenced objects (extra carriage return, line feed)
line feed
indicates end
of message
2: Application Layer 9 2: Application Layer 10
www.somesite.com/animalsearch?monkeys&banana
3
Method types HTTP response message
status line
HTTP/1.0 HTTP/1.1 (protocol
GET GET, POST, HEAD status code HTTP/1.1 200 OK
status phrase) Connection close
POST PUT Date: Thu, 06 Aug 1998 12:00:15 GMT
HEAD uploads file in entity header
Server: Apache/1.3.0 (Unix)
body to path specified lines
Last-Modified: Mon, 22 Jun 1998 …...
asks server to leave
in URL field Content-Length: 6821
requested object out of
response DELETE Content-Type: text/html
deletes file specified in data, e.g., data data data data data ...
the URL field requested
HTML file
4
Trying out HTTP (client side) for yourself User-server state: cookies
Example:
1. Telnet to your favorite Web server: Many major Web sites
use cookies Susan always access
telnet cis.poly.edu 80 Opens TCP connection to port 80 Internet always from PC
(default HTTP server port) at cis.poly.edu. Four components:
Anything typed in sent 1) cookie header line of visits specific e-
to port 80 at cis.poly.edu HTTP response message commerce site for first
2) cookie header line in time
2. Type in a GET HTTP request: HTTP request message
3) cookie file kept on when initial HTTP
By typing this in (hit carriage
GET /~ross/ HTTP/1.1
return twice), you send
user’s host, managed by requests arrives at site,
Host: cis.poly.edu user’s browser site creates:
this minimal (but complete)
GET request to HTTP server 4) back-end database at
Web site unique ID
5
Web caches (proxy server) More about Web caching
Goal: satisfy client request without involving origin server
cache acts as both Why Web caching?
user sets browser: origin client and server reduce response time
server
Web accesses via typically cache is for client request
cache Proxy installed by ISP
HT t reduce traffic on an
T ues
equ server P req
Pr
browser sends all (university, company, institution’s access
clientHTT Pr
est
HT T
p on se
HTTP requests to esp
ons T P res residential ISP) link.
e HT
cache st
eque Internet dense with
object in cache: cache r se
TP on
HT esp caches: enables “poor”
returns object T Pr
HT content providers to
else cache requests
object from origin client effectively deliver
server, then returns
origin
server
content (but so does
object to client P2P file sharing)
2: Application Layer 21 2: Application Layer 22
6
Caching example (cont) Conditional GET
origin
possible solution: install servers Goal: don’t send object if cache server
cache public cache has up-to-date cached HTTP request msg
suppose hit rate is 0.4 Internet version
object
If-modified-since:
consequence cache: specify date of <date>
not
40% requests will be cached copy in HTTP request modified
satisfied almost immediately If-modified-since: HTTP response
1.5 Mbps
60% requests satisfied by access link <date>
HTTP/1.0
origin server 304 Not Modified
institutional server: response contains no
utilization of access link
reduced to 60%, resulting in network
10 Mbps LAN object if cached copy is up-
HTTP request msg
negligible delays (say 10 to-date:
msec) HTTP/1.0 304 Not
If-modified-since:
<date> object
total avg delay = Internet Modified modified
delay + access delay + LAN institutional
delay = .6*(2.01) secs + HTTP response
cache
.4*milliseconds < 1.4 secs HTTP/1.0 200 OK
<data>
2: Application Layer 25 2: Application Layer 26
7
DNS Distributed, Hierarchical Database
Root DNS Servers
8
DNS name
Local Name Server root DNS server
resolution example
2
Does not strictly belong to hierarchy Host at cis.poly.edu 3
TLD DNS server
Each ISP (residential ISP, company, wants IP address for 4
Also called “default name server” iterated query: local DNS server
dns.poly.edu
When a host makes a DNS query, query is contacted server 7 6
replies with name of 1 8
sent to its local DNS server server to contact
Acts as a proxy, forwards query into hierarchy. “I don’t know this
authoritative DNS server
dns.cs.umass.edu
name, but ask this requesting host
server” cis.poly.edu
gaia.cs.umass.edu
DNS name
DNS: caching and updating records
resolution example root DNS server
gaia.cs.umass.edu
2: Application Layer 35 2: Application Layer 36
9
DNS records
DNS: distributed db storing resource records (RR)
Type=A RR format: (name, value, type, ttl)
name is hostname
value is IP address
E.g.: (dns.umass.edu, 128.119.40.111, A)
Type=NS
name is domain (e.g. foo.com)
value is hostname of authoritative name server for this domain
E.g.: (umass.edu, dns.umass.edu, NS)
Type=CNAME
name is alias name for some “canonical” (the real) name
www.ibm.com is really servereast.backup2.ibm.com
value is canonical name
E.g. : (www.ibm.com, servereast.backup2.ibm.com, CNAME)
Type=MX
value is name of mailserver associated with name
2: Application Layer 37 E.g. (foo.com, mail.bar.foo.com, MX) 2: Application Layer 38
10
DNS protocol, messages Inserting records into DNS
Example: just created startup “Network Utopia”
Name, type fields Register name networkuptopia.com at a registrar
for a query (e.g., Network Solutions)
Need to provide registrar with names and IP addresses of
RRs in response your authoritative name server (primary and secondary)
to query Registrar inserts two RRs into the com TLD server:
Exercise
Chapter2: Application layer
Suppose within your Web browser, you click on a link
to obtain a Webpage. The IP address for the
associated URL is not cached in your local host. Principles of network applications
Suppose that n DNS servers should be visited before Architecture: client-server or P2P
your host receives the IP address. The successive Services that an application needs
visits incur an RTT of RTT1, RTT2, …, RTTn. Suppose
important application-level protocols
that the base HTML file associated with the link
references three very small objects (small pictures) FTP, SMTP, P2P, ……
on the same server. Let RTT0 denote the RTT programming network applications
between the local host and the server containing the socket API
objects. Neglecting transmission times, how much time
elapses with Web stuff
a) Non-persistent HTTP with no parallel TCP connections? Web searching
b) Non-persistent HTTP with parallel connections?
c) Persistent HTTP with pipelining?
2: Application Layer 43 2: Application Layer 44
11
How Search Engines Work Standard Web Search Engine Architecture
Check for duplicates,
crawl the store the
Gather the contents of all web pages (using web documents
Crawler
a program called a crawler or spider) machines
docIDs
query index
Search
Inverted
Show results engine
index
To user servers
12
Spiders or crawlers Spider behaviour varies
How to find web pages to visit and copy? Parts of a web page that are indexed
Can start with a list of domain names, visit How deeply a site is indexed
the home pages there.
Types of files indexed
Look at the hyperlink on the home page, and
follow those links to more pages. How frequently the site is spidered
• Use HTTP commands to GET the pages
Keep a list of URLs visited, and those still to
be visited.
Each time the program loads in a new HTML
page, add the links in that page to the list to
be crawled.
Slide adapted from Lew & Davis2: Slide adapted from Lew & Davis2:
Application Layer Application Layer
13
The Internet Is Enormous “Freshness”
Need to keep checking pages
Pages change (25%,7% large changes)
• At different frequencies
• Who is the fastest changing?
• Pages are removed
Many search engines cache the pages (store a
copy on their own servers)
A small fraction of the Web that search Record information about each page
engines know about; no search engine is List of words
exhaustive
In the title?
Not the “live” Web, but the search engine’s
How far down in the page?
index
Was the word in boldface?
Not the “Deep Web”
URLs of pages pointing to this one
Anchor text on pages pointing to this one
The anchor text summarizes what the
website is about.
<a href=http://web.njit… > CS 656 </a>
Slide adapted from Lew & Davis2: Slide adapted from Lew & Davis2:
Application Layer Application Layer
14
Inverted Index Example
Inverted Index
2: Application Layer 59
Node4,1 Node4,2 Node4,3 … Node4,N
2: Application Layer 60
15
iii. Results ranking Some ranking criteria
Slide adapted from Lew & Davis2: Slide adapted from Lew & Davis2:
Application Layer Application Layer
2: Application Layer
Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188
63 2: Application Layer 64
Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188
16
Search Engine Information Acknowledgement
www.searchenginewatch.com Slides about web searching are adapted
www.searchenginejournal.com from the slides authored by Dr. Marti
www.searchengineshowdown.com Hearst.
http://battellemedia.com
http://jeremy.zawodny.com/blog/
17