Sei sulla pagina 1di 2

CSE 5337/7337: Information Retrieval and Web Search

Spring 2016, Project 2: Query engine implementation (100 points)


MAKE UP VERSION
Deliverables:
1. Complete code in a compressed archive (zip, tgz, etc)
2. A readme file with complete description of used software, installation, compilation and
execution instructions to allow me to install and run your program if needed.
3. A document with the results for the tasks below.
Task:
Develop a simplified query engine.
Test your data only on the data in:
http://lyle.smu.edu/~fmoore
1. I have provided you with the complete java source to a simple web crawler. One aspect
that is inefficient is that the robots.txt file associated with a url, is retrieved every time for
every page. Modify the program so when the robots.txt file is retrieved, it is cached
(either to memory or disk, your choice) so you can refer to your copy rather than
refetching the file every time. [10 points]
2. You will need a build dictionary of words. [20 points]
a) What is your definition of word?
b) You can assume an upper bound of 3000 words. Modify the processpage routine to
add terms to the dictionary, as well as creating the inverted index. So you will need to
include the data structure to include page identifier, url, checksum, and pointer to words
on page.
3. For the purpose of this project, you may assume a maximum of 30 documents. You will
need to create a word/document frequency matrix to support queries. [20 points]
a) modify addnewurl so you can retrieve .txt files in addition to .htm and .html as well
as make sure you dont retrieve urls outside of my directory.
b) Modify the program to read in a list of stop words from a file, then modify
processpage to remove stop words from the page being processed.
c) modify the run procedure to compute a checksum of the page returned by getpage. If
that checksum matches the checksum of any previously read pages, then display a
message that this is a duplicate file and ignore it.
d) make the necessary modifications to save the words and number of occurrences to
support cosine computation.
4. The user will be able to enter multiple queries, consisting of one or more query words
separated by space. [10 points]
a) You will need to develop a new procedure that is run after wc.run(argv) and reads a
line of input. If quit is entered, then stop the program, otherwise the input contains a
query to be processed.
b) What happens if a user enters a stop word?

c) make sure the input and matching is not case sensitive.


5. Implement the cosine similarity of the query against all documents. [40 points]

a) Display the similarity measure and document URL in descending numerical order for
the top 5 non-zero results.
b) Also display the first 20 words of the document.

Potrebbero piacerti anche