Sei sulla pagina 1di 4

Writing a Web Crawler in the Java Programming Language

Sun

Java

Solaris

Communities

My SDN Account

APIs

Downloads

Products

Support

Training

Participate

search tips

Search

SDN Home > Products & Technologies > Java Technology > Reference > Technical Articles and Tips > Developer Technical Articles & Tips > Third-Party Technologies >

Article

Writing a Web Crawler in the Java Programming Language



Print-friendly Version

Articles Index

By Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC January 1998
Everyone uses web crawlersindirectly, at least! Every time you search the Internet using a service such as Alta Vista, Excite, or Lycos, you're making use of an index that's based on the output of a web crawler. Web crawlersalso known as spiders, robots, or wanderersare software programs that automatically traverse the Web. Search engines use crawlers to find what's on the Web; then they construct an index of the pages that were found. However, you might want to use a crawler directly. You might even want to write your own! Here are some possible reasons: You want to maintain mirror sites for popular Web sites. You need to test web pages and links for valid syntax and structure. You want to monitor sites to see when their structure or contents change. Your company needs to search for copyright infringements. You'd like to build a special-purpose indexfor example, one that has some understanding of the content stored in multimedia files on the Web. This article explains what web crawlers are. It includes a web-crawling demo program, written in the Java programming language, that you can run from your browser. The demo traverses the Web automatically, shows a running list of files it has found, and updates the list each time it finds a new one. You can specify what type of file you want to find. The Java language source code for this demo application is provided as a programming example.

How Web Crawlers Work


Web crawlers start by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively. Web-crawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Following links isn't greatly useful in itself, of course. The list of linked pages almost always serves some subsequent purpose. The most common use is to build an index for a web search engine, but crawlers are also used for other purposes, such as those mentioned in the previous section.

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Writing a Web Crawler in the Java Programming Language Muscle Fish uses a crawler to search the Web for audio files. This is a straightforward task, as shown by the demo in the next section. It turns out that searching for audio files is not very different from searching for any other kind of file. On the other hand, indexing audio is anything but straightforward. Most search engines, if they handle audio at all, index only textual information that's associated with the sound file. Muscle Fish's approach is to acoustically analyze the audio itself. This feature lets you search for sound files based on how they actually soundyou're not limited to searching for whatever words happen to be located nearby on the same web page. (A forthcoming article and demo program will show this feature.)

A Web-Crawling Demo Program


The simple application shown below crawls the Web, searching for a specified type of file. Note: This demo was written using JDK 1.1.3. Not all web browsers support such a recent version of JDK. You can run the demo on any platform by using the HotJava browser. On the Macintosh, the demo should work with any browser that uses MRJ (Macintosh Runtime for Java) 2.0.

Application source code. To run the demo, follow these steps: Type a valid URL (web address), including the "http://" portion, in the text field at the top of the application window. Click the Search button. Look at the status area below the scrolling list. In this area, the application reports which page it is currently searching. As it encounters links on the page, it adds any new URLs to the scrolling list. The application remembers which pages it's already visited, so it won't search any web page twice. This prevents infinite loops. As you inspect the list of URLs, you can see that the application performs a breadth-first search. In other words, it accumulates a list of all the links that are on the current page before it follows any of the links to a new page. If you tire of witnessing this little tour of the Web, click the Stop button. The status area reports "stopped." If you let the tour run without stopping, it will eventually stop on its own once it's found 50 files. At this point, it reports "reached search limit of 50." (You can increase the limit by changing the SEARCH_LIMIT constant in the source code.) The application will also stop automatically if it encounters a dead endmeaning that

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Writing a Web Crawler in the Java Programming Language it's traversed all the files that are directly or indirectly available from the starting position you specified. If this happens, the application reports "done." The next time you click Search, the list of files gets cleared, and the search process starts over again. Notice that there's a pull-down menu that lets you specify what type of file you want to find. The default is HTML text files. You can also choose "audio/basic," "audio/au," "audio/aiff," "audio/wav," "video/mpeg," or "video/x-avi."

A Look at the Code


Take a look at the Java-language source code for this demo. The code occupies less than 400 lines, including comments. It is a testament to JDK's elegance that this application took only a few person-hours to write from scratch. (Muscle Fish had never written a crawler before, nor was any pre-existing web-crawler code borrowed or studied.) Here's a pseudocode summary of the algorithm:

Get the user's input: the starting URL and the desired file type. Add the URL to the currently empty list of URLs to search. While the list of URLs to search is not empty, { Get the first URL in the list. Move the URL to the list of URLs already searched. Check the URL to make sure its protocol is HTTP (if not, break out of the loop, back to "While"). See whether there's a robots.txt file at this site that includes a "Disallow" statement. (If so, break out of the loop, back to "While".) Try to "open" the URL (that is, retrieve that document From the Web). If it's not an HTML file, break out of the loop, back to "While." Step through the HTML file. While the HTML text contains another link, { Validate the link's URL and make sure robots are allowed (just as in the outer loop). If it's an HTML file, If the URL isn't present in either the to-search list or the already-searched list, add it to the to-search list. Else if it's the type of the file the user requested, Add it to the list of files found. } }

This demo tries to respect the robots exclusion standard, meaning that it avoids sites where it's unwelcome. Any site can exclude web crawlers from all or part of its filesystem, by putting certain statements in a file called robots.txt . See the robotSafe function in the demo's source code. This function is conservative in that it avoids sites where any crawler is disallowed, even if this particular one is not. (There is a new HTML meta-tag called ROBOTS , which this demo does not yet support. If you revise the source code to support this meta-tag, send your code to the authors and the version posted here will be updated.)

Where to Go from Here


This simple programming example might have given you some ideas about how to write a full-fledged web crawler. Muscle Fish can't provide technical support for running this demo program or for writing crawlers. However, there are various resources on the Web for people interested in crawlers. The Web Robots Pages is a good starting point, and it contains links to other important sites.

Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold are members of Muscle Fish, LLC, a software consulting firm in Berkeley, California. Muscle Fish specializes in audio and music technology, and produces software that searches for sound based on its acoustical content.

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Writing a Web Crawler in the Java Programming Language

Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.

About Sun | About This Site | Newsletters | Contact Us | Employment | How to Buy | Licensing | Terms of Use | Privacy | Trademarks

A Sun Developer Network Site

2010, Oracle Corporation and/or its affiliates

Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.

Sun Developer RSS Feeds

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Potrebbero piacerti anche