CSCI4964:  Crawling the Web

Li Ding
Feb 21, 2008

0. Prolog

Tim Berners Lee's talk on Web day 2007

1. From Web Surfing to Web Crawling

1.1 Surfing the web

1.2 Crawling  the web

2. Building a Simple Java Web Crawler

Here is a ten years old online tutorial: ( By Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC, Jan 1998). Beyond the example code, we need to handle the following issues:

2.1 Java Issues

Java has evolved in the past 10 years, so we should use new technologies and avoid deprecated code. You should consider use JDK 5 or higher version.

Below are some useful classes from JDK:

2.2 HTTP content negotiation 

Checkout this Exmaple

        URLConnection con = null;
        con = url.openConnection();
        con.setRequestProperty("User-Agent", HTTP_USER_AGENT);   

We need to set properties of HTTP Request Header
We need to get properties of HTTP Response Header
            task.m_conn = (HttpURLConnection)connection;
            // process http response information
            task.m_nHttpResponseCode = task.m_conn.getResponseCode();

2.3 Download content of web page

be care of charater encoding when converting Byte array into String (not necessary in this work, but imporant in real world practice)

2.4 Extract URLs from web page

typical pattern of hyperlinks
    <a  href=""> example </a>

URL parsing approaches, e.g. simple string matching or regular expression matching

URL normalization 

2.5 When to skip a URL for Politeness or other issue

acknowledge robots.txt

only visit URL hosted by certain website

2.6 Scheduling 

do not revisit some URLs  (e.g. content negotiation)

which URLs in the "frontier" pool should be visited first

temination conditions

aviod crawler traps

3. Advanced Topic

3.1 Revisit Policy

A web page may change over time ( see example cs department home page)

How frequently should a crawler revisit a discovered URL to update its status?

3.2 Web Page duplication detection

Avoid index the same document again and again

handle URLs containing query parameters, the order of parameters may cause exponetial number of  URL duplications

3.2 Advanced crawling scheduling