The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

In this paper, Larry and Sergey present Google.

Features

One of Google's main features is its use of PageRank to rank search results. For an arbitrary site $A$ and a set of sites $T_1, \ldots, T_n$ that link to it, the PageRank function satisfies the following equation:

$$ PageRank(A) = (1 - d) + d \sum_{i=1}^n \frac{PageRank(T_i)}{OutDegree(T_i)} $$

where $d$ is a damping factor. Intuitively, sites get a high PageRank if they are frequently linked to or linked to by other sites with high PageRank.

Google also associates anchor text (the text part of a link) with the page being linked to instead of the page with the link. This anchor text provides information about a site which the site itself may not include.

Google also provides a number of other miscellaneous features. For example, it scores words in documents based on their size and boldness. It also maintains a repository of HTML that researchers can use as a dataset.

Most related work is in the field of information retrieval, but this literature typically assumes that information is being extracted from a small set of homogeneous documents. The web is huge and varied, and users can even maliciously try to affect search engine results.

System Anatomy

A URL Server sends URLs to a set of crawlers which download sites and send them to a store server which compresses them and stores them in a repository. An indexer uses the repository to build a forward index (stored in a set of barrels) which maps documents to a set of hits. Each hit records a word in the document, its position in the document, its boldness, etc. The indexer also stores all anchors in a file which a URL resolver uses to build a graph for PageRank. A sorter periodically converts forward indexes into inverted indexes.

Data Structures

Crawling the Web

The URL server and the crawlers are implemented in Python. Each crawler has about 300 open connections, performs event loop asynchronous IO, and maintains a local DNS cache to boost performance. Crawling is the most fragile part of the entire search engine and requires a lot of corner case handling.

Indexing

Google implements a custom parser to handle exotic HTML. The HTML is converted into a set of hits using the lexicon. New words not in the lexicon are logged and later added to the lexicon.

A search query is converted into a set of WordIds. The index is searched to find a set of documents that contain the words. The results are ranked by a number of factors (e.g. PageRank, proximity for range searches) before being returned to the user.