Skip to main content

The Anatomy of a Search Engine

Popularity Report

Total Popularity Score: 0

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Rank

Bookmark History

Saved by 46 people (-16 private), first by anonymouse user on 2006-10-29


Public Sticky notes

The citation (link) graph of the web is an important resource that has larg

Highlighted by galaen

PageRank

Highlighted by galaen

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/

Highlighted by cmccooey

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

ageRank

Highlighted by galaen

ageRank

Highlighted by galaen

PageRan

Highlighted by galaen

PageRan

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

ageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

The citation (link) graph of the web is an important resource that has larg

Highlighted by galaen

The citation (link) graph of the web is an important resource that has larg

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

PageRank

Highlighted by galaen

portance

Highlighted by galaen

ng Orde

Highlighted by galaen

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/

Highlighted by cmccooey

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Highlighted by decoeur

(Note: There are two versions of this paper -- a longer full version and a shorter printed version. The full version is available on the web and the conference CD-ROM.)
The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as Yahoo! or with search engines. Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. To make matters worse, some advertisers attempt to gain people's attention by taking measures meant to mislead automated search engines. We have built a large-scale search engine which addresses many of the problems of existing systems. It makes especially heavy use of the additional structure present in hypertext to provide much higher quality search results. We chose our system name, Google, because it is a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search engines.

Highlighted by hkfn_123

Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics

Highlighted by zeenko

rely on keyword matching

Highlighted by zeenko

low quality matches

Highlighted by zeenko

advertisers attempt

Highlighted by zeenko

mislead automated search engines

Highlighted by zeenko

Search engine technology has had to scale dramatically to keep up with the growth of the web. In 1994, one of the first web search engines, the World Wide Web Worm (WWWW) [McBryan 94] had an index of 110,000 web pages and web accessible documents. As of November, 1997, the top search engines claim to index from 2 million (WebCrawler) to 100 million web documents (from Search Engine Watch). It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents. At the same time, the number of queries search engines handle has grown incredibly too. In March and April 1994, the World Wide Web Worm received an average of about 1500 queries per day. In November 1997, Altavista claimed it handled roughly 20 million queries per day. With the increasing number of users on the web, and automated systems which query search engines, it is likely that top search engines will handle hundreds of millions of queries per day by the year 2000. The goal of our system is to address many of the problems, both in quality and scalability, introduced by scaling search engine technology to such extraordinary numbers.

Highlighted by hkfn_123

a comprehensive index of the Web will contain over a billion documents.

Highlighted by microli

20 million queries per day.

Highlighted by microli

top search engines will handle hundreds of millions of queries per day

Highlighted by microli

Our main goal is to improve the quality of web search engines

Highlighted by zeenko

make it easy to find almost anything on the Web (once all the data is entered)

Highlighted by zeenko

Junk results" often wash out any results that a user is interested in

Highlighted by zeenko

Fast crawling technology is needed to gather the web documents and keep them up to date.

Highlighted by microli

People are still only willing to look at the first few tens of results

Highlighted by zeenko

. Its data structures are optimized for fast and efficient access

Highlighted by microli

. One of the main causes of this problem is that the number of documents in the indices has been increasing by many orders of magnitude,

Highlighted by microli

First,

Highlighted by zeenko

two important features that help it produce high precision results

Highlighted by zeenko

link structure

Highlighted by zeenko

PageRank

Highlighted by zeenko

Second

Highlighted by zeenko

Indeed, we want our notion of "relevant" to only include the very best documents

Highlighted by microli

objective measure of its citation importance that corresponds well with people's subjective idea of importance

Highlighted by zeenko

In 1994, one of the first web search engines, the World Wide Web Worm (WWWW) [McBryan 94] had an index of 110,000 web pages and web accessible documents

Highlighted by glasswort

remain largely a black art and to be advertising oriented

Highlighted by microli

WebCrawler

Highlighted by glasswort

makes use of the link structure

Highlighted by microli

utilizes link

Highlighted by microli

page can have a high PageRank if there are many pages that point to it

Highlighted by zeenko

or if there are some pages that point to it and have a high PageRank

Highlighted by zeenko

PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page.

Highlighted by microli

damping factor

Highlighted by microli

a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation.

Highlighted by microli

One important variation is to only add the damping factor d to a single page,

Highlighted by microli

Some argue that on the web, users should specify more accurately what they want and add more words to their query. We disagree vehemently with this position. If a user issues a query like "Bill Clinton" they should get reasonable results since there is a enormous amount of high quality information available on this topic. Given examples like these, we believe that the standard information retrieval work needs to be extended to deal effectively with the web.

Highlighted by zeenko

the number of documents in the indices has been increasing by many orders of magnitude, but the user's ability to look at documents has not

Highlighted by glasswort

here are even numerous companies which specialize in manipulating search engines for profit.

Highlighted by zeenko

This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm

Highlighted by microli

We use anchor propagation mostly because anchor text can help provide better quality results.

Highlighted by microli

technically difficult

Highlighted by microli

has location information for all hits

Highlighted by microli

visual presentation details

Highlighted by microli

web crawling (downloading of web pages)

Highlighted by zeenko

n this section, we will give a high level overview of how the whole system works as pictured in Figure 1. Further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.

In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

Highlighted by hariprasade

build systems that reasonable numbers of people can actually use

Highlighted by glasswort

the standard vector space model tries to return the document that most closely approximates the query,

Highlighted by microli

s very short documents that are the query plus a few words.

Highlighted by microli

"Bill Clinton Sucks" and picture from a "Bill Clinton" query.

Highlighted by microli

To support novel research uses, Google stores all of the actual documents it crawls in compressed form

Highlighted by glasswort

completely uncontrolled heterogeneous documents.

Highlighted by microli

documents are stored one after the other and are prefixed by docID, length, and URL

Highlighted by zeenko

On the other hand, we define external meta information as information that can be inferred about a document, but is not contained within it.

Highlighted by microli

update frequency, quality, popularity or usage, and citations.

Highlighted by microli

The document index keeps information about each document

Highlighted by zeenko

virtually no control over what people can put on the web.

Highlighted by microli

because any text on the page which is not directly represented to the user is abused to manipulate search engines.

Highlighted by microli

First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank and is described in detail in [Page 98]. Second, Google utilizes link to improve search results.

Highlighted by glasswort

crawling, indexing, and searching

Highlighted by microli

A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information

Highlighted by zeenko

Fancy hits include hits occurring in a URL, title, anchor text, or meta tag

Highlighted by zeenko

Plain hits include everything else

Highlighted by zeenko

several distributed crawlers.

Highlighted by microli

URLserver t

Highlighted by microli

Every web page has an associated ID number called a docID

Highlighted by microli

Each document is converted into a set of word occurrences called hits.

Highlighted by microli

a partially sorted forward index.

Highlighted by microli

Highlighted by daniel_teacher

It puts the anchor text into the forward index,

Highlighted by microli

resorts them by wordID to generate the inverted index.

Highlighted by microli

in place

Highlighted by microli

PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:

Highlighted by glasswort

a list of wordIDs and offsets into the inverted index

Highlighted by microli

this list together with the lexicon produced by the indexer

Highlighted by microli

addressable by 64 bit integers

Highlighted by microli

BigFiles are virtual files spanning multiple file systems

Highlighted by microli

operating systems do not provide enough for our needs.

Highlighted by microli

rudimentary compression

Highlighted by microli

We chose zlib's speed over a significant improvement in compression offered by bzip.

Highlighted by microli

prefixed by docID, length, and URL

Highlighted by microli

a file which lists crawler errors

Highlighted by microli

PageRank can be thought of as a model of user behavior

Highlighted by glasswort

The information stored in each entry includes the current document status, a pointer into the repository,

Highlighted by microli

pointer points into the URLlist

Highlighted by microli

width file called docinfo which contains its URL and title.

Highlighted by microli

reasonably compact data structure

Highlighted by microli

In order to find the docID of a particular URL, the URL's checksum is computed and a binary search is performed on the checksums file to find its docID.

Highlighted by microli

URLresolver uses to turn URLs into docIDs.

Highlighted by microli

The goal of searching is to provide quality search results efficiently

Highlighted by zeenko

The current lexicon contains 14 million words (though some rare words were not added to the lexicon).

Highlighted by microli

a hash table of pointers.

Highlighted by microli

Every hitlist includes position, font, and capitalization information

Highlighted by zeenko

factor in hits from anchor text and the PageRank of the document

Highlighted by zeenko

Hit lists account for most of the space used in both the forward and the inverted indices

Highlighted by microli

The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on

Highlighted by glasswort

URL, title, anchor text, or meta tag.

Highlighted by microli

For anchor hits, the 8 bits of position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor occurs in.

Highlighted by microli

wordID in the forward index and the docID in the inverted index.

Highlighted by microli

Each barrel holds a range of wordID's.

Highlighted by microli

name servers

Highlighted by microli

web servers

Highlighted by microli

A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3).

Highlighted by microli

Python

Highlighted by microli

performance stress is DNS lookup

Highlighted by microli

looking up DNS, connecting to host, sending request, and receiving response.

Highlighted by microli

half a million servers

Highlighted by microli

This page is copyrighted and should not be indexed"

Highlighted by microli

This resulted in lots of garbage messages

Highlighted by microli

immense variation

Highlighted by microli

hundreds of obscure problems

Highlighted by microli

very robust and carefully tested

Highlighted by microli

typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep,

Highlighted by microli

flex to generate a lexical analyzer which we outfit with its own stack.

Highlighted by microli

The biggest problem facing users of web search engines today is the quality of the results they get back

Highlighted by zeenko

Google is a research tool.

Highlighted by zeenko

Scan through the doclists

Highlighted by microli

looks at that document's hit list for that word.

Highlighted by microli

we have a user feedback mechanism

Highlighted by microli

This feedback is saved.

Highlighted by microli

Although far from perfect, this gives us some idea of how a change in the ranking function affects the search results.

Highlighted by microli

results are clustered by server.

Highlighted by microli

relied on anchor text

Highlighted by microli

heavy importance on the proximity of word occurrences.

Highlighted by microli

The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them

Highlighted by glasswort

It parses out all the links in every web page and stores important information about them in an anchors file.

Highlighted by glasswort

just over one third of the total data it stores.

Highlighted by microli

the total of all the data used by the search engine requires a comparable amount of storage, about 55 GB.

Highlighted by microli

short inverted index.

Highlighted by microli

the major operations are Crawling, Indexing, and Sorting.

Highlighted by microli

disks filled up, name servers crashed, or any number of other problems which stopped the system.

Highlighted by microli

4 million pages per day or 48.5

Highlighted by microli

These optimizations included bulk updates to the document index and placement of critical data structures on the local disk. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.

Highlighted by microli

optimizing the indexer

Highlighted by microli

most queries in between 1 and 10 seconds.

Highlighted by microli

. Furthermore, Google does not have any optimizations such as query caching, subindices on common terms, and other common optimizations.

Highlighted by microli

Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.

Highlighted by glasswort

page rank, anchor text, and proximity information

Highlighted by microli

a complex system

Highlighted by microli

simple improvements to efficiency

Highlighted by microli

updates

Highlighted by microli

search databases

Highlighted by microli

proxy caches

Highlighted by microli

We are planning to add simple features

Highlighted by microli

relevance feedback and clustering

Highlighted by microli

extend the use of link structure and link text.

Highlighted by microli

PageRank can be personalized

Highlighted by microli

Google is designed to provide higher quality search so as the Web continues to grow rapidly, information can be found easily.

Highlighted by microli

The analysis of link structure via PageRank

Highlighted by microli

the use of proximity information

Highlighted by microli

The document index keeps information about each document.

Highlighted by glasswort

bottlenecks in CPU, memory access, memory capacity, disk seeks, disk throughput, disk capacity, and network IO.

Highlighted by microli

24 million pages, in less than one week.

Highlighted by microli

shown a number of limitations to queries about the Web that may be answered without having the Web available locally.

Highlighted by microli

A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information

Highlighted by glasswort

These include things like the crawlers, indexers, and sorters.

Highlighted by microli

100 million web pages we will be very close up against all sorts of operating system limits in the common operating systems (currently we run on both Solaris and Linux). These include things like addressable memory, number of open file descriptors, network sockets and bandwidth, and many others.

Highlighted by microli

would greatly increase the complexity

Highlighted by microli

Of course a distributed systems like Gloss [Gravano 94] or Harvest will often be the most efficient and elegant technical solution for indexing

Highlighted by microli

So we are optimistic that our centralized web search engine architecture will improve in its ability to cover the pertinent text information over time and that there is a bright future for search.

Highlighted by microli

A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document.

Highlighted by glasswort

For maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which we outfit with its own stack.

Highlighted by glasswort

Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. Additionally, we factor in hits from anchor text and the PageRank of the document. Combining all of this information into a rank is difficult.

Highlighted by glasswort

Some simple improvements to efficiency include query caching, smart disk allocation, and subindices.

Highlighted by glasswort