blank
 
 
 

Please Do Not Read The Text Below.

Download the original file at

The anatomy of a large-scale hypertextual web search engine
L Page, S Brin - InTo Appear: Proceedings of the Seventh International Web …, 1998 - site.uottawa.ca
1 1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented by: Diem –
Hang Nguyen Ngoc Lawrence Chan 2 ... Designed to be a scalable search engine ...

We copied the scrambled text from pdf file (that's how it gets copied, sorry we had no control) so the keywords would be searchable.

1
1
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Presented by:
Diem – Hang Nguyen Ngoc
Lawrence Chan
2
1. Introduction
�� Authors: Lawrence Page, Sergey Brin
�� Started small at Stanford
�� Google index grew to over 1 billion pages in June 2000
�� At the time, Google served an average of 18 million queries/day
�� Google is now employing over 1900 worldwide
�� Traditional keyword searches not very accurate
�� Google makes use of html document structure
3
1. Introduction (cont’d)
�� Information is growing at a tremendous rate:
�� Scaling up to internet
�� Design Goals of Google:
�� Return high quality Results
�� Support Research
4
2. System Features
�� PageRank1: Bringing Order to the Web
(Maps contains 518 million of hyperlinks)
�� Description of PageRank calculation
�� Intuitive Justification
�� Anchor Text
�� Other Features
1 Demo available at google.stanford.edu
5
Description of PageRank
A = given page
T1 … Tn = pages that point to page A (i.e. citations)
d = damping factor which can be between 0 and 1
(usually we set d = 0.85)
C(A) = number of links going out of page A
PR(A) = the PageRank of a page A
PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
NOTE: the sum of all web pages’ PageRank = 1
6
Intuitive Justification
�� Assume there is a “random surfer”
�� Probability that random surfer visits a page is its PR
�� d is the probability at each page the “random surfer”
will get bored
�� Only add d to a single page or a group of page
�� If there are many pages that point to it
�� If there are some pages that point to it and have a
high PageRank
2
7
Anchor Text
�� Associate the text of a link with the page that
the link is on and the page the link points to
�� Advantages:
�� Anchors often provide more accurate description
�� Anchors may exist for documents which cannot be
indexed (I.e. image, programs, and database)
�� Propagating anchor text helps provide better
quality results
�� 24 million pages �� over 259 million anchors
8
Other Features
�� Location information for all hits
�� Keep track of some visual presentation details
(i.e. font size of words)
�� Full raw HTML of pages is available in a
repository
9
3. Related Work
�� Information Retrieval
�� In the past, focused largely on scientific stories, articles, etc.
�� Ex. Assignment 1 (Vector Space Model)
�� Differences between web and controlled collections
�� The web contains varying content (html, doc, PDF, etc.)
�� No control of web content
10
4. System Anatomy
�� Provide a high level discussion of the
architecture
�� Some in-depth description of important data
structure
�� The major applications:
�� crawling
�� indexing
�� searching
11
Google Architecture overview
URL Server
crawlers Store Server
Repository
Indexer
Anchors
URL Resolver
Links
Doc
index
Lexicon
PageRank
Searcher
Barrels
Sorters
Figure 4.1: High Level Google Architecture
12
Major Data Structure
�� BigFiles
�� Virtual files are addressable by 64 bit integers
�� Support rudimentary compression options
�� Repository
�� Contains full HTML of every web page
�� Compression rate of zlib is 3 to 1 compare to bzib is 4 to 1
�� docID, length, URL
�� Document Index
�� Keep info of each document (docID, fixed width ISAM index)
�� Current doc status, a pointer into the repository, a doc
checksum, and various statistics
3
13
Major Data .. (cont’d)
�� Lexicon
�� Repository of words
�� Implemented with a list and a hash table of pointers
�� Hit Lists
�� Stores occurrences of a particular word in a particular
document
�� Types of hits: Plain, Fancy and anchor (2 bytes per hit)
Lexicon
14
Major Data .. (cont’d)
�� Forward Index
�� Barrel holds a range of word ids
�� Doc id and a list of word ids
�� Inverted Index
�� Similar to Assignment 1 inverted index
�� Can be sorted by doc id or by ranking of word occurrence
Barrels
15
4.3 Crawling the Web
�� Crawlers need to be reliable, fast and robust
�� Some authors don’t want their pages to be crawled
�� Google usually has about 3-4 crawlers running
�� At peek speeds, the 4 crawlers can process 100 web pages/sec.
�� Crawlers have different states: DNS lookup, connecting to host,
send request, receiving response
�� Crawlers and URL server are implemented in Python
16
4.4 Indexing the Web
�� Parsing
�� Parsers need to handle errors very well (typos, formatting,
etc.)
�� Indexing
�� After parsing, placed into forward barrels
�� Words converted into word id and occurrences into hit lists
�� Sorting
�� Forward barrels are sorted by word id to produce an
inverted index
17
4.5 Searching
�� Goal of searching: quality search results efficiently
�� Ranking system:
�� Every hit list includes position, font and capitalization info
�� Consider each hit to be one of several different type and
each of which has it’s own type-weight
�� Feedback:
�� Many parameters: type-weight, type-prox-weight
�� User feedback mechanism
18
5. Results
and Performance
�� Storage Requirements
�� Google compresses it’s repository to about 53GB
(24 million pages)
�� Additional storage used by Google for indexes,
temporary storage, lexicon, etc. requires about
55GB.
4
19
System Performance
�� Crawling & Indexing efficiently
�� Crawling may take a week or more
�� To improve efficiency, the Indexer and Crawlers
need to be capable of simultaneous execution.
�� Sorting needs to be done in parallel on several
machines
20
Search Performance
�� Disk I/O – Requires most time (Seek time, transfer
time, network delay, etc.)
�� Caching – Reduces the number of disk access
�� Performance times can improve from 2.13 sec. to
0.06 sec.
21
6. Overview
�� Designed to be a scalable search engine
�� Provide high quality search
�� Complete architecture for gathering web
pages, indexing them, and performing search
queries over them
22
Future Work
�� Improve search efficiency (i.e. query caching,
smart disk allocation, subindices)
�� Scale to ~ 100 million web pages
�� Smart algorithms
�� Add simple features (i.e. Boolean operators,
negation, stemming)
�� Relevance feedback and clustering
�� Support user context, result summarization
23
High Quality Search
�� Heavy use of hypertextual info
�� Link structure
�� Link (anchor) text
�� Proximity and font info �� increase relevance
�� PageRank �� quality
�� Link text �� relevant
24
Scalable Architecture
�� Efficient in both space and time
�� Bottlenecks in:
�� CPU
�� Memory capacity
�� Disk seeks
�� Disk throughput
�� Disk capacity
�� Network IO
�� Major data structure make efficient use of available
storage space (24 million pages in > 1 week)
�� Build an index of 100 million pages > 1 month
5
25
Research Tool
�� Google is a research tool
�� Google is a necessary one for a wide range of
applications
�� Resource for searchers and researchers all
around the globe
�� Spark the next generation of search engine
technology

 

 

  © 2002-2004   Home Page ; Iconocast offers eMarketing, Internet Advertising, Online Advertising, Internet Marketing, Online Branding, and eMarketing News Services.