Branding Laws · Internet Marketing · eMarketing · Internet Advertising · Online Branding |
| |
The anatomy of a large-scale hypertextual web search engine
L Page, S Brin - InTo Appear: Proceedings of the Seventh International Web …, 1998 - site.uottawa.ca
1 1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented by: Diem –
Hang Nguyen Ngoc Lawrence Chan 2 ... Designed to be a scalable search engine ...
1
1
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Presented by:
Diem – Hang Nguyen Ngoc
Lawrence Chan
2
1. Introduction
Authors: Lawrence Page, Sergey Brin
Started small at Stanford
Google index grew to over 1 billion pages in June 2000
At the time, Google served an average of 18 million queries/day
Google is now employing over 1900 worldwide
Traditional keyword searches not very accurate
Google makes use of html document structure
3
1. Introduction (cont’d)
Information is growing at a tremendous rate:
Scaling up to internet
Design Goals of Google:
Return high quality Results
Support Research
4
2. System Features
PageRank1: Bringing Order to the Web
(Maps contains 518 million of hyperlinks)
Description of PageRank calculation
Intuitive Justification
Anchor Text
Other Features
1 Demo available at google.stanford.edu
5
Description of PageRank
A = given page
T1 … Tn = pages that point to page A (i.e. citations)
d = damping factor which can be between 0 and 1
(usually we set d = 0.85)
C(A) = number of links going out of page A
PR(A) = the PageRank of a page A
PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
NOTE: the sum of all web pages’ PageRank = 1
6
Intuitive Justification
Assume there is a “random surfer”
Probability that random surfer visits a page is its PR
d is the probability at each page the “random surfer”
will get bored
Only add d to a single page or a group of page
If there are many pages that point to it
If there are some pages that point to it and have a
high PageRank
2
7
Anchor Text
Associate the text of a link with the page that
the link is on and the page the link points to
Advantages:
Anchors often provide more accurate description
Anchors may exist for documents which cannot be
indexed (I.e. image, programs, and database)
Propagating anchor text helps provide better
quality results
24 million pages over 259 million anchors
8
Other Features
Location information for all hits
Keep track of some visual presentation details
(i.e. font size of words)
Full raw HTML of pages is available in a
repository
9
3. Related Work
Information Retrieval
In the past, focused largely on scientific stories, articles, etc.
Ex. Assignment 1 (Vector Space Model)
Differences between web and controlled collections
The web contains varying content (html, doc, PDF, etc.)
No control of web content
10
4. System Anatomy
Provide a high level discussion of the
architecture
Some in-depth description of important data
structure
The major applications:
crawling
indexing
searching
11
Google Architecture overview
URL Server
crawlers Store Server
Repository
Indexer
Anchors
URL Resolver
Links
Doc
index
Lexicon
PageRank
Searcher
Barrels
Sorters
Figure 4.1: High Level Google Architecture
12
Major Data Structure
BigFiles
Virtual files are addressable by 64 bit integers
Support rudimentary compression options
Repository
Contains full HTML of every web page
Compression rate of zlib is 3 to 1 compare to bzib is 4 to 1
docID, length, URL
Document Index
Keep info of each document (docID, fixed width ISAM index)
Current doc status, a pointer into the repository, a doc
checksum, and various statistics
3
13
Major Data .. (cont’d)
Lexicon
Repository of words
Implemented with a list and a hash table of pointers
Hit Lists
Stores occurrences of a particular word in a particular
document
Types of hits: Plain, Fancy and anchor (2 bytes per hit)
Lexicon
14
Major Data .. (cont’d)
Forward Index
Barrel holds a range of word ids
Doc id and a list of word ids
Inverted Index
Similar to Assignment 1 inverted index
Can be sorted by doc id or by ranking of word occurrence
Barrels
15
4.3 Crawling the Web
Crawlers need to be reliable, fast and robust
Some authors don’t want their pages to be crawled
Google usually has about 3-4 crawlers running
At peek speeds, the 4 crawlers can process 100 web pages/sec.
Crawlers have different states: DNS lookup, connecting to host,
send request, receiving response
Crawlers and URL server are implemented in Python
16
4.4 Indexing the Web
Parsing
Parsers need to handle errors very well (typos, formatting,
etc.)
Indexing
After parsing, placed into forward barrels
Words converted into word id and occurrences into hit lists
Sorting
Forward barrels are sorted by word id to produce an
inverted index
17
4.5 Searching
Goal of searching: quality search results efficiently
Ranking system:
Every hit list includes position, font and capitalization info
Consider each hit to be one of several different type and
each of which has it’s own type-weight
Feedback:
Many parameters: type-weight, type-prox-weight
User feedback mechanism
18
5. Results
and Performance
Storage Requirements
Google compresses it’s repository to about 53GB
(24 million pages)
Additional storage used by Google for indexes,
temporary storage, lexicon, etc. requires about
55GB.
4
19
System Performance
Crawling & Indexing efficiently
Crawling may take a week or more
To improve efficiency, the Indexer and Crawlers
need to be capable of simultaneous execution.
Sorting needs to be done in parallel on several
machines
20
Search Performance
Disk I/O – Requires most time (Seek time, transfer
time, network delay, etc.)
Caching – Reduces the number of disk access
Performance times can improve from 2.13 sec. to
0.06 sec.
21
6. Overview
Designed to be a scalable search engine
Provide high quality search
Complete architecture for gathering web
pages, indexing them, and performing search
queries over them
22
Future Work
Improve search efficiency (i.e. query caching,
smart disk allocation, subindices)
Scale to ~ 100 million web pages
Smart algorithms
Add simple features (i.e. Boolean operators,
negation, stemming)
Relevance feedback and clustering
Support user context, result summarization
23
High Quality Search
Heavy use of hypertextual info
Link structure
Link (anchor) text
Proximity and font info increase relevance
PageRank quality
Link text relevant
24
Scalable Architecture
Efficient in both space and time
Bottlenecks in:
CPU
Memory capacity
Disk seeks
Disk throughput
Disk capacity
Network IO
Major data structure make efficient use of available
storage space (24 million pages in > 1 week)
Build an index of 100 million pages > 1 month
5
25
Research Tool
Google is a research tool
Google is a necessary one for a wide range of
applications
Resource for searchers and researchers all
around the globe
Spark the next generation of search engine
technology