Branding Laws · Internet Marketing · eMarketing · Internet Advertising · Online Branding |
| |
Building Domain-Specific Search Engines with Machine Learning Techniques
Building Domain-Specic Search Engines with
Machine Learning Techniques
Andrew McCallumzy
mccallum@justresearch.com
Kamal Nigamy
knigam@cs.cmu.edu
Jason Renniey
jr6b@andrew.cmu.edu
Kristie Seymorey
kseymore@ri.cmu.edu
zJust Research
4616 Henry Street
Pittsburgh, PA 15213
ySchool of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Abstract
Domain-specic search engines are growing in popu-
larity because they oer increased accuracy and extra
functionality not possible with the general, Web-wide
search engines. For example, www.campsearch.com
allows complex queries by age-group, size, location
and cost over summer camps. Unfortunately these
domain-specic search engines are dicult and time-
consuming to maintain. This paper proposes the
use of machine learning techniques to greatly auto-
mate the creation and maintenance of domain-specic
search engines. We describe new research in rein-
forcement learning, information extraction and text
classication that enables ecient spidering, iden-
tifying informative text segments, and populating
topic hierarchies. Using these techniques, we have
built a demonstration system: a search engine for
computer science research papers. It already con-
tains over 50,000 papers and is publicly available at
www.cora.justresearch.com.
1 Introduction
As the amount of information on the World Wide
Web grows, it becomes increasingly dicult to nd
just what we want. While general-purpose search en-
gines, such as Altavista and HotBot oer high coverage,
they often provide only low precision, even for detailed
queries.
When we know that we want information of a certain
type, or on a certain topic, a domain-specic search
engine can be a powerful tool. For example:
www.campsearch.com allows the user to search for
summer camps for children and adults. The user
can query the system based on geographic location,
cost, duration and other requirements.
www.netpart.com lets the user search over company
pages by hostname, company name, and location.
www.mrqe.com allows the user to search for reviews
of movies. Type a movie title, and it provides links
to relevant reviews from newspapers, magazines, and
individuals from all over the world.
www.maths.usyd.edu.au/MathSearch.html lets the
user search web pages about mathematics.
www.travel-nder.com allows the user to search web
pages about travel, with special facilities for search-
ing by activity, category and location.
Performing any of these searches with a traditional,
general-purpose search engine would be extremely te-
dious or impossible. For this reason, domain-specic
search engines are becoming increasingly popular. Un-
fortunately, however, building these search engines is a
labor-intensive process, typically requiring signicant
and ongoing human eort.
This paper describes the Ra Project|an eort to
automate many aspects of creating and maintaining
domain-specic search engines by using machine learn-
ing techniques. These techniques allow search en-
gines to be created quickly with minimal eort and
are suited for re-use across many domains. This pa-
per presents machine learning methods for spidering
in an ecient topic-directed manner, extracting topic-
relevant substrings, and building a browseable topic
hierarchy. These approaches are brie
y described in
the following three paragraphs.
Every search engine must begin with a collection of
documents to index. A spider (or \crawler") is an
agent that traverses the Web, looking for documents
to add to the search engine. When aiming to popu-
late a domain-specic search engine, the spider need
not explore the Web indiscriminantly, but should ex-
plore in a directed fashion in order to nd domain-
relevant documents eciently. We frame the spidering
task in a reinforcement learning framework (Kaelbling,
Littman, & Moore 1996), allowing us to precisely and
mathematically dene \optimal behavior." This ap-
proach provides guidance for designing an intelligent
spider that aims to select hyperlinks optimally. Our
experimental results show that a reinforcement learn-
ing spider is three times more ecient than a spider
with a breadth-rst search strategy.
Extracting characteristic pieces of information from
the documents of a domain-specic search engine al-
lows the user to search over these features in a way
that general search engines cannot. Information ex-
traction, the process of automatically nding specic
textual substrings in a document, is well suited to this
task. We approach information extraction with a tech-
nique from statistical language modeling and speech
recognition, namely hidden Markov models (Rabiner
1989). Our initial algorithm extracts elds such as the
title, authors, institution, and journal name from re-
search paper reference sections with 93% accuracy.
Search engines often provide a hierarchical organiza-
tion of materials into relevant topics; Yahoo is the pro-
totypical example. Automatically adding documents
into a topic hierarchy can be framed as a text classi-
cation task. We present extensions to a probabilis-
tic text classier known as naive Bayes (Lewis 1998;
McCallum & Nigam 1998) that succeed in this task
without requiring large sets of labeled training data.
The extensions reduce the need for human eort in
training the classier by (1) using keyword match-
ing to automatically assign approximate labels, (2) us-
ing a statistical technique called shrinkage that nds
more robust parameter estimates by taking advantage
of the hierarchy, and (3) increasing accuracy further
by iterating Expectation-Maximization to probabilisti-
cally reassign approximate labels and incorporate un-
labeled data. Use of the resulting algorithms places
documents into a 70-leaf computer science hierarchy
with 66% accuracy|performance approaching human
agreement levels.
2 The Cora Search Engine
We have brought all the above-described machine
learning techniques together in a demonstration sys-
tem: a domain-specic search engine on computer sci-
ence research papers named Cora. The system is pub-
licly available at www.cora.justresearch.com. Not only
does it provide keyword search facilities over 50,000
collected papers, it also places these papers into a com-
puter science topic hierarchy, maps the citation links
between papers, and provides bibliographic informa-
tion about each paper. Our hope is that in addition
to providing a platform for testing machine learning
research, this search engine will become a valuable
tool for other computer scientists, and will complement
similar eorts, such as the Computing Research Repos-
itory (xxx.lanl.gov/archive/cs), by providing function-
ality and coverage not available online elsewhere.
The construction of a search engine can be decom-
posed into three functional stages: collecting new in-
formation, collating and extracting from that informa-
tion, and presenting it in a publicly-available web in-
terface. Cora implements each stage by drawing upon
machine learning techniques described in this paper.
The rst stage is the collection of computer science
research papers. A spider crawls the Web, starting
from the home pages of computer science departments
and laboratories. Using reinforcement learning, it e-
ciently explores the Web, collecting all postscript doc-
uments it nds. Nearly all computer science papers
are in postscript format, though we are adding more
formats, such as PDF. These postscript documents are
then converted into plain text. If the document can be
reliably determined to have the format of a research pa-
per (e.g. by having Abstract and Reference sections),
it is added to Cora. Using this system, we have found
50,000 computer science research papers, and are con-
tinuing to spider for even more.
The second stage of building a search engine is to
extract relevant knowledge from each paper. To this
end, the beginning of each paper (up to the abstract)
is passed through an information extraction system
that automatically nds the title, author, institution
and other important header information. Addition-
ally, the bibliography section of each paper is located,
individual references identied, and each reference bro-
ken down into the appropriate elds, such as author,
title, journal, and date. Using this extracted informa-
tion, reference and paper matches are made|grouping
citations to the same paper together, and matching ci-
tations to papers in Cora. Of course, many papers that
are cited do not appear in the repository. This match-
ing procedure is similar to one described by Bollacker,
Lawrence, & Giles (1998), except that we use addi-
tional eld-level constraints provided by knowing, for
example, the title and authors of each paper.
The third stage is to provide a publicly-available user
interface. We have implemented two methods for nd-
ing papers. First, a search engine over all the papers is
provided. It supports commonly-used searching syn-
tax for queries, including +, -, and phrase searching
with "", and ranks resulting matches by the weighted
log of term frequency, summed over all query terms. It
also allows searches restricted to extracted elds, such
as authors and titles. Query response time is usually
less than a second. The results of search queries are
presented as in Figure 1. Additionally, each individual
paper has a \details" page that shows all the relevant
information, such as title and authors, links to the ac-
tual postscript paper, and a citation map that can be
traversed either forwards or backwards. One example
Figure 1: A screen shot of the query results page of the Cora
search engine (www.cora.justresearch.com). Extracted pa-
per titles, authors and abstracts are provided at this level.
of this is shown in Figure 2. We also provide auto-
matically constructed BibTeX entries, general Cora in-
formation links, and a mechanism for submitting new
papers and web sites for spidering.
The other user interface access method is through a
topic hierarchy, similar to that provided by Yahoo, but
customized specically for computer science research.
This hierarchy was hand-constructed, and contains 70
leaves, varying in depth from one to three. Using text
classication techniques, each research paper is auto-
matically placed into a topic node. By following hy-
perlinks to traverse the topic hierarchy, the most-cited
papers in each research topic can be found.
3 Ecient Spidering
Spiders are agents that explore the hyperlink graph of
the Web, often for the purpose of nding documents
with which to populate a search engine. Extensive spi-
dering is the key to obtaining high coverage by the
major Web search engines, such as AltaVista and Hot-
Bot. Since the goal of these general-purpose search
engines is to provide search capabilities over the Web
as a whole, for the most part they simply aim to nd
as many distinct web pages as possible. Such a goal
lends itself to strategies like breadth-rst search. If,
on the other hand, the task is to populate a domain-
specic search engine, then an intelligent spider should
Figure 2: A screen shot of a details page of the Cora search
engine. At this level, all extracted information about a
paper is displayed, including the citation linking, which are
hyperlinks to other details pages.
try to avoid hyperlinks that lead to o-topic areas, and
concentrate on links that lead to documents of interest.
In Cora ecient spidering is a major concern. The
majority of the pages in many computer science de-
partment web sites do not contain links to research
papers, but instead are about courses, homework,
schedules and admissions information. Avoiding whole
branches and neighborhoods of departmental web
graphs can signicantly improve eciency and increase
the number of research papers found given a nite
amount of crawling time. We use reinforcement learn-
ing to perform ecient spidering.
Several other systems have also studied spidering,
but without a framework dening optimal behavior.
Arachnid (Menczer 1997) maintains a collection of
competitive, reproducing and mutating agents for nd-
ing information on the Web. Cho, Garcia-Molina, &
Page (1998) suggest a number of heuristic ordering
metrics for choosing which link to crawl next when
searching for certain categories of web pages. Ad-
ditionally, there are systems that use reinforcement
learning for non-spidering Web tasks. WebWatcher
(Joachims, Freitag, & Mitchell 1997) is a browsing as-
sistant that uses a combination of supervised and re-
inforcement learning to help a user nd information
by recommending which hyperlinks to follow. Laser
uses reinforcement learning to tune the parameters of
a search engine (Boyan, Freitag, & Joachims 1996).
3.1 Reinforcement Learning
In machine learning, the term \reinforcement learn-
ing" refers to a framework for learning optimal deci-
sion making from rewards or punishment (Kaelbling,
Littman, & Moore 1996). It diers from supervised
learning in that the learner is never told the correct
action for a particular state, but is simply told how
good or bad the selected action was, expressed in the
form of a scalar \reward."
A task is dened by a set of states, s 2 S, a set
of actions, a 2 A, a state-action transition function,
T : SA ! S, and a reward function, R : SA!<.
At each time step, the learner (also called the agent)
selects an action, and then as a result is given a reward
and its new state. The goal of reinforcement learning
is to learn a policy, a mapping from states to actions,
: S !A, that maximizes the sum of its reward over
time. The most common formulation of \reward over
time" is a discounted sum of rewards into an innite
future. A discount factor,
, 0
< 1, expresses
\in
ation," making sooner rewards more valuable than
later rewards. Accordingly, when following policy ,
we can dene the value of each state to be:
V (s) =
1
X
t=0
trt; (1)
where rt is the reward received t time steps after start-
ing in state s. The optimal policy, written ?, is the
one that maximizes the value, V (s), for all states s.
In order to learn the optimal policy, we learn its
value function, V ?, and its more specic correlate,
called Q. Let Q?(s; a) be the value of selecting action
a from state s, and thereafter following the optimal
policy. This is expressed as:
Q?(s; a) = R(s; a) +
V ?(T(s; a)): (2)
We can now dene the optimal policy in terms of Q
by selecting from each state the action with the high-
est expected future reward: ?(s) = arg maxa Q?(s; a).
The seminal work by Bellman (1957) shows that the
optimal policy can be found straightforwardly by dy-
namic programming.
3.2 Spidering as Reinforcement Learning
As an aid to understanding how reinforcement learn-
ing relates to spidering, consider the common reinforce-
ment learning task of a mouse exploring a maze to nd
several pieces of cheese. The agent's actions are mov-
ing among the grid squares of the maze. The agent
receives a reward for nding each piece of cheese. The
state is the position of the mouse and the locations
of the cheese pieces remaining to be consumed (since
the cheese can only be consumed and provide reward
once). Note that the agent only receives immediate re-
ward for nding a maze square containing cheese, but
that in order to act optimally it must choose actions
considering future rewards as well.
In the spidering task, the on-topic documents are
immediate rewards, like the pieces of cheese. The ac-
tions are following a particular hyperlink. The state
is the bit vector indicating which on-topic documents
remain to be consumed. The state does not include
the current \position" of the agent since a crawler can
go to any URL next. The number of actions is large
and dynamic, in that it depends on which documents
the spider has visited so far.
The key features of topic-specic spidering that
make reinforcement learning the proper framework for
dening the optimal solution are: (1) performance is
measured in terms of reward over time, and (2) the
environment presents situations with delayed reward.
3.3 Practical Approximations
The problem now is how to apply reinforcement learn-
ing to spidering in such a way that it can be practically
solved. Unfortunately, the state space is huge: two to
the power of the number of on-topic documents on the
Web. The action space is also large: the number of
unique URLs with incoming links on theWeb. Thus we
need to make some simplifying assumptions in order to
make the problem tractable and to aid generalization.
Note, however, that by dening the exact solution in
terms of the optimal policy, and making our assump-
tions explicit, we will better understand what inaccura-
cies we have introduced, and how to select areas of fu-
ture work that will improve performance further. The
assumptions we choose initially are the following two:
(1) we assume that the state is independent of which
on-topic documents have already been consumed; that
is, we collapse all states into one, and (2) we assume
that the relevant distinctions between the actions can
be captured by the words in the neighborhood of the
hyperlink corresponding to each action.
Thus our Q function becomes a mapping from
a \bag-of-words" to a scalar (sum of future re-
ward). Learning to perform ecient spidering then in-
volves only two remaining sub-problems: (1) gathering
training data consisting of bag-of-words/future-reward
pairs, and (2) learning a mapping using the training
data.
There are several choices for how to gather training
data. Although the agent could learn from experience
on-line, we currently train the agent o-line, using col-
lections of already-found documents and hyperlinks.
In the vocabulary of traditional reinforcement learn-
ing, this means that the state transition function, T,
and the reward function, R, are known, and we learn
the Q function by dynamic programming in the origi-
nal, uncollapsed state space.
We represent the mapping using a collection of naive
Bayes text classiers (see Section 5.2). We perform the
mapping by casting this regression problem as classi-
cation (Torgo & Gama 1997). We discretize the dis-
counted sum of future reward values of our training
data into bins, place the hyperlinks into the bin corre-
sponding to their Q values by dynamic programming,
and use the hyperlinks' neighborhood text as train-
ing data for a naive Bayes text classier. We dene
a hyperlink's neighborhood to be two bags-of-words:
1) the full text of the page on which the hyperlink is
located, and 2) the anchor text of the hyperlink and
portions of the URL.1 For each hyperlink, we calculate
the probabilistic class membership of each bin. Then
the reward value of a hyperlink is estimated by taking
a weighted average of each bins' reward value, using
the probabilistic class memberships as weights.
3.4 Data and Experimental Results
In August 1998 we completely mapped the docu-
ments and hyperlinks of the web sites of computer sci-
ence departments at Brown University, Cornell Univer-
sity, University of Pittsburgh and University of Texas.
They include 53,012 documents and 592,216 hyper-
links. We perform a series of four test/train splits,
in which the data from three universities was used to
train a spider that then is tested on the fourth. The
target pages (for which a reward of 1 is given) are
computer science research papers. They are identi-
ed with very high precision by the simple hand-coded
algorithm mentioned in Section 2.
We present results of two dierent reinforcement
learning spiders and compare them to breadth-rst
search. Immediate uses
= 0, utilizing only immediate
reward in its assignment of hyperlink values. This em-
ploys a binary classier that distinguishes links that
do or do not point directly to a research paper. Fu-
ture uses
= 0:5 and represents the Q-function with a
more nely-discriminating 10-bin classier that makes
use of future reward.
Spiders trained in this fashion are evaluated on each
test/train split by spidering the test university. Fig-
ure 3 plots the number of research papers found over
the course of all the pages visited, averaged over all
four universities. Notice that at all times during their
progress, the reinforcement learning spiders have found
1We have found that performance does not improve
when a more restricted set of neighborhood text is chosen.
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Percent Research Papers Found
Percent Hyperlinks Followed
Spidering CS Departments
RL Immediate
RL Future
Breadth-First
Figure 3: The performance of reinforcement learning spi-
dering versus traditional breadth-rst search, averaged over
four test/train splits with data from four universities. The
reinforcement learning spiders nd target documents sig-
nicantly faster than the traditional method.
more research papers than Breadth-rst.
One measure of performance is the number of hy-
perlinks followed before 75% of the research papers
are found. Reinforcement learning performs signi-
cantly more eciently, requiring exploration of only
16% of the hyperlinks; in comparison Breadth-rst re-
quires 48%. This represents a factor of three increase
in spidering eciency.
Note also that the Future reinforcement learning spi-
der performs better than the Immediate spider in the
beginning, when future reward must be used to cor-
rectly select among alternative branches, none of which
give immediate reward. On average, the Immediate spi-
der takes nearly three times as long as Future to nd
the rst 28 (5%) papers.
In Figure 3, after the rst 50% of the papers are
found, the Immediate spider performs slightly better
than the Future spider. This is because the system has
uncovered many links that will give immediate reward
if followed, and the Immediate spider recognizes them
more accurately. In ongoing work we are investigating
techniques for improving classication with the larger
number of bins required for regression with future re-
ward. We believe that adding features based on the
HTML structure around a hyperlink (headers, titles,
and neighboring pages) will improve classication and
thus regression.
We are also currently applying the Future spider to
other tasks where rewards are more sparse, and thus
modeling future reward is more crucial. For example,
information about a company's corporate ocers is of-
ten contained on a single web page in the company's
web site; here there is a single reward. Our prelimi-
nary experiments show that our current Future spider
performs signicantly better than the Immediate spider
on these common tasks.
4 Information Extraction
Information extraction is concerned with identifying
phrases of interest in textual data. For many applica-
tions, extracting items such as names, places, events,
dates, and prices is a powerful way to summarize the
information relevant to a user's needs. In the case of
a domain-specic search engine, the automatic identi-
cation of important information can increase the ac-
curacy and eciency of a directed search.
We use hidden Markov models (HMMs) to extract
the elds relevant to research papers, such as title, au-
thor, journal and publication date. The extracted text
segments are used (1) to allow searches over specic
elds, (2) to provide useful eective presentation of
search results (e.g. showing title in bold), and (3) to
match references to papers. Our interest in HMMs for
information extraction is particularly focused on learn-
ing the state and transition structure of the models
from training data.
4.1 Hidden Markov Models
Hidden Markov models, widely used for speech recog-
nition and part-of-speech tagging (Rabiner 1989; Char-
niak 1993), provide a natural framework for modeling
the production of the headers and reference sections
of research papers. Discrete output, rst-order HMMs
are composed of a set of states Q, with specied ini-
tial and nal states qI and qF , a set of transitions be-
tween states (q ! q0), and a discrete vocabulary of out-
put symbols = 12 : : : M. The model generates a
string x = x1x2 : : : xl by beginning in the initial state,
transitioning to a new state, emitting an output sym-
bol, transitioning to another state, emitting another
symbol, and so on, until a transition is made into the
nal state. The parameters of the model are the tran-
sition probabilities P(q ! q0) that one state follows
another and the emission probabilities P(q " ) that
a state emits a particular output symbol. The prob-
ability of a string x being emitted by an HMM M is
computed as a sum over all possible paths by:
P(xjM) = X
q1;:::;ql2Ql
l+1
Y
k=1
P(qk1 ! qk)P(qk " xk); (3)
where q0 and ql+1 are restricted to be qI and qF respec-
tively, and xl+1 is an end-of-string token. The observ-
able output of the system is the sequence of symbols
that the states emit, but the underlying state sequence
itself is hidden. One common goal of learning prob-
lems that use HMMs is to recover the state sequence
V (xjM) that has the highest probability of having pro-
duced an observation sequence:
V (xjM)= arg max
q1:::ql2Ql
l+1
Y
k=1
P(qk1 ! qk)P(qk " xk): (4)
Fortunately, there is an ecient algorithm, called the
Viterbi algorithm (Viterbi 1967), that eciently recov-
ers this state sequence.
HMMs may be used for information extraction from
research papers by formulating a model in the follow-
ing way: each state is associated with a eld class
that we want to extract, such as title, author or in-
stitution. Each state emits words from a class-specic
unigram distribution. We can learn the class-specic
unigram distributions and the transition probabilities
from data. In our case, we collect BibTeX les from
the Web with reference classes explicitly labeled, and
use the text from each class as training data for the ap-
propriate unigram model. Transitions between states
are estimated directly from a labeled training set, since
BibTeX data does not contain this information. In or-
der to label new text with classes, we treat the words
from the new text as observations and recover the
most-likely state sequence with the Viterbi algorithm.
The state that produces each word is the class tag for
that word.
HMMs have been used in other systems for infor-
mation extraction and the closely related problems of
topic detection and text segmentation. Leek (1997)
uses hidden Markov models to extract information
about gene names and locations from scientic ab-
stracts. The Nymble system (Bikel et al. 1997) deals
with named-entity extraction, and a system by Yam-
ron et al. (1998) uses an HMM for topic detection and
tracking. Unlike our work, these systems do not con-
sider automatically determining model structure from
data; they either use one state per class, or use hand-
built models assembled by inspecting training exam-
ples.
4.2 Experiments
Our experiments on reference extraction are based on
ve hundred references that were selected at random
from a set of 500 research papers. The words in each
of the 500 references were manually tagged with one
of the following 13 classes: title, author, institution,
location, note, editor, publisher, date, pages, volume,
journal, booktitle, and technical report. The tagged
references were split into a 300-instance, 6995 word to-
ken training set and a 200-instance, 4479 word token
test set. Unigram language models were built for each
of the thirteen classes from almost 2 million words of
BibTeX data acquired from the Web, and were based
volume
editor
institution
title
journal
booktitle
tech
date
institution
author
0.85
0.03
0.86
0.14
0.80
0.20
0.80
0.20
0.11
0.89
1.0
0.33
0.03 0.66
0.03
0.03
0.03 0.91
0.09
0.18
0.37
0.82
Figure 4: An example HMM built from only ve labeled ref-
erences after merging neighbors and collapsing V-neighbors
in the forward and backward directions. Note that the
structure is close to many reference section formats.
on a 44,000 word vocabulary. Each HMM state is as-
sociated with one class label, and uses the appropriate
unigram distribution to provide its emission probabil-
ities. Emission distributions are not re-estimated dur-
ing the training process.
We examine the potential of learning model struc-
ture from data by comparing four dierent HMMs.
The rst two models are fully-connected HMMs where
each class is represented by a single state. In the rst
model (HMM-0), the transitions out of each state re-
ceive equal probability. Finding the most likely path
through this model for an observation sequence is
equivalent to consulting each unigram model for each
test set word, and setting each word's class to the
class of the unigram model that produces the highest
probability. The transition probabilities for the second
model (HMM-1) are set to the maximum likelihood es-
timates from the labeled training data. A smoothing
count of 1 is added to all transitions to avoid non-zero
probabilities.
Next, an HMM is built where each word token in the
training set is assigned a single state that only transi-
tions to the state that follows it. Each state is associ-
ated with the class label of its word token. From the
initial state, there are 300 equiprobable transitions into
sequences of states, where each sequence represents the
tags for one of the 300 training references. This model
consists of 6997 states, and is maximally specic in
that its transitions exactly explain the training data.
This HMM is put through a series of state merges in
order to generalize the model. First, \neighbor merg-
ing" combines all states that share a unique transi-
Accuracy
Model # states Any word Punc word
HMM-0 13 59.2 80.8
HMM-1 13 91.5 92.9
HMM-2 1677 90.2 91.1
HMM-3 46 91.7 92.9
Table 1: Word classication accuracy results (%) on 200
test references (4479 words).
tion and have the same class label. For example, all
adjacent title states are merged into one title state,
representing the sequence of title words for that refer-
ence. As multiple neighbor states with the same class
label are merged into one, a self-transition loop is intro-
duced, whose probability represents the expected state
duration for that class. After neighbor merging, 1677
states remain in the model (HMM-2).
Next, the neighbor-merged HMM is put through for-
ward and backward V-merging. For V-merging, any
two states that share transitions from or to a common
state and have the same label are merged. A simple
example of an HMM built from just 5 tagged refer-
ences after V-merging is shown in Figure 4. Notice
that even with just ve references, the model closely
matches formats found in many reference sections. Af-
ter V-merging, the HMM is reduced from 1677 states
to 46 states (HMM-3).
All four HMM models are used to tag the 200 test
references by nding the Viterbi path through each
HMM for each reference. The class labels of the states
in the Viterbi path are the classications assigned to
each word in the test references. Word classication
accuracy results for two testing scenarios are reported
in Table 1. In the Any word case, state transitions
are allowed to occur after any observation word. In
the Punc word case, state transitions to a new state
(with a dierent class label) are only allowed to occur
after observations ending in punctuation, since punc-
tuation is often a delimiter between elds in references.
For HMM-0, allowing transitions only after words with
punctuation greatly increases classication accuracy,
since in this case punctuation-delimited phrases are be-
ing classied instead of individual words. For the last
three cases, the overall classication accuracy is quite
high. The V-merged HMM derived directly from the
training data (HMM-3) performs at 93% accuracy, as
well as the HMM where only one state was allowed per
class (HMM-1). For these three cases, limiting state
transitions to occur only after words with punctuation
improves accuracy by about 1% absolute.
...
...
... ...
...
...
computer, university, science, system, paper
...
language
NLP
processing
Compiler
Design
compiler
Garbage
garbage
collection
Collection
Semantics
semantics
denotational
software
design
engineering
Software
Engineering programming
Programming
language
logic
programs
OS
distributed
system
systems
network
time
learning
Artificial
Intelligence
intelligence
Hardware &
Architecture
circuits
design
HCI
multimedia
information
text
retrieval
Information
Retrieval
classification
Cooperative
cscw
multimedia
Multimedia
interface
Interface
Design
design
Planning
knowledge
representation
Knowledge
Representation
tools
environments
construction
types optimization
memory
region
parallel
data
language
text
information
learning
Learning
Machine
algorithms
networks
algorithm
university problems
plan
reasoning
temporal
language
natural
system
interfaces
sketch
user
group
provide
work
collaborative
real
time
data
media
computer
system
university
paper
performance
university
computer
based
computer
university
university
language code
natural
planning
documents
Computer Science
Figure 5: A subset of Cora's topic hierarchy. Each node contains its title, and the ve most probable words, as calculated
by naive Bayes and shrinkage with vertical word redistribution (Hofmann & Puzicha 1998). Words that were not among the
keywords for that class are indicated with italics.
4.3 Future Work
All of the experiments presented above use HMMs
where the model structure and parameters were esti-
mated directly from labeled training instances. Our fu-
ture work will focus on using unlabeled training data.
Unlabeled training data is preferable to labeled data
because generally greater quantities of unlabeled data
are available, and model parameters may be more reli-
ably estimated from larger amounts of data. Addition-
ally, manually labeling large amounts of training data
is costly and error-prone.
Specically, if we are willing to x the model struc-
ture, we can use the Baum-Welch estimation tech-
nique (Baum 1972) to estimate model parameters. The
Baum-Welch method is an Expectation-Maximization
procedure for HMMs that nds local likelihood max-
ima, and is used extensively for acoustic model estima-
tion in automatic speech recognition systems.
We can remove the assumption of a xed model
structure and estimate both model structure and pa-
rameters directly from the data using Bayesian Model
Merging (Stolcke 1994). Bayesian Model Merging in-
volves starting out with a maximally specic hidden
Markov model, where each training observation is rep-
resented by a single state. Pairs of states are iteratively
merged, generalizing the model until an optimal trade-
o between t to the training data and a preference
for smaller, more generalized models is attained. This
merging process can be explained in Bayesian terms
by considering that each merging step is looking to
nd the model that maximizes the posterior probabil-
ity of the model given the training data. We believe
that Bayesian Model Merging, when applied to the V-
merged model (HMM-3, 46 states), will result in an in-
termediate HMM structure that will outperform both
the fully-connected model (HMM-1, 13 states) and the
V-merged model on the reference extraction task.
We will test both of these induction methods on ref-
erence extraction, and will include new experiments on
header extraction. We believe that extracting informa-
tion from headers will be a more challenging problem
than references because there is less of an established
format for presenting information in the header of a
paper.
5 Classication into a Topic Hierarchy
Topic hierarchies are an ecient way to organize, view
and explore large quantities of information that would
otherwise be cumbersome. The U.S. Patent database,
Yahoo, MedLine and the Dewey Decimal system are
all examples of topic hierarchies that exist to make
information more manageable.
As Yahoo has shown, a topic hierarchy can be a use-
ful, integral part of a search engine. Many search en-
gines (e.g. Lycos, Excite, and HotBot) now display
hierarchies on their front page. This feature is equally
valuable for domain-specic search engines. We have
created a 70-leaf hierarchy of computer science topics
for Cora, part of which is shown in Figure 5. Creating
the hierarchy took about 60 minutes, during which we
examined conference proceedings, and explored com-
puter science sites on the Web. Selecting a few key-
words associated with each node took about 90 min-
utes.
A much more dicult and time-consuming part of
creating a hierarchy is populating it with documents
that are placed in the correct topic branches. Ya-
hoo has hired large numbers of people to categorize
web pages into their hierarchy. The U.S. patent oce
also employs people to perform the job of categorizing
patents. In contrast, we automate this process with
learned text classiers.
5.1 Seeding Naive Bayes using Keywords
One method of classifying documents into a hierarchy
is to match them against the keywords in a rule-list
fashion; for each document, we step through the key-
words, and place the document in the category of the
rst keyword that matches. If an extensive keyword
list is carefully chosen, this method can be reason-
ably accurate. However, nding enough keywords to
obtain broad coverage and nding suciently specic
keywords to obtain high accuracy can be very di-
cult; it requires intimate knowledge of the data and a
lot of trial and error. Without this extensive eort,
keyword matching will be brittle, incapable of nding
documents that do not contain matches for selected
keywords.
A less brittle approach is provided by naive Bayes,
an established text classication algorithm (Lewis
1998; McCallum & Nigam 1998) based on Bayesian
machine learning techniques. However, it requires
large amounts of labeled training data to work well.
Traditionally, training data is labeled by a human, and
is dicult and tedious to obtain.
In this paper, we propose using a combination of
these two approaches with what we call pseudo-labeled
data. Instead of asking the builder to hand-label nu-
merous training examples, or to generate a complete-
coverage set of keywords, the builder simply provides
just a few keywords for each category. A large col-
lection of unlabeled documents are \pseudo-labeled"
by using the keywords as a rule-list classier. These
pseudo-labels are noisy, and the majority of documents
remain unlabeled. However, we then build an improved
classier by using all the documents and any pseudo-
labels to bootstrap a naive Bayes text classier that has
been combined with Expectation-Maximization (EM)
(Dempster, Laird, & Rubin 1977) and a powerful tech-
nique from statistics called shrinkage. EM serves to
incorporate the evidence from the unlabeled data, and
to correct, to some extent, the pseudo-labels. Hier-
archical shrinkage serves to alleviate poor parameter
estimates caused by sparse training data.
In this approach, using an enhanced naive Bayes text
classier acts to smooth the brittleness of the original
keywords. One way to understand this is that naive
Bayes discovers new keywords that are probabilistically
correlated with the original keywords. The resulting
pseudo-labeled method provides classication accuracy
that is higher than keyword matching.
5.2 Naive Bayes Text Classication
We use the framework of multinomial naive Bayes text
classication. The classier parameterizes each class
separately with a document frequency, and also word
frequencies. Each class, cj , has a document frequency
relative to all other classes, written P(cj ). For every
word, wt, in the vocabulary, V , P(wtjcj ) indicates the
frequency that the classier expects word wt to occur
in documents in class cj .
We represent a document, di, as an unordered collec-
tion of its words. Given a document and a classier,
we determine the probability that the document be-
longs in class cj by Bayes' rule and the naive Bayes
assumption|that the words in a document occur in-
dependently of each other given the class. If we denote
wdi;k to be the kth word in document di, then classi-
cation becomes:
P(cj jdi) / P(cj)P(dijcj )
/ P(cj )
jdij
Y
k=1
P(wdi;k jcj ): (5)
Learning these parameters (P(cj ) and P(wtjcj)) for
classication is accomplished using a set of labeled
training documents, D. To estimate the word prob-
ability parameters, P(wtjcj ), we count the frequency
that word wt occurs in all word occurrences for doc-
uments in class cj . We supplement this with Laplace
smoothing that primes each estimate with a count of
one to avoid probabilities of zero. Dene N(wt; di) to
be the count of the number of times word wt occurs in
document di, and dene P(cj jdi) 2 f0; 1g, as given by
the document's class label. Then, the estimate of the
probability of word wt in class cj is:
P(wtjcj)=
1 +Pdi2D N(wt; di)P(cj jdi)
jV j +PjV j
s=1Pdi2D N(ws; di)P(cj jdi)
: (6)
The class frequency parameters are set in the same
way, where jCj indicates the number of classes:
P(cj) =
1 +Pdi2D P(cj jdi)
jCj + jDj
: (7)
Empirically, when given a large number of training
documents, naive Bayes does a good job of classifying
text documents (Lewis 1998). More complete presenta-
tions of naive Bayes for text classication are provided
by Mitchell (1997) and McCallum & Nigam (1998).
5.3 Combining Labeled and Unlabeled
Data
When there are both labeled and unlabeled docu-
ments available when training a naive Bayes classi-
er, Expectation-Maximization can be used to prob-
abilistically ll in the missing class labels of the un-
labeled data, allowing them to be used for training.
This results in parameters that are more likely given
all the data, both the labeled and the unlabeled. In
previous work (Nigam et al. 1999), we have shown
this technique signicantly increases classication ac-
curacy with limited amounts of labeled data and large
amounts of unlabeled data.
EM is a class of iterative algorithms for maximum
likelihood estimation in problems with incomplete data
(Dempster, Laird, & Rubin 1977). Given a model of
data generation, and data with some missing values,
EM iteratively uses the current model to estimate the
missing values, and then uses the missing value esti-
mates to improve the model. Using all the available
data, EM will locally maximize the likelihood of the
parameters and give estimates for the missing values.
In our scenario, the class labels of the unlabeled data
are treated as the missing values.
In implementation, EM is an iterative two-step pro-
cess. Initially, the parameter estimates are set in the
standard naive Bayes way, using just the labeled data.
Then we iterate the E- and M-steps. The E-step cal-
culates probabilistically-weighted class labels, P(cj jdi),
for every unlabeled document using the classier and
Equation 5. The M-step estimates new classier pa-
rameters using all the labeled data, both original and
probabilistically labeled, by Equations 6 and 7. We it-
erate the E- and M-steps until the classier converges.
5.4 Shrinkage and Naive Bayes
When the classes are organized hierarchically, as they
are in Cora, naive Bayes parameter estimates can also
be signicantly improved with the statistical technique
shrinkage.
Consider trying to estimate the probability of the
word \intelligence" in the class NLP. This word should
clearly have non-negligible probability there, however,
with limited training data we may be unlucky, and the
observed frequency of \intelligence" in NLP may be
very far from its true expected value. One level up
the hierarchy, however, the Articial Intelligence class
contains many more documents (the union of all the
children); there, the probability of the word \intelli-
gence" can be more reliably estimated.
Shrinkage calculates new word probability estimates
for each leaf by a weighted average of the estimates
on the path from the leaf to the root. The technique
balances a trade-o between specicity and reliability.
Estimates in the leaf are most specic but unreliable;
further up the hierarchy estimates are more reliable
but unspecic. We can calculate mixture weights that
are guaranteed to maximize the likelihood of held-out
data by an iterative procedure.
More formally, let fP1(wtjcj ); : : : ;Pk(wtjcj )g be
word probability estimates, where P1(wtjcj ) is the es-
timate using training data just in the leaf, Pk1(wtjcj )
is the estimate at the root using all the training data,
and Pk(wtjcj ) is the uniform estimate (Pk(wtjcj ) =
1=jV j). The interpolation weights among cj 's \ances-
tors" (which we dene to include cj itself) are writ-
ten f1
j; 2
j; : : : ; kj
g, where Pk
i=1 i
j = 1. The new
word probability estimate based on shrinkage, denoted
P
(
wtjcj )
,
is then
P
(
wtjcj)
=
1
jP1(wtjcj) + : : : + k
jPk(wtjcj ): (8)
The j vectors are calculated using EM. In the E-
step, for every word of training data in class cj , we
determine the expectation that each ancestor was re-
sponsible for generating it (in leave-one-out fashion,
withholding from parameter estimation the data in
the word's document). In the M-step, we normalize
the sum of these expectations to obtain new mixture
weights j . Convergence usually occurs in less than 10
iterations and less than 5 minutes of wall clock time.
A more complete description of hierarchical shrinkage
for text classication is presented by McCallum et al.
(1998).
5.5 Experimental Results
Now we describe results of classifying computer sci-
ence research papers into our 70-leaf hierarchy. A test
set was created by one expert hand-labeling a random
sample of 625 research papers from the 30,682 papers
in the Cora archive at the time we began these experi-
ments. Of these, 225 did not t into any category, and
were discarded|resulting in a 400 document test set.
Some of these papers were outside the area of com-
puter science (e.g. astrophysics papers), but most of
these were papers that, with a more complete hierar-
chy, would be considered computer science papers. In
these experiments, we used the title, author, institu-
tion, references, and abstracts of papers for classica-
tion, not the full text.
Table 2 shows classication results with dierent
techniques used. The rule-list classier based on the
keywords alone provides 45%. Traditional naive Bayes
with 399 labeled training documents, tested in a leave-
one-out fashion, results in 47% classication accuracy.
However, only 100 documents could have been hand-
labeled in the time it took to create the keyword-lists;
using this smaller training set results in 30% accuracy.
We now turn to our pseudo-label approach. Applying
the keyword rule-list to the 30,682 documents in the
Method # Lab # P-Lab # Unlab Acc
Keyword | | | 45%
NB 100 | | 30%
NB 399 | | 47%
NB | 12,657 | 47%
NB+S | 12,657 | 63%
NB+EM+S | 12,657 18,025 66%
Human | | | 72%
Table 2: Classication results with dierent techniques:
keyword matching, human agreement, naive Bayes (NB),
and naive Bayes combined with hierarchical shrinkage (S),
and EM. The classication accuracy (Acc), and the number
of labeled (Lab), keyword-matched pseudo-labeled (P-Lab),
and unlabeled (Unlab) documents used by each method are
shown.
archive results in 12,657 matches, and thus an equiva-
lent number of pseudo-labeled documents. When these
noisy labels are used to train a traditional naive Bayes
text classier, 47% accuracy is reached on the test
set. When naive Bayes is augmented with hierarchical
shrinkage, accuracy increases to 63%. The full algo-
rithm, including naive Bayes, shrinkage, and EM re-
assignment of the pseudo-labeled and unlabeled data,
achieves 66% accuracy. As an interesting comparison,
a second expert classied the same test set; human
agreement between the two was 72%.
These results demonstrate the utility of the pseudo-
label approach. Keyword matching alone is noisy, but
when naive Bayes, shrinkage and EM are used together
as a regularizer, the resulting classication accuracy is
close to human agreement levels. We expect that using
EM with unlabeled data will yield much larger bene-
ts when the hierarchy is expanded to cover more of
computer science. Approximately one-third of the un-
labeled data do not t the hierarchy; this mismatch of
the data and the model misleads EM. The paradigm of
creating pseudo-labels, either from keywords or other
sources, avoids the signicant human eort of hand-
labeling training data.
In future work we plan to rene our probabilistic
model to allow for documents to be placed in interior
hierarchy nodes, documents to have multiple class as-
signments, and classes to be modeled with multiple
mixture components. We are also investigating prin-
cipled methods of re-weighting the word features for
\semi-supervised" clustering that will provide better
discriminative training with unlabeled data.
6 Related Work
Several related research projects investigate the gath-
ering and organization of specialized information. The
WebKB (Craven et al. 1998) project focuses on the col-
lection and organization of information from the Web
into knowledge bases. This project also has a strong
emphasis on using machine learning techniques, includ-
ing text classication and information extraction, to
promote easy re-use across domains. Two example do-
mains, computer science departments and companies,
have been developed.
The CiteSeer project (Bollacker, Lawrence, & Giles
1998) has also developed a search engine for computer
science research papers. It provides similar functional-
ity for searching and linking of research papers, but
does not currently provide a hierarchy of the eld.
CiteSeer focuses on the domain of research papers, but
not as much on using machine learning techniques to
automate search engine creation.
The New Zealand Digital Library project (Witten et
al. 1998) has created publicly-available search engines
for domains from computer science technical reports
to song melodies. The emphasis of this project is on
the creation of full-text searchable digital libraries, and
not on machine learning techniques that can be used
to autonomously generate such repositories. The web
sources for their libraries are manually identied. No
high-level organization of the information is given. No
information extraction is performed and, for the paper
repositories, no citation linking is provided.
The WHIRL project (Cohen 1998) is an eort to in-
tegrate a variety of topic-specic sources into a single
domain-specic search engine. Two demonstration do-
mains of computer games and North American birds
integrate information from many sources. The empha-
sis is on providing soft matching for information re-
trieval searching. Information is extracted from web
pages by hand-written extraction patterns that are cus-
tomized for each web source. Recent WHIRL research
(Cohen & Fan 1999) learns general wrapper extractors
from examples.
7 Conclusions and Future Work
The amount of information available on the Internet
continues to grow exponentially. As this trend con-
tinues, we argue that not only will the public need
powerful tools to help them sort through this informa-
tion, but the creators of these tools will need intelligent
techniques to help them build and maintain these ser-
vices. This paper has shown that machine learning
techniques can signicantly aid the creation and main-
tenance of domain-specic search engines. We have
presented new research in reinforcement learning, text
classication and information extraction towards this
end.
Much future work in each machine learning area has
already been discussed. However, we also see many
other areas where machine learning can further au-
tomate the construction and maintenance of domain-
specic search engines. For example, text classica-
tion can decide which documents on the Web are rel-
evant to the domain. Unsupervised clustering can au-
tomatically create a topic hierarchy and generate key-
words. Citation graph analysis can identify seminal
papers. We anticipate developing a suite of many ma-
chine learning techniques so domain-specic search en-
gine creation can be accomplished quickly and easily.
Acknowledgements
Most of the work in this paper was performed while all
the authors were at Just Research. The second, third
and fourth authors are listed in alphabetic order. Ka-
mal Nigam was supported in part by the Darpa HPKB
program under contract F30602-97-1-0215.
References
Baum, L. E. 1972. An inequality and associated maxi-
mization technique in statistical estimation of probabilis-
tic functions of a Markov process. Inequalities 3:1{8.
Bellman, R. E. 1957. Dynamic Programming. Princeton,
NJ: Princeton University Press.
Bikel, D. M.; Miller, S.; Schwartz, R.; and Weischedel, R.
1997. Nymble: a high-performance learning name-nder.
In Proceedings of ANLP-97, 194{201.
Bollacker, K. D.; Lawrence, S.; and Giles, C. L. 1998.
CiteSeer: An autonomous web agent for automatic re-
trieval and identication of interesting publications. In
Agents '98, 116{123.
Boyan, J.; Freitag, D.; and Joachims, T. 1996. A machine
learning architecture for optimizing web search engines. In
AAAI workshop on Internet-Based Information Systems.
Charniak, E. 1993. Statistical Language Learning. Cam-
bridge, Massachusetts: The MIT Press.
Cho, J.; Garcia-Molina, H.; and Page, L. 1998. Ecient
crawling through URL ordering. In WWW7.
Cohen, W., and Fan, W. 1999. Learning page-independent
heuristics for extracting data from web pages. In AAAI
Spring Symposium on Intelligent Agents in Cyberspace.
Cohen, W. 1998. A web-based information system that
reasons with structured collections of text. In Agents '98.
Craven, M.; DiPasquo, D.; Freitag, D.; McCallum, A.;
Mitchell, T.; Nigam, K.; and Slattery, S. 1998. Learning
to extract symbolic knowledge from the World Wide Web.
In AAAI-98, 509{516.
Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977.
Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society, Series
B 39(1):1{38.
Hofmann, T., and Puzicha, J. 1998. Statistical models
for co-occurrence data. Technical Report AI Memo 1625,
Articial Intelligence Laboratory, MIT.
Joachims, T.; Freitag, D.; and Mitchell, T. 1997. Web-
watcher: A tour guide for the World Wide Web. In Pro-
ceedings of IJCAI-97.
Kaelbling, L. P.; Littman, M. L.; and Moore, A. W. 1996.
Reinforcement learning: A survey. Journal of Articial
Intelligence Research 237{285.
Leek, T. R. 1997. Information extraction using hidden
Markov models. Master's thesis, UC San Diego.
Lewis, D. D. 1998. Naive (Bayes) at forty: The indepen-
dence assumption in information retrieval. In ECML-98.
McCallum, A., and Nigam, K. 1998. A comparison
of event models for naive Bayes text classication. In
AAAI-98 Workshop on Learning for Text Categorization.
http://www.cs.cmu.edu/mccallum.
McCallum, A.; Rosenfeld, R.; Mitchell, T.; and Ng, A.
1998. Improving text clasication by shrinkage in a hier-
archy of classes. In ICML-98, 359{367.
Menczer, F. 1997. ARACHNID: Adaptive retrieval agents
choosing heuristic neighborhoods for information discov-
ery. In ICML '97.
Mitchell, T. M. 1997. Machine Learning. New York:
McGraw-Hill.
Nigam, K.; McCallum, A.; Thrun, S.; and Mitchell, T.
1999. Text classication from labeled and unlabeled doc-
uments using EM. Machine Learning. To appear.
Rabiner, L. R. 1989. A tutorial on hidden Markov models
and selected applications in speech recognition. Proceed-
ings of the IEEE 77(2):257{286.
Stolcke, A. 1994. Bayesian Learning of Probabilistic Lan-
guage Models. Ph.D. Dissertation, UC Berkeley.
Torgo, L., and Gama, J. 1997. Regression using classi-
cation algorithms. Intelligent Data Analysis 1(4).
Viterbi, A. J. 1967. Error bounds for convolutional codes
and an asymtotically optimum decoding algorithm. IEEE
Transactions on Information Theory IT-13:260{269.
Witten, I. H.; Nevill-Manning, C.; McNab, R.; and
Cunnningham, S. J. 1998. A public digital library based
on full-text retrieval: Collections and experience. Com-
munications of the ACM 41(4):71{75.
Yamron, J.; Carp, I.; Gillick, L.; Lowe, S.; and van Mul-
bregt, P. 1998. A hidden Markov model approach to text
segmentation and event tracking. In Proceedings of the
IEEE ICASSP.