Patterns in Unstructured Data

Patterns in Unstructured Data
Discovery, Aggregation, and Visualization

A Presentation to the Andrew W. Mellon Foundation by
Clara Yu
John Cuadrado
Maciej Ceglowski
J. Scott Payne
National Institute for Technology and Lieral !ducation " NITL! #
I!"#DU$!I# % !&' ''D F#" (MA"!'" ('A"$& ')I'(
$s of early %&&%' there were just o(er two illion we )ages listed in the *oogle search engine inde+'
widely taken to e the ,ost co,)rehensi(e. No one knows how ,any ,ore we )ages there are on the
Internet' or the total nu,er of docu,ents a(ailale o(er the )ulic network' ut there is no -uestion
that the nu,er is enor,ous and growing -uickly. !(ery one of those we )ages has co,e into
e+istence within the )ast ten years. There are we sites co(ering e(ery concei(ale to)ic at e(ery le(el
of detail and e+)ertise' and infor,ation ranging fro, nu,erical tales to )ersonal diaries to )ulic
discussions. Ne(er efore ha(e so ,any )eo)le had access to so ,uch di(erse infor,ation.
!(en as the early )ulicity surrounding the Internet has died down' the network itself has continued to
e+)and at a fantastic rate' to the )oint where the -uantity of infor,ation a(ailale o(er )ulic networks
is starting to e+ceed our aility to search it. Search engines ha(e een in e+istence for ,any decades'
ut until recently they ha(e een s)eciali.ed tools for use y e+)erts' designed to search ,odest' static'
well/inde+ed' well/defined data collections. Today0s search engines ha(e to co)e with ra)idly changing'
heterogenous data collections that are orders of ,agnitude larger than e(er efore. They also ha(e to
re,ain si,)le enough for a(erage and no(ice users to use. 1hile co,)uter hardware has ke)t u) with
these de,ands / we can still search the we in the link of an eye / our search algorith,s ha(e not. $s
any 1e user knows' getting reliale' rele(ant results for an online search is often difficult.
2or all their )role,s' online search engines ha(e co,e a long way. Sites like *oogle are )ioneering the
use of so)histicated techni-ues to hel) distinguish content fro, dri(el' and the ar,s race etween
search engines and the ,arketers who want to ,ani)ulate the, has s)urred inno(ation. 3ut the
challenge of finding rele(ant content online re,ains. 3ecause of the sheer nu,er of docu,ents
a(ailale' we can find interesting and rele(ant results for any search -uery at all. The )role, is that
those results are likely to e hidden in a ,ass of se,i/rele(ant and irrele(ant infor,ation' with no easy
way to distinguish the good fro, the ad.
Precision, "an*ing, and "ecall % the &oly !rinity
In talking aout search engines and how to i,)ro(e the,' it hel)s to re,e,er what distinguishes a
useful search fro, a fruitless one. To e truly useful' there are generally three things we want fro, a
search engine4
5. 1e want it to gi(e us all of the rele(ant infor,ation a(ailale on our to)ic.
%. 1e want it to gi(e us only infor,ation that is rele(ant to our search
6. 1e want the infor,ation ordered in so,e ,eaningful way' so that we see the ,ost rele(ant
results first.
The first of these criteria / getting all of the rele(ant infor,ation a(ailale / is called recall. 1ithout good
recall' we ha(e no guarantee that (alid' interesting results won0t e left out of our result set. 1e want
the rate of false negati(es / rele(ant results that we ne(er see / to e as low as )ossile.
The second criterion / the )ro)ortion of docu,ents in our result set that is rele(ant to our search / is
called +recision. 1ith too little )recision' our useful results get diluted y irrele(ancies' and we are left
with the task of sifting through a large set of docu,ents to find what we want. 7igh )recision ,eans the
lowest )ossile rate of false )ositi(es.
There is an ine(itale tradeoff etween )recision and recall. Search results generally lie on a continuu,
of rele(ancy' so there is no distinct )lace where rele(ant results sto) and e+traneous ones egin. The
wider we cast our net' the less )recise our result set eco,es. This is why the third criterion' ran*ing' is
so i,)ortant. 8anking has to do with whether the result set is ordered in a way that ,atches our
intuiti(e understanding of what is ,ore and what is less rele(ant. 9f course the conce)t of 0rele(ance0
de)ends hea(ily on our own i,,ediate needs' our interests' and the conte+t of our search. In an ideal
world' search engines would learn our indi(idual )references so well that they could fine/tune any search
we ,ade ased on our )ast e+)ressed interests and )ecadilloes. In the real world' a useful ranking is
anything that does a reasonale jo distinguishing etween strong and weak results.
!he Platonic (earch 'ngine
3uilding on these three criteria of )recision' ranking and recall' it is not hard to en(ision what an ideal
search engine ,ight e like4
(co+e, The ideal engine would e ale to search e(ery docu,ent on the Internet
(+eed, 8esults would e a(ailale i,,ediately
$urrency, $ll the infor,ation would e ke)t co,)letely u)/to/date
"ecall, 1e could always find e(ery docu,ent rele(ant to our -uery
Precision, There would e no irrele(ant docu,ents in our result set
"an*ing, The ,ost rele(ant results would co,e first' and the ones furthest afield would co,e
last
9f course' our ,undane search engines ha(e a way to go efore reaching the Platonic ideal. 1hat will it
take to ridge the ga):
2or the first three ite,s in the list / sco)e' s)eed' and currency / it0s )ossile to ,ake ,ajor
i,)ro(e,ents y throwing resources at the )role,. Search engines can always e ,ade ,ore
co,)rehensi(e y adding content' they can always e ,ade faster with etter hardware and
)rogra,,ing' and they can always e ,ade ,ore current through fre-uent u)dates and regular )urging
of outdated infor,ation.
I,)ro(ing our trinity of )recision' ranking and recall' howe(er' re-uires ,ore than rute force. In the
following )ages' we will descrie one )ro,ising a))roach' called latent se-antic inde.ing' that lets us
,ake i,)ro(e,ents in all three categories. LSI was first de(elo)ed at 3ellcore in the late 5;<&0s' and is
the oject of acti(e research' ut is sur)risingly little/known outside the infor,ation retrie(al co,,unity.
3ut efore we can talk aout LSI' we need to talk a little ,ore aout how search engines do what they
do.
I(ID' !&' MID #F A ('A"$& ')I'
!a*ing !hings /iterally
If I handed you stack of news)a)ers and ,aga.ines and asked you to )ick out all of the articles ha(ing
to do with French Impressionism' it is (ery unlikely that you would )ore o(er each article word/y/
word' looking for the e+act )hrase. Instead' you would )roaly fli) through each )ulication' ski,,ing
the headlines for articles that ,ight ha(e to do with art or history' and then reading through the ones
you found to see if you could find a connection.
If' howe(er' I handed you a stack of articles fro, a highly technical ,athe,atical journal and asked you
to show ,e e(erything to do with n-dimensional manifolds' the chances are high "unless you are a
,athe,atician# that you would ha(e to go through each article line/y/line' looking for the )hrase =n/
di,ensional ,anifold= to a))ear in a sea of jargon and e-uations.
The two searches would generate (ery different results. In the first e+a,)le' you would )roaly e
done ,uch faster. You ,ight ,iss a few instances of the )hrase French Impressionism ecause they
occured in an unlikely article / )erha)s a ,ention of a usiness figure0s eing related to Claude Monet /
ut you ,ight also find a nu,er of articles that were (ery rele(ant to the search )hrase French
Impressionism' e(en though they didn0t contain the actual words4 articles aout a 8enoir e+hiition' or
(isiting the ,useu, at *i(erny' or the Salon des 8efus>s.
1ith the ,ath articles' you would )roaly find e(ery instance of the e+act )hrase n-dimensional
manifold' gi(en strong coffee and a good )air of eyeglasses. 3ut unless you knew so,ething aout
higher ,athe,atics' it is (ery unlikely that you would )ick out articles aout topology that did not
contain the search )hrase' e(en though a ,athe,atician ,ight find those articles (ery rele(ant.
These two searches re)resent two o))osite ways of searching a docu,ent collection. The first is a
conce)tual search' ased on a higher/le(el understanding of the -uery and the search s+ace' including
all kinds of conte+tual knowledge and assu,)tions aout how news)a)er articles are structured' how the
headline relates to the contents of an article' and what kinds of to)ics are likely to show u) in a gi(en
)ulication.
The second is a )urely ,echanical search' ased on an e+hausti(e co,)arison etween a certain set of
words and a ,uch larger set of docu,ents' to find where the first a))ear in the second. It is not hard to
see how this )rocess could e ,ade co,)letely auto,atic4 it re-uires no understanding of either the
search -uery or the docu,ent collection' just ti,e and )atience.
9f course' co,)uters are )erfect for doing rote tasks like this. 7u,an eings can ne(er take a )urely
,echanical a))roach to a te+t search )role,' ecause hu,an eings can0t hel) ut notice things. !(en
so,eone looking through technical literature in a foreign language will egin to recogni.e )atterns and
clues to hel) guide the, in selecting candidate articles' and start to for, ideas aout the conte+t and
,eaning of the search. 3ut co,)uters know nothing aout conte+t' and e+cel at )erfor,ing re)etiti(e
tasks -uickly. This rote ,ethod of searching is how search engines work.
!(ery full/te+t search engine' no ,atter how co,)le+' finds its results using just such a ,echanical
,ethod of e+hausti(e search. 1hile the techni-ues it uses to rank the results ,ay e (ery fancy indeed
"*oogle is a good e+a,)le of inno(ation in choosing a syste, for ranking#' the actual search is ased
entirely on keywords' with no higher/le(el understanding of the -uery or any of the docu,ents eing
searched.
0ohn &enry "evisited
9f course' while it is nice to ha(e re)etiti(e things auto,ated' it is also nice to ha(e our search agent
understand what it is doing. 1e want a search agent who can eha(e like a lirarian' ut on a ,assi(e
scale' ringing us rele(ant docu,ents we didn0t e(en know to look for. The -uestion is' is it )ossile to
aug,ent the e+hausti(eness of a ,echanical keyword search with so,e kind of a conce)tual search that
looks at the ,eaning of each docu,ent' not just whether or not a )articular word or )hrase a))ears in
it: If I a, searching for infor,ation on the effects of the naval blockade on the economy of
the Confederacy during the Civil War' chances are high that a nu,er of docu,ents )ertinent to
that to)ic ,ight not contain e(ery one of those keywords' or e(en a single one of the,. $ discussion of
cotton )roduction in *eorgia during the )eriod 5<?&/5<@& ,ight e e+tre,ely re(ealing and useful to
,e' ut if it does not ,ention the Ci(il 1ar or the na(al lockade directly' a keyword search will ne(er
find it.
Many strategies ha(e een tried to get around this 0du, co,)uter0 )role,. So,e of these are si,)le
,easures designed to enhance a regular keyword search / for e+a,)le' lists of synony,s for the search
engine to try in addition to the search -uery' or fuzzy searches that tolerate ad s)elling and different
word for,s. 9thers are a,itious e+ercises in artificial intelligence' using co,)le+ language ,odels and
search algorith,s to ,i,ic how we aggregate words and sentences into higher/le(el conce)ts.
Anfortunately' these higher/le(el ,odels are really ad. Bes)ite years of trying' no one has een ale to
create artificial intelligence' or e(en artificial stu)idity. $nd there is growing agree,ent that nothing
short of an artificial intelligence )rogra, can consistently e+tract higher/le(el conce)ts fro, written
hu,an language' which has )ro(en far ,ore a,iguous and difficult to understand than any of the early
)ioneers of co,)uting e+)ected.
That lea(es natural intelligence' and s)ecifically e+)ert hu,an archi(ists' to do the co,)le+ work of
organi.ing and tagging data to ,ake a conce)tual search )ossile.
(!"U$!U"'D DA!A % 'V'"1!&I) I I!( P/A$'
!he 0oys o2 !a.ono-y
$nyone who has e(er used a card catalog or online lirary ter,inal is fa,iliar with structured data.
8ather than inde+ing the full te+t of e(ery ook' article' and docu,ent in a large collection' works are
assigned keywords y an archi(ist' who also categori.es the, within a fi+ed hierarchy. $ search for the
keywords Khazar empire' for e+a,)le' ,ight yield se(eral titles under the category Khazars -
Ukraine - Kiev - History' while a search for beet farming ,ight return entries under Vegetables
- Postharvest Diseases and Injuries - Handbooks, Manuals, etc.. The Lirary of Congress is
a good e+a,)le of this kind of co,)rehensi(e classification / each work is assigned keywords fro, a
rigidly constrained (ocaulary' then gi(en a uni-ue identifier and )laced into one or ,ore categories to
facilitate later searching.
1hile ,ost lirary collections do not feature full/te+t search "since so few works in )rint are a(ailale in
electronic for,#' there is no reason why structured dataases can0t also include a full/te+t search. Many
early we search engines' including Yahoo' used just such an a))roach' with hu,an archi(ists re(iewing
each )age and assigning it to one or ,ore categories efore including it in the search engine0s docu,ent
collection.
The ad(antage of structured data is that it allows users to refine their search using conce)ts rather than
just indi(idual keywords or )hrases. If we are ,ore interested in )olitics than ,ountaineering' it is (ery
hel)ful to e ale to li,it a search for Geneva summit to the category Politics-International-20th
Century' rather than Switzerland-Geography. $nd once we get our result' we can use the classifiers
to rowse within a category or su/category for other results that ,ay e conce)tually si,ilar' such as
Rejkyavik summit or SALT II talks' e(en if they don0t contain the keyword Geneva.
1ou (ay Vegetables,,!o-ato, I (ay Fruits,,!o-ato
1e can see how assigning descri)tors and classifiers to a te+t gi(es us one i,)ortant ad(antage' y
returning rele(ant docu,ents that don0t necessarily contain a (erati, ,atch to our search -uery. 2ully
descried data sets also gi(e us a (iew of the 0ig )icture0 / y e+a,ining the structure of categories and
su/categories "or ta.ono-y#' we can for, a rough i,age of the sco)e and distriution of the
docu,ent collection as a whole.
3ut there are serious drawacks to this a))roach to categori.ing data. 2or starters' there are the
)role,s inherent in any kind of ta+ono,y. The world is a fu..y )lace that so,eti,es resists
categori.ation' and )utting na,es to things can constrain the ways in which we (iew the,. Is a to,ato a
fruit or a (egetale: The answer de)ends on whether you are a otanist or a cook. Serian and Croatian
are ,utually intelligile' ut ha(e different writing syste,s and are s)oken y different )o)ulations with
a di, (iew of one another. $re they two different languages: 8ussian and Polish ha(e two words for
0lue0' where !nglish has one. 1hich is right: Classifying so,ething ine(italy colors the way in which we
see it.
Moreo(er' what ha))ens if I need to co,ine two docu,ent collections inde+ed in different ways: If I
ha(e a large set of articles aout Indian dialects inde+ed y language fa,ily' and another large inde+ed
y geogra)hic region' I either need to choose one ta+ono,y o(er the other' or co,ine the two into a
third. In either case I will e re/inde+ing a lot of the data. There are ,any efforts underway to ,itigate
this )role, / ranging fro, standards/ased a))roaches like Bulin Core to rarefied research into
ontological ta.ono-ies "finding a sort of 9ne True Path to classifying data#. Ne(ertheless' the
underlying )role, is a thorny one.
9ne co,,on/sense solution is to classify things in ,ulti)le ways / assigning a (ariety of categories'
keywords' and descri)tors to e(ery docu,ent we want to inde+. 3ut this runs us into the )role, of
li,ited resources. 7a(ing an e+)ert archi(ist re(iew and classify e(ery docu,ent in a collection is an
e+)ensi(e undertaking' and it grows ,ore e+)ensi(e and ti,e/consu,ing as we e+)and our ta+ono,y
and keyword (ocaulary. 1hat0s ,ore' ,aking changes eco,es ,ore e+)ensi(e. 8e,e,er that if we
want to aug,ent or change our ta+ono,y "as has actually ha))ened with se(eral large tagged linguistic
cor)ora#' there is no recourse e+ce)t to start fro, the eginning. $nd if any docu,ent gets ,isclassified'
it ,ay ne(er e seen again.
Si,)le sche,as ,ay not e descri)ti(e enough to e useful' and co,)le+ sche,as re-uire ,any
thousands of hours of e+)ert archi(ist ti,e to design' i,)le,ent' and ,aintain. $dding docu,ents to a
collection re-uires ,ore e+)ert ti,e. 2or large collections' the effort eco,es )rohiiti(e.
3etter /iving !hrough Matri. Algebra
So far the choice see,s )retty stark / either we li(e with a,or)hous data that we can only search y
keyword' or we ado)t a regi,ented a))roach that re-uires enor,ous -uantities of e+)ensi(e skilled user
ti,e' filters results through the lens of i,)licit and e+)licit assu,)tions aout how the data should e
organi.ed' and is a chore to ,aintain. The situation cries out for a ,iddle ground' so,e way to at least
)artially organi.e co,)le+ data without hu,an inter(ention in a way that will e ,eaningful to hu,an
users. 2ortunately for us' techni-ues e+ist to do just that.
/A!'! ('MA!I$ ID'4I)
!a*ing a &olistic View
8egular keyword searches a))roach a docu,ent collection with a kind of accountant ,entality4 a
docu,ent contains a gi(en word or it doesn0t' with no ,iddle ground. 1e create a result set y looking
through each docu,ent in turn for certain keywords and )hrases' tossing aside any docu,ents that don0t
contain the,' and ordering the rest ased on so,e ranking syste,. !ach docu,ent stands alone in
judge,ent efore the search algorith, / there is no interde)endence of any kind etween docu,ents'
which are e(aluated solely on their contents.
Latent se,antic inde+ing adds an i,)ortant ste) to the docu,ent inde+ing )rocess. In addition to
recording which keywords a docu,ent contains' the ,ethod e+a,ines the docu,ent collection as a
whole' to see which other docu,ents contain so,e of those sa,e words. LSI considers docu,ents that
ha(e ,any words in co,,on to e se,antically close' and ones with few words in co,,on to e
se,antically distant. This si,)le ,ethod correlates sur)risingly well with how a hu,an eing' looking at
content' ,ight classify a docu,ent collection. $lthough the LSI algorith, doesn0t understand anything
aout what the words mean' the )atterns it notices can ,ake it see, astonishingly intelligent.
1hen you search an LSI/inde+ed dataase' the search engine looks at si,ilarity (alues it has calculated
for e(ery content word' and returns the docu,ents that it thinks est fit the -uery. 3ecause two
docu,ents ,ay e se,antically (ery close e(en if they do not share a )articular keyword' LSI does not
re-uire an e+act ,atch to return useful results. 1here a )lain keyword search will fail if there is no e+act
,atch' LSI will often return rele(ant docu,ents that don0t contain the keyword at all.
To use an earlier e+a,)le' let0s say we use LSI to inde+ our collection of ,athe,atical articles. If the
words n-dimensional' manifold and topology a))ear together in enough articles' the search
algorith, will notice that the three ter,s are se,antically close. $ search for n-dimensional
manifolds will therefore return a set of articles containing that )hrase "the sa,e result we would get
with a regular search#' ut also articles that contain just the word topology. The search engine
understands nothing aout ,athe,atics' ut e+a,ining a sufficient nu,er of docu,ents teaches it that
the three ter,s are related. It then uses that infor,ation to )ro(ide an e+)anded set of results with
etter recall than a )lain keyword search.
Ignorance is 3liss
1e ,entioned the difficulty of teaching a co,)uter to organi.e data into conce)ts and de,onstrate
understanding. 9ne great ad(antage of LSI is that it is a strictly ,athe,atical a))roach' with no insight
into the ,eaning of the docu,ents or words it analy.es. This ,akes it a )owerful' generic techni-ue ale
to inde+ any cohesi(e docu,ent collection in any language. It can e used in conjunction with a regular
keyword search' or in )lace of one' with good results.
3efore we discuss the theoretical under)innings of LSI' it0s worth citing a few actual searches fro, so,e
sa,)le docu,ent collections. In each search' a red title or astrisk indicates that the docu,ent doesn0t
contain the search string' while a lue title or astrisk infor,s the (iewer that the search string is )resent.
In an $P news wire dataase' a search for Saddam Hussein returns articles on the *ulf 1ar' AN
sanctions' the oil e,argo' and docu,ents on Ira- that do not contain the Ira-i )resident0s
na,e at all.
Looking for articles aout Tiger Woods in the sa,e dataase rings u) ,any stories aout the
golfer' followed y articles aout ,ajor golf tourna,ents that don0t ,ention his na,e.
Constraining the search to days when no articles were written aout Tiger 1oods still rings u)
stories aout golf tourna,ents and well/known )layers.
In an i,age dataase that uses LSI inde+ing' a search on Normandy invasion shows i,ages of
the 3ayeu+ ta)estry / the fa,ous ta)estry de)icting the Nor,an in(asion of !ngland in 5&??'
the town of 3ayeu+' followed y )hotogra)hs of the !nglish in(asion of Nor,andy in 5;CC.
In all these cases LSI is 0s,art0 enough to see that Saddam Hussein is so,ehow closely related to Iraq
and the Gulf War' that Tiger Woods )lays golf' and that Bayeux has close se,antic ties to
invasions and England. $s we will see in our e+)osition' all of these a))arently intelligent connections
are artifacts of word use )atterns that already e+ist in our docu,ent collection.
&#W /(I W#"5(
!he (earch 2or $ontent
1e ,entioned that latent se,antic inde+ing looks at )atterns of word distriution "s)ecifically' word co%
occurence# across a set of docu,ents. 3efore we talk aout the ,athe,atical under)innings' we should
e a little ,ore )recise aout what kind of words LSI looks at.
Natural language is full of redundancies' and not e(ery word that a))ears in a docu,ent carries se,antic
,eaning. In fact' the ,ost fre-uently used words in !nglish are words that don0t carry content at all4
functional words' conjunctions' )re)ositions' au+illiary (ers and others. The first ste) in doing LSI is
culling all those e+traeous words fro, a docu,ent' lea(ing only content words likely to ha(e se,antic
,eaning. There are ,any ways to define a content word / here is one reci)e for generating a list of
content words fro, a docu,ent collection4
5. Make a co,)lete list of all the words that a))ear anywhere in the collection
%. Biscard articles' )re)ositions' and conjunctions
6. Biscard co,,on (ers "know' see' do' e#
C. Biscard )ronouns
D. Biscard co,,on adjecti(es "ig' late' high#
?. Biscard frilly words "therefore' thus' howe(er' aleit' etc.#
@. Biscard any words that a))ear in e(ery docu,ent
<. Biscard any words that a))ear in only one docu,ent
This )rocess condenses our docu,ents into sets of content words that we can then use to inde+ our
collection.
!hin*ing Inside the )rid
Asing our list of content words and docu,ents' we can now generate a ter-%docu-ent -atri.. This is
a fancy na,e for a (ery large grid' with docu,ents listed along the hori.ontal a+is' and content words
along the (ertical a+is. 2or each content word in our list' we go across the a))ro)riate row and )ut an 0E0
in the colu,n for any docu,ent where that word a))ears. If the word does not a))ear' we lea(e that
colu,n lank.
Boing this for e(ery word and docu,ent in our collection gi(es us a ,ostly e,)ty grid with a s)arse
scattering of E/es. This grid dis)lays e(erthing that we know aout our docu,ent collection. 1e can list
all the content words in any gi(en docu,ent y looking for E/es in the a))ro)riate colu,n' or we can
find all the docu,ents containing a certain content word y looking across the a))ro)riate row.
Notice that our arrange,ent is inary / a s-uare in our grid either contains an E' or it doesn0t. This ig
grid is the (isual e-ui(alent of a generic keyword search' which looks for e+act ,atches etween
docu,ents and keywords. If we re)lace lanks and E/es with .eroes and ones' we get a nu,erical
-atri. containing the sa,e infor,ation.
The key ste) in LSI is deco,)osing this ,atri+ using a techni-ue called singular value deco-+osition.
The ,athe,atics of this transfor,ation are eyond the sco)e of this article "a rigorous treat,ent is
a(ailale here#' ut we can get an intuiti(e gras) of what SFB does y thinking of the )rocess s)atially.
$n analogy will hel).
3rea*2ast in &y+ers+ace
I,agine that you are curious aout what )eo)le ty)ically order for reakfast down at your local diner'
and you want to dis)lay this infor,ation in (isual for,. You decide to e+a,ine all the reakfast orders
fro, a usy weekend day' and record how ,any ti,es the words bacon' eggs and coffee occur in each
order.
You can gra)h the results of your sur(ey y setting u) a chart with three orthogonal a+es / one for each
keyword. The choice of direction is aritrary / )erha)s a bacon a+is in the + direction' an eggs a+is in
the y direction' and the all/i,)ortant coffee a+is in the . direction. To )lot a )articular reakfast order'
you count the occurence of each keyword' and then take the a))ro)riate nu,er of ste)s along the a+is
for that word. 1hen you are finished' you get a cloud of )oints in three/di,ensional s)ace' re)resenting
all of that day0s reakfast orders.
If you draw a line fro, the origin of the gra)h to each of these )oints' you otain a set of vectors in
0acon/eggs/and/coffee0 s)ace. The si.e and direction of each (ector tells you how ,any of the three key
ite,s were in any )articular order' and the set of all the (ectors taken together tells you so,ething
aout the kind of reakfast )eo)le fa(or on a Saturday ,orning.
1hat your gra)h shows is called a ter- s+ace. !ach reakfast order for,s a (ector in that s)ace' with
its direction and ,agnitude deter,ined y how ,any ti,es the three keywords a))ear in it. !ach
keyword corres)onds to a se)arate s)atial direction' )er)endicular to all the others. 3ecause our
e+a,)le uses three keywords' the resulting ter, s)ace has three di,ensions' ,aking it )ossile for us
to (isuali.e it. It is easy to see that this s)ace could ha(e any nu,er of di,ensions' de)ending on how
,any keywords we chose to use. If we were to go ack through the orders and also record occurences of
sausage' muffin' and bagel' we would end u) with a si+/di,ensional ter, s)ace' and si+/di,ensional
docu,ent (ectors.
$))lying this )rocedure to a real docu,ent collection' where we note each use of a content word' results
in a ter, s)ace with ,any thousands of di,ensions. !ach docu,ent in our collection is a (ector with as
,any co,)onents as there are content words. $lthough we can0t )ossily (isuali.e such a s)ace' it is
uilt in the e+act sa,e way as the whi,sical reakfast s)ace we just descried. Bocu,ents in such a
s)ace that ha(e ,any words in co,,on will ha(e (ectors that are near to each other' while docu,ents
with few shared words will ha(e (ectors that are far a)art.
Latent se,antic inde+ing works y )rojecting this large' ,ultidi,ensional s)ace down into a s,aller
nu,er of di,ensions. In doing so' keywords that are se,antically si,ilar will get s-uee.ed together'
and will no longer e co,)letely distinct. This lurring of oundaries is what allows LSI to go eyond
straight keyword ,atching. To understand how it takes )lace' we can use another analogy.
(ingular Value Deco-+osition
I,agine you kee) tro)ical fish' and are )roud of your )ri.e a-uariu, / so )roud that you want to su,it
a )icture of it to Modern Aquaria ,aga.ine' for fa,e and )rofit. To get the est )ossile )icture' you will
want to choose a good angle fro, which to take the )hoto. You want to ,ake sure that as ,any of the
fish as )ossile are (isile in your )icture' without eing hidden y other fish in the foreground. You also
won0t want the fish all unched together in a clu,)' ut rather shot fro, an angle that shows the,
nicely distriuted in the water. Since your tank is trans)arent on all sides' you can take a (ariety of
)ictures fro, ao(e' elow' and fro, all around the a-uariu,' and select the est one.
In ,athe,atical ter,s' you are looking for an o)ti,al ,a))ing of )oints in 6/s)ace "the fish# onto a
)lane "the fil, in your ca,era#. 09)ti,al0 can ,ean ,any things / in this case it ,eans 0aesthetically
)leasing0. 3ut now i,agine that your goal is to )reser(e the relati(e distance etween the fish as ,uch
as )ossile' so that fish on o))osite sides of the tank don0t get su)eri,)osed in the )hotogra)h to look
like they are right ne+t to each other. 7ere you would e doing e+actly what the SFB algorith, tries to
do with a ,uch higher/di,ensional s)ace.
Instead of ,a))ing 6/s)ace to %/s)ace' howe(er' the SFB algorith, goes to ,uch greater e+tre,es. $
ty)ical ter, s)ace ,ight ha(e tens of thousands of di,ensions' and e )rojected down into fewer than
5D&. Ne(ertheless' the )rinci)le is e+actly the sa,e. The SFB algorith, )reser(es as ,uch infor,ation
as )ossile aout the relati(e distances etween the docu,ent (ectors' while colla)sing the, down into
a ,uch s,aller set of di,ensions. In this colla)se' infor,ation is lost' and content words are
su)eri,)osed on one another.
Infor,ation loss sounds like a ad thing' ut here it is a lessing. 1hat we are losing is noise fro, our
original ter,/docu,ent ,atri+' re(ealing si,ilarities that were latent in the docu,ent collection. Si,ilar
things eco,e ,ore si,ilar' while dissi,ilar things re,ain distinct. This reducti(e ,a))ing is what gi(es
LSI its see,ingly intelligent eha(ior of eing ale to correlate se,antically related ter,s. 1e are really
e+)loiting a )ro)erty of natural language' na,ely that words with si,ilar ,eaning tend to occur
together.
/(I '4AMP/' % ID'4I) A D#$UM'!
Putting !heory into Practice
1hile a discussion of the ,athe,atics ehind singular (alue deco,)osition is eyond the sco)e of our
article' it0s worthwhile to follow the )rocess of creating a ter,/docu,ent ,atri+ in so,e detail' to get a
feel for what goes on ehind the scenes. 7ere we will )rocess a sa,)le wire story to de,onstrate how
real/life te+ts get con(erted into the nu,erical re)resentation we use as in)ut for our SFB algorith,.
The first ste) in the chain is otaining a set of docu,ents in electronic for,. This can e the hardest
thing aout LSI / there are all too ,any interesting collections not yet a(ailale online. In our
e+)eri,ental dataase' we download wire stories fro, an online news)a)er with an $P news feed. $
scri)t downloads each day0s news stories to a local disk' where they are stored as te+t files.
Let0s i,agine we ha(e downloaded the following sa,)le wire story' and want to incor)orate it in our
collection4
O'Neill Criticizes Europe on Grants
PITTSBURGH (AP)
Treasury Secretary Paul O'Neill expressed irritation
Wednesday that European countries have refused to go
along with a U.S. proposal to boost the amount of
direct grants rich nations offer poor countries.
The Bush administration is pushing a plan to increase
the amount of direct grants the World Bank provides the
poorest nations to 50 percent of assistance, reducing
use of loans to these nations.
The first thing we do is stri) all for,atting fro, the article' including ca)itali.ation' )unctuation' and
e+traneous ,arku) "like the dateline#. LSI )ays no attention to word order' for,atting' or ca)itali.ation'
so can safely discard that infor,ation. 9ur cleaned/u) wire story looks like this4
o'neill criticizes europe on grants treasury secretary
paul o'neill expressed irritation wednesday that
european countries have refused to go along with a us
proposal to boost the amount of direct grants rich
nations offer poor countries the bush administration is
pushing a plan to increase the amount of direct grants
the world bank provides the poorest nations to 50
percent of assistance reducing use of loans to these
nations
The ne+t thing we want to do is )ick out the content words in our article. These are the words we
consider se,antically significant / e(erything else is clutter. 1e do this y a))lying a sto+ list of
co,,only used !nglish words that don0t carry se,antic ,eaning. Asing a sto) list greatly reduces the
a,ount of noise in our collection' as well as eli,inating a large nu,er of words that would ,ake the
co,)utation ,ore difficult. Creating a sto) list is so,ething of an art / they de)end (ery ,uch on the
nature of the data collection. You can see our full wire stories sto) list here.
7ere is our sa,)le story with sto)/list words highlighted4
o'neill criticizes europe on grants treasury secretary
paul o'neill expressed irritation wednesday that
european countries have refused to go along with a US
proposal to boost the amount of direct grants rich
nations offer poor countries the bush administration is
pushing a plan to increase the amount of direct grants
the world bank provides the poorest nations to 50
percent of assistance reducing use of loans to these
nations
8e,o(ing these sto) words lea(es us with an are(iated (ersion of the article containing content words
only4
o'neill criticizes europe grants treasury secretary
paul o'neill expressed irritation european countries
refused US proposal boost direct grants rich nations
poor countries bush administration pushing plan
increase amount direct grants world bank poorest
nations assistance loans nations
7owe(er' one ,ore i,)ortant ste) re,ains efore our docu,ent is ready for inde+ing. Notice how ,any
of our content words are )lural nouns "grants' nations# and inflected (ers "pushing' refused#. It
doesn0t see, (ery useful to ha(e each inflected for, of a content word e listed se)erately in our ,aster
word list / with all the )ossile (ariants' the list would soon grow unwieldy. More trouling is that LSI
,ight not recogni.e that the different (ariant for,s were actually the sa,e word in disguise. 1e sol(e
this )role, y using a ste--er.
(te--ing
1hile LSI itself knows nothing aout language "we saw how it deals e+clusi(ely with a ,athe,atical
(ector s)ace#' so,e of the )re)aratory work needed to get docu,ents ready for inde+ing is (ery
language/s)ecific. 1e ha(e already seen the need for a sto) list' which will (ary entirely fro, language
to language and to a lesser e+tent fro, docu,ent collection to docu,ent collection. Ste,,ing is
si,ilarly language/s)ecific' deri(ed fro, the ,or)hology of the language. 2or !nglish docu,ents' we use
an algorith, called the Porter ste--er to re,o(e co,,on endings fro, words' lea(ing ehind an
in(ariant root for,. 7ere are so,e e+a,)les of words efore and after ste,,ing4
information -> inform
presidency -> presid
presiding -> presid
happiness -> happi
happily -> happi
discouragement -> discourag
battles -> battl
$nd here is our sa,)le story as it a))ears to the ste,,er4
o'neill criticizes europe grants treasury secretary
paul o'neill expressed irritation european countries
refused US proposal boost direct grants rich nations
poor countries
bush administration pushing plan increase amount direct
grants world bank poorest nations assistance loans
nations
Note that at this )oint we ha(e reduced the original natural/language news story to a series of word
ste,s. $ll of the infor,ation carried y )unctuation' gra,,ar' and style is gone / all that re,ains is
word order' and we will e doing away with e(en that y transfor,ing our te+t into a word list. It is
striking that so ,uch of the ,eaning of te+t )assages inheres in the nu,er and choice of content
words' and relati(ely little in the way they are arranged. This is (ery counterintuiti(e' considering how
i,)ortant gra,,ar and writing style are to hu,an )erce)tions of writing.
7a(ing stri))ed' )runed' and ste,,ed our te+t' we are left with a flat list of words4
administrat
amount
assist
bank
boost
bush
countri (2)
direct
europ
express
grant (2)
increas
irritat
loan
nation (3)
o'neill
paul
plan
poor (2)
propos
push
refus
rich
secretar
treasuri
US
world
This is the infor,ation we will use to generate our ter,/docu,ent ,atri+' along with a si,ilar word list
for e(ery docu,ent in our collection.
!&' !'"M%D#$UM'! MA!"I4
Doing the u-bers
$s we ,entioned in our discussion of LSI' the ter-%docu-ent -atri. is a large grid re)resenting e(ery
docu,ent and content word in a collection. 1e ha(e looked in detail at how a docu,ent is con(erted
fro, its original for, into a flat list of content words. 1e )re)are a ,aster word list y generating a
si,ilar set of words for e(ery docu,ent in our collection' and discarding any content words that either
a))ear in e(ery docu,ent "such words won0t let us discri,inate etween docu,ents# or in only one
docu,ent "such words tell us nothing aout relationshi)s across docu,ents#. 1ith this ,aster word list
in hand' we are ready to uild our TBM.
1e generate our TBM y arranging our list of all content words along the (ertical a+is' and a si,ilar list
of all docu,ents along the hori.ontal a+is. These need not e in any )articular order' as long as we kee)
track of which colu,n and row corres)onds to which keyword and docu,ent. 2or clarity we will show the
keywords as an al)haeti.ed list.
1e fill in the TBM y going through e(ery docu,ent and ,arking the grid s-uare for all the content
words that a))ear in it. 3ecause any one docu,ent will contain only a tiny suset of our content word
(ocaulary' our ,atri+ is (ery s)arse "that is' it consists al,ost entirely of .eroes#.
7ere is a frag,ent of the actual ter,/docu,ent ,ari+ fro, our wire stories dataase4
Document: a b c d e f g h i j k l m n o p q r { 3000 more columns }
aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
amotd 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
aaliyah 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
aarp 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...
ab 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
...
zywicki 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
1e can easily see if a gi(en word a))ears in a gi(en docu,ent y looking at the intersection of the
a))ro)riate row and colu,n. In this sa,)le ,atri+' we ha(e used ones to re)resent docu,entGkeyword
)airs. 1ith such a inary sche,e' all we can tell aout any gi(en docu,entGkeyword co,ination is
whether the keyword a))ears in the docu,ent.
This a))roach will gi(e acce)tale results' ut we can significantly i,)ro(e our results y a))lying a kind
of linguistic fa(oritis, called ter- weighting to the (alue we use for each non/.ero ter,Gdocu,ent
)air.
ot all Words are $reated '6ual
Ter, weighting is a for,ali.ation of two co,,on/sense insights4
5. Content words that a))ear se(eral ti,es in a docu,ent are )roaly ,ore ,eaningful than
content words that a))ear just once.
%. Infre-uently used words are likely to e ,ore interesting than co,,on words.
The first of these insights a))lies to indi(idual docu,ents' and we refer to it as local weighting. 1ords
that a))ear ,ulti)le ti,es in a docu,ent are gi(en a greater local weight than words that a))ear once.
1e use a for,ula called logarith-ic local weighting to generate our actual (alue.
The second insight a))lies to the set of all docu,ents in our collection' and is called global ter-
weighting. There are ,any gloal weighting sche,esH all of the, reflect the fact that words that
a))ear in a s,all handful of docu,ents are likely to e ,ore significant than words that are distriuted
widely across our docu,ent collection. 9ur own inde+ing syste, uses a sche,e called inverse
docu-ent 2re6uency to calculate gloal weights.
3y way of illustration' here are so,e sa,)le words fro, our collection' with the nu,er of docu,ents
they a))ear in' and their corres)onding gloal weights.
word count global weight
unit 833 1.44
cost 295 2.4
project 1!9 3."3
tackle 4" 4.4
wrestler !.22
You can see that a word like wrestler' which a))ears in only se(en docu,ents' is considered twice as
significant as a word like project' which a))ears in o(er a hundred.
There is a third and final ste) to weighting' called nor-alization. This is a scaling ste) designed to kee)
large docu,ents with ,any keywords fro, o(erwhel,ing s,aller docu,ents in our result set. It is
si,ilar to handica))ing in golf / s,aller docu,ents are gi(en ,ore i,)ortance' and larger docu,ents
are )enali.ed' so that e(ery docu,ent has e-ual significance.
These three (alues ,ulti)lied together / local weight' gloal weight' and nor,ali.ation factor / deter,ine
the actual nu,erical (alue that a))ears in each non/.ero )osition of our ter,Gdocu,ent ,atri+.
$lthough this ste) ,ay a))ear language/s)ecific' note that we are only looking at word fre-uencies
within our collection. Anlike the sto) list or ste,,er' we don0t need any outside source of linguistic
infor,ation to calculate the (arious weights. 1hile weighting isn0t critical to understanding or
i,)le,enting LSI' it does lead to ,uch etter results' as it takes into account the relati(e i,)ortance of
)otential search ter,s.
!he Mo-ent o2 !ruth
1ith the weighting ste) done' we ha(e done e(erything we need to construct a finished ter,/docu,ent
,atri+. The final ste) will e to run the SFB algorith, itself. Notice that this critical ste) will e )urely
,athe,atical / although we know that the ,atri+ and its contents are a shorthand for certain linguistic
features of our collection' the algorith, doesn0t know anything aout what the nu,ers ,ean. This is
why we say LSI is language/agnostic / as long as you can )erfor, the ste)s needed to generate a ter,/
docu,ent ,atri+ fro, your data collection' it can e in any language or for,at whatsoe(er.
You ,ay e wondering what the large ,atri+ of nu,ers we ha(e created has to do with the ter,
(ectors and ,any/di,ensional s)aces we discussed in our earlier e+)lanation of how LSI works. In fact'
our ,atri+ is a con(enient way to re)resent (ectors in a high/di,ensional s)ace. 1hile we ha(e een
thinking of it as a looku) grid that shows us which ter,s a))ear in which docu,ents' we can also think
of it in s)atial ter,s. In this inter)retation' e(ery colu,n is a long list of coordinates that gi(es us the
e+act )osition of one docu,ent in a ,any/di,ensional ter, s)ace. 1hen we a))lied ter, weighting to
our ,atri+ in the )re(ious ste)' we nudged those coordinates around to ,ake the docu,ent0s )osition
,ore accurate.
$s the na,e suggests' singular (alue deco,)osition reaks our ,atri+ down into a set of s,aller
co,)onents. The algorith, alters one of these co,)onents " this is where the nu,er of di,ensions
gets reduced #' and then reco,ines the, into a ,atri+ of the sa,e sha)e as our original' so we can
again use it as a looku) grid. The ,atri+ we get ack is an a))ro+i,ation of the ter,/docu,ent ,atri+
we )ro(ided as in)ut' and looks ,uch different fro, the original4
a b c d e f g h i j k

aa -0.0006 -0.0006 0.0002 0.0003 0.0001 0.0000 0.0000 -0.0001 0.0007 0.0001 0.0004 ...
amotd -0.0112 -0.0112 -0.0027 -0.0008 -0.0014 0.0001 -0.0010 0.0004 -0.0010 -0.0015 0.0012 ...
aaliyah -0.0044 -0.0044 -0.0031 -0.0008 -0.0019 0.0027 0.0004 0.0014 -0.0004 -0.0016 0.0012 ...
aarp 0.0007 0.0007 0.0004 0.0008 -0.0001 -0.0003 0.0005 0.0004 0.0001 0.0025 0.0000 ...
ab -0.0038 -0.0038 0.0027 0.0024 0.0036 -0.0022 0.0013 -0.0041 0.0010 0.0019 0.0026 ...
...
zywicki -0.0057 0.0020 0.0039 -0.0078 -0.0018 0.0017 0.0043 -0.0014 0.0050 -0.0020 -0.0011 ...
Notice two interesting features in the )rocessed data4
The ,atri+ contains far fewer .ero (alues. !ach docu,ent has a si,ilarity (alue for ,ost
content words.
So,e of the si,ilarity (alues are negati(e. In our original TBM' this would corres)ond to a
docu,ent with fewer than .ero occurences of a word' an i,)ossiility. In the )rocessed ,atri+'
a negati(e (alue is indicati(e of a (ery large se,antic distance etween a ter, and a docu,ent.
This finished ,atri+ is what we use to actually search our collection. *i(en one or ,ore ter,s in a search
-uery' we look u) the (alues for each search ter,Gdocu,ent co,ination' calculate a cu,ulati(e score
for e(ery docu,ent' and rank the docu,ents y that score' which is a ,easure of their si,ilarity to the
search -uery. In )ractice' we will )roaly assign an e,)irically/deter,ined threshold value to ser(e
as a cutoff etween rele(ant and irrele(ant docu,ents' so that the -uery does not return e(ery
docu,ent in our collection.
!he 3ig Picture
Now that we ha(e looked at the details of latent se,antic inde+ing' it is instructi(e to ste) ack and
e+a,ine so,e real/life a))lications of LSI. Many of these go far eyond )lain search' and can assu,e
so,e sur)rising and no(el guises. Ne(ertheless' the underlying techni-ues will e the sa,e as the ones
we ha(e outlined here.
APP/I$A!I#(
What $an /(I Do For Me !oday7
Throughout this docu,ent' we ha(e een )resenting LSI in its role as a search tool for unstructured
data. *i(en the shortco,ings in current search technologies' this is undoutedly a critical a))lication of
se,antic inde+ing' and one with (ery )ro,ising results. 7owe(er' there are ,any a))lications of LSI
that go eyond traditional infor,ation retrie(al' and ,any ,ore that e+tend the notion of what a search
engine is' and how we can est use it. To illustrate this' here are just a few e+a,)les of the areas where
e+citing work is ha))ening "or should e ha))ening# with LSI4
"elevance Feedbac*
Most regular search engines work est when searching a s,all set of keywords' and (ery -uickly
decline in recall when the nu,er of search ter,s grows high. 3ecause LSI shows the re(erse
eha(ior "the ,ore it knows aout a docu,ent' the etter it is at finding si,ilar ones#' a latent
se,antic search engine can allow a user to create a 0sho))ing cart0 of useful results' and then go
out and search for futher results that ,ost closely ,atch the stored ones. This lets the user do
an iterati(e search' )ro(iding feedack to guide the search engine towards a useful result.
Archivist8s Assistant
In introducing LSI we contrasted it with ,ore traditional a))roaches to structuring data'
including hu,an/generated ta+ono,ies. *i(en LSI0s strength at )artially structuring
unstructured data' the two techni-ues can e used in tande,. This is )otentially a (ery )owerful
co,ination / it would allow archi(ists to use their ti,e ,uch ,ore efficiently' enhancing'
laeling and correcting LSI/generated categories rather than ha(ing to inde+ e(ery docu,ent
fro, scratch. In the ne+t section' we will look at a data (isuali.ation a))roach that could e
used in conjunction with LSI to create a so)histicated' interacti(e a))lication for archi(ist use.
Auto-ated Writing Assess-ent

3y co,)aring student writing against a large data set of stored essays on a gi(en to)ic' LSI
tools can analy.e su,itted assign,ents and highlight content areas that the student essay
didn0t co(er. This can e used as a kind of auto,ated grading syste,' where the assign,ent is
co,)ared to a )ool of essays of known -uality' and gi(en the closest ,atching grade. 1e
elie(e a ,ore a))ro)riate use of the technology is a feedack tool to guide the student in
re(ising his essay' and suggest directions for further study.
I More info and de,o4 htt)4GGwww/)sych.n,su.eduGessayG J
!e.tual $oherence,
LSI can look at the se,antic relationshi)s within a te+t to calculate the degree of to)ical
coherence etween its constituent )arts. This kind of coherence correlates well with readaility
and co,)rehension' which suggests that LSI ,ight e a useful feedack tool in writing
instruction "along the lines of e+isting readaility ,etrics#.
I source4 htt)4GGwww.knowledge/technologies.co,G)a)ersGas/d)%.folt..ht,l J
In2or-ation Filtering,
LSI is )otentially a )owerful custo,i.ale technology for filtering s)a, "unsolicited electronic
,ail#. 3y training a latent se,antic algorith, on your ,ailo+ and known s)a, ,essages' and
adjusting a user/deter,ined threshold' it ,ight e )ossile to flag junk ,ail ,uch ,ore
efficiently than with current keyword ased a))roaches. The sa,e ,ay a))ly to co,,on
Microsoft 9utlook co,)uter (iruses' which tend to share a asic structure.
LSI could also e used to filter newsgrou) and ulletin oard ,essages. I source4 htt)4GGwww/
)sych.n,su.eduGK)folt.GcoisGfiltering/cois.ht,l J
MU/!I%DIM'(I#A/ ($A/I)
!he 3ig Picture
In our discussion of infor,ation retrie(al' we ha(e een talking aout searches that gi(e te+tual results /
we enter -ueries and )hrases as te+t' and look at results in the for, of a list. In this section' we will e
looking at an anne+ technology to LSI called -ulti%di-ensional scaling' or MBS' which lets us
(isuali.e co,)le+ data and interact with it in a gra)hical en(iron,ent. This a))roach lets us use the
ad(anced )attern/recognition software in our (isual corte+ to notice things in our data collection that
,ight not e a))arent fro, a te+t search.
Multi%di-ensional scaling' or MBS' is a ,ethod for taking a two/ or three/di,ensional 0sna)shot0 of a
,any/di,ensional ter, s)ace' so that di,ensionally/challenged hu,an eings can see it. 1here efore
we used singular (alue deco,)osition to co,)ress a large ter, s)ace into a few hundred di,ensions'
here we will e using MBS to )roject our ter, s)ace all the way down to two or three di,ensions' where
we can see it.
The reason for using a different ,ethod " and another three/letter acrony, # has to do with how the two
algorith,s work. SFB is a )erfectionist algorith, that finds the est )rojection onto a gi(en s)ace.
2inding this )rojection re-uires a lot of co,)uting horse)ower' and conse-uently a good deal of ti,e.
MBS is a ,uch faster' iterati(e a))roach that gi(es a 0good/enough0 result close to the o)ti,u, one we
would get fro, SFB. Instead of deri(ing the est )ossile )rojection through ,atri+ deco,)osition' the
MBS algorith, starts with a rando, arrange,ent of data' and then incre,entally ,o(es it around'
calculating a stress 2unction after each )erturation to see if the )rojection has grown ,ore or less
accurate. The algorith, kee)s nudging the data )oints until it can no longer find lower (alues for the
stress function. 1hat this )roduces is a scatter )lot of the data' with dissi,ilar docu,ents far a)art' and
si,ilar ones closer together. So,e of the actual distances ,ay e a little it off " since we are using an
a))ro+i,ate ,ethod#' ut the general distriution will show good agree,ent with the actual si,ilarity
data.
To show what this kind of )rojection looks like' here is an e+a,)le of an MBS gra)h of a))ro+i,ately
6&&& $P wire stories fro, 2eruary' %&&%. !ach s-uare in the gra)h re)resents a one/)age news story
inde+ed using the )rocedures descried earlier4
Figure 9, MD( o2 ews Wire (tories 2or Feb. :;;:
MDS graph of 3,000 AP wire stories from the month of February, 2002. a!h square is a sing"e story. #o"ored
areas indi!ate topi! !"usters dis!ussed be"ow.
2002 NITLE
$s you can see' the stories cluster towards the center of the gra)h' and thin out towards the sides. This
is not sur)rising gi(en the nature of the docu,ent collection / there tend to e ,any news stories aout
a few ,ain to)ics' with a s,aller nu,er of stories aout a ,uch wider (ariety of to)ics.
The actual a+es of the gra)h don0t ,ean anything / all that ,atters is the relati(e distance etween any
two stories' which reflects the se,antic distance LSI calculates fro, word co/occurence. 1e can
esti,ate the si,ilarity of any two news stories y eyealling their )osition in the MBS scatter gra)h. The
further a)art the stories are' the less likely they are to e aout the sa,e to)ic.
Note that our distriution is not )erfectly s,ooth' and contains 0clu,)s0 of docu,ents in certain regions.
Two of these clu,)s are highlighted in red and lue on the ,ain MBS gra)h. If we look at the articles
within the clu,)s' we find out that these docu-ent clusters consist of a set of closely related articles.
2or e+a,)le' here is the result of .oo,ing in on the lue cluster fro, 2igure 54
Figure :, $loseu+ o2 $luster 9
#"oseup of do!ument !"uster shown in b"ue in Figure $. %e&t shows first 30 !hara!ters of story head"ine. 'ote how
arti!"es are grouped by topi!.
2002 NITLE
$s you can see' nearly all of the articles in this cluster ha(e to do with the Israeli/Palestinian conflict.
3ecause the articles share a great ,any keywords' they are unched together in the sa,e se,antic
s)ace' and for, a cluster in our two/di,ensional )rojection. If we e+a,ine the red cluster' we find a
si,ilar set of related stories' this ti,e aout the !nron fiasco4
Figure <, $loseu+ o2 $luster :
#"oseup of red do!ument !"uster from Figure $. %e&t shows first 30 !hara!ters of story head"ine.
2002 National Institute for Technology in Liberal Education
Notice that these to)ic clusters ha(e for,ed s)ontaneously fro, word co/occurence )atterns in our
original set of data. 1ithout any guidance fro, us' the unstructured data collection has )artially
organi.ed itself into categories that are conce)tually ,eaningful. $t this )oint' we could a))ly ,ore
refined ,athe,atical techni-ues to auto,atically detect oundaries etween clusters' and try to sort our
data into a set of self/defined categories. 1e could e(en try to ,a) different clusters to s)ecific
categories in a ta+ono,y' so that in a (ery real sense unstructured data would e organi.ing itself to fit
an e+isting fra,ework.
This )heno,enon of clustering is a (isual e+)ression in two di,ensions of what LSI does for us in a
higher nu,er of di,ensions / re(eals )ree+isting )atterns in our data. 3y gra)hing the relati(e
se,antic distance etween docu,ents' MBS re(eals si,ilarities on the conce)tual le(el' and takes the
first ste) towards organi.ing our data for us in a useful way.
PU!!I) MD( AD /(I !# W#"5
$reative A++roaches
Much of the work in a))lied MBS has co,e fro, the fields of ad(ertising and cogniti(e )sychology
"where it is also known as +erce+tual -a++ing#. 8esearchers in oth fields use the techni-ue to
transfor, -uestionnaires aout relati(e )references and si,ilarities into a (isual re)resentation using the
scaling techni-ues we ha(e outlined. These techni-ues do not a))ear to ha(e een a))lied to linguistic
data until relati(ely recently.
This illustrates a co,,on the,e in latent se,antic research / co,ining fa,iliar techni-ues fro,
different disci)lines in a no(el way to tackle )role,s in data retrie(al. This kind of creati(e ju+ta)osition
is one of the things that ,akes LSI interesting to work on' and le(els the )laying field etween ,ajor
research institutions and lieral arts colleges. 9ne does not need an enor,ous su)erco,)uter or
ad(anced ,athe,atical knowledge to do interesting work with these techni-ues. In fact' ecause LSI
research draws on )ure and a))lied ,athe,atics' linguistics' co,)uter science' )sychology' infor,ation
retrie(al' and the social sciences' what really ,atters is readth of knowledge. There are likely to e
connections further afield that re,ain to e disco(ered.
1ith this eclectic ackground in ,ind' here are so,e )otential a))lications of se,antic inde+ing cou)led
with MBS data (isuali.ation4
5. Archive Manage-ent !ools,
1e already ,entioned the )otential use of LSI as an archi(ist0s assistant' using LSI to highlight
content )atterns in a data collection' and ,ore traditional ta+ono,ies to for,ali.e and heighten
those )atterns. 9ne intuiti(e ,ethod for creating such tools is to dis)lay data (isually using
MBS' and allow for hu,an feedack. $n interacti(e )rogra, using ,ulti/di,ensional scaling
would allow an archi(ist to gra)hically ,ani)ulate data' draw oundaries etween clusters'
e+a,ine content relationshi)s and add classifiers using an intuiti(e' click/and/drag ty)e
interface. 1hat0s ,ore' different e+)ert users would e ale to use MBS to generate their own
)ersonal (iew of a data set' and then reconcile or co,ine those (iews.
%. $once+t Ma+s,
Conce)t ,a)s take this notion of interacti(ity and classification further' letting users ,ani)ulate
and edit LSI/generated (iews of a data collection to )roduce a s)atial ,a) of to)ics and
conce)ts. 3y drawing connections etween ite,s and ,o(ing the, around' users can create
their own (iew of a data collection. These (iews can e 0untangled0 using ,athe,atical
techni-ues to create clear' (isually direct conce)t ,a)s. These ,a)s can e shared' co,ined'
and co,)ared with others' ,aking a uni-ue )edagogical or research tool.
6. 3ioin2or-atics,
The sa,e LSI techni-ues we use to find si,ilarities in language ha(e enor,ous )otential in the
field of ioinfor,atics. 3oth BN$ and )rotein ,olecules consist of long strings of ioche,ical
0words0. 2inding and understanding )atterns in those words is one of the ,ajor research
)role,s in ,odern iology. Asing the tools we descrie would ,ake it )ossile to detect and
(isuali.e such )atterns' and conduct i,)ortant asic research in this nascent field.
Further "eading
/(I 0ournal Articles
The following are journal articles rele(ant to the ,aterial co(ered in this o(er(iew. 1e welco,e
suggestions for e+)anding this list
.
5. Bu,ais' S. T.' Landauer' T. L. and Litt,an' M. L. "5;;?# =$uto,atic cross/linguistic infor,ation
retrie(al using Latent Se,antic Inde+ing.= In SI*I80;? / 1orksho) on Cross/Linguistic
Infor,ation 8etrie(al' )). 5?/%6' $ugust 5;;?.
$(ailale online
%. 2olt.' P. 1.' Lintsch' 1.' and Landauer' T. L. "5;;<#. The Measure,ent of Te+tual Coherence
with Latent Se,antic $nalysis. Biscourse Processes' %D' %<D/6&@.
$(ailale online
6. 2olt.' P. 1. "5;;&# =Asing Latent Se,antic Inde+ing for Infor,ation 2iltering=. In 8. 3. $llen
"!d.# Proceedings of the Conference on 9ffice Infor,ation Syste,s' Ca,ridge' M$' C&/C@.
$(ailale online
C. Landauer' T. L.' 2olt.' P. 1.' and Laha,' B. "5;;<#. Introduction to Latent Se,antic $nalysis.
Biscourse Processes' %D' %D;/%<C.
$(ailale online
D. Landauer' T. L. and Bu,ais' S. T. "5;@@# / ht,l only' =Solution to Plato0s Prole,4 The Latent
Se,antic $nalysis Theory of $c-uisition' Induction and 8e)resentation of Lnowledge.=
Psychological 8e(iew' 5;;@' 5&C "%#' %55/%C&.
$(ailale online
?. Person' N. L.' *raesser' $. C.' 3autista' L.' Mathews' !. C.' M the Tutoring 8esearch *rou)
"%&&5#. !(aluating Student Learning *ains in Two Fersions of $utoTutor. In J. B. Moore' C. L.
8edfield' M 1. L. Johnson "!ds.# $rtificial intelligence in education4 $I/!B in the wired and
wireless future ")). %<?/%;6#. $,sterda,' I9S Press.
$(ailale online
@. 8ehder' 3.' Schreiner' M. !.' 1olfe' M. 3.' Laha,' B.' Landauer' T. L.' M Lintsch' 1. "5;;<#.
Asing Latent Se,antic $nalysis to assess knowledge4 So,e technical considerations. Biscourse
Processes' %D' 66@/6DC.
<. 8oinson' C.S. M 3urgess' C. Focaulary Perfor,ance of 7$L and LS$ Asing a Standardi.ed
Perfor,ance Measure. Society for Co,)uters in Psychology' No(e,er 5D' %&&5.
$(ailale online
/(I Websites
Links to wesites with co,)rehensi(e iliogra)hies on LSI and related techni-ues
.
5. Latent Se,antic Inde+ing 1esite "Ani(ersity of Tennessee#. Includes links to sites y Michael
3erry and Susan Bu,ais' oth of who, ha(e done e+tensi(e work in LSI.
%. Telcordia Technologies LSI links )age. Co,)rehensi(e list of articles. Telcordia holds a )atent in
LSI.
6. Lnowledge $nalysis Technologies . Co,)any that ,akes an LSI )roduct for auto,ated essay
assess,ent.
C. $uthorTutor . Pa)ers related to $uthorTutor.

Patterns in Unstructured Data

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Patterns in Unstructured Data

Caricato da

Copyright:

Formati disponibili

Patterns in Unstructured Data

Discovery, Aggregation, and Visualization

Auto-ated Writing Assess-ent

Potrebbero piacerti anche