Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
com
ScienceDirect
Procedia Technology 25 (2016) 310 317
*OREDO&ROORTXLXPLQ5HFHQW$GYDQFHPHQWDQG(IIHFWXDO5HVHDUFKHVLQ(QJLQHHULQJ6FLHQFHDQG
7HFKQRORJ\5$(5(67
$QHIILFLHQWSULYDF\SUHVHUYLQJVHDUFKVFKHPHZLWKDFFHVVFRQWURO
IRUFORXGGDWDFHQWHUV
7UHVD0DU\*HRUJH9
6KDPQD6-XELODQW-.L]KDNNHWKRWWDP
Department of Computer Science and Engineering, Musaliar College of Engineering and Technology, Pathanamthitta 689653, India
$EVWUDFW
7KHLQWHUQHWDQGWKHHPHUJHQFHRIVRFLDOQHWZRUNVSURGXFHWHUDE\WHVRIGDWDHYHU\GD\,QWKLVELJGDWDVFHQDULRWKHDELOLW\WR
RXWVRXUFHWKHGDWDWRDFORXGVWRUDJHIDFLOLW\VDYHVWKHGDWDPDQDJHPHQWDQGVWRUDJHIDFLOLW\FRVW6RPHPDMRUFKDOOHQJHVZLWK
WKLV VFKHPH DUH SURYLGLQJ VHFXULW\ DQG HQVXULQJ WKH SULYDF\ RI WKH RXWVRXUFHG GDWD $OWKRXJK GDWD VHFXULW\ FDQ EH DFKLHYHG
WKURXJK HQFU\SWLRQ VHDUFKLQJ RQ HQFU\SWHG GDWD EHFRPH D FRPSOH[ WDVN 7KH SURSRVHG ZRUN VXJJHVWV DQ HIILFLHQW VHDUFKLQJ
VFKHPHIRUHQFU\SWHGFORXGGDWDEDVHGRQKLHUDUFKLFDOFOXVWHULQJRIGRFXPHQWV7KHKLHUDUFKLFDOFOXVWHULQJPHWKRGSUHVHUYHVWKH
VHPDQWLF UHODWLRQVKLS EHWZHHQ WKH GRFXPHQWV LQ WKH HQFU\SWHG GRPDLQ WR VSHHG XS WKH VHDUFK SURFHVV &RQVHTXHQWO\ WKH
SURSRVHG V\VWHP KDV OLQHDU FRPSXWDWLRQDO FRPSOH[LW\ GXULQJ WKH VHDUFK SKDVH LQ UHVSRQVH WR DQ H[SRQHQWLDO LQFUHDVH LQ WKH
QXPEHURIGRFXPHQWV7KHV\VWHPDOVRHQVXUHVGDWDSULYDF\E\SURYLGLQJRQO\OLPLWHGDFFHVVRIWKHGRFXPHQWVWRWKHGLIIHUHQW
W\SHVRIXVHUVE\LPSOHPHQWLQJDFFHVVFRQWUROPHFKDQLVPVUHVXOWLQJLQPRUHVHFXUHGGDWDVWRUDJHLQWKHFORXG
7KH$XWKRUV3XEOLVKHGE\(OVHYLHU/WG
2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
3HHUUHYLHZXQGHUUHVSRQVLELOLW\RIWKHRUJDQL]LQJFRPPLWWHHRI5$(5(67
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the organizing committee of RAEREST 2016
Keywords:VHDUFKDEOHHQFU\SWLRQPXOWLNH\ZRUGVHDUFKKLHUDUFKLFDOFOXVWHULQJDFFHVVFRQWURO
,QWURGXFWLRQ
$IXQGDPHQWDODSSOLFDWLRQRIFORXGFRPSXWLQJLVWKHDELOLW\WRRXWVRXUFHUHPRWHGDWDWRH[WHUQDOFORXGVHUYHUVWR
HQDEOHVFDODEOHGDWDVWRUDJH7KHFORXGVHUYHUFDQSURYLGHDKXJHVWRUDJHVSDFHDQGKLJKFRPSXWDWLRQDOSRZHU>@
&RUUHVSRQGLQJDXWKRU
E-mail address:YWUHVDPJ#JPDLOFRP
2212-0173 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the organizing committee of RAEREST 2016
doi:10.1016/j.protcy.2016.08.112
V. Tresa Mary George et al. / Procedia Technology 25 (2016) 310 317 311
$FFRUGLQJO\HQWHUSULVHVDQGXVHUVZKRRZQDODUJHDPRXQWRIGDWDFDQRYHUFRPHWKHLUKDUGZDUHOLPLWDWLRQV$VWKLV
WHFKQLTXHLVEHFRPLQJPRUHDQGPRUHSRSXODUWKHGDWDYROXPHLQFORXGVWRUDJHIDFLOLWLHVLVH[SHULHQFLQJDGUDPDWLF
JURZWK
$PDMRUFRQFHUQUHJDUGLQJWKHXVHRIFORXGFRPSXWLQJIRUGDWDVWRUDJHLVWKDWWKHRXWVRXUFHGGDWDPD\FRQWDLQ
VHQVLWLYH LQIRUPDWLRQ VXFK DV SKRWRV HPDLOV EDQN VWDWHPHQWVHWF ,I WKHGDWD LV VWRUHG LQD SXEOLF FORXGZKLFK LV
DFFHVVLEOH WR VHYHUDO RWKHU SHRSOH ZLWKRXW HIILFLHQW SURWHFWLRQ PHFKDQLVP LW FDQ OHDG WR VHYHUH SULYDF\ DQG
FRQILGHQWLDOLW\ YLRODWLRQV >@ 7KH WUDGLWLRQDO ZD\ WR SUHYHQW VHQVLWLYH GDWD LV HQFU\SWLRQ 7KH GRFXPHQWV DUH
HQFU\SWHG EHIRUH RXWVRXUFLQJ WKHP WR WKH FORXG 7KLV KRZHYHU LQWURGXFHV IXUWKHU FRPSOH[LWLHV GXULQJ WKH VHDUFK
RSHUDWLRQ RQ HQFU\SWHG GDWD ZKHQ OHJLWLPDWH XVHUV QHHG DFFHVV WR WKRVH GRFXPHQWV 0DQ\ UHVHDUFKHUV KDYH
LQYHVWLJDWHGRQWKLVLVVXHLQWKHUHFHQWGD\VDQGSURSRVHGVHYHUDOFLSKHUWH[WVHDUFKVFKHPHVEDVHGRQFU\SWRJUDSK\
WHFKQLTXHV >@ >@ +RZHYHU WKHVH PHWKRGV QHHG H[WHQVLYH FRPSXWDWLRQV DQG VXIIHU IURP KLJK WLPH FRPSOH[LW\
+HQFH WKHVH PHWKRGV DUH QRW VXLWDEOH IRU D ELJ GDWD HQYLURQPHQW >@ $QRWKHU PDMRU GUDZEDFN LV WKDW WKH
UHODWLRQVKLSEHWZHHQWKHGRFXPHQWVLVFRQFHDOHGGXULQJWKHHQFU\SWLRQSURFHVV0DLQWDLQLQJVXFKDUHODWLRQVKLSLV
LPSRUWDQWDVLWUHSUHVHQWVWKHSURSHUWLHVRIWKHGRFXPHQWV
,W LV DOVR QHFHVVDU\ WRSURYLGH FRQWUROOHGDFFHVV WR WKHRXWVRXUFHG FORXG GDWD WR GLIIHUHQW FODVVHVRI XVHUV 7KH
V\VWHP PXVW SUHYHQW XQDXWKRUL]HG XVHUV IURP XSORDGLQJ FRUUXSWHG GRFXPHQWV WR WKH FORXG VHUYHU )RU H[DPSOH
FRQVLGHUDXQLYHUVLW\FORXGLQZKLFKWKHVWXGHQWPDUNOLVWVDUHVWRUHGLQWKHFORXG,QVXFKDVFHQDULRWKHVWXGHQWV
PXVWEHSUHYHQWHGIURPXSORDGLQJWKHLURZQPDUNOLVWVWKHUHE\RYHUZULWLQJWKHRULJLQDOFRS\7RSUHYHQWWKLVWKH
V\VWHP ZLOO SURYLGH RQO\ GRZQORDG SULYLOHJHV WR WKH VWXGHQW XVHUV RI WKH FORXG 3URSHU LPSOHPHQWDWLRQ RI DFFHVV
FRQWUROPHFKDQLVPVZLOOHQVXUHVXFKOLPLWHGDFFHVVWRWKHGLIIHUHQWFODVVRIFORXGXVHUV
7KHSURSRVHGV\VWHPXVHVDVHDUFKLQJVFKHPHEDVHGRQPXOWLNH\ZRUGUDQNHGVHDUFK,QDGGLWLRQDKLHUDUFKLFDO
FOXVWHULQJPHWKRGLVXVHGWRFOXVWHUWKHGRFXPHQWVEDVHGRQDUHOHYDQFHVFRUH7KHUHLVDOVRDOLPLWRQWKHPD[LPXP
VL]HRIHDFKFOXVWHU,IWKHVL]HRIDFOXVWHUH[FHHGVWKLVOLPLWWKHFOXVWHULVIXUWKHUGLYLGHGLQWRVXEFOXVWHUVXQWLOWKH
VL]H RI HDFK FOXVWHU IDOO EHORZ WKH WKUHVKROG YDOXH 'XULQJ WKH VHDUFK SKDVH WKH V\VWHP LWHUDWLYHO\ GHWHUPLQHV WKH
PRVWUHOHYDQWFOXVWHU2QO\WKRVHGRFXPHQWVLQWKDWFOXVWHUQHHGWREHVHDUFKHGWKHUHE\LWUHGXFHVWKHRYHUDOOVHDUFK
WLPH
5HODWHGZRUNV
0DQ\ UHVHDUFKHV KDYH SURSRVHG VHYHUDO PHWKRGV IRU VHDUFK RQ HQFU\SWHG GDWD LQ WKH FORXG 6RPH RI WKHP DQG
WKHLUGUDZEDFNVDUHGLVFXVVHGEHORZ
,QWKHPHWKRGSURSRVHGE\6RQJHWDO>@HDFKZRUGLQWKHGRFXPHQWLVHQFU\SWHGLQGHSHQGHQWO\7KLVUHTXLUHV
VFDQQLQJ RI WKH HQWLUH GDWD FROOHFWLRQ ZRUG E\ ZRUG 7KH PDMRU GUDZEDFN RI WKLV PHWKRG LV WKH KLJK VHDUFK FRVW
UHVXOWLQJIURPWKHVFDQQLQJRIHQWLUHGRFXPHQW&DVKHWDO>@SURSRVHGDV\PPHWULFVHDUFKDEOHHQFU\SWLRQVFKHPH
7KRXJKLWSURYLGHVKLJKHIILFLHQF\IRUODUJHGDWDEDVHVLWODFNVDUDQNPHFKDQLVP,IDODUJHQXPEHURIGRFXPHQWV
FRQWDLQWKHVHDUFKHGNH\ZRUGWKHXVHUKDVWRPDQXDOO\VHOHFWZKDWWKH\DFWXDOO\ZDQWZKLFKLQWXUQLQFUHDVHWKH
RYHUDOOVHDUFKWLPH
&DR HW DO >@SURSRVHG DQ DUFKLWHFWXUHZKLFK SHUIRUP PXOWLNH\ZRUG VHDUFK DQG DOVRVXSSRUW UHVXOW UDQNLQJ E\
XVLQJNQHDUHVWQHLJKERUDOJRULWKP+RZHYHUWKHVHDUFKWLPHRIWKLVPHWKRGJURZVH[SRQHQWLDOO\LQUHVSRQVHWRDQ
H[SRQHQWLDOO\ LQFUHDVLQJ VL]H RI WKH GRFXPHQW FROOHFWLRQV 6XQ HW DO >@ SURSRVHG D QHZ DUFKLWHFWXUH 7KRXJK LW
SURYLGHV EHWWHU HIILFLHQF\ WKH UHOHYDQFH EHWZHHQ WKH GRFXPHQWV LV LJQRUHG DQG KHQFH LW GRHV QRW UHWXUQ WKH PRVW
UHOHYDQWUHVXOWV
312 V. Tresa Mary George et al. / Procedia Technology 25 (2016) 310 317
7DULN0RDWD]DQG$EGXOODWLI6KLIND>@SURSRVHGDV\VWHPIRUVHDUFKLQJPXOWLSOHNH\ZRUGVRYHUHQFU\SWHGGDWD
XVLQJ%RROHDQ6\PPHWULF6HDUFKDEOH(QFU\SWLRQ%66(,WXVHV*UDP6FKPLGWSURFHVVWRRSWLPL]HWKHVHDUFK
SURFHVV,WFRQVLGHUVDUELWUDU\ERROHDQH[SUHVVLRQVVXFKDVFRQMXQFWLRQVDQGGLVMXQFWLRQVRINH\ZRUGVDQGWKHLU
FRPSOHPHQWRQNH\ZRUGV
7KHDERYHPHQWLRQHGVHDUFKLQJVFKHPHVZLOOUHWULHYHILOHVRQO\EDVHGRQH[DFWPDWFKRIWKHNH\ZRUG$Q\W\SRV
DQGLQFRQVLVWHQFLHVLQWKHIRUPDWZLOOQRWUHWXUQWKHUHTXLUHGGRFXPHQWV-/LHWDO>@SURSRVHGDZLOGFDUGEDVHG
WHFKQLTXH WR FUHDWH HIILFLHQWIX]]\NH\ZRUG VHWV WKDW FDQEHXVHG IRU PDWFKLQJ UHOHYDQW GRFXPHQWV :KHQHYHU WKH
H[DFWPDWFKVHDUFKIDLOVWKHVHDUFKUHVXOWLVSURYLGHGEDVHGRQWKHIX]]\NH\ZRUGGDWDVHW
6\VWHPPRGHODQGSUREOHPIRUPXODWLRQ
7KH SURSRVHG V\VWHP XVHV D YHFWRU VSDFH PRGHO LQ ZKLFK HYHU\ GRFXPHQW LV UHSUHVHQWHG E\ D YHFWRU (YHU\
GRFXPHQWFDQEHVHHQDVDSRLQWLQDKLJKGLPHQVLRQDOVSDFH7KHGRFXPHQWVDUHFODVVLILHGLQWRFDWHJRULHVE\XVLQJD
FOXVWHULQJPHWKRG7KHSURSRVHGV\VWHPXVHVDKLHUDUFKLFDOFOXVWHULQJLQGH[LHDKLHUDUFK\RIFOXVWHUVDWGLIIHUHQW
OHYHOVLVXVHG(DFKFOXVWHUKDVDFRQVWUDLQWRQWKHPLQLPXPUHOHYDQFHVFRUHEHWZHHQWKHGRFXPHQWVLQWKDWFOXVWHU
:KHQDQHZGRFXPHQWLVDGGHGWRWKHFOXVWHUWKHFRQVWUDLQWPD\JHWEURNHQ,QVXFKDFDVHDQHZFOXVWHUFHQWHUZLOO
EHDGGHGWRWKHV\VWHP$IWHUWKDWDOOWKHFOXVWHUFHQWHUVZLOOEHUHVHOHFWHGDQGDOOWKHGRFXPHQWVZLOOEHUHDVVLJQHG
7KHPD[LPXPVL]HRIWKHFOXVWHULVDOVRIL[HGIRUHDFKOHYHO,IWKHVL]HRIDFOXVWHUH[FHHGVWKHPD[LPXPOLPLWWKDW
FOXVWHUZLOOEHGLYLGHGLQWRPXOWLSOHVXEFOXVWHUV:KHQDVHDUFKLVEHLQJSHUIRUPHGRQO\WKRVHGRFXPHQWVLQWKH
UHOHYDQWFOXVWHUVQHHGWREHVHDUFKHGWKHUHE\LWUHGXFHVWKHRYHUDOOVHDUFKWLPH
'XULQJWKHVHDUFKSKDVHWKHUHOHYDQFHVFRUHEHWZHHQWKHVHDUFKTXHU\DQGWKHFOXVWHUFHQWHUVRIWKHILUVWOHYHO
LQGH[ LV FRPSXWHG 7KH FOXVWHU FHQWHU ZLWK PD[LPXP UHOHYDQFH VFRUH ZLOO EH VHOHFWHG DQG WKLV SURFHVV ZLOO EH
LWHUDWLYHO\UHSHDWHGIRUWKHFKLOGUHQLQWKHQH[WOHYHOFOXVWHUVXQWLOWKHVPDOOHVWFOXVWHULQWKHORZHVWOHYHOLVIRXQG,I
WKLVFOXVWHUGRHVQRWFRQWDLQWKHGHVLUHGGRFXPHQWWKHV\VWHPZLOOWUDFHEDFNWRWKHSDUHQWRIWKHVPDOOHVWFOXVWHU
7KLVSURFHVVLVUHSHDWHGXQWLOWKHGHVLUHGGRFXPHQWLVIRXQGRUWKHURRWFOXVWHULVUHDFKHG
7KHV\VWHPDUFKLWHFWXUHLVFRPSRVHGRIPDLQO\IRXUHQWLWLHVDVVKRZQLQ)LJ7KH\DUHWKHGDWDRZQHUWKHGDWD
XVHUWKHFORXGVHUYHUDQGWKHFORXGPDQDJHU7KHGDWDRZQHULVWKHPRGXOHUHVSRQVLEOHIRUFROOHFWLQJGRFXPHQWV
SHUIRUPLQJ WKH HQFDSVXODWLRQ EXLOGLQJ WKH GRFXPHQW LQGH[ DQG RXWVRXUFLQJ WKH HQFU\SWHG GRFXPHQW WR WKH FORXG
VHUYHU7KHGDWDXVHULVWKHFRQVXPHURIWKHGRFXPHQWVDQGWKH\PXVWKDYHQHFHVVDU\DXWKRUL]DWLRQEHIRUHDFFHVVLQJ
WKLVGDWD7KHFORXGVHUYHULVWKHHQWLW\ZKLFKSURYLGHVDKXJHVWRUDJHVSDFHDQGQHFHVVDU\FRPSXWDWLRQDOUHVRXUFHV
IRU WKH FLSKHUWH[W VHDUFK7KH FORXG PDQDJHU LV UHVSRQVLEOH IRU HQVXULQJ DFFHVV FRQWURO ,W EORFNV DOOXQDXWKRUL]HG
UHTXHVWVIRUWKHGDWDE\FKHFNLQJWKHSULYDF\VHWWLQJVRIHDFKXVHU:KHQWKHFORXGVHUYHUUHFHLYHVDUHTXHVWIRUD
GRFXPHQWWKLVUHTXHVWLVYHULILHGE\WKHFORXGPDQDJHU8SRQVXFFHVVIXOYHULILFDWLRQWKHFORXGVHUYHUUHWXUQVWKH
UHTXLUHGGRFXPHQWV
V. Tresa Mary George et al. / Procedia Technology 25 (2016) 310 317 313
)LJ6\VWHPDUFKLWHFWXUH
,PSOHPHQWDWLRQGHWDLOV
7KHSURSRVHGV\VWHPXVHV0XOWLNH\ZRUG5DQNHG6HDUFKRYHU(QFU\SWHGGDWDEDVHGRQ+LHUDUFKLFDO&OXVWHULQJ
,QGH[ 056(+&, VFKHPH LQ ZKLFK WKH YHFWRU VSDFH PRGHO LV DGRSWHG IURP WKH 0XOWLNH\ZRUG 5DQNHG 6HDUFK
RYHU (QFU\SWHG GDWD 056( >@ DQG WKH LQGH[LQJ LV EDVHG RQ +LHUDUFKLFDO ,QGH[LQJ 6WUXFWXUH +&, >@ 7KH
GHWDLOHGGHVFULSWLRQLVDVIROORZV(YHU\GRFXPHQWLVLQGH[HGE\DYHFWRUDQGHDFKGLPHQVLRQRIWKHYHFWRUUHIHUVWR
D NH\ZRUG 7KH YDOXH RI HDFK GLPHQVLRQ LQGLFDWHV ZKHWKHU WKH NH\ZRUG DSSHDUV LQ WKH SDUWLFXODU GRFXPHQW 7KH
TXHU\LVDOVRUHSUHVHQWHGLQDVLPLODUZD\DVDYHFWRU7KHOHQJWKVRIWKHGRFXPHQWYHFWRUVDUHQRUPDOL]HGDQGKHQFH
WKH GLVWDQFH RI SRLQWV LQ WKH QGLPHQVLRQDO VSDFH UHIOHFWV WKH UHOHYDQFH RI FRUUHVSRQGLQJ GRFXPHQWV 'XULQJ WKH
VHDUFKSKDVHWKHFORXGVHUYHUFRPSRQHQWFRPSXWHVWKHUHOHYDQFHVFRUHEHWZHHQWKHTXHU\YHFWRUDQGWKHGRFXPHQWV
YHFWRU E\ FRPSXWLQJ WKHLU LQQHU SURGXFW :KHQ WKH GRFXPHQWV DUH VWRUHG LQ WKH FORXG LQ DQ HQFU\SWHG IRUP WKH
VHPDQWLFUHODWLRQVKLSEHWZHHQWKHGRFXPHQWVZLOOEHORVW+RZHYHUWKHSURSRVHGV\VWHPXVHVDFOXVWHULQJPHWKRG,Q
WKHQGLPHQVLRQDOVSDFHWKHSRLQWVRIKLJKO\UHOHYDQWGRFXPHQWVDUHYHU\FORVHWRHDFKRWKHUWKHUHE\WKHVHPDQWLF
UHODWLRQVKLSEHWZHHQWKHGRFXPHQWVLVSUHVHUYHG
:KHQWKHYROXPHRIGDWDLQWKHFORXGH[SHULHQFHVDGUDPDWLFJURZWKWKHWUDGLWLRQDOVHDUFK DSSURDFKHVZLOOEH
YHU\LQHIILFLHQWDQGKDVDQH[SRQHQWLDOJURZWK7RLPSURYHWKHVHDUFKHIILFLHQF\DKLHUDUFKLFDOFOXVWHULQJPHWKRGLV
XVHG7KHKLHUDUFKLFDODSSURDFKFOXVWHUVWKHGRFXPHQWVEDVHGRQWKHUHOHYDQFHVFRUHDWGLIIHUHQWOHYHOV:KHQWKH
VL]H RI WKH FOXVWHU UHDFKHV WKH PD[LPXP FOXVWHU VL]H WKUHVKROG WKH V\VWHP SDUWLWLRQV WKH FOXVWHUV LQWR VXEFOXVWHUV
XQWLO WKH FULWHULRQ LV VDWLVILHG :KHQ WKH GRFXPHQWV DUH EHLQJ XSORDGHG WKH GDWD RZQHU DOVR EXLOGV DQ HQFU\SWHG
LQGH[$V\PPHWULFNH\HQFU\SWLRQDOJRULWKPLVXVHGDQGWKHGRFXPHQWVDUHHQFU\SWHGXVLQJVRPHUDQGRPQXPEHUV
DQGDVHFUHWNH\:KHQWKHGDWDXVHUQHHGVDSDUWLFXODUGRFXPHQWDTXHU\LVVXEPLWWHGWRWKHFORXGVHUYHU7KHFORXG
VHUYHUZLOOUHWXUQWKHWDUJHWGRFXPHQWWRWKHGDWDXVHU
7KHIXQFWLRQVRIWKHGLIIHUHQWFRPSRQHQWVDUHGHVFULEHGEHORZ
.H\JHQ7KLVIXQFWLRQZLOOJHQHUDWHWKHVHFUHWNH\XVHGWRHQFU\SWWKHLQGH[DQGWKHGRFXPHQWV)RUWKLVD
ELWYHFWRULQZKLFKHDFKHOHPHQWLVDQLQWHJHURUDQGWZRLQYHUWLEOH PDWULFHV
M1DQGM2ZKRVHHOHPHQWVDUHUDQGRPLQWHJHUVDUHJHQHUDWHG
,QGH[ 7KLVSKDVHJHQHUDWHV WKH HQFU\SWHG LQGH[E\ XVLQJ WKH DERYH JHQHUDWHG VHFUHWNH\ 7KH FOXVWHULQJ SURFHVV
314 V. Tresa Mary George et al. / Procedia Technology 25 (2016) 310 317
DOVRWDNHVSODFHLQWKLVSKDVH7KHLQGH[DOJRULWKPLVDVIROORZV
$WRNHQL]HUDQGDSDUVHUWRROVDUHXVHGWRH[WUDFWDOOWKHNH\ZRUGVSUHVHQWLQWKHGRFXPHQW
7KHGRFXPHQWVDUHWUDQVIRUPHGLQWRDFROOHFWLRQRI'RFXPHQW9HFWRUV'9
$ 4XDOLW\ +LHUDUFKLFDO &OXVWHULQJ 4+& PHWKRG LV XVHG WR JHQHUDWH WKH LQIRUPDWLRQ DERXW 'RFXPHQWV
&ODVVLILFDWLRQ'&DQGWKHFROOHFWLRQRI&OXVWHU&HQWHUV9HFWRUV&&9)
7KH GDWD RZQHU SHUIRUPV WKH GLPHQVLRQH[SDQGLQJ DQG YHFWRU VSOLWWLQJ SURFHGXUH RQ HYHU\ GRFXPHQW
YHFWRU
D 'XULQJGLPHQVLRQH[SDQGLQJSURFHGXUHHDFKYHFWRULQ&&9LVH[WHQGHGWR ELWORQJ
YHFWRU ZKHUH WKH YDOXH LQ GLPHQVLRQ LV DQ LQWHJHU QXPEHU JHQHUDWHG UDQGRPO\
DQGWKHODVWGLPHQVLRQLVVHWWR
E 'XULQJWKHYHFWRUVSOLWWLQJSURFHGXUHHYHU\H[WHQGHGGRFXPHQWYHFWRULVVSOLWLQWRWZR
ELWORQJ YHFWRUV DQG XVLQJ WKH DERYH JHQHUDWHG ELW YHFWRUDV D VSOLWWLQJ
LQGLFDWRU
(QFU\SWLRQ7KHSODLQGRFXPHQWVHW'LVHQFU\SWHGXVLQJDQ\VHFXUHV\PPHWULFHQFU\SWLRQDOJRULWKPVXFKDV$(6
7KHHQFU\SWHGGRFXPHQWLVWKHQRXWVRXUFHGWRWKHFORXG
7UDSGRRU:KHQDXVHUVXEPLWVDTXHU\WKHFORXGPDQDJHUZLOODQDO\VHWKHTXHU\DQGYHULI\WKDWWKHUHTXHVWFRPH
IURP DQ DXWKHQWLFDWHG XVHU7KHNH\ZRUGV LQ WKHTXHU\ DUH DQDO\]HG ZLWK WKHKHOSRIGLFWLRQDU\ ': DQG DTXHU\
YHFWRU49LVJHQHUDWHGZKLFKLVWKHQH[WHQGHGWRD ELWYHFWRU
6HDUFK:KHQWKHFORXGVHUYHUUHFHLYHVWKHTXHU\YHFWRUWKHUHOHYDQFHVFRUHEHWZHHQWKHTXHU\YHFWRUDQGLQGH[
YHFWRURIFOXVWHUVDUHFRPSXWHGLQDKLHUDUFKLFDOPDQQHU,WILQDOO\FKRVHVWKHFOXVWHUZLWKPD[LPXPUHOHYDQFHVFRUH
DVWKHWDUJHWFOXVWHUDQGVHDUFKIRUWKHUHTXLUHGGRFXPHQW,IWKHGRFXPHQWLVQRWIRXQGLWEDFNWUDFNVDQGFKRRVHD
GLIIHUHQWFOXVWHUZLWKQH[WKLJKHVWVFRUH7KLVSURFHVVLVUHSHDWHGXQWLOWKHWDUJHWGRFXPHQWLVIRXQG
'HFU\SWLRQ7KLVFRPSRQHQWLVXVHGE\WKHGDWDXVHUWRGHFU\SWWKHUHWXUQHGGRFXPHQW7KHVHFUHWNH\LVH[FKDQJHG
WRWKHXVHUWKURXJKDVHFXUHPHFKDQLVP
,QWKHSURSRVHGV\VWHPWKHFRQFHSWRIFRRUGLQDWHPDWFKLQJLVXVHGDVDUHOHYDQFHPHDVXUH7KHUHOHYDQFHVFRUH
EHWZHHQGRFXPHQWdiDQGTXHU\ LVGHWHUPLQHGDVGHVFULEHGLQ(TXDWLRQ
V. Tresa Mary George et al. / Procedia Technology 25 (2016) 310 317 315
6RPH RI WKH PRVW ZLGHO\ XVHG DQG SRSXODU FOXVWHULQJ DOJRULWKPV DUH K-means DQG K-medoids ,Q WKHVH
DOJRULWKPVWKHYDOXHRIkLVIL[HGHDUOLHU+RZHYHULQDELJGDWDVFHQDULRLWLVLPSRVVLEOHWRSUHGLFWWKHYDOXHRIN
HDUO\ 7KH FOXVWHUV DUH WR EH JHQHUDWHG G\QDPLFDOO\ +HQFH D dynamic K-means algorithm LV XVHG 7R NHHS WKH
FOXVWHUVGHQVHDQGFRPSDFWDPLQLPXPUHOHYDQFHWKUHVKROGYDOXHLVPDLQWDLQHG:KLOHSHUIRUPLQJWKHFOXVWHULQJ
SURFHVVWKHUHOHYDQFHVFRUHEHWZHHQHDFKGRFXPHQWDQGLWVFOXVWHUFHQWHULVFRPSXWHGDQGLIWKLVYDOXHLVOHVVWKDQ
WKH PLQLPXP WKUHVKROG YDOXH D QHZ FOXVWHU LV DGGHG DQG DOO WKH GRFXPHQWV DUH UHDVVLJQHG DFFRUGLQJO\ 7KLV
SURFHGXUHLVH[HFXWHGLWHUDWLYHO\XQWLODVWDEOHYDOXHRIkLVUHDFKHG
7RVHDUFKIRUDSDUWLFXODUGRFXPHQWWKHFORXGVHUYHUILUVWQHHGVWRILQGWKHFOXVWHUWKDWPRVWPDWFKWKHTXHU\7KH
FORXGVHUYHUXVHVWKHFOXVWHULQGH[ DQGDQLWHUDWLYHSURFHGXUHDVGHVFULEHGEHORZWRILQGWKHWRSPDWFKHGFOXVWHU
7KHFORXGVHUYHUILUVWFRPSXWHVWKHUHOHYDQFHVFRUHYDOXHEHWZHHQTXHU\ DQGHQFU\SWHGYHFWRUVRIWKH
ILUVWOHYHOFOXVWHUFHQWHUVLQFOXVWHULQGH[ DVGHVFULEHGLQ(TXDWLRQ,WWKHQFKRRVHVWKHiWKFOXVWHUFHQWHU
ZLWKWKHKLJKHVWVFRUH
)RUHDFKFKLOGFOXVWHUFHQWHUVRIWKHDERYHVHOHFWHGFOXVWHUFHQWHUWKHFORXGVHUYHUFRPSXWHVWKHUHOHYDQFH
VFRUH EHWZHHQ DQG HYHU\ HQFU\SWHG YHFWRUV RI FKLOG FOXVWHU FHQWHUV DQG ILQDOO\ JHWV WKH FOXVWHU FHQWHU
ZLWKWKHWRSVFRUH
7KHDERYHSURFHGXUHLVLWHUDWHGXQWLOWKHXOWLPDWHFOXVWHUFHQWHU LQODVWOHYHOOLVDFKLHYHG
5HVXOWVDQGDQDO\VLV
7KHHIILFLHQF\RIWKHV\VWHPZDVWHVWHGZLWKDWZROHYHOFOXVWHULQJPRGHO7KHQXPEHURIRSHUDWLRQQHHGHGIRU
WKHHQWLUHVHDUFKSURFHVVFDQEHFRPSXWHGDVGHVFULEHGLQ(TXDWLRQ7RLQFUHDVHWKHVHDUFKHIILFLHQF\WKHV\VWHP
XVHVDVWDWLFGLFWLRQDU\RINH\ZRUGVZKLFKGRHVQRWHIIHFWLYHO\FRQWULEXWHWRWKHVHDUFKSURFHVV7KHWHUPVOLNHIRU
DQG HWF LQ WKH VHDUFK TXHU\ ZLOO EH UHPRYHG DQG D PRGLILHG TXHU\ YHFWRU ZLOO EH FRQVWUXFWHG 7KH VXEVHTXHQW
FRPSDULVRQVDUHPDGHRQO\ZLWKWKHPRGLILHGTXHU\YHFWRU/HWxGHQRWHWKHVL]HRIWKHVWDWLFGLFWLRQDU\wGHQRWH
WKHQXPEHURITXHU\NH\ZRUGVuGHQRWHWKHQXPEHURINH\ZRUGVLQWKHPRGLILHGTXHU\YHFWRUnGHQRWHWKHWRWDO
QXPEHURIGRFXPHQWVLQWKHGRFXPHQWVFROOHFWLRQkGHQRWHWKHQXPEHURIFDWHJRULHVLQWKHILUVWOHYHOFOXVWHUDQGt
GHQRWHWKHDYHUDJHQXPEHURIGRFXPHQWVLQWKHVXEVHTXHQWFOXVWHU
7KHQXPEHURIRSHUDWLRQVUHTXLUHGE\DV\VWHPZLWKRXWDQ\FOXVWHULQJWHFKQLTXHLVGHVFULEHGLQ(TXDWLRQ
'XULQJ WKH VHDUFK VWHS WKH H[LVWLQJ V\VWHP FRPSDUHV WKH TXHU\ YHFWRU ZLWK WKH HQWLUH GRFXPHQWV FROOHFWLRQ
ZKHUHDV WKH SURSRVHG V\VWHP FRPSDUHV LW RQO\ ZLWK WKH UHOHYDQW FOXVWHU OHDGLQJ WR VLJQLILFDQW UHGXFWLRQ LQ VHDUFK
WLPH
7RWHVWWKHSHUIRUPDQFHRIWKHSURSRVHGV\VWHPDQH[SHULPHQWDOVHWXSZDVEXLOWDVIROORZV$QDSSOLFDWLRQ
316 V. Tresa Mary George et al. / Procedia Technology 25 (2016) 310 317
VLPXODWLQJWKHDFWLYLWLHVRIDXQLYHUVLW\ZDVFUHDWHG7KHFORXGVWRUDJHSODWIRUPIRUWKHV\VWHPZDVSURYLGHGE\WKH
*RRJOHSXEOLFFORXG7KHGDWDRZQHUVRIWKHV\VWHPDUH
7KHXQLYHUVLW\ZKLFKRZQVWKHPDUNOLVWVDQGFHUWLILFDWHVRIDOOWKHSDVVHGRXWDQGSUHVHQWO\VWXG\LQJ
VWXGHQWV
7KHFROOHJHZKLFKXSORDGVWKHVHVVLRQDOPDUNVDQGRWKHUVWXGHQWVSHFLILFGRFXPHQWVRIDOOWKHVWXGHQWV
7KHGDWDVHWIRUWKHSHUIRUPDQFHDQDO\VLVZDVEXLOWIURPWKHDERYHPHQWLRQHGW\SHVRIGRFXPHQWV7KHV\VWHP
ZDVWHVWHGZLWKDOLQHDULQFUHDVHLQWKHQXPEHURIGRFXPHQWVDQGWKHFRUUHVSRQGLQJVHDUFKWLPHVZHUHHVWLPDWHG,W
LVHYLGHQWIURP)LJWKDWWKHSURSRVHGV\VWHPRXWSHUIRUPVWKHH[LVWLQJV\VWHPZLWKRXWFOXVWHULQJ7KHV\VWHPZDV
DOVRWHVWHGZLWKDQH[SRQHQWLDOJURZWKLQWKHQXPEHURIGRFXPHQWV)LJVKRZVWKDWWKHSURSRVHGV\VWHPZLWK
FOXVWHULQJKDVDOLQHDUJURZWKLQVHDUFKWLPHZKLOHWKHV\VWHPZLWKRXWFOXVWHULQJKDVDQH[SRQHQWLDOJURZWKLQVHDUFK
WLPH
12000
10000
Searchtime
8000 without
6000 clustering
4000
2000 with
0 hierarchica
10 20 30 40 50 lclustering
Numberofdocuments(x100)
)LJ&RPSDULVRQRIVHDUFKWLPHZLWKDOLQHDUJURZWKLQGRFXPHQWVFROOHFWLRQ
20000
15000
Searchtime
10000 without
clustering
5000 with
clustering
0
148 403 109629808103
Numberofdocuments
)LJ&RPSDULVRQRIVHDUFKWLPHZLWKDQH[SRQHQWLDOJURZWKLQGRFXPHQWVFROOHFWLRQ
$ GHGLFDWHG PRGXOH FDOOHG FORXG PDQDJHU LV DGGHG WR WKH SURSRVHG V\VWHP WR YHULI\ WKH DXWKHQWLFLW\ RI WKH
DUULYLQJ UHTXHVWV 7R HQVXUH WKH FRQILGHQWLDOLW\ DQG SULYDF\ RI WKH GRFXPHQWV VWRUHG LQ WKH FORXG VHUYHU DOO WKH
GRFXPHQWV DUH HQFU\SWHG XVLQJ D V\PPHWULF HQFU\SWLRQ DOJRULWKP EHIRUH XSORDGLQJ LW WR WKH FORXG ,Q DGGLWLRQ WR
WKDWWKHFORXGVWRUDJHSURYLGHUDOVRSHUIRUPVDWZROHYHOHQFU\SWLRQRQWKHGRFXPHQWVDQGUHWXUQVDSXEOLFNH\WR
WKHFORXGPDQDJHU$OOWKHNH\VDUHPDQDJHGE\WKHFORXGPDQDJHUDQGRQO\SHRSOHZLWKVXIILFLHQWDFFHVVULJKWVFDQ
V. Tresa Mary George et al. / Procedia Technology 25 (2016) 310 317 317
GHFU\SW WKH GRFXPHQW &RQVHTXHQWO\ WKH V\VWHP HQVXUHV WKDW HYHQ LI DQ LQWUXGHU DFFHVVHV WKH GRFXPHQW GLUHFWO\
IURPWKHFORXGVHUYHUWKH\FDQQRWJHWWKHSODLQWH[WRIWKHGRFXPHQWV
&RQFOXVLRQDQGIXWXUHZRUN
7KHSUREOHPRIVHDUFKLQJDQGVHFXUHO\DFFHVVLQJWKHHQFU\SWHGGDWDLQWKHFORXGLVDQDO\]HG,WLVXQGHUVWRRGWKDW
PDLQWDLQLQJWKHVHPDQWLFUHODWLRQVKLSEHWZHHQWKHGRFXPHQWVUHGXFHWKHVHDUFKWLPHIRUDGRFXPHQW7KHSURSRVHG
ZRUN LV EDVHG RQ PXOWL NH\ZRUG UDQNHG VHDUFK RYHU HQFU\SWHG GDWD 7KH XVH RI KLHUDUFKLFDO FOXVWHULQJ PHWKRG WR
FOXVWHUWKHGRFXPHQWVSUHVHUYHVWKHVHPDQWLFUHODWLRQVKLSEHWZHHQWKHGRFXPHQWV7KHH[SHULPHQWDOUHVXOWVSURYH
WKDWWKHSURSRVHGV\VWHPKDVDOLQHDUJURZWKLQWLPHFRPSOH[LW\ZKHQWKHVL]HRIWKHGRFXPHQWVFROOHFWLRQLQFUHDVHV
H[SRQHQWLDOO\,WDOVRLPSOHPHQWVDGHGLFDWHGPRGXOHQDPHGFORXGPDQJHUWRHQVXUHWKHSULYDF\RIFORXGGDWDE\
JUDQWLQJRQO\OLPLWHGDFFHVVWRWKHGRFXPHQWVFROOHFWLRQWRGLIIHUHQWFODVVHVRIXVHUV$VIXWXUHZRUNPRUHVHFXUH
DOJRULWKPV FDQ EH GHYHORSHG IRU LPSURYLQJ WKH SULYDF\ RI WKH XSORDGHG GRFXPHQWV 0RUH VHFXUH DFFHVV FRQWURO
VFKHPHV VXFK DV '\QDPLF ,QIRUPDWLRQ )ORZ 7UDFNLQJ ',)7 WHFKQLTXHV >@ ZLWK FDSDELOLWLHV WR UHFRJQL]H WKH
DGYDQFHGYXOQHUDELOLWLHVFDQDOVRERRVWXSWKHRYHUDOOSHUIRUPDQFHRIWKHV\VWHP
5HIHUHQFHV
>@ ;LDQ & /X < + /L = 'HFHPEHU $GDSWLYH FRPSXWDWLRQ RIIORDGLQJ IRU HQHUJ\ FRQVHUYDWLRQ RQ EDWWHU\SRZHUHG V\VWHPV
,Q3DUDOOHODQG'LVWULEXWHG6\VWHPV,QWHUQDWLRQDO&RQIHUHQFHRQ9ROSS,(((
>@/L+'DL<7LDQ/ <DQJ+,GHQWLW\EDVHGDXWKHQWLFDWLRQIRUFORXGFRPSXWLQJ,Q&ORXGFRPSXWLQJSS6SULQJHU
%HUOLQ+HLGHOEHUJ
>@6XQ::DQJ%&DR1/L0/RX:+RX<7 /L+0D\3ULYDF\SUHVHUYLQJPXOWLNH\ZRUGWH[WVHDUFKLQWKHFORXG
VXSSRUWLQJ VLPLODULW\EDVHG UDQNLQJ ,Q3URFHHGLQJV RI WKH WK $&0 6,*6$& V\PSRVLXP RQ ,QIRUPDWLRQ FRPSXWHU DQG FRPPXQLFDWLRQV
VHFXULW\SS$&0
>@:DQJ% <X6 /RX: +RX<7$SULO 3ULYDF\SUHVHUYLQJ PXOWLNH\ZRUGIX]]\VHDUFKRYHUHQFU\SWHGGDWDLQWKHFORXG
,Q,1)2&203URFHHGLQJV,(((SS,(((
>@6HEDVWLDQ/5%DEX6 .L]KDNNHWKRWWDP--)HEUXDU\&KDOOHQJHVZLWKELJGDWDPLQLQJ$UHYLHZ,Q6RIW&RPSXWLQJDQG
1HWZRUNV6HFXULW\,&616,QWHUQDWLRQDO&RQIHUHQFHRQSS,(((
>@6RQJ';:DJQHU' 3HUULJ$3UDFWLFDOWHFKQLTXHVIRUVHDUFKHVRQHQFU\SWHGGDWD,Q6HFXULW\DQG3ULYDF\6 3
3URFHHGLQJV,(((6\PSRVLXPRQSS,(((
>@&DVK'-DHJHU--DUHFNL6-XWOD&.UDZF]\N+5RVX0& 6WHLQHU02FWREHU'\QDPLFVHDUFKDEOHHQFU\SWLRQLQYHU\
ODUJHGDWDEDVHV'DWDVWUXFWXUHVDQGLPSOHPHQWDWLRQ,Q1HWZRUNDQG'LVWULEXWHG6\VWHP6HFXULW\6\PSRVLXP1'66
>@&DR1:DQJ&/L05HQ. /RX:3ULYDF\SUHVHUYLQJPXOWLNH\ZRUGUDQNHGVHDUFKRYHUHQFU\SWHGFORXGGDWD3DUDOOHO
DQG'LVWULEXWHG6\VWHPV,(((7UDQVDFWLRQVRQ
>@6XQ::DQJ%&DR1/L0/RX:+RX<7 /L+9HULILDEOHSULYDF\SUHVHUYLQJPXOWLNH\ZRUGWH[WVHDUFKLQWKH
FORXGVXSSRUWLQJVLPLODULW\EDVHGUDQNLQJ3DUDOOHODQG'LVWULEXWHG6\VWHPV,(((7UDQVDFWLRQVRQ
>@0RDWD]7 6KLNID$0D\%RROHDQV\PPHWULFVHDUFKDEOHHQFU\SWLRQ,Q3URFHHGLQJVRIWKHWK$&06,*6$&V\PSRVLXPRQ
,QIRUPDWLRQFRPSXWHUDQGFRPPXQLFDWLRQVVHFXULW\SS$&0
>@-/L4:DQJ&:DQJ1&DR.5HQDQG:/RX)X]]\.H\ZRUG6HDUFKRYHU(QFU\SWHG'DWDLQ&ORXG&RPSXWLQJ3URFRI
,(((,1)2&200LQL&RQIHUHQFH
>@&KHQ&=KX;6KHQ3+X-*XR67DUL= =RPD\D$$Q(IILFLHQW3ULYDF\3UHVHUYLQJ5DQNHG.H\ZRUG6HDUFK0HWKRG
>@'DOWRQ0.R]\UDNLV& =HOGRYLFK1$XJXVW1HPHVLV3UHYHQWLQJ$XWKHQWLFDWLRQ $FFHVV&RQWURO9XOQHUDELOLWLHVLQ:HE
$SSOLFDWLRQV,Q86(1,;6HFXULW\6\PSRVLXPSS