Sei sulla pagina 1di 10

kv.sreehari@yahoo.

com A Genetic Programming Approach to Record Deduplication ******************************************************************* * Domain: knowledge and data engineering: data mining + datawarehousing web mining: web databases or multiple databases: multiple databases:every day related to same topic some new documents are upload in database:large databases :for analysis purpose we gets the many number of pr oblems =>in each and every web database present some number of topics content =>user enter the any one of the topic name =>topic name:searching process =>from multiple databases :extraction:display the results of content problem1:there is no communication inbetween of one database to another database problem2:first database:records same records:another database or second database problem3:duplicate records:these kind of databases are not quality databases existing system some approaches or methods : **************************************************** =>government organization,digital libraries:previous methods:detection of duplic ates 1.record matching 2.record linkage 3.record deduplication communication of different databases : data integration :data cleaning:remove th e duplicate records of content =>using above three approaches its not possible to remove the duplicates =>its not gives the gurantee to remove all duplicates also in multiple databases record matching: ****************** 1.training phase:retrieval of records 2.testing phase:duplicate are present or not:remove the duplicates training phase: => multiple databases =>query:based on query extract the results or records =>without any communication display the results:duplicate records are present

testing phase:multiple databases of communication : record1 record2. ......n =>record1 -record2 record2 -record1 record1-record3 record2 -record3 record1-record4 =>similar records =>duplicate records =>unique records =>these all unique records are meaningful or semantic records we are not derive here =>there is no evidence record linkage: **************** 1.training phase - retrieve the records of information 2.testing phase =>matching or linkage of records =>threshold parameter:in total number of attributes:7 attributes are matching:th ose records also detects as a duplicates =>one record to another record linkage process we start here =>detection of duplicates related 10=10,10=9 10=8 10=7 below attributes are matching in any other record:those records are not duplicat e =>some more extra duplicates detection is possible =>improve the performance in detection disadvantages: 1.more computational cost 2.overhead problems also generated here record deduplication approach: =>we reduce the overhead,resources,time parameters

training phase testing phase =>search the duplicates:concise query or range based query: in large databases we are not search total database, in which area require the d uplicate detection that area itself detects the duplicates =>in particular remove the duplicates =>these approaches also its not possible to remove total duplicates Disadvantages: 1.low response time 2.quality loss 3.performance degradation 4.operational cost is high new system related duplicates detection: =>record deduplication:unique records =>multiple databases :large data repositaries: before duplicate detection =>in large data repositaries:apply the record deduplication function =>remove the duplicates:small repositaries next we offer another function:suggested function =>small repositaries : unique records: =>unique records:related to one topic :topic:subtopics are present here =>select one of the subtopic:related same subtopic how many records are availabl e as a relationship records we findout:similar features of records we findout he re => different subtopics of related similar features of records =>subtopic1 20 =>count:population =>gives the rating based on count =>parent -children -sub children -leaf which type of relatioship is present:after identification of rating:directly its not possible to arrange without any logic =>all records are not align in the form of proper way subtopic2 25 subtopic3 23

airthematic function: ********************** =>we use the completely 4 operations =>addition,substraction,multiplication,division =>proper relationship:we arrange all records =>evidence,proof,semantic or meaningful records existing genetic programming approach: ******************************************** =>given records:suitable answers:arrangement of records =>GP:have problems:there is no communication inbetween of one record to another record =>every record align as a independent record in tree =>conflict problem =>its not effective result in implementation part =>its not effective evidence results =>its not gives meaningful arrangement of records new GP:generational evolutionary approach =>improve the better evidence results =>based on users requirement changes the structure of results in implementation part =>proof results as a optimal results experiment:using first iteration:arrangement of records:first iterations its not gives the effecient results:second iteration =>better third iteration :n iterations:optimal records arrangement:SVM(support vector mac hine) =>first iteration:less features:increase the features:increase the iterations =>feature vector of features are increases here =>more features of content to users =>tuning of data with many number of related databases =>tuning operation provide the better results summary: previous system :independent documents or records

=>there is no supporting documents =>features less =>these results are not meaningul new system:dependent documents or records =>supporting documents =>based on generational evolutionary approach we increases the features =>new documents:automatically everytime evalute the new documents:insert the new documents:new tree of results or modified results of content we Describe the good applications:scientific citation of articles and restaurant catalog present wikipedia website:hyperlinks as a independent : we don't no which hyperl ink is strongest hyperlink :own decisions:wrong results restaurant catalog:no users are not satisfy wikipedia:hyperlinks or citations:duplicate citations:remove the duplicates with the help of deduplication function =>unique records =>suggested function: one citation of similar features of citations we findout h ere =>calculates the fitness of present citations =>those citations are not proper alignment airthematic function better fit results =>new records are added:duplicates:automatically remove the duplicate records of content =>unique with better fit results Related Work or background or literature survey: **************** =>data entry problems 1.upload the any record content 2.we are not maintain the constraints 3.we are provide wrong data also its allow 4.duplicate data 5.format

own format =>recognization of results in searching process :its not possible to recognize a ll duplicates =>less duplicate detection ratio is available in implementation part =>this is related optical character recognization(OCR):neural networks realted c oncept =>in all formats which format is quality standard format =>combine the all format :new quality standard format:display the results into q uality standard word matching phrase matching subfield matching Edit distance adhoc domain knowledge training approaches domain knowledges statistics related probabilistic appraoach machine learning word matching,phrase matching,subfield matching:edit distance ********************************************************************** =>edit distance: in total database area:how much area itself we start the detect ion of duplicates =>total word of characters are matching:detects as a duplicates different databases:different topics related information =>enter the topic name:results:word is available in two or more number of topics =>matching the duplicates in present topic:same word we use for some purpose =>we identifies as a duplicate data =>data is converts as a meaningless phrase matching:information retrieval:n-grams =>combination of two or more words of sequences we generate here

=>two or more number of words :sequences:phrase =>using phrase based approach start the detection of duplicates =>those duplicates possible to detect in particular area :edit distance =>in all dimensions its not possible to detect the duplicates subfield extraction: ******************** =>words , phrases environment =>in word:many characters are present:10 characters:5 characters are matching:su bfield or substring representation =>detects the duplicates =>remove the duplicates =>its not meaningful way to start the detection of duplicates =>these substring based approach start the detection in particular distance new approaches:training approaches: Ad-hoc domain knowledge based approach *********************************************** 1.input:multiple web databases 2.enter the topic:same topic it may chance to present into two or more number of domains 3.all domains of information or records we display into output:large amount of r ecords or huge amount of records are displayed 3.from all domains first we deviate required domain results 4.huge amount of records contents - small amount of records 5.its very easy to detects the duplicates =>in present domain results enter the string or topic name =>more number of results =>change the query:add the new string:two strings:adhoc string mechanism =>meaningful duplicates detection =>in which combination of strings possible to detects all duplicates we have no idea =>its is expensive in detection of duplicates Training based approaches: ****************************** =>1.input:multiple databases

2.enter the query:from multiple databases extracts the duplicates of records 3.using classification technique:we write the rule:based on rule start the d etection of duplicates =>in single rule its not possible to detects the all duplicates =>supervised clustering:machine learning:semi supervised results:50% Domain knowledge approaches: ************************************ => multiple databases =>enter the query =>display the related records =>for detection of duplicates we introduces new similarity function =>one record:identifies the similar records of information:count the similar rec ords second record:identifies the similar records:count threshold:similarity count or weight above:detects as a duplicates:high weight o f duplicates below threshold:non duplicates:low weight:some duplicates are present:its not po ssible to detect all duplicates accurately probabilistic approaches: *************************** => this complete probabilistic approach depends on navies bayesian classificatio n => divide the records in two subsets:positive records related subset,negative re cords related subset =>positive record subset:unique records =>unique records: =>record deduplication:unique records =>we are not satisfy with unique records =>some duplicates are present for removing the more number of duplicate records: navies bayesian + statistical approach =>unique records:similar records:weight value or count value =>two boundary value:lower bound high weight boundary =>non duplicates records

=>there is no proper alignment disadvantages: 1.more amount of time and processing memory 2.similarity of records concept is very complex 3.boundary values sometimes its works as a bad boundaries: wrong results of spec ification machine learning approaches: ********************************* =>supervised clustering:navie bayesian classification:one time classification:it s not possible to derive the best results unsupervised clustering: 2 or more number of rules also it may chance to generat e:n rules:SVM =>n times of classification record matching: record linkage:2 times of classification is possible in implementation =>some what optimal results or good results of content 2 times of classification:1.training environment 2.testing environment record matching: record:attributes another record:attributes:matching attributes or matching of records also =>detects as a duplicates record linkage: record:attributes => some attributes are matching in implementation =>most of attributes are matching:higher probability of attributes =>detects the duplicates record deduplication: adaptive approach: ********************* =>clustering approach

=>records:select the record: attributes using cluster number of records:similar features of records =>static records =>new records:in new records also how many records are available as a duplicate Decision Trees: ***************** =>this is also one of the classification approach for detection of duplicates =>we use binary operations of content =>true or false =>we spend the training cost is low =>unique records:tree:its not meaningful tree record alignment Ranking approach: ********************* =>multiple databases =>enter the topic name =>similar records =>relevant records =>duplicates:data integration =>unique records =>similarity function: weight:according to weight:ranking =>based on ranking alignment of records: in ranking records its display without any relationship =>ranking records also available as a independent records GP aproach Generational evolutionary approach:

Potrebbero piacerti anche