Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
(a) To differentiate between structured, unstructured and 1.Structured data origin, organization, semi-structured data storage, access and usage (b) To understand the need to 2.Semi-structured data origin, integrate structured, organization, storage, access and usage unstructured and semistructured data 3.Unstructured data origin, organization, storage, access and usage
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Session Plan
: 45 to 60 minutes : 15 minutes
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Agenda
Types of digital data Unstructured Origin Management Storage Storage of unstructured data in relational database Process of extracting information Key take-away and additional reads Semi-structured Origin Management Storage Storage of semi-structured data in relational database Process of extracting information XML Key take-away and additional reads
Agenda (contd.)
Types of digital data contd. Structured Origin Management Storage Process of extracting information Key take-away and additional reads
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Digital Data
Digital data can be Unstructured Semi-structured Structured According to Merrill Lynch 8090% of business data is either unstructured or semi-structured Data is usually in a format which makes it difficult to extract information from it
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Unstructured Data
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Unstructured data
Challenges faced
Ensuring security is difficult due to varied sources of data (e.g. e-mail, web pages)
Updating, deleting, etc. are not easy due to the unstructured form Indexing becomes difficult with increase in data. Searching is difficult for non-text data
Create hardware which support unstructured data either compliment the existing storage devices or be a stand alone for unstructured data
Possible solutions
Store in relational databases which support BLOBs which is Binary Large Objects
Store in XML which tries to give some structure to unstructured data by using tags and elements
Challenges faced
Designing algorithms to understand the meaning of the document and then tag or index them accordingly is difficult Computer programs cannot automatically derive meaning/structure from unstructured data
Increasing number of file formats make it difficult to interpret data Different naming conventions followed across organization make it difficult to classify data. the
Text mining tools help in grouping and classifying unstructured data and analyze by considering grammar, context, synonyms ,etc.
Possible solutions
Application platforms like XOLAP help extract information from e-mail and XML based documents
Taxonomies within the organization can be managed automatically to organize data in hierarchical structures
Following naming conventions or standards across an organization can greatly improve storage and retrieval
Further Reading
http://www.information-management.com/issues/20030201/6287-1.html http://www.enterpriseitplanet.com/storage/features/article.php/11318_3407 161_2 http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.in dex.html http://www.research.ibm.com/UIMA/UIMA%20Architecture %20Highlights.html
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Ask the participants of the learning program to state some more examples of Unstructured data
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Do it Exercise
Search, think and write about two best practices for managing the growth of unstructured data
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Semi-structured Data
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Semi-structured data
Semi-structured data cannot be stored in existing RDBMS as data cannot be mapped into tables directly Some data elements may have extra information while others none at all
Challenges faced
In many cases the structure is implicit. Interpreting relationships and correlations is very difficult Schemas keep changing with requirements making it difficult to capture it in a database Vague distinction between schema and data exists at times making it difficult to capture data
Semi-structured data can be stored in a relational database by mapping the data to a relational schema which is then mapped to a table
Possible solutions
Databases which are specifically designed to store semi-structured data
Data can be stored and exchanged in the form of graph where entities are represented as objects which are the vertices in a graph
Challenges faced
Data comes from varied sources which is difficult to tag and search
Extracting structure when there is none and interpreting the relations existing in the structure which is present is a difficult task
Allows data to be stored in a graph-based data model which is easier to index and search
Possible solutions
Allows data to be arranged in a hierarchical or tree-like structure which enables indexing and searching
Various mining tools are available which search data based on graphs, schemas, structure, etc.
The words in the <> (angular brackets) are user-defined tags XML is known as self-describing as data can exist without a schema and schema can be added later Schema can be described in XSLT or XML schema
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Further Reading
http://queue.acm.org/detail.cfm?id=1103832 http://www.computerworld.com/s/article/93968/Taming_Text http://searchstorage.techtarget.com/generic/0,295582,sid5_gci1334684,00 .html http://searchdatamanagement.techtarget.com/generic/0,295582,sid91_gci1 264550,00.html http://searchdatamanagement.techtarget.com/news/article/0,289142,sid91 _gci1252122,00.html
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Structured Data
Structured Data
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Structured Data
Semi-structured
Name Patrick Wood E-mail ptw@dcs.abc.ac.uk, p.wood@ymail.uk MarkT@dcs.ymail.ac. uk
Structured
AlexBourdoo@dcs.y mail.ac.uk
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights
Data can be indexed based not only on a text string but other attributes as well. This enables streamlined search
Structured data can be easily mined and knowledge can be extracted from it
BI works extremely well with structured data. Hence data mining, warehousing, etc. can be easily undertaken
Further Readings
http://www.govtrack.us/articles/20061209data.xpd http://www.sapdesignguild.org/editions/edition2/sui_content.asp
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Do it Exercise
Think and write about an instance where data was presented to you in Unstructured, semi-structured and structured data format
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
Summary please
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.