Sei sulla pagina 1di 28

Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI

Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Casos y Materiales de estudio

Momento Intermedio Fase 3 - Unidad 3 - Introduccin, Modelado y administracin de Sistemas Big Data

( https://github.com/Unad-BDAvanzadas/U3_Caso_Material_Estudio)

Revisin de Material de Referencia ( Lecturas/Videos, Temas para Foros de Discusin/Opinion y Cuestionarios)


Grupo 1: Lecturas y Introduccin a Big Data
Temas de discusin
Lectura What's in Big Data Applications and Systems?

Big Data: Why and Where

Why Big Data?


Lectura What launched the Big Data era?
Lectura Applications: What makes big data valuable
Lectura Example: Saving lives with Big Data
Lectura Example: Using Big Data to Help Patients
Lectura A Sentiment Analysis Success Story: Meltwater helping Danone

Reading Did you know?: 25 facts about big data


Big Data: Where Does It Come From?

Lectura Getting Started: Where Does Big Data Come From?


Lectura Machine-Generated Data: It's Everywhere and There's a Lot!
Lectura Machine-Generated Data: Advantages
Lectura Big Data Generated By People: The Unstructured Challenge
Lectura Big Data Generated By People: How Is It Being Used?
Lectura Organization-Generated Data: Structured but often siloed
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada
Lectura Organization-Generated Data: Benefits Come From Combining With Other Data Types
Lectura The Key: Integrating Diverse Data
Readings: McKinsey report: http://www.mckinsey.com/business-functions/business-technology/our-
insights/big-data-the-next-frontier-for-innovation
Lectura The WIFIRE Project: https://www.youtube.com/watch?v=0ohwGggaXZM
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Temas de discusin a poner el Foro (Opcionales)

Let's Discuss: What application area interests you?

If you had your choice Big Data analysis areas to work in, which would you choose? If you can describe a specific type of data or
problem you would like to work on.
Health/Medical
City/Government/Infrastructure
Personalized Marketing
Product Growth
Something else?

Let's discuss: Who are you providing data to?

It's commonly discussed in the news how social media sites like twitter and facebook gather data on their users. But take a minute to
this in detail about the various ways you interact with machines and applications on a given day.

What's one surprising or uncomfortable thing you may be providing data on?
Is there a non-social media (or shopping) application you realize you do give information to (perhaps that you hadn't thought of
before)?

Grupo 2: Lecturas y Caractersticas de Big Data y dimensionamiento de la escalabilidad


Temas de discusin
Caractersticas de Big Data

Lectura Getting Started: Characteristics Of Big Data


Lectura Characteristics of Big Data Volume
Lectura What does astronomical scale mean?
Lectura Characteristics of Big Data Variety
Lectura Characteristics of Big Data Velocity
Lectura Characteristics of Big Data Veracity
Lectura Characteristics of Big Data Valence
Lectura The Sixth V: Value
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Reading A Small Definition of Big Data


Reading : http://spaceanalytics.blogspot.com.co/2016/05/caracteristicas-de-big-data.html
Ciencia de Datos: Sacando valor de Big Data
Definiendo la pregunta
Lectura Data Science: Getting Value out of Big Data
Lectura Building a Big Data Strategy
Lectura How does big data science happen?: Five Components of Data Science
Lectura Asking the Right Questions

Reading Five P's of Data Science


Proceso de anlisis de datos
Lectura Steps in the Data Science Process
Lectura Step 1: Acquiring Data
Lectura Step 2-A: Exploring Data
Lectura Step 2-B: Pre-Processing Data
Lectura Step 3: Analyzing Data
Lectura Step 4: Communicating Results
Lectura Step 5: Turning Insights into Action
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Temas de discusin a poner el Foro (Opcionales)

Practice: Writing Big Data questions


Now that you have learned about the characteristics of Big Data, it's important to practice the process of considering what data you
have and brainstorming what kinds of data science questions you could ask of that data.
Let's take the example of Eginece, our builder of th e Flamingo game. Think about the kind of big data generated by users playing this
game.

What questions can you formulate that could be answered with this data that would bring value to the company?

Let's Discuss: Improving the Flamingo Game

In the value lectures, we mentioned a social media game called Catch the Flamingo. Big data is generated when we track all of the
user's data and store them in our database.
Why do you think it would be beneficial to collect player data and perform analysis on it for the future of the game? In particular,
which aspect of the game could you imagine we could improve using this data?

Let's Discuss: Thinking more deeply about the Ps

Think about a big data problem of interest to you (e.g., in your career, in your life, etc.). When you think about that problem, which of
the 5 Ps is most interesting or most challenging to address?
Note: If you don't have a big data problem of your own, think about Ilkay's work in fire analytics.
The 5 Ps are: People, Process, Purpose, Platforms, Programmability

Let's Discuss: Building a Team


What type of expertise do you think you would need in order to create a data science team for a social media site such as Facebook,
Twitter, Instagram or similar?
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Grupo 3: Fundamentacin de sistemas Big Data y programacin


Basic Scalable Computing Concepts
Lectura Getting Started: Why worry about foundations?
Lectura What is a Distributed File System?
Lectura Scalable Computing over the Internet
Lectura Programming Models for Big Data
Sistemas: Iniciando con Hadoop
Getting Started with Hadoop
Lectura Hadoop: Why, Where and Who?
Lectura The Hadoop Ecosystem: Welcome to the zoo!
Lectura The Hadoop Distributed File System: A Storage System for Big Data
Lectura YARN: A Resource Manager for Hadoop
Lectura MapReduce: Simple Programming for Big Results

Reading MapReduce in the Pasta Sauce Example

Lectura When to Reconsider Hadoop?


Lectura Cloud Computing: An Important Big Data Enabler
Lectura Cloud Service Models: An Exploration of Choices
Lectura Value From Hadoop and Pre-built Hadoop Images
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Hands On (Talleres)

Downloading and Installing Hadoop


Reading Downloading and Installing the Cloudera VM Instructions (Mac)
Reading Downloading and Installing the Cloudera VM Instructions (Windows)
Reading Copy your data into the Hadoop Distributed File System (HDFS) Instructions
Running your First Application on Hadoop
Lectura Copy your data into the Hadoop Distributed File System (HDFS)
Reading Run the WordCount program Instructions
Lectura Run the WordCount program
Other Let's Discuss: Map Reduce in your life
Optional Materials
Reading How do I figure out how to run Hadoop MapReduce programs?
Discussions:

Let's Discuss: Map Reduce in your life


What are some examples in your work or daily life where applying the map reduce algorithm can speed up the process of the situation?
In particular, if it is possible to implement, state who/what would store the data in each stage and who/what would execute the
mapping and reducing.
Participation is optional

Project 1 instructions (Grupo 1 3)

MapReduce is the core programming model for the Hadoop Ecosystem. Weve found its really helpful to walk through the steps of
MapReduce for yourself in order to internalize how it really works. In Lectura lecture, we walked through the steps of MapReduce to
count words -- our keys were words. In this exercise, well have you count shapes -- the keys will be shapes.
Note: This assignment can be done in PPT and printed to PDF or on paper and submitted as a picture. Template in PPT, template in
JPG.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Cuestionario

Why Big Data and Where Did it Come From?

1. Which of the following is an example of big data utilized in action today?

Wi-Fi Networks
Social Media
The Internet
Individual, Unconnected Hospital Databases
2. What reasoning was given for the following: why is the data storage to price ratio relevant to big data?

Larger storage means easier accessibility to big data for every user because it allows users to download in bulk.
Access of larger storage becomes easier for everyone, which means client-facing services require very large data
storage.
It isn't, it was just an arbitrary example on big data usage.
Companies can't afford to own, maintain, and spend the energy to support large data storage unless the cost is
sufficiently low.
3. What is the best description of personalized marketing enabled by big data?

Being able to use the data from each customer for marketing needs.
Being able to obtain and use customer information for specific groups and utilize them for marketing needs.
Marketing to each customer on an individual level and suiting to their needs.
4. Of the following, which are some examples of personalized marketing related to big data?

Facebook revealing posts that cater towards similar interests.


News outlets gathering information from the internet in order to report them to the public.
A survey that asks your age and markets to you a specific brand.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

5. What is the workflow for working with big data?

Extrapolation -> Understanding -> Reproducing


Big Data -> Better Models -> Higher Precision
Theory -> Models -> Precise Advice
6. Which is the most compelling reason why mobile advertising is related to big data?

Mobile advertising benefits from data integration with location which requires big data.
Since almost everyone owns a cell/mobile phone, the mobile advertising market is large and thus requires big data to
contain all the information.
Mobile advertising allows massive cellular/mobile texting to a wide audience, thus providing large amounts of data.
Mobile advertising in and of itself is always associated with big data.
7. What are the three types of diverse data sources?

Machine Data, Map Data, and Social Media


Sensor Data, Organizational Data, and Social Media
Information Networks, Map Data, and People
Machine Data, Organizational Data, and People
8. What is an example of machine data?

Social Media
Weather station sensor output.
Sorted data from Amazon regarding customer info.
9. What is an example of organizational data?

Disease data from Center for Disease Control.


Social Media
Satellite Data
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

10. Of the three data sources, which is the hardest to implement and streamline into a model?

Organizational Data
Machine Data
People
11. Which of the following summarizes the process of using data streams?

Integration -> Personalization -> Precision


Big Data -> Better Models -> Higher Precision
Theory -> Models -> Precise Advice
Extrapolation -> Understanding -> Reproducing
12. Where does the real value of big data often come from?

Combining streams of data and analyzing them for new insights.


Size of the data.
Having data-enabled decisions and actions from the insights of new data.
Using the three major data sources: Machines, People, and Organizations.
13. What does it mean for a device to be "smart"?

Must have a way to interact with the user.


Connect with other devices and have knowledge of the environment.
Having a specific processing speed in order to keep up with the demands of data processing.
14. What does the term "in situ" mean in the context of big data?

In the situation
The sensors used in airplanes to measure altitude.
Accelerometers.
Bringing the computation to the location of the data.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

15. Which of the following are reasons mentioned for why data generated by people are hard to process?

The velocity of the data is very high.


Very unstructured data.
Skilled people to analyze the data are hard to come by.
They cannot be modeled and stored.
16. What is the purpose of retrieval and storage; pre-processing; and analysis in order to convert multiple
data sources into valuable data?

To enable ETL methods.


Since the multi-layered process is built into the Neo4j database connection.
Designed to work like the ETL process.
To allow scalable analytical solutions to big data.
17. Which of the following are benefits for organization generated data?

Customer Satisfaction
Better Profit Margins
Improved Safety
High Velocity
Higher Sales
18. What are data silos and why are they bad?

Highly unstructured data. Bad because it does not provide meaningful results for organizations.
A giant centralized database to house all the data produces within an organization. Bad because it is hard to maintain
as highly structured data.
Data produced from an organization that is spread out. Bad because it creates unsynchronized and invisible data.
A giant centralized database to house all the data production within an organization. Bad because it hinders
opportunity for data generation.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

19. Which of the following is a benefit of data integration?

Increase data collaboration.


Adds value to big data.
Increase data availability.
Unify your data system.
Reduce data complexity.
Monitoring of data.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Why Big Data and Where Did it Come From?

1. Which of the following is an example of big data utilized in action today?

Wi-Fi Networks
Social Media
The Internet

While the Internet may be enabling the easier collection and sharing of big data, in and of itself, it is not an example of
big data utilized in action today.

Individual, Unconnected Hospital Databases


2. What reasoning was given for the following: why is the "data storage to price ratio" relevant to big data?

Larger storage means easier accessibility to big data for every user because it allows users to download in bulk.
Access of larger storage becomes easier for everyone, which means client-facing services require very large data
storage.
It isn't, it was just an arbitrary example on big data usage.
Companies can't afford to own, maintain, and spend the energy to support large data storage unless the cost is
sufficiently low.
3. What is the best description of personalized marketing enabled by big data?

Being able to use the data from each customer for marketing needs.
Being able to obtain and use customer information for specific groups and utilize them for marketing needs.
Marketing to each customer on an individual level and suiting to their needs.
4. Of the following, which are some examples of personalized marketing related to big data?

Facebook revealing posts that cater towards similar interests.


News outlets gathering information from the internet in order to report them to the public.
A survey that asks your age and markets to you a specific brand.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

5. What is the workflow for working with big data?

Extrapolation -> Understanding -> Reproducing


Big Data -> Better Models -> Higher Precision
Theory -> Models -> Precise Advice
6. Which is the most compelling reason why mobile advertising is related to big data?

Mobile advertising benefits from data integration with location which requires big data.
Since almost everyone owns a cell/mobile phone, the mobile advertising market is large and thus requires big data to
contain all the information.
Mobile advertising allows massive cellular/mobile texting to a wide audience, thus providing large amounts of data.
Mobile advertising in and of itself is always associated with big data.
7. What are the three types of diverse data sources?

Machine Data, Map Data, and Social Media


Sensor Data, Organizational Data, and Social Media
Information Networks, Map Data, and People
Machine Data, Organizational Data, and People
8. What is an example of machine data?

Social Media
Weather station sensor output.
Sorted data from Amazon regarding customer info.
9. What is an example of organizational data?

Disease data from Center for Disease Control.


Social Media
Satellite Data
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

10. Of the three data sources, which is the hardest to implement and streamline into a model?

Organizational Data
Machine Data
People
11. Which of the following summarizes the process of using data streams?

Integration -> Personalization -> Precision


Big Data -> Better Models -> Higher Precision
Theory -> Models -> Precise Advice
Extrapolation -> Understanding -> Reproducing
12. Where does the real value of big data often come from?

Combining streams of data and analyzing them for new insights.


Size of the data.
Having data-enabled decisions and actions from the insights of new data.
Using the three major data sources: Machines, People, and Organizations.
13. What does it mean for a device to be "smart"?

Must have a way to interact with the user.


Connect with other devices and have knowledge of the environment.
Having a specific processing speed in order to keep up with the demands of data processing.
14. What does the term "in situ" mean in the context of big data?

In the situation
The sensors used in airplanes to measure altitude.
Accelerometers.
Bringing the computation to the location of the data.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

15. Which of the following are reasons mentioned for why data generated by people are hard to process?

The velocity of the data is very high.


Very unstructured data.
Skilled people to analyze the data are hard to come by.
They cannot be modeled and stored.
16. What is the purpose of retrieval and storage; pre-processing; and analysis in order to convert multiple data sources
into valuable data?

To enable ETL methods.


Since the multi-layered process is built into the Neo4j database connection.
Designed to work like the ETL process.
To allow scalable analytical solutions to big data.
17. Which of the following are benefits for organization generated data?

Customer Satisfaction
Better Profit Margins
Improved Safety
High Velocity
Higher Sales
18. What are data silos and why are they bad?

Highly unstructured data. Bad because it does not provide meaningful results for organizations.
A giant centralized database to house all the data produces within an organization. Bad because it is hard to maintain
as highly structured data.
Data produced from an organization that is spread out. Bad because it creates unsynchronized and invisible data.
A giant centralized database to house all the data production within an organization. Bad because it hinders
opportunity for data generation.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

V for the V's of Big Data

1. Amazon has been collecting review data for a particular product. They have realized that almost 90% of the reviews
were mostly a 5/5 rating. However, of the 90%, they realized that 50% of them were customers who did not have proof
of purchase or customers who did not post serious reviews about the product. Of the following, which is true about the
review data collected in this situation?

High Veracity
Low Valence
High Valence
High Volume
Low Veracity
Low Volume

2. As mentioned in the slides, what are the challenges to data with high valence?

Complex Data Exploration Algorithms


Difficult to Integrate
Reliability of Data
3. Which of the following are the 6 V's in big data?

Veracity
Value
Volume
Valence
Vision
Velocity
Variety
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

4. What is the veracity of big data?

The speed at which data is produced.


The size of the data.
The connectedness of data.
The abnormality or uncertainties of data.
5. What are the challenges of data with high variety?

The quality of data is low.


Hard in utilizing group event detection.
Hard to perform emergent behavior analysis.
Hard to integrate.
6. Which of the following is the best way to describe why it is crucial to process data in real-time?

Batch processing is an older method that is not as accurate as real-time processing.


More expensive to batch process.
Prevents missed opportunities.
More accurate.
7. What are the challenges with big data that has high volume?

Storage and Accessibility


Speed Increase in Processing
Cost, Scalability, and Performance
Effectiveness and Cost
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Data Science 101

1. Which of the follow are parts of the 5 P's of data science and part of an additional P introduced in the slides?

Purpose
Process
People
Platforms
Perception

Programmability
Product
2. Which of the following are part of the four main categories to acquire, access, and retrieve data?

Remote Data
Web Services

Text Files
Traditional Databases
NoSQL Storage
3. What are the steps required for data analyzation?

Select Technique, Build Model, Evaluate


Classification, Regression, Analysis
Regression, Evaluate, Classification
Investigate, Build Model, Evaluate
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

4. Of the following, what is a technique mentioned in the videos for building a model?

Analysis
Validation
Evaluation
Investigation
5. What is the first step in finding a right problem to tackle in data science?

Ask the Right Questions


Define Goals
Assess the Situation
Define the Problem
6. What is the first step to big data strategy?

Collect Data
Build In-House Expertise
Business Objectives
Organizational Buy-In
7. According to Ilkay, why is exploring data crucial to better modeling?

Data exploration... <complete the sentence>

enables histograms and others graphs as data visualization.


enables a description of data which allows visualization.
enables understanding of general trends, correlations, and outliers.
leads to data understanding which allows an informed analysis of the data.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

8. Why is data science mainly about teamwork?

Data science requires a variety of expertise in different fields.


Exhibition of curiosity is required.
Engineering solutions are preferred.
Analytic solutions are required.
9. What are the ways to address data quality issues?

Remove data with missing values.


Data Wrangling

Generate best estimates for invalid values.


Remove outliers.
Merge duplicate records.
10. What is done to the data in the preparation stage?

Understand Nature of Data and Preliminary Analysis.


Build Models
Identify Data Sets and Query Data
Select Analytical Techniques
Retrieve Data
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Foundations for Big Data

1. Which of the following is the best description of why it is important to learn about the foundations in big data?

Foundations allow understanding of practical concepts in Hadoop.


Understanding of practical concepts in Hadoop allows solid foundation.
Foundations stand the test of time.
Since foundations can be retained for a long time, Hadoop should be learned.
2. What is the benefit of a commodity cluster?

Much faster than a traditional super computer.


Prevents individual component failures.
Cost Effective
Prevents network connection failure.
3. What is a way to enable fault tolerance?

System Wide Restart


Better LAN Connection
Redundant Data Storage
Distributed Computing

4. What is a benefit specific to a distributed file system?

High Concurrency
Large Storage

Data Scalability
High Fault Tolerance
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

5. Which of the following are general requirements for a programming language in order to support big data models?

Handle Fault Tolerance


Optimization of Specific Data Types
Enable Adding of More Racks
Utilize Map Reduction Methods
Support Big Data Operations
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Intro to Hadoop

1. What does IaaS provide?

Hardware Only
Computing Environment
Software On-Demand
2. What does PaaS provide?

Software On-Demand
Computing Environment
Hardware Only
3. What does SaaS provide?

Hardware Only
Software On-Demand
Computing Environment
4. What are the two key components of HDFS and what are they used for?

NameNode for metadata and DataNode for block storage.


FASTA for genome sequence and Rasters for geospatial data.
NameNode for block storage and Data Node for metadata.
5. What is the job of the NameNode?

Coordinate operations and assigns tasks to Data Nodes


Listens from DataNode for block creation, deletion, and replication.
For gene sequencing calculations.
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

6. What are the three steps to Map Reduce?

Shuffle and Sort -> Map -> Reduce


Map -> Shuffle and Sort -> Reduce
Map -> Reduce -> Shuffle and Sort
Shuffle and Sort -> Reduce -> Map
7. What is a benefit of using pre-built Hadoop images?

Quick prototyping, deploying, and guaranteed bug free.


Quick prototyping, deploying, and validating of projects.
Guaranteed hardware support.
Less software choices to choose from.
8. What are some examples of open-source tools built for Hadoop and what does it do?

Pig, for real-time and in-memory processing of big data.


Zookeeper, analyze social graphs.
Giraph, for SQL-like queries.
Zookeeper, management system for animal named related components.
9. What is the difference between low level interfaces and high level interfaces?

Low level deals with interactivity while high level deals with storage and scheduling.
Low level deals with storage and scheduling while high level deals with interactivity.
10. Which of the following are problem sto look out for when you want to integrate your project with Hadoop?

Task Level Parallelism


Random Data Access
Advanced Alogrithms
Infrastructure Replacement
Data Level Parallelism
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

11. As covered in the slides, which of the following are the major goals of Hadoop?

Latency Sensitive Tasks


Enable Scalability
Handle Fault Tolerance
Provide Value for Data
Facilitate a Shared Environment
Optimized for a Variety of Data Types
12. What is the purpose of YARN?

Allows various applications to run on the same Hadoop cluster.


Enables large scale data across clusters.
Implementation of Map Reduce.
13. What are the two main components for a data computation framework that were described in the slides?

Applications Master and Container


Node Manager and Applications Master
Resource Manager and Container
Node Manager and Container
Resource Manager and Node Manager
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada

Lecturas para el diseo de Mapas Concetuales

Los textos de cada una de las Lecturas que se deben hacer, para la preparacin de los Mapas Conceptuales, se
encuentran en el siguiente vnculo.
(https://github.com/Unad-BDAvanzadas/U3_Caso_Material_Estudio)

Recomendaciones por el docente: El trabajo final de grupo, para cada una de las fases establecidas, debe obtenerse a
partir de la discusin, revisin, complementacin y consolidacin de los productos y aportes presentados individualmente.
Debe darse una dinmica de interaccin permanente y de aportes significativos al interior del grupo, de acuerdo al rol
asumido por cada integrante tanto en el desarrollo del trabajo colaborativo como en la produccin de los entregables
(producto final del grupo).
Se debe entregar un slo archivo con el desarrollo del trabajo. La idea es que presenten un documento con la consolidacin de
los consensos o acuerdos hechos a partir de las propuestas individuales, que es diferente a la unin (copie y pegue) de todo lo
enviado y tambin diferente a la presentacin de slo uno de los aportes individuales enviados.
Uso de la norma APA, versin 3 en espaol (Traduccin de la versin 6 en ingls)
Las Normas APA es el estilo de organizacin y presentacin de informacin ms usado en el rea de las ciencias sociales.
Estas se encuentran publicadas bajo un Manual que permite tener al alcance las formas en que se debe presentar un artculo
cientfico. Aqu podrs encontrar los aspectos ms relevantes de la sexta edicin del Manual de las Normas APA, como
referencias, citas, elaboracin y presentacin de tablas y figuras, encabezados y seriacin, entre otros. Puede consultar como
implementarlas ingresando a la pgina http://normasapa.com/
Universidad Nacional Abierta y a Distancia UNAD Vicerrectora Acadmica y de Investigacin - VIACI
Escuela: Ciencias Bsicas Tecnologa e Ingeniera - Programa: Ingeniera de Sistemas Curso: Base de Datos Avanzada
Polticas de plagio: Qu es el plagio para la UNAD? El plagio est definido por el diccionario de la Real Academia como
la accin de "copiar en lo sustancial obras ajenas, dndolas como propias". Por tanto el plagio es una falta grave: es el
equivalente en el mbito acadmico, al robo. Un estudiante que plagia no se toma su educacin en serio, y no respeta el
trabajo intelectual ajeno.
No existe plagio pequeo. Si un estudiante hace uso de cualquier porcin del trabajo de otra persona, y no documenta su
fuente, est cometiendo un acto de plagio. Ahora, es evidente que todos contamos con las ideas de otros a la hora de
presentar las nuestras, y que nuestro conocimiento se basa en el conocimiento de los dems. Pero cuando nos apoyamos en
el trabajo de otros, la honestidad acadmica requiere que anunciemos explcitamente el hecho que estamos usando una
fuente externa, ya sea por medio de una cita o por medio de un parfrasis anotado (estos trminos sern definidos ms
adelante). Cuando hacemos una cita o un parfrasis, identificamos claramente nuestra fuente, no slo para dar
reconocimiento a su autor, sino para que el lector pueda referirse al original si as lo desea.
Existen circunstancias acadmicas en las cuales, excepcionalmente, no es aceptable citar o parafrasear el trabajo de otros. Por
ejemplo, si un docente asigna a sus estudiantes una tarea en la cual se pide claramente que los estudiantes respondan
utilizando sus ideas y palabras exclusivamente, en ese caso el estudiante no deber apelar a fuentes externas an, si stas
estuvieran referenciadas adecuadamente.

Potrebbero piacerti anche