Big Data Question Bank PDF

Big Data Question Bank
Answer Pattern:
1. Introduction statement
2. Relevant explanation
3. Example
4. Diagram
5. Anything more you want to add
Q1. Explain the concept of big data. Compare and contrast with data
warehouse.
Answer:
The term Big Data is being increasingly used almost everywhere on the planet – online
and offline. And it is not related to computers only.
It comes under a blanket term called Information Technology, which is now part of
almost all other technologies and fields of studies and businesses.
Big Data is not a big deal. The hype surrounding it is a sure pretty big deal to confuse
you. This article takes a look at what is Big Data.
It also contains an example of how Netflix used its data, or rather, Big Data, to better
serve its clients’ needs.
What is Big Data
The data lying in the servers of your company was just data until yesterday – sorted and
filed. Suddenly, the slang Big Data got popular, and now the data in your company is
Big Data.
The term covers each and every piece of data your organization has stored till now. It
includes data stored in clouds and even the URLs that you bookmarked. Your company
might not have digitized all the data.
You may not have structured all the data already. But then, all the digital, papers,
structured and non-structured data with your company is now Big Data.
In short, all the data – whether or not categorized – present in your servers is
collectively called BIG DATA. All this data can be used to get different results using
different types of analysis.
It is not necessary that all analysis use all the data. The different analysis uses different
parts of the BIG DATA to produce the results and predictions necessary.
Big Data is essentially the data that you analyse for results that you can use for
predictions and other uses. When using the term Big Data, suddenly your company or
organization is working with top level Information technology to deduce different types of
results using the same data that you stored intentionally or unintentionally over the
years.
How big is Big Data
Essentially, all the data combined is Big Data, but many researchers agree that Big
Data – as such – cannot be manipulated using normal spreadsheets and regular tools
of database management.
They need special analysis tools like Hadoop (we’ll study this in a separate post) so that
all the data can be analysed at one go (may include iterations of analysis).
Contrary to the above, though I am not an expert on the subject, I would say that data
with any organization – big or small, organized or unorganized – is Big Data for that
organization and that the organization may choose its own tools to analyse the data.
Normally, for analysing data, people used to create different data sets based on one or
more common fields so that analysis becomes easy. In case of Big Data, there is no
need to create subsets for analysing it.
We now have tools that can analyse data irrespective of how huge it is. Probably, these
tools themselves categorize the data even as they are analysing it.
Related Question: What is the relation between data warehouse and big data. Explain
with suitable example.
Answer:
Data warehouse = historical data only. Big data = now data + current Data (IOT
Devices)
Data warehouse = DSS (Decision Support System) type of model
Big Data = Expert system type of approach

Q2. What are the V's of BIG DATA?
Answer:
Q3. Explain one application each from Manufacturing and Service Industry.
Answer:
Improving Manufacturing Processes
McKinsey and Company offers a big data use case in pharmaceutical manufacturing. A
biopharmaceutical company was using live, genetically engineered cells and tracking
200 variables to track the purity of its manufacturing process for vaccines and blood
components. However, two batches of the same substance manufactured using
identical processes showed a yield variation from 50 to 100 percent. The inconsistency
in capacity and quality could attract regulatory attention.
The project team segmented its manufacturing processes into clusters of activity. Using
big data analytics the team assessed process interdependencies and identified nine
parameters that had a direct impact on vaccine yield. By modifying target processes the
company was able to increase vaccine production by 50 percent resulting in savings
between $5 and $10 million annually.
Custom Product Design
Tata Consultancy Services cites the case of a $2 billion company that generates most
of its revenue by manufacturing products to order.
Using big data analytics this company was able to analyze the behavior of repeat
customers. The outcome is critical to understanding how to deliver goods in a timely
and profitable manner.
Much of the analyses centered on how to make sure strong contracts were in place.
The company also was able to shift to lean manufacturing to determine which products
were viable and which ones needed to be scrapped.
Better Quality Assurance
Intel has been harnessing big data for its processor manufacturing for some time. The
chipmaker has to test every chip that comes off its production line. That normally means
running each chip through 19,000 tests.
Using big data for predictive analytics Intel was able to significantly reduce the number
of tests required for quality assurance. Starting at the wafer level, Intel analyzed data
from the manufacturing process to cut down test time and focus on specific tests.
The result was a savings of $3 million in manufacturing costs for a single line of Intel
Core processors. By expanding big data use in its chip manufacturing, the company
expects to save an additional $30 million.
Managing Supply Chain Risk

One manufacturer is using big data to reduce risk in delivery of raw materials, no matter
what happens in the supply chain.
Using big data analytics, the company has overlaid potential delays on a map, analyzing
weather statistics for tornadoes, earthquakes, hurricanes, etc. Predictive analytics allow
the company to calculate the probabilities of delays. The company uses the analytics
findings to identify backup suppliers and develop contingency plans to make sure
production isn’t interrupted by natural disaster.
These are just four examples of big data use cases in the manufacturing industry. There
are dozens of others. If you can narrowly define the problem and assemble the right
data you can harness big data to address almost any manufacturing problem.
Service Based Industries
Since consumers expect rich media on-demand in different formats and in a variety of
devices, some big data challenges in the communications, media and entertainment
industry include:
1. Collecting, analyzing, and utilizing consumer insights

2. Leveraging mobile and social media content
3. Understanding patterns of real-time, media content usage
4. Applications of big data in the Communications, media and entertainment
industry
5. Organizations in this industry simultaneously analyze customer data along with
behavioral data to create detailed customer profiles that can be used to:
6. Create content for different target audiences
A case in point is the Wimbledon Championships (YouTube Video) that leverages big
data to deliver detailed sentiment analysis on the tennis matches to TV, mobile, and
web users in real-time.
Spotify, an on-demand music service, uses Hadoop big data analytics, to collect data
from its millions of users worldwide and then uses the analyzed data to give informed
music recommendations to individual users.
Amazon Prime, which is driven to provide a great customer experience by offering,

video, music and Kindle books in a one-stop shop also heavily utilizes big data.
Q4. Write Short Notes on:
a) HDFS and Tools
b) Data->Information->Knowledge
Answer:
HDFS
The HDFS is the storage system of the Hadoop framework. It is a distributed file system
that can conveniently run on commodity hardware for processing unstructured data.
Due to this functionality of HDFS that is built to run on commodity hardware, it is able to
be highly fault-tolerant.
The same data is stored in multiple locations and in the event of one storage location
failing to provide the required data, the same data can be easily fetched from another
location. It owes its existence to the Apache Nutch project but today is a top level
Apache Hadoop project.
HDFS is a major constituent of Hadoop along with Hadoop YARN, Hadoop MapReduce
and Hadoop Common.
HDFS key features
HDFS is a highly scalable and reliable storage system for big data platform Hadoop.
Working closely with Hadoop YARN for data processing and data analytics, it improves
the data management layer of the Hadoop cluster making it efficient enough to process
big data concurrently. HDFS also works in close coordination with HBase. Let us find
out some of the highlights that make this technology special :
HDFS key features Description
Stores bulks of data Capable of storing terabytes and petabytes of data
Minimum HDFS manages thousands of nodes without operator intervention

intervention
Computing Benefits of distributed and parallel computing at once
Scaling out It works on scaling out rather than scaling up without single
downtime
Rollback Allows returning to its previous version post an upgrade
Data integrity Deals with corrupted data by replicating it several times

The servers in HDFS are fully connected and communicate through TCP-based
protocols. Though designed for huge databases, normal file systems (FAT, NTFS) can
also be viewed. Current status of a node is obtained through Checkpoint Node.
TOOLS
Data Extraction Tool: Talend, Pentaho

Data Storing Tool: Hive, Sqoop, MongoDB
Data Mining Tool: Oracle
Data Analysing Tool: HBase, Pig
Data integrating Tool: Zookeeper
MongoDB is an open source database that uses a document-oriented data model.

MongoDB
How it Works:
MongoDB stores data using a flexible document data model that is similar to JSON.
Documents contain one or more fields, including arrays, binary data and sub-
documents. Fields can vary from document to document.
Some features of MongoDB Tool:
MongoDB can be used as a file system with load balancing and data replication
features over multiple machines for storing files.
Following are the main features of MongoDB:
1. Ad hoc queries
2. Indexing
3. Replication
4. Load balancing
5. Aggregation
6. Server-side JavaScript execution
7. Capped collections
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing
data summarization, query, and analysis. Hive gives an SQL-like interface to query data
stored in various databases and file systems that integrate with Hadoop.
How it Works:
Hive has three main functions data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, which automatically translates SQL-like
queries into MapReduce jobs executed on Hadoop.
Features of Apache Hive
Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and
compatible file systems such as Amazon S3 file system.
Q5. What is Predictive analytics? Explain with suitable example
Answer:
What Is Predictive Analytics?
Predictive analytics refers to using historical data, machine learning, and artificial
intelligence to predict what will happen in the future. This historical data is fed into a
mathematical model that considers key trends and patterns in the data. The model is
then applied to current data to predict what will happen next.
Using the information from predictive analytics can help companies—and business
applications—suggest actions that can affect positive operational changes. Analysts can
use predictive analytics to foresee if a change will help them reduce risks, improve
operations, and/or increase revenue. At its heart, predictive analytics answers the
question, “What is most likely to happen based on my current data, and what can I do to
change that outcome?”
Real World Examples of Predictive Analytics in Business Intelligence
For many companies, predictive analytics is nothing new. But it is increasingly used by
various industries to improve everyday business operations and achieve a competitive
differentiation.
In practice, predictive analytics can take a number of different forms. Take these
scenarios for example.
Identify customers that are likely to abandon a service or product:
Consider a yoga studio that has implemented a predictive analytics model. The system
may identify that ‘Jane’ will most likely not renew her membership and suggest an
incentive that is likely to get her to renew based on historical data. The next time Jane
comes into the studio, the system will prompt an alert to the membership relations staff
to offer her an incentive or talk with her about continuing her membership. In this
example, predictive analytics can be used in real time to remedy customer churn before
it takes place.
Send marketing campaigns to customers who are most likely to buy. If your business
only has a $5,000 budget for an upsell marketing campaign and you have three million
customers, you obviously can’t extend a 10 percent discount to each customer.
Predictive analytics and business intelligence can help forecast the customers who
have the highest probability of buying your product, then send the coupon to only those
people to optimize revenue.
Improve customer service by planning appropriately:
Businesses can better predict demand using advanced analytics and business
intelligence. For example, consider a hotel chain that wants to predict how many
customers will stay in a certain location this weekend so they can ensure they have
enough staff and resources to handle demand.
Q6. Case Study - ABC ltd. is a company who is a maker of boutique leather articles. It
has been in business for last 20 years. It has implemented a CRM system 5 years back
and has transferred all the sales and customer data since inception. As a big data
consultant, chart out the Data -> information lifecycle for the organization and suggest a
suitable advertisement mix based on suitable assumptions (stating them).
Q7. List the components of Hadoop, explain its use.
Answer:
HDFS Components:
There are two major components of Hadoop HDFS:
1. NameNode
2. DataNode
Let’s now discuss these Hadoop HDFS Components-
NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
DataNode the data is stored and other details. It consists of files and directories.
Tasks of HDFS NameNode

• Manage file system namespace.
• Regulates client’s access to files.
• Executes file system execution such as naming, closing, opening files and
directories.
DataNode
It is also known as Slave. HDFS DataNode is responsible for storing actual data in
HDFS. DataNode performs read and write operation as per the request of the clients.
Replica block of DataNode consists of 2 files on the file system. The first file is for data
and second file is for recording the block’s metadata.
HDFS Metadata includes checksums for data. At start-up, each DataNode connects to
its corresponding NameNode and does handshaking. Verification of namespace ID and
software version of DataNode take place by handshaking. At the time of mismatch
found, DataNode goes down automatically.
Tasks of HDFS DataNode
• DataNode performs operations like block replica creation, deletion, and

replication according to the instruction of NameNode.
• DataNode manages data storage of the system.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides data
processing. MapReduce is a software framework for easily writing applications that
process the vast amount of structured and unstructured data stored in the Hadoop
Distributed File system.
MapReduce programs are parallel in nature, thus are very useful for performing large-
scale data analysis using multiple machines in the cluster. Thus, it improves the speed
and reliability of cluster this parallel processing.
Hadoop MapReduce
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two
phases:
• Map phase
• Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also
specifies two functions: map function and reduce function
Map function takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Read Mapper in
detail.
Reduce function takes the output from the Map as an input and combines those data
tuples based on the key and accordingly modifies the value of the key. Read Reducer in
detail.
Features of MapReduce
• Simplicity – MapReduce jobs are easy to run. Applications can be written in any
language such as java, C++, and python.
• Scalability – MapReduce can process petabytes of data.
• Speed – By means of parallel processing problems that take days to solve, it is

solved in hours and minutes by MapReduce.
• Fault Tolerance – MapReduce takes care of failures. If one copy of data is

unavailable, another machine has a copy of the same key pair which can be
used for solving the same subtask.
Q8. What is HDFS and what is its fault tolerance?
Answer:
The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data
across highly scalable Hadoop clusters.
Fault tolerance in HDFS refers to the working strength of a system in unfavorable

conditions and how that system can handle such situation. HDFS is highly fault tolerant.
It handles faults by the process of replica creation. The replica of user’s data is created
on different machines in the HDFS cluster. So whenever if any machine in the cluster
goes down, then data can be accessed from other machines in which same copy of
data was created.
Q9. Hadoop is a great file system for running big data applications but it is
very costly, comment on the Truthfulness of this statement.
Answer:
Not a filesystem it is an ecosystem.
It is not costly as it is runs on commodity hardware.

Q10. What are the 2 types of subroutines/procedures? Explain with
examples.
Answer:
A subroutine is a section of code that can be re-used several times in the same
program. It is separate from the main code and has to be ‘called’ upon. In a game of
Mario you could imagine a subroutine as the part of the level that is reached by
travelling down a pipe. It is away from the main level / program and you once you have
gone through it you return to the program again (you can also re-visit it several times).
Subroutines are designed to be repeated and they have three key benefits:
1. Subroutines make programs more readable.

2. They reduce the duplication of code.
3. Complex problems are broken down into smaller chunks.
There are two types of subroutines, procedures and functions. A procedure just
executes commands, such as printing something a certain number of times. A function
produces information by receiving data from the main program and returning a value
back to the main program. For example, a function could take the radius of a sphere
from the main program and then calculate a sphere’s area and return the value of the
area back to the main program. A function generally requires parameters to work –
these are the values to be transferred from the main program to the subroutine.
Whenever we require code to be reused we need to bundle them in subroutines.
Sub-routines are of two types, one that returns a value = function and one which does
NOT return a value = procedure
Modularity and reusability is achieved using subroutines.
Predefined and User-defined functions and procedures is another classification based

on the source (library routine or programmer developed)
Q11. Explain with examples from Data Analytics the difference between
predefined and user defined subroutines.
Answer:
Using functions and procedures
In a computer program there are often sections of the program that we want to re-use or
repeat. Chunks of instructions can be given a name - they are
called functions and procedures.
Algorithms can be broken down into procedures or functions. This saves time by only
having to execute (call) the function when it is required, instead of having to type out the
whole instruction set.
Programming languages have a set of pre-defined (also known as built-in) functions and
procedures. If the programmer makes their own ones, they are custom-made or user-
defined.
Procedures or functions?
A procedure performs a task, whereas a function produces information.

Functions differ from procedures in that functions return values, unlike procedures
which do not. However, parameters can be passed to both procedures and functions.
In a program for drawing shapes, the program could ask the user what shape to draw.
The instructions for drawing a square could be captured in a procedure. The algorithm
for this action could be a set of tasks, such as these:
Repeat the next two steps four times: Draw a line of length n. Turn right by 90 degrees.
If this were a computer program, this set of instructions could be given the name
'square' and this sequence would be executed by running (calling) that procedure.
A function could calculate the VAT due on goods sold. The algorithm for this function
could be:
VAT equals (value_of_goods_sold * 0.2) Return VAT
If this were a computer program, this set of instructions could be given the name
'calculate_VAT' and would be executed by running (calling) that function.
In our example, the function would be called by using:

calculate_VAT(value_of_goods_sold)
The function would then return the value as VAT which is then used elsewhere.
Q12. What are the various types of decision making models
in data analytics and how are they related to the MIS, DSS and Expert
Systems?
Answer:
Types of data analytics
There are 4 types of analytics. Here, we start with the simplest one and go down to
more sophisticated. As it happens, the more complex an analysis is, the more value it
brings.
Descriptive analytics
Descriptive analytics answers the question of what happened. For instance, a

healthcare provider will learn how many patients were hospitalized last month; a retailer
– the average weekly sales volume; a manufacturer – a rate of the products returned for
a past month, etc.
Let us also bring an example from our practice: a manufacturer was able to decide on
focus product categories based on the analysis of revenue, monthly revenue per
product group, income by product group, total quality of metal parts produced per
month.
Descriptive analytics juggles raw data from multiple data sources to give valuable
insights into the past. However, these findings simply signal that something is wrong or
right, without explaining why. For this reason, highly data-driven companies do not
content themselves with descriptive analytics only, and prefer combining it with other
types of data analytics.
Diagnostic analytics
At this stage, historical data can be measured against other data to answer the question
of why something happened. Thanks to diagnostic analytics, there is a possibility to drill
down, to find out dependencies and to identify patterns. Companies go for diagnostic
analytics, as it gives in-depth insights into a particular problem. At the same time, a
company should have detailed information at their disposal, otherwise data collection
may turn out to be individual for every issue and time-consuming.
Let’s take another look at the examples from different industries: a healthcare provider
compares patients’ response to a promotional campaign in different regions; a retailer
drills the sales down to subcategories. Another flashback to our BI projects: in the
healthcare industry, customer segmentation coupled with several filters applied (like
diagnoses and prescribed medications) allowed measuring the risk of hospitalization.
Predictive analytics
Predictive analytics tells what is likely to happen. It uses the findings of descriptive and
diagnostic analytics to detect tendencies, clusters and exceptions, and to predict future
trends, which makes it a valuable tool for forecasting. Despite numerous advantages
that predictive analytics brings, it is essential to understand that forecasting is just an
estimate, the accuracy of which highly depends on data quality and stability of the
situation, so it requires a careful treatment and continuous optimization.
Thanks to predictive analytics and the proactive approach it enables, a telecom
company, for instance, can identify the subscribers who are most likely to reduce their
spend, and trigger targeted marketing activities to remediate; a management team can
weigh the risks of investing in their company’s expansion based on cash flow analysis
and forecasting. One of our case studies describes how advanced data
analytics allowed a leading FMCG company to predict what they could expect after
changing brand positioning.
Prescriptive analytics
The purpose of prescriptive analytics is to literally prescribe what action to take to

eliminate a future problem or take full advantage of a promising trend. An example of
prescriptive analytics from our project portfolio: a multinational company was able to
identify opportunities for repeat purchases based on customer analytics and sales
history.
This state-of-the-art type of data analytics requires not only historical data, but also
external information due to the nature of statistical algorithms. Besides, prescriptive
analytics uses sophisticated tools and technologies, like machine learning, business
rules and algorithms, which makes it sophisticated to implement and manage. That is
why, before deciding to adopt prescriptive analytics, a company should compare
required efforts vs. an expected added value.
Difference between DSS, MIS and Expert Systems

Another answer approach
Let us find out the characteristics of the three systems :
DSS (DECISION SUPPORT SYSTEM):
• DSS generally provide support for unstructured, or semi-structured decisions

(decisions that cannot be described in detail).
• DSS problems are often characterized by incomplete or uncertain knowledge, or

the use of qualitative data.
• DSS will often include modelling tools in them, where various alternative
scenarios can be modelled and compared.
• Investment decisions are an examples of those that might be supported by DSS
MIS (MANAGEMENT INFORMATION SYSTEMS):
• MIS is generally more sophisticated reporting systems built on existing

transaction processing systems
• Often used to support structured decision making (decisions that can be

described in detail before the decision is made)
• Typically will also support tactical level management, but sometimes are used at
other levels
• Examples of structured decisions supported by MIS might include deciding on

stock levels or the pricing of products.
DIMENSION DSS MIS EIS

Focus Analysis, decision Information Status Access
Support processing
Typical Users Analysts, Middle, lower Senior Executives
Served professions, levels, sometime Expediency
managers (via senior executives
intermediaries)
Impetus Effectiveness Efficiency
Application Diversified Areas Production Environmental
where Managerial control, sales scanning,
Decisions are forecasts, performance
made financial analysis, evaluation,
human resource identifying
management problems and
opportunities
Database(s) Special Corporate Special
Decision Supports semi- Direct or indirect Indirect support,
Support structured and support, mainly mainly high level
Capabilities unstructured structured routine and
decision making; problems, using unstructured
mainly ad-hoc, but standard decisions and
sometimes operations, policies
repetitive research and
decisions other models
Type of Information to Scheduled and News items,
Information support demand reports; external
specific situations structured flow, information on
exception customers,
reporting mainly competitors and
internal operations the environment
Principal Use Planning, Control Tracking and
Organizing, control
staffing and
control
Adaptability to Permits individual Usually none, Tailored to the
Individual User judgment, what-if standardized decision making
capabilities, some style
choice of dialogue of each individual
style executive, offers
several options of
outputs
Graphics Integrated part of Desirable A must
many DSS
User A must where no Desirable A must
Friendliness intermediaries are
used
Treatment of Information Information is Filters and
Information provided by the provided to a compresses the
EIS/or MIS is used diversified group information,
as an input of users who then tracks critical data
to the DSS manipulate it or and
summarize information
it as needed
Supporting Can be Inflexibility of Instant access to
Detailed programmed into reports, cannot the supporting
Information DSS get details of any
the supporting summary
details quickly
Model Base The Core of the Standard Models Can be added,
DSS are available but usually not
are not managed included or
limited in nature
Construction By users, either By vendors or IS By Vendors or IS
alone or with specialists Specialists
specialists from IS
or IC departments
Hardware Mainframes, Mainframes, Distributed
micros or Micros or system
distributed distributed
Nature of Large Application Interactive, easy
Computing computational oriented, to access multiple
Packages capabilities, performance databases, on-
modelling reports, line access,
languages and strong reporting sophisticated
simulation, capabilities, DBMS capabilities
applications and standard and complex
DSS statistical, linkages
generators financial,
accounting and
management
science models
EIS (EXECUTIVE INFORMATION SYSTEM):
• EIS support a range of decision making, but more often than not, this tends to be
unstructured
• EIS support the executive level of management, often used to formulate high level
strategic decisions impacting on the direction of the organization
• These systems will usually have the ability to extract summary data from internal
systems, along with external data that provides intelligence on the environment of the
organization
• Generally these systems work by providing a user friendly interface into other systems,
both internal and external to the organization
Related questions:
1. difference between dss, mis and expert systems.
2. What are the decision making models scenario modelling, Goal seek and Data Table.
Answer: MIS/Data Table Gives only the report and tells people who can provide with the
information
DSS is where data is presented and certain support for decision making is provided.
Expert System does calculation and returns a result.
Q13. Explain with a suitable example the various tasks for a business
analyst and the required skills for data analysis in a business
environment.
Answer:
Professional business analysts can play a critical role in a company's productivity,

efficiency, and profitability. Essential skills range from communication and interpersonal
skills to problem-solving and critical thinking.
Business analysts can hone their skills through executive education programs and
eventually earn a Certified Business Analysis Profession (CBAP) certification from the
International Institute of Business Analysis.
How to Use Skills Lists
When writing your resume, list relevant skills. Don’t assume hiring supervisors know you
have what they want.
When you find a job that appeals to you, read the job description thoroughly and
research the company. That way, you will know what to highlight in your cover letter,
based on what the business values.
The interviewer will want you to elaborate on the skills you bring to the table, so choose
three or four that relate to the position itself and be ready to share a few stories which
showcase your qualifications. It also may help to review the skills listed by job and types
of skills.
Core Skills
A number of skills are beneficial for business analysts, but there are a handful of
abilities that are absolutely necessary.
This is a rundown of those fundamental skills:
Communicating
Business analysts spend a significant amount of time interacting with clients, users,
management, and developers. Therefore, being an effective communicator is key. You
will be expected to facilitate work meetings, ask the right questions, and actively listen
to your colleagues to take in new information and build relationships. A project's
success may revolve around your ability to communicate things like project
requirements, changes, and testing results. In your interview, focus on your ability to
communicate proficiently in person, on conference calls, in meetings both digitally and
otherwise, and through email. Consider having an example ready that demonstrates
how being an effective communicator has served former employers well.
Problem-Solving
Every project you work on is, at its core, developing a solution to a problem. Business
analysts work to build a shared understanding of problems, outline the parameters of
the project, and determine potential solutions.
Negotiating
A business analyst is an intermediary between a variety of people with various types of

personalities: clients, developers, users, management, and information technology (IT).
You have to be able to achieve a profitable outcome for your company while finding a
solution for the client that makes them happy. This balancing act demands the ability to
influence a mutual solution and maintaining professional relationships.
Critical Thinking
Business analysts must assess multiple choices before leading the team toward a
solution. Effectively doing so requires a critical review of data, documentation, user
input surveys, and workflow. They ask probing questions until every issue is evaluated
in its entirety to determine the best conflict resolution.
General Skills
Besides the core skills, employers also will be looking for more general skills and
attributes:
Personal Attributes: Sought-after personal attributes include adaptability and the

ability to work in a fast-paced environment with cross-functional teams. You also should
hone analytical thinking, attention to detail, and creativity. Business analysts also are
equipped with strong organizational skills, the ability to multitask and be an assertive,
diplomatic leader.
Computer Skills: As a business analyst, you’ll need to be able to use many types of
software, from the popular Microsoft Office Suite to less common packages like
SharePoint, Visio, and Software Design Tools. You will need to stay abreast of new
developments in IT as well.
Analytical Skills: Of course, a business analyst needs analytical tools for the efficient
designing and implementing of processes to forecasting and gap analysis.
Q14. Explain with suitable example the concept of Internal and
External Data sources for performing data analysis in the business
environment.
Answer:
Internal data is information generated from within the business, covering areas such as
operations, maintenance, personnel, and finance. External data comes from the market,
including customers and competitors. It’s things like statistics from surveys,
questionnaires, research, and customer feedback.
Research has shown that business analysts consider data generated internally to be
more valuable. According to one survey, “About 65% of respondents rank internal data
as more important than data collected outside the company.”
Both kinds of data are helpful. Internal data helps you run your business and optimize
your operations. External data helps you better understand your customer base and the
competitive landscape. You need a clear view of both to have truly insightful business
intelligence.
Various types of data are very useful for business reports, and in business reports, you
will quickly come across things like revenue (money earned in a given period, usually a
year), turnover (people who left the organization in a given period), and many others.
There are a variety of data available when one is constructing a business report. We
may categorize data in the following manner:
Internal
Employee headcount
Employee demographics (e.g., sex, ethnicity, marital status)
Financials (e.g., revenue, profit, cost of goods sold, margin, operating ratio)
External
Number of vendors used

Number of clients in a company’s book of business
Size of the industry (e.g. number of companies, total capital)
Internal and external business or organizational data come in two main categories:
qualitative and quantitative.
Qualitative data are data that are generally non-numeric and require context, time, or
variance to have meaning or utility.
Examples: taste, energy, sentiments, emotions
Quantitative data are data that are numeric and therefore largely easier to understand.
Example: temperature, dimensions (e.g., length), prices, headcount, stock on hand
Both types of data are useful for business report writing. Usually a report will feature as
much “hard” quantitative data as possible, typically in the form of earnings or revenue,
headcount, and other numerical data available. Most organizations keep a variety of
internal quantitative data.
Qualitative data, such as stories, case studies, or narratives about processes or events,
are also very useful, and provide context. We may consider that a good report will have
both types of data, and a good report writer will use both types of data to build a picture
of information for their readers.
Q15. What is granularity (Explain it along the lines of roll-up and drill-
down) of data and how does it affect the data → Information cycle?
Answer:
When designing the data warehouse, one of the most basic concepts is that of storing
data at the lowest level of granularity. By storing data at the lowest level of granularity,
the data can be reshaped to meet different needs – of the finance department, of the
marketing department, of the sales department, and so forth. Granular data can be
summarized, aggregated, broken into many different subsets and so forth. There are
indeed many good reasons for storing data in the data warehouse at the lowest level of
granularity.
And why does data need to be broken into low levels of granularity? The answer is that
most data warehouse data comes from transactions. And typically, transactions contain
data that is very denormalized. Denormalized data is at a high level of granularity.
Let’s take a look at a typical transaction.
The typical transaction may have data such as:

• the date of the transaction,
• the item being purchased,
• the terms of the purchase,
• the person making the purchase,
• the location where the transaction was made,
• the price of the transaction, and
• the salesperson.
All of the data that has been brought to bear on the transaction is natural and normal.
Naturally enough, the data in the transaction focuses on the transaction itself. At the
same time, the data in the transaction is very denormalized.
Roll Up and Drill Down:
1) Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can be
performed in 2 ways
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things
based on their order or level.
Consider the following diagram
• In this example, cities New jersey and Lost Angles and rolled up into country
USA
• The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively.
They become 2000 after roll-up
• In this aggregation process, data is location hierarchy moves up from city to the
country.
• In the roll-up process at least one or more dimensions need to be removed. In
this example, Quarter dimension is removed.
•
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via
• Moving down the concept hierarchy
• Increasing a dimension
Consider the diagram above
• Quarter Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
• In this example, dimension months are added.
Granularity means the level of detail of your data within the data structure. In a
typical Data Warehouse one might find very detailed data (such as
seconds, single product, one specific attribute) and aggregated data (such
as total number of, monthly orders, all products).
The higher the granularity of a fact table the more data (or in an excel sheet: rows) you
will have. But the granularity of your data also determines what kind of information you
can get out of the stored data. So to aggregate data you need of course the same
granularity. (A weekly report can only be generated when you have time related data
stored. At least it should be a “week”, better it is to have “day”.)
Q16. Difference between transactional system and big data system.
Answer: Both systems are DBMS based.
Transactional system has insert update and select.

In case of big data system we have insert and select.
A transactional system follows a normalized form.

A Big Data System keeps heuristic data so as to analyse it.
TS are single dimensional

BdS are multidimensional like accounting data, production data, sales data.
Ts cater to B2C
BdS cater to B2B
Faster /Better Hardware req for Big Data

Q17. How is the emergence of Cloud Technologies related to the
growth in BIG DATA?
Answer:
How is Big Data Related to Cloud Computing?
Hence, from the above description, we can see that Cloud enables “As-a-Service”
pattern by abstracting the challenges and complexity through a scalable and elastic self-
service application. Big data requirement is same where distributed processing of
massive data is abstracted from the end users.
There are multiple benefits of Big data analysis in Cloud.
Improved analysis
With the advancement of Cloud technology, big data analysis has become more
improved causing better results. Hence, companies prefer to perform big data analysis
in the Cloud. Moreover, Cloud helps to integrate data from numerous sources.
Simplified Infrastructure
Big Data analysis is a tremendous strenuous job on infrastructure as the data comes in
large volumes with varying speeds, and types which traditional infrastructures usually
cannot keep up with. As the Cloud computing provides flexible infrastructure, which we
can scale according to the needs at the time, it is easy to manage workloads.
Lowering the cost
Both Big data and Cloud technology delivers value to organizations by reducing the
ownership. The Pay-per-user model of Cloud turns CAPEX into OPEX. On the other
hand, Apache cut down the licensing cost of Big data which is supposed to be cost
millions to build and buy. Cloud enables customers for big data processing without
large-scale big data resources. Hence, both Big Data and Cloud technology are driving
the cost down for enterprise purposes and bringing value to the enterprise.
Security and Privacy
Data security and privacy are two major concerns when dealing with enterprise data.
Moreover, when your application is hosted on a Cloud platform due to its open
environment and limited user control security becomes a primary concern. On the other
hand, being an open source application, Big data solution like Hadoop uses a lot of
third-party services and infrastructure. Hence, nowadays system integrators bring in
Private Cloud Solution that is Elastic and Scalable. Furthermore, it also leverages
Scalable Distributed Processing.
Besides that Cloud data is stored and processed in a central location commonly known
as Cloud storage server. Along with it the service provider and the customer signs a
service level agreement (SLA) to gain the trust between them. If require the provider
also leverages required advanced level of security control.
Q18. What are IOT Devices and how they are related to Big Data and
Cloud Technologies?
Answer:
In order to understand the relationship between big data, IoT and cloud computing, we
might need to rearrange the order. The interconnection that would then be established
would paint the bigger picture for you to understand.
First off, IoT is an ecosystem of devices, which are interconnected. Basically, it is a net
of devices, consisting of specific IP addresses; and are capable of generation,
transmission and reception of data, without human intervention. IoT is thus the
abbreviated version of ‘Internet of Things’. It would make one wonder, “where does all
this data get processed then?”
This is where big data steps in. Big data is the term coined for data sets so humungous,
that trillion units of data generated by IoTs can be processed. As opposed to the
common misconception, big data is not some sort of a database, but is a software
ecosystem. This would then lead one to the next question, “what about the
infrastructure and the expenses involved in setting up such massive machines of data
processing?”
The solution to that is cloud computing. With cloud computing, you are just a click away
from accessing your data, from anywhere in the world, within a second or even less.
This not only saves up the space for infrastructure, but also cuts down on the expenses
behind maintaining them.
And this is how IoT, big data and cloud computing are connected.

Big Data Question Bank PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Big Data Question Bank PDF

Caricato da

Copyright:

Formati disponibili

Big Data Question Bank

Data warehouse = DSS (Decision Support System) type of model

Big Data = Expert system type of approach

Improving Manufacturing Processes

Custom Product Design

Better Quality Assurance

Managing Supply Chain Risk

Service Based Industries

1. Collecting, analyzing, and utilizing consumer insights

Amazon Prime, which is driven to provide a great customer experience by offering,

HDFS key features

HDFS key features Description

Stores bulks of data Capable of storing terabytes and petabytes of data

Minimum HDFS manages thousands of nodes without operator intervention

Computing Benefits of distributed and parallel computing at once

Rollback Allows returning to its previous version post an upgrade

Data integrity Deals with corrupted data by replicating it several times

Data Extraction Tool: Talend, Pentaho

MongoDB is an open source database that uses a document-oriented data model.

Some features of MongoDB Tool:

Following are the main features of MongoDB:

Features of Apache Hive

What Is Predictive Analytics?

Real World Examples of Predictive Analytics in Business Intelligence

Identify customers that are likely to abandon a service or product:

There are two major components of Hadoop HDFS:

Let’s now discuss these Hadoop HDFS Components-

Tasks of HDFS NameNode

Tasks of HDFS DataNode

• DataNode performs operations like block replica creation, deletion, and

• Scalability – MapReduce can process petabytes of data.

• Speed – By means of parallel processing problems that take days to solve, it is

• Fault Tolerance – MapReduce takes care of failures. If one copy of data is

Fault tolerance in HDFS refers to the working strength of a system in unfavorable

Not a filesystem it is an ecosystem.

It is not costly as it is runs on commodity hardware.

1. Subroutines make programs more readable.

Whenever we require code to be reused we need to bundle them in subroutines.

Modularity and reusability is achieved using subroutines.

Predefined and User-defined functions and procedures is another classification based

Using functions and procedures

A procedure performs a task, whereas a function produces information.

VAT equals (value_of_goods_sold * 0.2) Return VAT

'calculate_VAT' and would be executed by running (calling) that function.

In our example, the function would be called by using:

Types of data analytics

Descriptive analytics answers the question of what happened. For instance, a

The purpose of prescriptive analytics is to literally prescribe what action to take to

Difference between DSS, MIS and Expert Systems

DSS (DECISION SUPPORT SYSTEM):

• DSS generally provide support for unstructured, or semi-structured decisions

• DSS problems are often characterized by incomplete or uncertain knowledge, or

• Investment decisions are an examples of those that might be supported by DSS

MIS (MANAGEMENT INFORMATION SYSTEMS):

• MIS is generally more sophisticated reporting systems built on existing

• Often used to support structured decision making (decisions that can be

• Examples of structured decisions supported by MIS might include deciding on

DIMENSION DSS MIS EIS

EIS (EXECUTIVE INFORMATION SYSTEM):

Professional business analysts can play a critical role in a company's productivity,

How to Use Skills Lists

This is a rundown of those fundamental skills:

A business analyst is an intermediary between a variety of people with various types of