Data Processing

INTRODUCTION
Data processing is any computer process that converts data into information.
The processing is usually assumed to be automated and running on a mainframe,
minicomputer, microcomputer, or personal computer. Because data are most useful when
well-presented and actually informative, dataprocessing systems are often referred to as
information systems to emphasize their practicality. Nevertheless, both terms are roughly
synonymous, performing similar conversions; data-processing systems typically
manipulate raw data into information, and likewise information systems typically take
raw data as input to produce information as output. To better market their profession, a
computer programmer or a systems analyst that might once have referred, such as during
the 1970s, to the computer systems that they produce as data-processing systems more
often than not nowadays refers to the computer systems that they produce by someother
term that includes the word information, such as information systems, information
technology systems, or management information systems.
In the context of data processing, data are defined as numbers or characters that
represent measurements from the real world. A single datum is a single measurement
from the real world. Measured information is then algorithmically derived and/or
logically deduced and/or statistically calculated from multiple data. Information is
defined as either a meaningful answer to a query or a meaningful stimulus that can
cascade into further queries.
More generally, the term data processing can apply to any process that converts
data from one format to another, although data conversion would be the more logical and
correct term. From this perspective, data processing becomes the process of converting
information into data and also the converting of data back into information. The
distinction is that conversion doesn't require a question (query) to be answered. For
example, information in the form of a string of characters forming a sentence in English
is converted or encoded from a keyboard's key-presses as represented by
hardwareoriented integer codes into ASCII integer codes after which it may be more
easily processed by a computernot as merely raw, amorphous integer data, but as a
meaningful character in a natural language's set of graphemesand finally converted or
decoded to be displayed as characters, represented by a font on the computer display. In
that example we can see the stage-by-stage conversion of the presence of and then
absence of electrical conductivity in the key-press and subsequent release at the keyboard
from raw substantiallymeaningless integer hardware-oriented data to evermore-
meaningful information as the processing proceeds toward the human being.
A more conventional example of the established practice of using the term data
processing is that a business has collected numerous data concerning an aspect of its
operations and that this multitude of data must be presented in meaningful, easy-to-access
presentations for the managers who must then use that information to increase revenue or
to decrease cost. That conversionand presentation of data as information is typically
performed by a dataprocessing application.
Definition of Data Processing
Data processing can be defined as the conversion of raw data to meaningful

information through a process. Data is manipulated to produce results that lead to a
resolution of a problem or improvement of an existing situation. Similar to a production
process, it follows a cycle where inputs (raw data) are fed to a process (computer
systems, software, etc.) to produce output (information and insights).
What is Data Processing?
Data processing is simply the conversion of raw data to meaningful

information through a process. Data is manipulated to produce results that lead to a
resolution of a problem or improvement of an existing situation. Similar to a production
process, it follows a cycle where inputs (raw data) are fed to a process (computer
systems, software, etc.) to produce output (information and insights).
Generally, organizations employ computer systems to carry out a series of

operations on the data in order to present, interpret, or obtain information. The process
includes activities like data entry, summary, calculation, storage, etc. Useful and
informative output is presented in various appropriate forms such as diagrams, reports,
graphics, etc.
Objectives
After going through this lesson, you will be in a position to l define the
concepts of data, information and data processing l explain various data processing
activities l utilise data processing cycle l explain data elements, records, files and
databases.
Data
The word "data" is the plural of datum, which means fact, observation,
assumption or occurrence. More precisely, data are representations of facts pertaining to
people, things, ideas and events. Data are represented by symbols such as letters of the
alphabets, numerals or other special symbols.
Data Processing
Data processing is the act of handling or manipulating data in some fashion.

Regardless of the activities involved in it, processing tries to assign meaning to data.
Thus, the ultimate goal of processing is to transform data into information. Data
processing is the process through which facts and figures are collected, assigned
meaning, communicated to others and retained for future use. Hence we can define data
processing as a series of actions or operations that converts data into useful information.
We use the term 'data processing system' to include the resources that are used to
accomplish the processing of data.
Information
Information, thus can be defined as data that has been transformed into a
meaningful and useful form for specific purposes. In some cases data may not require
any processing before constituting information. However, generally, data is not useful
unless it is subjected to a process through which it is manipulated and organised, its
contents analyzed and evaluated. Only then data becomes information.
HISTORY
Manual data processing
Although widespread use of the term data processing dates only from the
nineteen-fifties,[3] data processing functions have been performed manually for millennia.
For example, bookkeeping involves functions such as posting transactions and producing
reports like the balance sheet and the cash flow statement. Completely manual methods
were augmented by the application of mechanical or electronic calculators. A person
whose job was to perform calculations manually or using a calculator was called a
"computer."
The 1850 United States Census schedule was the first to gather data by
individual rather than household. A number of questions could be answered by making a
check in the appropriate box on the form. From 1850 through 1880 the Census Bureau
employed "a system of tallying, which, by reason of the increasing number of
combinations of classifications required, became increasingly complex. Only a limited
number of combinations could be recorded in one tally, so it was necessary to handle the
schedules 5 or 6 times, for as many independent tallies. It took over 7 years to publish
the results of the 1880 census using manual processing methods.
Automatic data processing
The term automatic data processing was applied to operations performed by

means of unit record equipment, such as Herman Hollerith's application of punched card
equipment for the 1890 United States Census. "Using Hollerith's punchcard equipment,
the Census Office was able to complete tabulating most of the 1890 census data in 2 to 3
years, compared with 7 to 8 years for the 1880 census.... It is also estimated that using
Herman Hollerith's system saved some $5 million in processing costs" [5] (in 1890 dollars)
even with twice as many questions as during 1880.
Electronic data processing
Computerized data processing, or Electronic data processing represents a

later development, with a computer used instead of several independent pieces of
equipment. The Census Bureau first made limited use of electronic computers for the
1950 United States Census, using a UNIVAC I system,[4] delivered during 1952.
Other developments
The term data processing has mostly been subsumed by the newer and
somewhat more general term information technology (IT).[citation needed]
The term "data
processing" is presently considered sometimes to have a negative connotation, suggesting
use of older technologies. As an example, during 1996 the Data Processing Management
Association (DPMA) changed its name to the Association of Information Technology
Professionals. Nevertheless, the terms are approximately synonymous.
DATA PROCESSING ACTIVITIES
As discussed above, data processing consists of those activities which are

necessary to transform data into information. Man has in course of time devised certain
tools to help him in processing data. These include manual tools such as pencil and paper,
mechanical tools such as filing cabinets, electromechanical tools such as adding machines
and typewriters, and electronic tools such as calculators and computers. Many people
immediately associate data processing with computers. As stated above, a computer is not
the only tool used for data processing, it can be done without computers also. However,
computers have outperformed people for certain tasks. There are some other tasks for
which computer is a poor substitute for human skill and intelligence.
Regardless to the type of equipment used, various functions and activities which need to
be performed for data processing can be grouped under five basic categories.
COLLECTION
Data originates in the form of events transaction or some observations This data
is then recorded in some usable form. Data may be initially recorded on paper source
documents 2.2 and then converted into a machine usable form for processing.
Alternatively, they may be recorded by a direct input device in a paperless, machine-
readable form. Data collection is also termed as data capture.
CONVERSION
Once the data is collected, it is converted from its source documents to a form
that is more suitable for processing. The data is first codified by assigning identification
codes. A code comprises of numbers, letters, special characters, or a combination of
these. For example, an employee may be allotted a code as 52-53-162, his category as A
class, etc. It is useful to codify data, when data requires classification. To classify means
to categorize, i.e., data with similar characteristics are placed in similar categories or
groups. For example, one may like to arrange accounts data according to account number
or date. Hence a balance sheet can easily be prepared.
After classification of data, it is verified or checked to ensure the accuracy before

processing starts.
After verification, the data is transcribed from one data medium to another. For
example, in case data processing is done using a computer, the data may be transformed
from source documents to machine sensible form using magnetic tape or a disk.
MANIPULATION
Once data is collected and converted, it is ready for the manipulation function
which converts data into information. Manipulation consists of following activities:
Sorting
It involves the arrangement of data items in a desired sequence. Usually, it is

easier to work with data if it is arranged in a logical sequence. Most often, the data are
arranged in alphabetical sequence. Sometimes sorting itself will transform data into
information. For example, a simple act of sorting the names in alphabetical order gives
meaning to a telephone directory. The directory will be practically worthless without
sorting.
Business data processing extensively utilises sorting technique. Virtually all the
records in business files are maintained in some logical sequence. Numeric sorting is
common in computer-based processing systems because it is usually faster than
alphabetical sorting.
Calculating
Arithmetic manipulation of data is called calculating. Items of recorded data can

be added to one another, subtracted, divided or multiplied tocreate new data as shown in
fig. 2.2(a). Calculation is an integral part of data processing. For example, in calculating
an employee's pay, the hours worked multiplied by the hourly wage rate gives the gross
pay. Based on total earning, income-tax deductions are computed and subtracted from
gross-pay to arrive at net pay.
Summarizing
To summarize is to condense or reduce masses of data to a more usable and

concise form as shown in fig. 2.2(b). For example, you may summarize a lecture attended
in a class by writing small notes in one or two pages. When the data involved is numbers,
you summarize by counting or accumulating the totals of the data in a classification or by
selecting strategic data from the mass of data being processed. For example, the
summarizing activity may provide a general manager with sales-totals by major product
line, the sales manager with sales totals by individual salesman as well as by the product
line and a salesman with sales data by customer as well as by product line.
Comparing
To compare data is to perform an evaluation in relation to some known measure.

For example, business managers compare data to discover how well their compaines are
doing. They many compare current sales figures with those for last year to analyze the
performance of the company in the current month.
MANAGING THE OUTPUT RESULTS
Once data has been captured and manipulated following activities may be
carried out :
Storing
To store is to hold data for continued or later use. Storage is essential for any
organised method of processing and re-using data. The storage mechanisms for data
processing systems are file cabinets in a manual system, and electronic devices such as
magnetic disks/magnetic tapes in case of computer based system. The storing activity
involves storing data and information in organised manner in order to facilitate the
retrieval activity. Of course, data should be stored only if the value of having them in
future exceeds the storage cost.
Retrieving
To retrieve means to recover or find again the stored data or information.

Retrieval techniques use data storage devices. Thus data, whether in file cabinets or in
computers can be recalled for further processing. Retrieval and comparison of old data
gives meaning to current information.
COMMUNICATION
Communication is the process of sharing information. Unless the information is

made available to the users who need it, it is worthless. Thus, communication involves
the transfer of data and information produced by the data processing system to the
prospective users of such information or to another data processing system. As a result,
reports and documents are prepared and delivered to the users. In electronic data
processing, results are communicated through display units or terminals.
STAGES OF DATA PROCESSING
1. Collection
Collection is the first stage of the cycle, and is very crucial, since the
quality of data collected will impact heavily on the output. The collection process needs
to ensure that the data gathered are both defined and accurate, so that subsequent
decisions based on the findings are valid. This stage provides both the baseline from
which to measure, and a target on what to improve.
Some types of data collection include census (data collection about

everything in a group or statistical population), sample survey (collection method that
includes only part of the total population), and administrative by-product (data collection
is a byproduct of an organizations day-to-day operations).
2. Preparation
Preparation is the manipulation of data into a form suitable for further

analysis and processing. Raw data cannot be processed and must be checked for accuracy.
Preparation is about constructing a dataset from one or more data sources to be used for
further exploration and processing. Analyzing data that has not been carefully screened
for problems can produce highly misleading results that are heavily dependent on the
quality of data prepared.
3. Input
Input is the task where verified data is coded or converted into machine
readable form so that it can be processed through a computer. Data entry is done through
the use of a keyboard, digitizer, scanner, or data entry from an existing source. This time-
consuming process requires speed and accuracy. Most data need to follow a formal and
strict syntax since a great deal of processing power is required to breakdown the complex
data at this stage. Due to the costs, many businesses are resorting to outsource this stage.
4. Processing
Processing is when the data is subjected to various means and methods of

manipulation, the point where a computer program is being executed, and it contains the
program code and its current activity. The process may be made up of multiple threads of
execution that simultaneously execute instructions, depending on the operating system.
While a computer program is a passive collection of instructions, a process is the actual
execution of those instructions. Many software programs are available for processing
large volumes of data within very short periods.
5. Output and interpretation
Output and interpretation is the stage where processed information is

now transmitted to the user. Output is presented to users in various report formats like
printed report, audio, video, or on monitor. Output need to be interpreted so that it can
provide meaningful information that will guide future decisions of the company.
6. Storage
Storage is the last stage in the data processing cycle, where data,
instruction and information are held for future use. The importance of this cycle is that it
allows quick access and retrieval of the processed information, allowing it to be passed
on to the next stage directly, when needed. Every computer uses storage to hold system
and application software.
The Data Processing is a series of steps carried out to extract information

from raw data. Although each step must be taken in order, the order is cyclic. The output
and storage stage can lead to the repeat of the data collection stage, resulting in another
cycle of data processing. The cycle provides a view on how the data travels and
transforms from collection to interpretation, and ultimately, used in effective business
decisions.
BASIC PRINCIPLES OF DATA PROCESSING
Processing of data is any operation or set of operations performed upon data

whether or not by automatic means such as collection, recording, organization, storage,
alteration, restoration, retrieval, use or disclosure through transfer, dissemination or
otherwise making data available, alignment or combination, blocking, erasure or
destruction.
Data processing is any operation performed upon data. For example: running
the the list of the participants of the meeting, recordings of data collected during entrance
to and exits from the premises, collecting information on contractor company directors or
persons having insurance, recording the customer contact information by shops, etc.
Principles of Data Processing
While processing the personal data the following principles shall be applied
Fair and Lawful Processing - Personal data shall be processed fairly and lawfully
without violating the dignity of data subjects;
Existence of Legitimate Purpose - Personal data shall be processed for specified,

explicit and legitimate purposes. It is necessary do define specific purpose prior
the processing of the data;
Adequacy and Proportionality - Personal data shall be adequate, relevant and

not excessive in relation to the purpose for which they are collected and/or further
processed;
Accuracy of the Data Personal data must be accurate and kept up to date. Data
controller shall take necessary measures to ensure the data accuracy; credibility of
the source from which data is obtained shall be verified, inaccurate and invalid
data shall be erased;
Storage of the Data - Personal data may only be stored for as long as necessary to
achieve the purpose for which they were collected or further processed.
The Grounds for the Data Processing
The personal data can only be processed if one of the following grounds exists:
Consent of the Data Subject data might be processed in case of the data subject
has given unambiguous, freely given, explicit and informed consent on processing
of his/her personal data for specific purposes and to a specific extent;
Permitted by Law - Data can be processed if it is directly prescribed by law;
Legal Obligations when the data processing is necessary for the purposes of
fulfilling the obligations of the data controller prescribed by the legislation;
Vital Interest data processing is necessary to protect the vital interests of the
data subject;
Protection of the Legal Interest The data might be processed for protection of
the legal interest of the data controller or the third party except were such interest
is overridden by the interest of the protection of the fundamental rights and
freedoms of the data subject;
The publicity of the data covers the situations where publicity of the data is
prescribed by law or if the data subject makes them publicly available;
Protection of the public interest personal data can be processing in order to

protect significant public interest;
Handling with application covers situations when the data is processed for the
purposes of providing service to the data subject or dealing with his/her
application.
Processing of the Sensitive Data
It is prohibited to process the sensitive personal data except the following cases:
The Written Consent of the Data Subject the consent shall be explicitly
provided for in the document and the free will of the data subject to process
his/her specific data for specific purposes shall be specified;
Publicity of the Data if the data subject has made data public without evidently
or explicitly restricting their use;
Employment Purposes - if the processing is necessary for the purposes of

fulfilling the obligations and specific rights of a data controller in the field of
employment;
Vital Interest if the processing is necessary to protect the vital interests of the
data subject or of another person where the data subject is physically or legally
incapable of giving his or her consent.
Healthcare Purposes - if they are processed by public health protection, as well

as other health-care institutions (staff members) for the purposes of protecting the
health of the public and individuals and the management or operation of health
services;
Processing of the data by the political, philosophical, religious organization,

trade union or other non-commercial organization within the scope of its legal
activities this covers the situations when processing is carried out in the course
of its legitimate activities by a foundation, association or any other non-profit-
seeking body with a political, philosophical, religious or trade-union aim and on
condition that the processing relates solely to the members of the body or to
persons having regular contact with it. In such cases the disclosure of the data to
the third parties is prohibited uncles the consent of the data subject.
INTRODUCTION
Amazon.com, Inc. is an American electronic commerce company with

headquarters in Seattle, Washington. It is the largest Internet-based retailer in the United
States. Amazon.com started as an online bookstore, but soon diversified,
selling DVDs, Blurays, CDs, video downloads/streaming, MP3 downloads/streaming, sof
tware, videogames, electronics, apparel, furniture, food, toys and jewellery. The company
also produces consumer electronicsnotably, Amazon Kindle e-book readers,
Fire tablets, Fire TV and Fire Phone and is a major provider of cloud
computing services. Amazon also sells certain low-end products like USB cables under
its in-house brand Amazon Basics.
Amazon has separate retail websites for United States, United Kingdom &
Ireland, France, Canada, Germany, The Netherlands, Italy, Spain, Australia, Brazil, Japan,
China, India and Mexico. Amazon also offers international shipping to certain other
countries for some of its products. In 2017, it had professed an intention to launch its
websites in Poland and Sweden.
HISTORY
The company was founded in 1994, spurred by what Bezos called his "regret
minimization framework", which described his efforts to fend off any regrets for not
participating sooner in the Internet business boom during that time. In 1994, Bezos left
his employment as vice-president of D. E. Shaw & Co., a Wall Street firm, and moved to
Seattle. He began to work on a business plan for what would eventually become
Amazon.com.
Jeff Bezos incorporated the company as "Cadabra" on July 5, 1994 and the site
went online as Amazon.com in 1995. Bezos changed the name cadabra.com to
amazon.com because it sounded too much like cadaver. Additionally, a name beginning
with "A" was preferential due to the probability it would occur at the top of any list that
was alphabetized.
Bezos selected the name Amazon by looking through the dictionary, and settled
on "Amazon" because it was a place that was "exotic and different" just as he planned for
his store to be; the Amazon river, he noted was by far the "biggest" river in the world, and
he planned to make his store the biggest in the world. Bezos placed a premium on his
head start in building a brand, telling a reporter, "There's nothing about our model that
can't be copied over time. But you know, McDonald's got copied. And it still built a huge,
multibillion-dollar company. A lot of it comes down to the brand name. Brand names are
more important online than they are in the physical world."
After reading a report about the future of the Internet which projected annual
Web commerce growth at 2,300%, Bezos created a list of 20 products which could be
marketed online. He narrowed the list to what he felt were the five most promising
products which included: compact discs, computer hardware, computer software, videos,
and books. Bezos finally decided that his new business would sell books online, due to
the large world-wide demand for literature, the low price points for books, along with the
huge number of titles available in print. Amazon was originally founded in Bezos' garage
in Bellevue, Washington.
The company began as an online bookstore, an idea spurred off with

discussion with John Ingram of Ingram Book (now called Ingram Content Group), along
with Keyur Patel who still holds a stake in Amazon. In the first two months of business,
Amazon sold to all 50 states and over 45 countries. Within two months, Amazon's sales
were up to $20,000/week. While the largest brick and mortar bookstores and mail
order catalogs might offer 200,000 titles, an online bookstore could "carry" several times
more, since it would have an almost unlimited virtual (not actual) warehouse: those of the
actual product makers/suppliers.
Since June 19, 2000, Amazon's logotype has featured a curved arrow leading
from A to Z, representing that the company carries every product from A to Z, with the
arrow shaped like a smile.
Amazon was incorporated in 1994, in the state of Washington. In July 1995,

the company began service and sold its first book on Amazon.com: Douglas
Hofstadter's Fluid Concepts and Creative Analogies: Computer Models of the
Fundamental Mechanisms of Thought. In October 1995, the company announced itself to
the public. In 1996, it was reincorporated in Delaware. Amazon issued its initial public
offering of stock on May 15, 1997, trading under the NASDAQ stock exchange
symbol AMZN, at a price ofUS$18.00 per share ($1.50 after three stock splits in the late
1990s).
Amazon's initial business plan was unusual; it did not expect to make a profit
for four to five years. This "slow" growth caused stockholders to complain about the
company not reaching profitability fast enough to justify investing in, or to even survive
in the long-term. When the dot-com bubble burst at the start of the 21st century,
destroying many e-companies in the process, Amazon survived, and grew on past the
bubble burst to become a huge player in online sales. It finally turned its first profit in the
fourth quarter of 2001: $5 million (i.e., 1 per share), on revenues of more than $1
billion. This profit margin, though extremely modest, proved to skeptics that Bezos'
unconventional model could succeed. In 1999, Time magazine named Bezos the Person
of the Year, recognizing the company's success in popularizing online shopping.
Barnes & Noble sued Amazon on May 12, 1997, alleging that Amazon's
claim to be "the world's largest bookstore" was false. Barnes and Noble asserted, "[It]
isn't a bookstore at all. It's a book broker." The suit was later settled out of court, and
Amazon continued to make the same claim. Wal-Mart sued Amazon on October 16,
1998, alleging that Amazon had stolen Wal-Marts trade secrets by hiring former Wal-
Mart executives. Although this suit was also settled out of court, it caused Amazon to
implement internal restrictions and the reassignment of the former Wal-Mart executives.
COMPANY VISION & MISSION
Amazon Vision Statement
We have provided below the content of the Amazon Vision Statement which
details their outlook of the future. Effective and successful statements are powerful and
compelling, conveying confidence and inspiring views of the future. The importance of
these types of statements should not be underestimated. One good paragraph will
describe the values, services and vision for the future. Whether you are looking to
compose a personal or a company vision statement our samples and this example of the
Amazon vision statement will provide you with some excellent ideas and inspiration.
Our vision is to be earth's most customer centric company; to build a place where
people can come to find and discover anything they might want to buy online.
The Mission Statement of Amazon.com:
Amazon.com has had a clear focus and a solitary mission since it began. Founder Jeff
Bezos has publicly referred to the Amazon.com mission statement as the guiding force
behind his leadership decisions many times in the company's 18-year history. It can be
concluded that the success of Amazon.com as the top Internet retailing company in the
world is due at least in part to their unwavering commitment to this mission and the daily
execution of it. The mission and vision of Amazon.com is...
Our vision is to be earth's most customer centric company; to build a place where people
can come to find and discover anything they might want to buy online.
STAGES OF DATA PROCESSING IN AMAZON.COM
DATA ANALYTIC FRAMEWORKS (Managed, distributed computing for data)
HADOOP & SPARK (Amazon EMR)
Amazon EMR provides a managed Hadoop framework that makes it easy,

fast, and cost-effective to process vast amounts of data across dynamically scalable
Amazon EC2 instances. You can also run other popular distributed frameworks such as
Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other
AWS data stores such as Amazon S3 and Amazon DynamoDB.
Amazon EMR securely and reliably handles a broad set of big data use cases,
including log analysis, web indexing, data transformations (ETL), machine learning,
financial analysis, scientific simulation, and bioinformatics.
INTERACTIVE QUERY SERVICE (Amazon Athena)
Amazon Athena is an interactive query service that makes it easy to analyze

data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure
to manage, and you pay only for the queries that you run.
Athena is easy to use. Simply point to your data in Amazon S3, define the
schema, and start querying using standard SQL. Most results are delivered within
seconds. With Athena, theres no need for complex ETL jobs to prepare your data for
analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale
datasets.
Start Querying Instantly (Serverless. No ETL.)

Athena is serverless. You can quickly query your data without having to setup
and manage any servers or data warehouses. Just point to your data in Amazon S3, define
the schema, and start querying using the built-in query editor. Amazon Athena allows you
to tap into all your data in S3 without the need to set up complex processes to extract,
transform, and load the data (ETL).
Pay Per Query (Only pay for data scanned)
With Amazon Athena, you pay only for the queries that you run. You are charged
$5 per terabyte scanned by your queries. You can save from 30% to 90% on your per-
query costs and get better performance by compressing, partitioning, and converting your
data into columnar formats. Athena queries data directly in Amazon S3. There are no
additional storage charges beyond S3.
ELASTICSEARCH (Amazon Elasticsearch Service)

Amazon Elasticsearch Service makes it easy to deploy,
operate, and scale Elasticsearch for log analytics, full text search,
application monitoring, and more. Amazon Elasticsearch Service is a
fully managed service that delivers Elasticsearchs easy-to-use APIs
and real-time capabilities along with the availability, scalability, and
security required by production workloads. The service offers built-in
integrations with Kibana, Logstash, and AWS services including
Amazon Kinesis Firehose, AWS Lambda, and Amazon CloudWatch so
that you can go from raw data to actionable insights quickly.
Its easy to get started with Amazon Elasticsearch Service.

You can set up and configure your Amazon Elasticsearch Service
domain in minutes from the AWS Management Console. Amazon
Elasticsearch Service provisions all the resources for your domain and
launches it. The service automatically detects and replaces failed
Elasticsearch nodes, reducing the overhead associated with self-
managed infrastructure and Elasticsearch software. Amazon
Elasticsearch Service allows you to easily scale your cluster via a single
API call or a few clicks in the console.
REAL-TIME DATA PROCESS
As you may know, Amazon Kinesis greatly simplifies the

process of working with real-time streaming data in the AWS Cloud.
Instead of setting up and running your own processing and short-term
storage infrastructure, you simply create a Kinesis Stream or Kinesis
Firehose, arrange to pump data in to it, and then build an application to
process or analyze it.
While it is relatively easy to build streaming data solutions

using Kinesis Streams and Kinesis Firehose, we want to make it even
easier. We want you, whether you are a procedural developer, a data
scientist, or a SQL developer, to be able to process voluminous
clickstreams from web applications, telemetry and sensor reports from
connected devices, server logs, and more using a standard query
language, all in real time.
Amazon Kinesis Firehose
Amazon Kinesis Firehose is the easiest way to load

streaming data into AWS. It can capture, transform, and load streaming
data into Amazon Kinesis Analytics, Amazon S3, Amazon Redshift, and
Amazon Elasticsearch Service, enabling near real-time analytics with
existing business intelligence tools and dashboards youre already
using today. It is a fully managed service that automatically scales to
match the throughput of your data and requires no ongoing
administration. It can also batch, compress, and encrypt the data
before loading it, minimizing the amount of storage used at the
destination and increasing security.
You can easily create a Firehose delivery stream from the

AWS Management Console, configure it with a few clicks, and start
sending data to the stream from hundreds of thousands of data
sources to be loaded continuously to AWS all in just a few minutes.
With Amazon Kinesis Firehose, you only pay for the amount of data you
transmit through the service. There is no minimum fee or setup cost.
Amazon Kinesis Analytics
Amazon Kinesis Analytics is the easiest way to process

streaming data in real time with standard SQL without having to learn
new programming languages or processing frameworks. Amazon
Kinesis Analytics enables you to create and run SQL queries on
streaming data so that you can gain actionable insights and respond to
your business and customer needs promptly.
Amazon Kinesis Analytics takes care of everything required

to run your queries continuously and scales automatically to match the
volume and throughput rate of your incoming data. With Amazon
Kinesis Analytics, you only pay for the resources your queries
consume. There is no minimum fee or setup cost.
Amazon Kinesis Streams
Amazon Kinesis Streams enables you to build custom

applications that process or analyze streaming data for specialized
needs. Amazon Kinesis Streams can continuously capture and store
terabytes of data per hour from hundreds of thousands of sources such
as website clickstreams, financial transactions, social media feeds, IT
logs, and location-tracking events. With Amazon Kinesis Client Library
(KCL), you can build Amazon Kinesis Applications and use streaming
data to power real-time dashboards, generate alerts, implement
dynamic pricing and advertising, and more. You can also emit data
from Amazon Kinesis Streams to other AWS services such as Amazon
Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic
Map Reduce (Amazon EMR), and AWS Lambda.
DATA STORAGE & DATABASES
Object Storage
Amazon Simple Storage Service (Amazon S3) is object
storage with a simple web service interface to store and retrieve any
amount of data from anywhere on the web. It is designed to deliver
99.999999999% durability, and scale past trillions of objects
worldwide.
Customers use S3 as primary storage for cloud-native

applications; as a bulk repository, or "data lake," for analytics; as a
target for backup & recovery and disaster recovery; and with
serverless computing.
It's simple to move large volumes of data into or out of

Amazon S3 with Amazon's cloud data migration options. Once data is
stored in S3, it can be automatically tiered into lower cost, longer-term
cloud storage classes like S3 Standard - Infrequent Access and Amazon
Glacier for archiving.
Graph Databases
Graph databases elegantly and efficiently represent entities

(generally known as vertices or nodes) and relationships (edges) that
connect them. Heres a very simple example of a graph:
Bill and Candace have a daughter named Janet, and she has a son
named Bob. This makes Candace Bobs grandmother, and Bill his
grandfather.
Once a graph has been built, it is processed by traversing

the edges between the vertices. In the graph above, we could traverse
from Bill to Janet, and from there to Bob. Graphs can be used to model
social networks (friends and likes), business relationships
(companies, employees, partners, suppliers, and customers),
dependencies, and so forth. Both vertices and edges can be typed;
some vertices could be people as in our example, and others places.
Similarly some edges could denote (as above) familial relationships
and others could denote likes. Every graph database allows
additional information to be attached to each vertex and to each edge,
often in the form of name-value pairs.
Titan is a scalable graph database that is optimized for storing

and querying graphs that contain hundreds of billions of vertices and
edges. It is transactional, and can support concurrent access from
thousands of users.
Amazon Aurora
Amazon Aurora is a MySQL-compatible relational database

engine that combines the speed and availability of high-end
commercial databases with the simplicity and cost-effectiveness of
open source databases. Amazon Aurora provides up to five times
better performance than MySQL with the security, availability, and
reliability of a commercial database at one tenth the cost.
NoSQL
Amazon DynamoDB is a fast and flexible NoSQL database
service for all applications that need consistent, single-digit millisecond
latency at any scale. It is a fully managed cloud database and supports
both document and key-value store models. Its flexible data model and
reliable performance make it a great fit for mobile, web, gaming, ad
tech, IoT, and many other applications. Start today by downloading the
local version of DynamoDB, then read our Getting Started Guide.
HBase on Amazon EMR
Apache HBase is a massively scalable, distributed big data

store in the Apache Hadoop ecosystem. It is an open-source, non-
relational, versioned database which runs on top of Amazon S3 (using
EMRFS) or the Hadoop Distributed Filesystem (HDFS), and it is built for
random, strictly consistent realtime access for tables with billions of
rows and millions of columns. Apache Phoenix integrates with Apache
HBase for low-latency SQL access over Apache HBase tables and
secondary indexing for increased performance. Additionally, Apache
HBase has tight integration with Apache Hadoop, Apache Hive, and
Apache Pig, so you can easily combine massively parallel analytics with
fast data access. Apache HBase's data model, throughput, and fault
tolerance are a good match for workloads in ad tech, web analytics,
financial services, applications using time-series data, and many more.
Relational Databases
Amazon Relational Database Service (Amazon RDS) makes it

easy to set up, operate, and scale a relational database in the cloud. It
provides cost-efficient and resizable capacity while managing time-
consuming database administration tasks, freeing you up to focus on
your applications and business. Amazon RDS provides you six familiar
database engines to choose from, including Amazon Aurora,
PostgreSQL, MySQL, MariaDB, Oracle, and Microsoft SQL Server.
DATA WAREHOUSING
Amazon Redshift
Amazon Redshift is a fast, fully managed, petabyte-scale

data warehouse that makes it simple and cost-effective to analyze all
your data using your existing business intelligence tools. Start small for
$0.25 per hour with no commitments and scale to petabytes for $1,000
per terabyte per year, less than a tenth the cost of traditional
solutions. Customers typically see 3x compression, reducing their costs
to $333 per uncompressed terabyte per year.
AMAZON EC2
Amazon Elastic Compute Cloud (Amazon EC2) is a web

service that provides secure, resizable compute capacity in the cloud.
It is designed to make web-scale cloud computing easier for
developers.
Amazon EC2s simple web service interface allows you to

obtain and configure capacity with minimal friction. It provides you
with complete control of your computing resources and lets you run on
Amazons proven computing environment. Amazon EC2 reduces the
time required to obtain and boot new server instances to minutes,
allowing you to quickly scale capacity, both up and down, as your
computing requirements change. Amazon EC2 changes the economics
of computing by allowing you to pay only for capacity that you actually
use. Amazon EC2 provides developers the tools to build failure resilient
applications and isolate them from common failure scenarios.
DATA MOVEMENT
Direct Connectivity
AWS Direct Connect makes it easy to establish a dedicated

network connection from your premises to AWS. Using AWS Direct
Connect, you can establish private connectivity between AWS and your
datacenter, office, or colocation environment, which in many cases can
reduce your network costs, increase bandwidth throughput, and
provide a more consistent network experience than Internet-based
connections.
AWS Direct Connect lets you establish a dedicated network

connection between your network and one of the AWS Direct Connect
locations. Using industry standard 802.1q VLANs, this dedicated
connection can be partitioned into multiple virtual interfaces. This
allows you to use the same connection to access public resources such
as objects stored in Amazon S3 using public IP address space, and
private resources such as Amazon EC2 instances running within an
Amazon Virtual Private Cloud (VPC) using private IP space, while
maintaining network separation between the public and private
environments. Virtual interfaces can be reconfigured at any time to
meet your changing needs.
Database Migration
AWS Database Migration Service helps you migrate

databases to AWS easily and securely. The source database remains
fully operational during the migration, minimizing downtime to
applications that rely on the database. The AWS Database Migration
Service can migrate your data to and from most widely used
commercial and open-source databases.
The service supports homogenous migrations such as Oracle

to Oracle, as well as heterogeneous migrations between different
database platforms, such as Oracle to Amazon Aurora or Microsoft SQL
Server to MySQL.
Homogeneous Database Migrations
In homogeneous database migrations, the source and target

database engines are the same or are compatible like Oracle to
Amazon RDS for Oracle, MySQL to Amazon Aurora, MySQL to Amazon
RDS for MySQL, or Microsoft SQL Server to Amazon RDS for SQL Server.
Since the schema structure, data types, and database code are
compatible between the source and target databases, this kind of
migration is a one step process. You create a migration task with
connections to the source and target databases, then start the
migration with the click of a button. AWS Database Migration Service
takes care of the rest. The source database can be located in your own
premises outside of AWS, running on an Amazon EC2 instance, or it can
be an Amazon RDS database. The target can be a database in Amazon
EC2 or Amazon RDS.
Heterogenous Database Migrations
In heterogeneous database migrations the source and target

databases engines are different, like in the case of Oracle to Amazon
Aurora, Oracle to PostgreSQL, or Microsoft SQL Server to MySQL
migrations. In this case, the schema structure, data types, and
database code of source and target databases can be quite different,
requiring a schema and code transformation before the data migration
starts. That makes heterogeneous migrations a two step process. First
use the AWS Schema Conversion Tool to convert the source schema
and code to match that of the target database, and then use the AWS
Database Migration Service to migrate data from the source database
to the target database. All the required data type conversions will
automatically be done by the AWS Database Migration Service during
the migration. The source database can be located in your own
premises outside of AWS, running on an Amazon EC2 instance, or it can
be an Amazon RDS database. The target can be a database in Amazon
EC2 or Amazon RDS.
Large Scale Data Transfer
Snowball is a petabyte-scale data transport solution that uses

secure appliances to transfer large amounts of data into and out of the
AWS cloud. Using Snowball addresses common challenges with large-
scale data transfers including high network costs, long transfer times,
and security concerns. Transferring data with Snowball is simple, fast,
secure, and can be as little as one-fifth the cost of high-speed Internet.
With Snowball, you dont need to write any code or purchase

any hardware to transfer your data. Simply create a job in the AWS
Management Console and a Snowball appliance will be automatically
shipped to you*. Once it arrives, attach the appliance to your local
network, download and run the Snowball client to establish a
connection, and then use the client to select the file directories that
you want to transfer to the appliance. The client will then encrypt and
transfer the files to the appliance at high speed. Once the transfer is
complete and the appliance is ready to be returned, the E Ink shipping
label will automatically update and you can track the job status via
Amazon Simple Notification Service (SNS), text messages, or directly in
the Console.
Snowball uses multiple layers of security designed to
protect your data including tamper-resistant enclosures, 256-bit
encryption, and an industry-standard Trusted Platform Module (TPM)
designed to ensure both security and full chain-of-custody of your data.
Once the data transfer job has been processed and verified, AWS
performs a software erasure of the Snowball appliance.
FUNCTIONS OF DATA PROCESSING
Data processing may involve various processes, including:
Validation Ensuring that supplied data is correct and relevant.
Sorting "arranging items in some sequence and/or in different sets."
Summarization reducing detail data to its main points.
Aggregation combining multiple pieces of data.
Analysis the "collection, organization, analysis, interpretation
and presentation of data.".

Reporting list detail or summary data or computed information.
Classification separates data into various categories.
CONCLUSION
As more and more data is generated and collected, data process requires scalable,
flexible, and high performing tools to provide insights in a timely fashion. However,
organizations are facing a growing big data ecosystem where new tools emerge and die
very quickly. Therefore, it can be very difficult to keep pace and choose the right tools.
This whitepaper offers a first step to help you solve this challenge. With a broad
set of managed services to collect, process, and process data the AWS platform makes it
easier to build, deploy and scale big data applications, allowing you to focus on business
problems instead of updating and managing these tools.
AWS provides many solutions to address big data process requirements. Most
big data architecture solutions use multiple AWS tools to build a complete solution: this
can help meet the stringent business requirements in the most cost-optimized, performant,
and resilient way possible. The result is a flexible, big data architecture that is able to
scale along with your business on the AWS global infrastructure.
WEBLOGRAPHY
http://aws.amazon.com/solutions/case-studies/big-data/
https://aws.amazon.com/kinesis/streams
http://docs.aws.amazon.com/kinesis/latest/APIReference/Welcome.html
http://docs.aws.amazon.com/aws-sdk-php/v2/guide/amazon-service-kinesis
https://personaldata.ge/en/for-public-bodies/monatsemta-damushavebis-
printsipebi-da-safudzvlebi
www.google.com
www.google.co/data/processing-amazon.co.in

Data Processing

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Processing

Caricato da

Copyright:

Formati disponibili

INTRODUCTION

Definition of Data Processing

Data processing can be defined as the conversion of raw data to meaningful

Data processing is simply the conversion of raw data to meaningful

Generally, organizations employ computer systems to carry out a series of

Data processing is the act of handling or manipulating data in some fashion.

Manual data processing

Automatic data processing

The term automatic data processing was applied to operations performed by

Computerized data processing, or Electronic data processing represents a

As discussed above, data processing consists of those activities which are

After classification of data, it is verified or checked to ensure the accuracy before

It involves the arrangement of data items in a desired sequence. Usually, it is

Arithmetic manipulation of data is called calculating. Items of recorded data can

To summarize is to condense or reduce masses of data to a more usable and

To compare data is to perform an evaluation in relation to some known measure.

To retrieve means to recover or find again the stored data or information.

Communication is the process of sharing information. Unless the information is

Some types of data collection include census (data collection about

Preparation is the manipulation of data into a form suitable for further

Processing is when the data is subjected to various means and methods of

5. Output and interpretation

Output and interpretation is the stage where processed information is

The Data Processing is a series of steps carried out to extract information

Processing of data is any operation or set of operations performed upon data

Principles of Data Processing

Existence of Legitimate Purpose - Personal data shall be processed for specified,

Adequacy and Proportionality - Personal data shall be adequate, relevant and

The Grounds for the Data Processing

Permitted by Law - Data can be processed if it is directly prescribed by law;

Protection of the public interest personal data can be processing in order to

Employment Purposes - if the processing is necessary for the purposes of

Healthcare Purposes - if they are processed by public health protection, as well

Processing of the data by the political, philosophical, religious organization,

Amazon.com, Inc. is an American electronic commerce company with

The company began as an online bookstore, an idea spurred off with

Amazon was incorporated in 1994, in the state of Washington. In July 1995,

Amazon Vision Statement

The Mission Statement of Amazon.com:

DATA ANALYTIC FRAMEWORKS (Managed, distributed computing for data)

HADOOP & SPARK (Amazon EMR)

Amazon EMR provides a managed Hadoop framework that makes it easy,

INTERACTIVE QUERY SERVICE (Amazon Athena)

Amazon Athena is an interactive query service that makes it easy to analyze

Start Querying Instantly (Serverless. No ETL.)

Pay Per Query (Only pay for data scanned)

ELASTICSEARCH (Amazon Elasticsearch Service)

Its easy to get started with Amazon Elasticsearch Service.

REAL-TIME DATA PROCESS

As you may know, Amazon Kinesis greatly simplifies the

While it is relatively easy to build streaming data solutions

Amazon Kinesis Firehose

Amazon Kinesis Firehose is the easiest way to load

You can easily create a Firehose delivery stream from the

Amazon Kinesis Analytics

Amazon Kinesis Analytics is the easiest way to process

Amazon Kinesis Analytics takes care of everything required

Amazon Kinesis Streams

Amazon Kinesis Streams enables you to build custom

DATA STORAGE & DATABASES

Customers use S3 as primary storage for cloud-native

It's simple to move large volumes of data into or out of

Graph databases elegantly and efficiently represent entities

Once a graph has been built, it is processed by traversing