Sei sulla pagina 1di 22

1

2
3
4
When you hear the term Big Data, there are usually multiple problems
associated with it. Of course the size of the source information is one, but why
is the data so large? Because it is not ERP data but something else, such as
weblogs, social media data, and so on. This data is either completely
unstructured or semi-structured, and this data needs to be related back to the
ERP data.

Let us have a look at three examples of what one might want to call Big Data.
5
You might have an extremely large table in the database with billings you want
to sum up and show in a report for further analysis. This is an example of what
Hadoop can do very well, but so can databases, HANA in particular. So unless
the data is in Hadoop already, you would probably not use Hadoop for that
purpose.
6
A weblog is a nice example of what Hadoop can do very well. It is a large
dataset, it is not very well structured, you do not want to do simple aggregation
functions like How often was the URL xyz called, but more complex
algorithms like What is the time when a single user opened one page and then
another the potential reading time of the first page. Expressing these
calculations in SQL is almost impossible and even if it can be done, it would
execute for a long time.
7
For simple statistical means you might want to count the number of times a
word occurs. But how do you break apart a long text into its words in SQL? And
even if you can, the amount of data would grow so much that it cannot be
processed anymore.
8
Hadoop by itself is very simple, it is a file storage system and can execute
programs reading the files and writing new ones. The program needs to be
written in Java and can consist of two methods only, a map method and a
reduce (aggregation) method.
What is unique about Hadoop however is that if both requirements are met,
files as source and map-reduce based programs, the Hadoop environment can
execute these in a massively parallel way. So your input of the map method
might be the Amazon product review as suggesting in Example 3 and the Map
phase will output the list of words. The reduce method will then take the words
and count them across all reviews so that per product you can see how often
the word issue was found. Just as an example of a first simple approach, we
of course would use the Text Data Processing Transform for that.
Hadoop has two extensions we are using also, the Hive which gives you SQL
like options to join and summarize data. And PIG to write Map-Reduce logic not
in low level Java code but a higher level scripting language compiled into
Map_Reduce logic at runtime.
9
It does not make sense to fill up a database with terabytes of text data if this
data cannot be analyzed because the logic cannot be expressed in SQL or
other database languages. And if the source data is just rough data that has to
be processed in order to get a measure out of it. Then you use Hadoop for the
cheap storage of the data and for processing it, the results are then loaded into
HANA or any other system to enrich the analysis possible there.
10
And why would we use weblogs, text information entered by millions of users?
Because it can help us to find out the Why.
11
Combining the ERP numbers with the Hadoop-produced statistics for each
product will allow you to relate the two, for example, relate revenue numbers
with the hit rates and the things said about the product to get a big picture of
the sales situation.
12
13
It is quite possible that you may have many tools from different vendors
supporting your EIM and/or data management initiatives for ETL, data quality,
data profiling, metadata management, and text analytics.

What if you could simplify your IT environment with a single foundation to
deliver data services across your entire enterprise (for current or new projects).
Such a foundation needs to be open to support all data sources and targets,
scalable to handle small and extreme data volumes, and reusable so that you
can build once and reuse for other projects.

Such a Data Services foundation can be your standard for delivering all of the
critical information management capabilities across your enterprise. As a result,
you can significantly gain greater IT efficiency and deliver maximum business
effectiveness.
14
For us as Data Services users, Hadoop is just yet another source target. We
can read and write files into Hadoop, and we can push down logic into Hadoop
so we do not have to do that in the engine.
15
The internal flow logic when Data Services is using Hadoop is quite simple, as
Hadoop is quite simple. We prepare the file, we generate the processing logic,
and let this script then be executed in Hadoop.

One important limitation at the moment is that we cannot load a remote Hadoop
system, so a job server needs to be installed in the Hadoop system.
16
17
The CDC checkbox and type can only be set when creating the datastore,
it cannot be modified later on.
Once a CDC datastore is created, it will show tables only where CDC or
change tracking is activated.
18
Change Data Capture (CDC) and Change Tracking enable applications to
determine the Insert, Update, and Delete operations that were made to user
tables in a database.
19
After each successful execution (at the end of the job), the watermark is written
to a status table; for the next execution, only changes after this watermark are
extracted.

You can use the CDC subscription name (for change tracking only) when
multiple subscribers need to access the delta independently.

For the type CDC, low and high watermarks are set to ensure data
consistency. This applies to all dataflows in the job.

Additional CDC columns are available in the input schema.
20
21
22

Potrebbero piacerti anche