Sei sulla pagina 1di 13

Syncsort Incorporated, 2016

All rights reserved. This document contains proprietary and confidential material, and is only for use by
licensees of DMExpress. This publication may not be reproduced in whole or in part, in any form,
except with written permission from Syncsort Incorporated. Syncsort is a registered trademark and
DMExpress is a trademark of Syncsort, Incorporated. All other company and product names used
herein may be the trademarks of their respective owners.

The accompanying DMExpress program and the related media, documentation, and materials
("Software") are protected by copyright law and international treaties. Unauthorized reproduction or
distribution of the Software, or any portion of it, may result in severe civil and criminal penalties, and
will be prosecuted to the maximum extent possible under the law.

The Software is a proprietary product of Syncsort Incorporated, but incorporates certain third-party
components that are each subject to separate licenses and notice requirements. Note, however, that
while these separate licenses cover the respective third-party components, they do not modify or form
any part of Syncsorts SLA. Refer to the Third-party license agreements topic in the online help for
copies of respective third-party license agreements referenced herein.
Table of Contents

Table of Contents
1 Introduction .................................................................................................................................... 1

2 Getting Started ............................................................................................................................... 2


2.1 Getting the DMX-h Components ......................................................................................... 2
2.2 Getting the Use Case Accelerators ..................................................................................... 2
2.3 Getting the Hortonworks Sandbox VM................................................................................ 2
2.4 Getting Help ........................................................................................................................ 2

3 Installing DMX-h Software ............................................................................................................. 3


3.1 Installing DMX-h on Linux in the Sandbox VM ................................................................... 3
3.2 Installing DMX-h Workstation on Windows ......................................................................... 4
3.3 Installing the Use Case Accelerators .................................................................................. 4
3.4 Configuring the Classpath ................................................................................................... 4

4 Using DMX-h ................................................................................................................................... 6


4.1 Use Case Accelerators ....................................................................................................... 6
4.2 Running the Use Case Accelerators ................................................................................... 7
4.3 Additional Information ......................................................................................................... 9

DMX-h Quick Start for Hortonworks Sandbox i


Introduction

1 Introduction
DMX-h ETL is Syncsorts high-performance ETL software for Hadoop. It combines
powerful ETL data processing capabilities with the enhanced performance and
scalability of Hadoop, without the need to learn complex MapReduce programming
skills. A downloadable package of use case accelerators demonstrates how common
ETL applications, easily developed in DMExpress, can be run in the Hadoop
environment.

Installing the DMX-h software and setting up the use case accelerators in your
Hortonworks Sandbox VM is fast and easy. Just follow the instructions in this
document, and try out DMX-h for yourself.

DMX-h Quick Start for Hortonworks Sandbox 1


Getting Started

2 Getting Started
2.1 Getting the DMX-h Components
The following components are included in the downloadable zip file, DMX-h_<DMX-h
version>_Hortonworks_Sandbox.zip, from www.syncsort.com/hortonworks:

DMX-h software for Linux:

dmexpress_<DMExpress version>_en_linux_2-6_x86-64_64bit.tar

DMX-h software for Windows:

dmexpress_<DMExpress version>_windows_x86.exe

DMExpress Installation Guide

2.2 Getting the Use Case Accelerators


The use case accelerators, a pre-developed set of DMX-h ETL example applications,
along with a set of sample data needed to run them, can be downloaded from Guide to
DMX-h ETL Use Case Accelerators.

2.3 Getting the Hortonworks Sandbox VM


If you dont already have the Hortonworks Sandbox 2.4 VM installed, you can download
it from here, and install it as directed.

2.4 Getting Help


For assistance with the DMX-h Test Drive, please visit the Syncsort User Community.

DMX-h Quick Start for Hortonworks Sandbox 2


Installing DMX-h Software

3 Installing DMX-h Software


3.1 Installing DMX-h on Linux in the Sandbox VM
Install the DMX-h software on Linux in your Hortonworks Sandbox VM as follows:

1. Be sure that the network adapter on the VM is configured correctly for SSH
connectivity. Refer to http://hortonworks.com/wp-
content/uploads/2013/03/InstallingHortonworksSandboxonWindowsUsingVMwarePl
ayerv2.pdf for details.
2. From your desktop, use scp (Secure Copy) via PuTTY or WinSCP to copy (in
binary mode) the file dmexpress_<DMExpress version>_en_linux_2-6_x86-
64_64bit.tar to the /root directory on the Sandbox VM, logging in as root with
password hadoop.
3. Log into the Sandbox VM as root with password hadoop using ssh via PuTTY or
another terminal emulator. This will put you in the /root folder.
4. Run the following command to extract the DMExpress software into a subfolder
named dmexpress:

tar xvof dmexpress_<DMExpress version>_en_linux_2-6_x86-


64_64bit.tar

5. Run the following commands to install the product:

cd dmexpress
./install

a. When prompted, select the option to start a free trial. The trial has a duration of
30 days, starting from the first time you run DMExpress.
b. Choose /usr/local/dmexpress as the installation directory.
c. When prompted to install the service, choose Yes.
6. Create a new user named dmxdemo for running DMX-h jobs (this will automatically
create the folder /home/dmxdemo on the VM) as follows:

useradd dmxdemo
passwd dmxdemo //follow prompts to create password dmxdemo

7. Create a home directory in HDFS for the dmxdemo user as follows:

su - hdfs
hadoop fs -mkdir /user/dmxdemo
hadoop fs -chown dmxdemo /user/dmxdemo

8. Switch user back to dmxdemo (su - dmxdemo), then edit the dmxdemo users
.bash_profile and add the following lines, adjusting the JVM library path as
needed for your installation:

DMX_HOME=/usr/local/dmexpress
PATH=$PATH:$HOME/bin:$DMX_HOME/bin

DMX-h Quick Start for Hortonworks Sandbox 3


Installing DMX-h Software

LD_LIBRARY_PATH=/usr/lib/jvm/java-1.7.0-openjdk-
1.7.0.95.x86_64/jre/lib/amd64/server/:$LD_LIBRARY_PATH:$DMX_HOM
E/lib

export DMX_HOME
export PATH
export LD_LIBRARY_PATH

3.2 Installing DMX-h Workstation on Windows


To view the sample DMExpress jobs/tasks and develop your own solutions, install the
DMX-h Workstation software on a Windows machine as follows:

1. Double-click on the DMExpress Windows installation file.


2. Follow the on-screen instructions.
a. When prompted, select the option to start a free trial. The trial has a duration of
30 days, starting from the first time you run DMExpress.
b. When prompted, the DBMS and SAP verification screens can be skipped.

To simplify creating remote server and file browsing connections, edit the Windows
hosts file as an Admin user and add an entry for the IP address of the VM (shown
when you connect to the VM) with the hostname sandbox.hortonworks.com. For
example:

192.168.137.128 sandbox.hortonworks.com

3.3 Installing the Use Case Accelerators


1. Using scp and connecting as the dmxdemo user, copy (in binary mode) the
downloaded zipped tar files DMX-h_UCA_Solutions.tar.gz and DMX-
h_UCA_Data.tar.gz to /home/dmxdemo on the VM.
2. Log into the VM as dmxdemo (or su dmxdemo if already logged in as a different
user) and unzip and extract both tar files as follows:

tar xvof DMX-h_UCA_Solutions.tar.gz


tar xvof DMX-h_UCA_Data.tar.gz

This will create Data, Jobs, and bin directories under /home/dmxdemo, which you
will later designate as the value of $DMXHADOOP_EXAMPLES_DIR.

3.4 Configuring the Classpath


Due to a known problem in Hortonworks, the following steps are necessary to ensure
that DMX-h can write to multiple HDFS targets in MapReduce:

1. Log in to Ambari as the admin user (password admin).


a. If needed, the password can be reset by logging into the VM as root
(password hadoop), running ambari-admin-password-reset, and following
the prompt.
2. Select MapReduce2, then the Configs tab.

DMX-h Quick Start for Hortonworks Sandbox 4


Installing DMX-h Software

3. Expand Advanced mapred-site, and append the Hadoop conf directory,


/etc/hadoop/conf, to the end of the list in mapreduce.application.classpath.

4. Save the configuration, and allow Ambari to restart the MR2 services if prompted.

DMX-h Quick Start for Hortonworks Sandbox 5


Using DMX-h

4 Using DMX-h
4.1 Use Case Accelerators
Syncsort provides a set of use case accelerators that cover a variety of common ETL
use cases to quickly and easily demonstrate both the development and running of
DMX-h ETL jobs in Hadoop.

There are two broad categories of use case accelerators:

DMExpress Hadoop ETL Jobs


Jobs that are eligible for the DMX-h Intelligent Execution (IX) are created as
standard DMExpress jobs and are found in a subdirectory named
DMXStandardJobs within the example directory structure. When run in Hadoop,
they are automatically converted to MapReduce jobs.
Jobs that are not currently supported for IX are created as user-defined
MapReduce jobs and are found in a subdirectory named
DMXUserDefinedMRJobs within the example directory structure. This folder is
also present for IX-eligible jobs to demonstrate how those jobs would be
defined as explicit MapReduce jobs, but the IX solution is the recommended
one to use when available.
DMExpress HDFS Load/Extract Jobs these are standard DMExpress jobs that
are run on the edge node for extracting and loading HDFS data. These jobs are
found in a subdirectory named DMXHDFSJobs within the example directory
structure.

A brief description of each use case accelerator is provided below, with links to more
detailed descriptions:

Category Use Case Accelerator Description

CDC Single Output Performs change data capture (CDC) against two
large input files, producing a single output file marking
records as inserted, deleted, or updated.

CDC Distributed Output Same as CDC Single Output, except that it produces
Change Data
three separate output files for the inserted, deleted,
Capture (CDC)
and updated records.

Mainframe Extract + CDC Same as CDC Single Output, but also converts and
loads mainframe data to HDFS before passing the
HDFS data to the CDC job.

Join Large Side | Small Side Performs an inner join between a small distributed
cache file and a large HDFS file.
Joins and
Join Large Side | Large Side Performs a join of two large files stored in HDFS.
Lookups
File Lookup Performs a lookup in a small distributed cache file
while processing a large HDFS file.

DMX-h Quick Start for Hortonworks Sandbox 6


Using DMX-h

Web Logs Aggregation Calculates the total number of visits per site in a set of
web logs using aggregate tasks.
Aggregations
Lookup + Aggregation Performs a lookup followed by an aggregation.

Word Count Performs the standard Hadoop word count example.

Direct Mainframe Extract & Loads two files residing on a remote mainframe
Load system to HDFS, converting to ASCII displayable text.

Mainframe File Load Same as Direct Mainframe, except that mainframe


files are loaded to HDFS from local file system.
Mainframe
Translation and Direct Mainframe Redefine Loads one file residing on a remote mainframe system
Connectivity Extract & Load to HDFS, interpreting REDEFINES clauses and
converting to ASCII displayable text.

Mainframe Redefine File Same as Direct Mainframe Redefine, except that the
Load mainframe file is loaded to HDFS from the local file
system.

HDFS Extract Extracts data from HDFS using HDFS connectivity in


a DMExpress copy task.

Connectivity HDFS Load Same as HDFS Extract, but loads data to HDFS.

HDFS Load Parallel Same as HDFS Load, but splits the data into multiple
partitions and loads to HDFS in parallel.

4.2 Running the Use Case Accelerators


To run the use case accelerators, do the following:

1. Log into the VM as dmxdemo and set the following environment variables as
indicated in order to run the prep script:

export DMXHADOOP_EXAMPLES_DIR=/home/dmxdemo
export LOCAL_SOURCE_DIR=$DMXHADOOP_EXAMPLES_DIR/Data/Source
export LOCAL_TARGET_DIR=$DMXHADOOP_EXAMPLES_DIR/Data/Target
export HDFS_SOURCE_DIR=<HDFS directory under which to copy the
sample source data, such as /user/dmxdemo/source/>
export HDFS_TARGET_DIR=<HDFS directory under which to write the
target data, such as /user/dmxdemo/target/>
export LOCAL_TEMP_DATA_DIR=$DMXHADOOP_EXAMPLES_DIR/Data/Temp

2. Create the target directory specified as the HDFS_TARGET_DIR, for example:

hadoop fs -mkdir /user/dmxdemo/target

3. Run the prep script to pre-load the sample data to HDFS.


a. This can be done for all use case accelerators using the ALL option:

$DMXHADOOP_EXAMPLES_DIR/bin/prep_dmx_example.sh ALL

DMX-h Quick Start for Hortonworks Sandbox 7


Using DMX-h

b. Or it can be done for the specified space-separated list of folder names under
$DMXHADOOP_EXAMPLES_DIR/Jobs. For example:

$DMXHADOOP_EXAMPLES_DIR/bin/prep_dmx_example.sh FileCDC
WebLogAggregation

4. On the Windows Workstation, start the DMExpress Job Editor, and run the desired
use case accelerator(s) as follows:
a. Select File->Open Job, click on the Remote Servers tab, click on New file
browsing connection, specify the connection as follows, and click OK:
Server: sandbox.hortonworks.com
Connection type: Secure FTP
Authentication: Password
User name: dmxdemo
Password: dmxdemo
b. Open the desired job as follows:
i. Browse to the location of the job you want to run in one of the following
folders as described earlier:

$DMXHADOOP_EXAMPLES_DIR/Jobs/<JobName>/DMXStandardJobs
$DMXHADOOP_EXAMPLES_DIR/Jobs/<JobName>/DMXUserDefinedMRJobs
$DMXHADOOP_EXAMPLES_DIR/Jobs/<JobName>/DMXHDFSJobs

ii. Select J_<JobName>.dxj (or MRJ_<JobName>.dxj, as applicable).


iii. Click on Open.
c. Click on the Status button in the Job Editor toolbar.
i. In the DMExpress Server Connection dialog (automatically raised if
DMExpress server is empty, otherwise click on Select Server), click on
the UNIX tab, populate the Connect to server, User name, and Password
fields as indicated in step a, and click OK.
ii. Select the Environment Variables tab, enter and set the following
environment variables as indicated, and click on Close:

HADOOP_HOME=<directory where Hadoop is installed>


HDFS_SERVER=sandbox.hortonworks.com
HDFS_SOURCE_DIR=<same as specified in step 1>
HDFS_TARGET_DIR=<same as specified in step 1>
LOCAL_SOURCE_DIR=<same as specified in step 1>
LOCAL_TARGET_DIR=<same as specified in step 1>
LOCAL_TEMP_DATA_DIR=<same as specified in step 1>
MAPRED_TEMP_DATA_DIR=.
DMX_HADOOP_ON_VALIDATION_FAILURE=LOCALNODE

Setting DMX_HADOOP_ON_VALIDATION_FAILURE to LOCALNODE allows any


jobs or subjobs that do not run in the cluster to run on the edge node
instead. Otherwise, they would run on a single cluster node, which is
unnecessary for a single node test drive sandbox.

DMX-h Quick Start for Hortonworks Sandbox 8


Using DMX-h

d. Click on the Run button in the Job Editor toolbar. In the Run Job dialog, select
Cluster in the Run on section and click OK.
e. This will bring up the DMExpress Server dialog, which will show the progress
of the running job. Upon completion, select the job and click on Detail to see
Hadoop messages and statistics. (The SRVCDFL warning message about the
data directory can be safely ignored.) To view the Hadoop logs, see Where to
Find DMExpress Hadoop Logs on YARN (MRv2).

4.3 Additional Information


For details on the Use Case Accelerator directory structure and instructions on running
the jobs outside of Hadoop, see the Guide to DMX-h ETL Use Case Accelerators.

For information on how to develop your own DMExpress Hadoop solutions, see DMX-
h ETL in the DMExpress Help, accessible via the DMExpress GUI (Job Editor or Task
Editor).

DMX-h Quick Start for Hortonworks Sandbox 9


About Syncsort

Syncsort provides enterprise software that allows organizations to collect, integrate, sort, and distribute
more data in less time, with fewer resources and lower costs. Thousands of customers in more than
85 countries, including 87 of the Fortune 100 companies, use our fast and secure software to optimize
and offload data processing workloads. Powering over 50% of the worlds mainframes, Syncsort software
provides specialized solutions spanning Big Iron to Big Data, including next gen analytical platforms
such as Hadoop, cloud, and Splunk. For more than 40 years, customers have turned to Syncsorts
software and expertise to dramatically improve performance of their data processing environments, while
reducing hardware and labor costs. Experience Syncsort at www.syncsort.com.

Syncsort Inc. 50 Tice Boulevard, Suite 250, Woodcliff Lake, NJ 07677 201.930.8200

Potrebbero piacerti anche