Sei sulla pagina 1di 31

Everything that you ever wanted to know about Oozie, but were afraid to ask

B Lublinsky, A Yakubovich

Apache Oozie
Oozie is a workflow/coordination system to manage Apache Hadoop jobs. A single Oozie server implements all four functional Oozie components:
Oozie workflow Oozie coordinator Oozie bundle Oozie SLA.

Main components
Oozie Server 3rd party application Bundle time condition monitoring Coordinator

WS API workflow data condition monitoring

action Oozie Command Line Interface action action action wf logic job submission and monitoring

definitions, states

Oozie shared libraries HDFS

Bundle Coordinator Coordinator MapReduce Data Coordinator Coordinator Coordinator

Workflow Coordinator Coordinator

Hadoop

Oozie workflow

Workflow Language
Flow-control node Decision Fork Join Kill Action node java fs MapReduce Pig Sub workflow Hive * Shell * ssh * Sqoop * Email * Distcp ? XML element type workflow:DECISION workflow:FORK workflow:JOIN workflow:kill XML element type workflow:JAVA workflow:FS Description expressing switch-case logic splits one path of execution into multiple concurrent paths waits until every concurrent execution path of a previous fork node arrives to it forces a workflow job to kill (abort) itself

Description invokes the main() method from the specified java class manipulate files and directories in HDFS; supports commands: move, delete, mkdir workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job workflow:pig runs a Pig job workflow:SUBruns a child workflow job WORKFLOW workflow:HIVE runs a Hive job workflow:SHELL runs a Shell command starts a shell command on a remote machine as a remote secure workflow:SSH shell workflow:SQOOP runs a Sqoop job workflow:EMAIL sending emails from Oozie workflow application Under development (Yahoo)

Workflow actions
Oozie workflow supports two types of actions: Synchronous, executed inside Oozie runtime Asynchronous, executed as a Map Reduce job.
ActionStartCommand WorkflowStore Services ActionExecutorContext JavaActionExecutor JobClient

1 : workflow := getWorkflow()

2 : action := getAction()

3 : context := init<>()

4 : executor := get()

5 : start()

6 : submitLauncher()

7 : jobClient := get() 8 : runningJob := submit()

9 : setStartData()

Workflow lifecycle
PREP

KILLED

RUNNING

FAILED

SUSPENDED

SUCCEDDED

Oozie execution console

Extending Oozie workflow


Oozie provides a minimal workflow language, which contains only a handful of control and actions nodes. Oozie supports a very elegant extensibility mechanism custom action nodes. Custom action nodes allow to extend Oozie language with additional actions (verbs). Creation of custom action requires implementation of following:
Java action implementation, which extends ActionExecutor class. Implementation of the actions XML schema defining actions configuration parameters Packaging of java implementation and configuration schema into action jar, which has to be added to Oozie war extending oozie-site.xml to register information about custom executor with Oozie runtime.

Oozie Workflow Client


Oozie provides an easy way for integration with enterprise applications through Oozie client APIs. It provides two types of APIs REST HTTP API
Number of HTTP requests Info requests (job status, job configuration) Job management (submit, start, suspend, resume, kill) Example: job definition info request GET /oozie/v0/job/job-ID?show=definition

Java API - package org.apache.oozie.client


OozieClient start(), submit(), run(), reRunXXX(), resume(), kill(), suspend() WorkflowJob, WorkflowAction CoordinatorJob, CoordinatorAction SLAEvent

Oozie workflow good, bad and ugly


Good
Nice integration with Hadoop ecosystem, allowing to easily build processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs. Nice UI for tracking execution progress Simple APIs for integration with other applications Simple extensibility APIs

Bad
Process has to be expressed directly in hPDL with no visual support No support for Uber Jars (but we added our own)

Ugly
Static forking (but you can regenerate workflow and invoke on a fly) No support for loops

Oozie Coordinator

Coordinator language
Element type coordinatorapp Description top-level element in coordinator instance Attributes and sub-elements frequency start end

controls

specify the execution policy for coordinator and timeout (actions) its elements (workflow actions) concurrency (actions) execution order (workflow instances)
Required singular element specifying the associated workflow. The jobs specified in workflow consume and produce dataset instances Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action specifies the dataset that should be produced by coordinator action Workflow name

action

datasets

input event

output event

Coordinator lifecycle

Oozie Bundle

Bundle lifecycle
PREP

PREPSUSPENDED

PREPPAUSED

RUNNING

KILLED

SUSPENDED

SUCCEDDED

FAILED

PAUSED

Oozie SLA

SLA Navigation
COORD_JOBS
id app_name app_path

WF_JOBS SLA_EVENT
event_id alert_contact alert-frieuency sla_id ... id app_name app_path

COORD_ACTIONS
id action_number action_xml external_id ...

WF_ACTIONS
id conf console_url

Using Probes to analyze/monitor Places


Select probe data for specified time/location Validate Filter - Transform probe data Calculate statistics on available probe data Distribute data per geo-tiles Calculate place statistics (e.g. attendance index) ------------------------------------------------------------If exception condition happens, report failure If all steps succeeded, report success

Workflow as acyclic graph

Workflow fragment 1

Workflow fragment 2

Oozie tips and tricks

Configuring workflow
Oozie provides 3 overlapping mechanisms to configure workflow config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations. The way Oozie processes these three sets of the parameters is as follows:
Use all of the parameters from command line invocation For remaining unresolved parameters, job config is used Use config-default.xml for everything else

Although documentation does not describe clearly when to use which, the overall recommendation is as follows:
Use config-default.xml for defining parameters that never change for a given workflow Use jobs properties for the parameters that are common for a given deployment of a workflow Use command line arguments for the parameters that are specific for a given workflow invocation.

Accessing and storing process variables


Accessing
Through the arguments in java main

Storing
String ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName)); Properties props = new Properties(); props.setProperty(key, value); props.store(os, ""); os.close();

Validating data presence


Oozie provides two possible approaches for validating resource file(s) presence
using Oozie coordinators input events based on the data set technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not. custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc.

Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.

Invoking map Reduce jobs


Oozie provides two different ways of invoking Map Reduce job MapReduce action and java action. Invocation of Map Reduce job with java action is somewhat similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages:
The same driver class can be used for both running Map Reduce job from an edge node and a java action in an Oozie process. A driver provides a convenient place for executing additional code, for example clean-up required for Map Reduce execution.

Driver requires a proper shutdown hook to ensure that there are no lingering Map Reduce jobs

Implementing predefined looping and forking


hPDL is an XML document with the well-defined schema. This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler. This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions. The other option is creation of template process and modifying it based on calculated parameters.

Oozie client security (or lack of)


By default Oozie client reads clients identity from the local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation Impersonation can be implemented by overwriting OozieClient class method createConfiguration, where client variables can be set through new constructor.
public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }

uber jars with Oozie


uber jar contains resources: other jars, so libraries, zip files
unpack resources to current uber jar dir set inverse classloader
invoke MR driver pass arguments set shutdown hook wait for complete

Oozie server uber jar


Classes (Launcher)

launcher java action

jars so zip
<java> <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> </java>

mapper mapper

Potrebbero piacerti anche