Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
B Lublinsky, A Yakubovich
Apache Oozie
Oozie is a workflow/coordination system to manage Apache Hadoop jobs. A single Oozie server implements all four functional Oozie components:
Oozie workflow Oozie coordinator Oozie bundle Oozie SLA.
Main components
Oozie Server 3rd party application Bundle time condition monitoring Coordinator
action Oozie Command Line Interface action action action wf logic job submission and monitoring
definitions, states
Hadoop
Oozie workflow
Workflow Language
Flow-control node Decision Fork Join Kill Action node java fs MapReduce Pig Sub workflow Hive * Shell * ssh * Sqoop * Email * Distcp ? XML element type workflow:DECISION workflow:FORK workflow:JOIN workflow:kill XML element type workflow:JAVA workflow:FS Description expressing switch-case logic splits one path of execution into multiple concurrent paths waits until every concurrent execution path of a previous fork node arrives to it forces a workflow job to kill (abort) itself
Description invokes the main() method from the specified java class manipulate files and directories in HDFS; supports commands: move, delete, mkdir workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job workflow:pig runs a Pig job workflow:SUBruns a child workflow job WORKFLOW workflow:HIVE runs a Hive job workflow:SHELL runs a Shell command starts a shell command on a remote machine as a remote secure workflow:SSH shell workflow:SQOOP runs a Sqoop job workflow:EMAIL sending emails from Oozie workflow application Under development (Yahoo)
Workflow actions
Oozie workflow supports two types of actions: Synchronous, executed inside Oozie runtime Asynchronous, executed as a Map Reduce job.
ActionStartCommand WorkflowStore Services ActionExecutorContext JavaActionExecutor JobClient
1 : workflow := getWorkflow()
2 : action := getAction()
3 : context := init<>()
4 : executor := get()
5 : start()
6 : submitLauncher()
9 : setStartData()
Workflow lifecycle
PREP
KILLED
RUNNING
FAILED
SUSPENDED
SUCCEDDED
Bad
Process has to be expressed directly in hPDL with no visual support No support for Uber Jars (but we added our own)
Ugly
Static forking (but you can regenerate workflow and invoke on a fly) No support for loops
Oozie Coordinator
Coordinator language
Element type coordinatorapp Description top-level element in coordinator instance Attributes and sub-elements frequency start end
controls
specify the execution policy for coordinator and timeout (actions) its elements (workflow actions) concurrency (actions) execution order (workflow instances)
Required singular element specifying the associated workflow. The jobs specified in workflow consume and produce dataset instances Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action specifies the dataset that should be produced by coordinator action Workflow name
action
datasets
input event
output event
Coordinator lifecycle
Oozie Bundle
Bundle lifecycle
PREP
PREPSUSPENDED
PREPPAUSED
RUNNING
KILLED
SUSPENDED
SUCCEDDED
FAILED
PAUSED
Oozie SLA
SLA Navigation
COORD_JOBS
id app_name app_path
WF_JOBS SLA_EVENT
event_id alert_contact alert-frieuency sla_id ... id app_name app_path
COORD_ACTIONS
id action_number action_xml external_id ...
WF_ACTIONS
id conf console_url
Workflow fragment 1
Workflow fragment 2
Configuring workflow
Oozie provides 3 overlapping mechanisms to configure workflow config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations. The way Oozie processes these three sets of the parameters is as follows:
Use all of the parameters from command line invocation For remaining unresolved parameters, job config is used Use config-default.xml for everything else
Although documentation does not describe clearly when to use which, the overall recommendation is as follows:
Use config-default.xml for defining parameters that never change for a given workflow Use jobs properties for the parameters that are common for a given deployment of a workflow Use command line arguments for the parameters that are specific for a given workflow invocation.
Storing
String ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName)); Properties props = new Properties(); props.setProperty(key, value); props.store(os, ""); os.close();
Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.
Driver requires a proper shutdown hook to ensure that there are no lingering Map Reduce jobs
jars so zip
<java> <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> </java>
mapper mapper