Sei sulla pagina 1di 28

D5.

1: EUBra-BIGSEA Software Architecture


Author(s) Daniele Lezzi (BSC), Walter dos Santos Filho (UFMG)
Status Final
Version V1.0
Date 05.10.2016

Dissemination Level
X PU: Public
PP: Restricted to other programme participants (including the Commission)
RE: Restricted to a group specified by the consortium (including the Commission)
CO: Confidential, only for members of the consortium (including the Commission)

EUBra-BIGSEA is funded by the European Commission under the


Cooperation Programme, Horizon 2020 grant agreement No 690116.
Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação
e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI)

Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric
Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European
Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT)
of Brazil in the frame of the third European-Brazilian coordinated call. The document has been
produced with the co-funding of the European Commission and the MCT.
The purpose of this report is the design of the software architecture of the EUBra-BIGSEA platform.
The document describes the overall functioning and interactions between the platform components,
and serves as development roadmap for the developers of the project.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Document identifier: EUBRA BIGSEA –WP5-D5.1


Deliverable lead BSC
Related work package WP5
Author(s) Daniele Lezzi (BSC), Walter dos Santos Filho (UFMG)
Contributor(s) Ignacio Blanquer (UPV), Dorgival Guedes (UFMG), Sandro Fiore (CMCC), Danilo
Ardagna (POLIMI)
Due date 30/09/2016
Actual submission date 05/10/2016
Reviewed by Germán Moltó (UPV), Dorgival Guedes (UFMG)
Approved by PMB
Start date of Project 01/01/2016
Duration 24 months
Keywords Big Data, programming models, architecture design, analytics

Versioning and contribution history


Version Date Authors Notes

0.1 31/08/2016 Daniele Lezzi (BSC) Table of Contents


0.2 Daniele Lezzi (BSC) Sections about COMPSs
0.3 15/09/2016 Walter Santos (UFMG) Section about Lemonade
0.4 21/09/2016 Daniele Lezzi (BSC) Requirements and technology evaluation sections
0.5 21/09/2016 Nuno Antunes (UC) Security section
0.6 26/09/2016 Daniele Lezzi (BSC) General edits
0.7 27/09/2016 Daniele Lezzi (BSC) Edits and formatting
0.8 29/09/2016 Danilo Ardagna (POLIMI), Revisions
Dorgival Guedes (UFMG)
0.9 30/09/2016 Ignacio Blanquer (UPV) Revision
1.0 03/10/2016 Daniele Lezzi Final version reviewed

Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0.
Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent the
views expressed by the European Commission or its services.
While the information contained in the document is believed to be accurate, the author(s) or any other participant in the EUBra-
BIGSEA Consortium make no warranty of any kind with regard to this material including, but not limited to the implied warranties of
merchantability and fitness for a particular purpose.
Neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be responsible or liable in
negligence or otherwise howsoever in respect of any inaccuracy or omission herein.
Without derogating from the generality of the foregoing neither the EUBra-BIGSEA Consortium nor any of its members, their officers,
employees or agents shall be liable for any direct or indirect or consequential loss or damage caused by or arising from any
information advice or inaccuracy or omission herein.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 2


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

TABLE OF CONTENT
EXECUTIVE SUMMARY ................................................................................................................ 4
1 Introduction .............................................................................................................................. 5
1.1 Scope of the Document .................................................................................................... 5
1.2 Target Audience ............................................................................................................... 5
1.3 Structure ........................................................................................................................... 5
2 EUBra-BIGSEA Architectural Overview ................................................................................... 6
3 Programming Abstractions Layer Requirements ...................................................................... 8
3.1 Use Case and technical requirements .............................................................................. 8
3.2 Types of users .................................................................................................................. 9
4 Programming Abstractions Layer Design ............................................................................... 10
4.1 Architecture Design ........................................................................................................ 10
4.1.1 Programming frameworks ........................................................................................ 11
4.1.2 Support to QoS specification.................................................................................... 16
4.1.3 Code generation and composition ........................................................................... 18
4.2 Application lifecycle ........................................................................................................ 21
4.3 Security aspects ............................................................................................................. 22
4.4 Technology Analysis ....................................................................................................... 25
5 Conclusions ........................................................................................................................... 26
6 References ............................................................................................................................ 27
7 GLOSSARY ........................................................................................................................... 28

LIST OF TABLES
Table 1 - QoS constraints specification ......................................................................................... 17
Table 1 - Lemonade supported operations .................................................................................... 20
Table 2 - Requirements and technologies ..................................................................................... 26

LIST OF FIGURES
Figure 1 - High-level view of the EUBra-BIGSEA architecture ......................................................... 6
Figure 2 - Detailed view of the software architecture ....................................................................... 7
Figure 3 - Architecture of the Abstraction Layer ............................................................................ 10
Figure 4 - Detailed view of WP5 components................................................................................ 12
Figure 5 - JSON specification of the QoS ...................................................................................... 18
Figure 6 - Lemonade components and supported frameworks (in progress) ................................. 18
Figure 7 - Lemonade Citron user interface .................................................................................... 19
Figure 8 - Application lifecycle diagram ......................................................................................... 22
Figure 9 - Relation between WP5 and WP6 (adapted from D6.1) ................................................. 23

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 3


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

EXECUTIVE SUMMARY
The EUBra-BIGSEA project aims at developing a set of cloud services empowering Big Data analytics to ease
the development of massive data processing applications. EUBra-BIGSEA will develop models, predictive and
reactive cloud infrastructure QoS techniques, efficient and scalable Big Data operators and a privacy and
quality analysis framework, exposed to several programming environments. EUBra-BIGSEA aims at covering
general requirements of multiple application areas, although it will be showcased in the treatment of massive
connected society information, and particularly in route recommendation.
The abstractions layer provides functionalities that allow to transparently build applications composed of
data operators mapped to different Big Data frameworks. On the other side, the integration with the
infrastructure layer, makes applications effectively scale across the infrastructure, providing also to the
developers appropriate abstractions to specify QoS constraints and a unified programming interface that
includes computing, data analytics, and security APIs.
The programming layer will be based on COMPSs and Spark that provide complementary capabilities to
satisfy the use cases requirements. Spark is an open source data processing framework. COMPSs is a
programming framework that aims to facilitate the parallelisation of existing applications through a simple
programming model based on sequential development. The COMPSs runtime is in charge of exploiting the
inherent concurrency of the code, automatically detecting and enforcing the data dependencies between
tasks and spawning these tasks to the available resources and provide scalability and elasticity features
allowing the dynamic provision of resources. The COMPSs programming interface will be enhanced in the
project through the integration of QoS constraints and security hints; the COMPSs runtime will be extended
with the support to Mesos [R04] in order to benefit from the proactive elasticity of EUBra-BIGSEA
infrastructure.
The final goal is to have an integrated layer that provides building blocks, developed with COMPSs and Spark,
that could be imported in the user applications.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 4


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

1 INTRODUCTION
1.1 Scope of the Document
This document summarizes the architectural design to be implemented in the course of the EUBra-BIGSEA
project. We highlight the interrelations among the different work packages involved in the decisions adopted,
and also outline the reasoning behind the choices made.
This document is intended for general reference but mostly focuses on the design of the Programming Model
Abstraction Layer and its integration with the other layers for applications (WP7), Big Data Ecosystem (WP4),
QoS infrastructure (WP3) and security (WP6) whose architectures have been detailed in D7.1, D4.1, D3.1 and
D6.1 respectively.

1.2 Target Audience


The document is mainly intended for internal use, although it is publicly released. The main target of this
document is the global team of technical experts of the EUBra-BIGSEA, including WP3, WP4, WP5 and WP6.

1.3 Structure
The rest of the document is structured as follows; Section 2 contains a high level summary of the BIGSEA
architecture. Section 3 analyses the use cases requirements that are also related to the programming
frameworks and proposes a list of requirements specific for the design of the abstractions layer. Section 4
addresses the main objective of the document, the design of the software architecture with an analysis of
the components selected to implement the layer. Section 5 concludes the document providing a timeline for
the remaining implementation activities.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 5


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

2 EUBRA-BIGSEA ARCHITECTURAL OVERVIEW


The EUBra-BIGSEA general architecture, as described in deliverable D3.1, comprises four main blocks:
● QoS Cloud Infrastructure services, which integrate the modelling of the workload, the monitoring of
the resources, the implementation of vertical and horizontal elasticity and the contextualization.
● Big Data Analytics services, which provide operators to process huge datasets and which can be
integrated in the programming models. Analytics services are characterized in the QoS cloud
infrastructure models of the underlying layer, which will automatically (or explicitly driven by the
analytics services) adjust resources to the expected workload and considering its specificities.
● Programming Models, which provide a higher-level programmatic framework and are also
characterized by the models of the infrastructure. The programming models will ease the
parallelization of the applications developed on top of them.
● Privacy and Security framework, which provides the means to annotate data and processing and
ensures the proper protection of privacy and security.
On top of those four blocks, applications are developed using the programming models and the data analytics
extensions. Application developers are expected to use the programming models and may use other features
of underlying layers, such as the user-level QoS metrics.
Figure 1 shows the high-level view of the EUBra-BIGSEA architecture depicting the interactions among the
main blocks.

Figure 1 - High-level view of the EUBra-BIGSEA architecture

Figure 2 highlights the separation between the infrastructure components for the management of the
resources (described in D3.1) and the software components that are targeted in this document. In particular,
WP5 focuses on how applications are composed using the abstractions provided by the programming models
and how those applications are deployed benefitting from the high-availability and reliability features of the
infrastructure.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 6


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Figure 2 - Detailed view of the software architecture

As shown in figure 2, resources of the infrastructure are managed by a Cloud Management Framework
(OpenNebula or OpenStack) that deploys or undeploys Virtual Machines (VMs) when requested. These VMs
build up the Mesos cluster, providing the agents on demand. The cluster is configured by the Infrastructure
Manager (IM) and the elasticity at the level of the resources is managed through Elastic Computing Clusters
in the Cloud (EC3), which monitors the Mesos cluster to detect the need of resources and the opportunity to
power off them. The Mesos cluster is accessed through a scheduler.

From the logical components-side, users write their applications using the Lemonade IDE which transforms
them into Spark and COMPSs code. Users can write the programs directly in Spark or COMPSs or port existing
applications written in Java or Python. Proactive policies give an estimation of the resources needed by the
execution, which are readjusted by the Monitoring system which monitors the QoS compliance.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 7


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

3 PROGRAMMING ABSTRACTIONS LAYER REQUIREMENTS


3.1 Use Case and technical requirements

This section provides a summary of the requirements, described in D7.1, relevant for the definition of the
abstraction layer.

● RE.1. Batch jobs. The infrastructure must support unrestricted batch execution of data analytic jobs.
Unrestricted in the sense of no QoS-bounded batch jobs, where latency is not a key issue. Single jobs
will be those that normally could fit in memory.
● RE.2. Bag of tasks. The infrastructure must support unrestricted batch execution of a bag of data
analytics jobs. A Bag of jobs is a model that fits the High-Throughput Computing (HTC) paradigm.
● RE.3. QoS Batch jobs. The infrastructure must execute batch jobs with associated QoS. Executions
should be characterized on time and could also require a bounded budget expressed in the maximum
resources time to be spent. The scheduler should adjust the resources to meet the expected QoS. 

● RE.4. Deadline-based jobs. The execution service of the infrastructure should provide deadline-
based execution requests, which will have to finish at a given time, and are characterized in terms of
resources and expected execution time. If the deadline is not feasible at submission time, it will notify
the user and run immediately as resources are available. If not, it will schedule the execution for the
future. If the execution takes place closer to the deadline, the data will be more up-to-date.
● RE.5. Self-adapting elasticity. The algorithms will be described in a way that the infrastructure can
dedicate more resources to fulfill the QoS. The infrastructure must be reactive in both allocated
computing resources and allocated memory. The infrastructure must be self-adapting in order to
accommodate the workload peaks that can appear in HTC applications.
● RE.6. Short jobs. The infrastructure must support the execution of short-jobs, finishing in interactive
time, which could arrive massively (hundreds per minute).
● RE.7. Workflows management. The infrastructure must support the execution of big data workflows,
where the input data and the products can be large (in the order of tens of GBs).
● RA.1. Authentication. The infrastructure must support end-user authentication for access control
and accounting purposes.
● RA.2. Authorization. The infrastructure must support end-user authorization for accessing the data
and the applications deployed with the infrastructure.
● R1.5. Data Access API. An API must be exposed to deal with the storage resources to authenticate,
populate data, retrieve and filter data, update data. Same operations for metadata. Data access
should have a short latency (near real-time access).

The requirements analysis has been already performed in other technical WPs from different points of view,
leading to technical choices related to the definition of the QoS infrastructure, of the Big Data ecosystem and
of the security strategy. The findings of those activities are relevant for the definition and implementation of
the programming abstraction layer and this document analyses the technological choices described in the
deliverables focusing on how the WP5 components have to be selected and extended to be fully integrated
in the BIGSEA platform.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 8


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

In particular, the D3.1 document has the objective of identifying the services of the QoS cloud architecture
for the Big Data analytics platform developed in EUBra-BIGSEA. Mesos has been selected for the
management of distributed resources whose availability will be monitored and elastically managed according
to the QoS parameters defined in the applications.
D4.1 describes the big data systems integrated to address multifaceted use cases requirements, including
fast data analysis over continuous streams from external data sources, general purpose data mining and
machine learning tools as well as OLAP-based systems for multidimensional data analysis. A relevant
outcome of the document is the proposal of a data access API (to address R1.5) that can be used directly by
the applications or through the tools developed in WP5.
D6.1 addresses security requirements related to the application development tools. In particular, the main
concern is the possibility to define privacy annotations in the programming model interface and to support
authentication, authorization and accounting mechanisms.
Based on this the Abstraction Layer (AL) has the following requirements
RAL1. Support to QoS batch jobs. The programming framework runtime must provide support to the
execution of batch jobs with possible QoS constraints as time and number of resources.
RAL2. Integration with Mesos. The programming framework runtime has to be able to schedule tasks to the
Mesos middleware.
RAL3. Support to reactive elasticity. The runtime must be aware of the changes in the QoS infrastructure in
order to adapt the scheduling policies.
RAL4. Support to HDFS data locations. The applications should be able to read and write data in HDFS
backends. This could imply to extend the runtime data manager.
RAL5. Definition of QoS constraints in the programming interface. A centric topic of the project, the ability
to define QoS parameters both at application definition and at execution time, depending on the type of
metric.
RAL6. Support to data privacy in the definition of the algorithms. Include ways for the programmer to
express the privacy characteristics of the developed algorithms.
RAL7. Definition of Big Data workflows. Directly maps the RE.7; as explained later the idea of the
Abstractions Layer is to provide building blocks for the use cases that can be composed as workflows.

3.2 Types of users


Different classes of users could use tools provided by the Programming Abstraction Layer. The following
classes of users have been identified, given the type of requirement and privilege:

● Developers, use the Programming Abstraction Layer to develop end-users applications, to test the
underlying system and perform the execution of ETL, data mining and analytics processing. Also,
developers may define restrictions regarding QoS and AAA.
● Domain experts, use higher levels of abstractions (e.g. workflows), to compose processing tasks,
generate new machine learning models and assess AAA policies.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 9


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

● Students and practitioners learning about distributed algorithm processing, machine learning and
data science.

4 PROGRAMMING ABSTRACTIONS LAYER DESIGN


This section provides details on the design of the programming layer of the EUBra-BIGSEA Platform. The
description of each component, its role in the platform and the relations with other WPs are presented.

4.1 Architecture Design


Figure 3 depicts the architectural diagram of the programming abstraction layer. This layer provides the
functionalities needed to satisfy the requirements for the implementation of the applications scenarios on
top of the Big Data layer.

Figure 3 - Architecture of the Abstraction Layer

The programming frameworks enable the implementation of the use cases providing modules and libraries
(building blocks in the figure) that abstract the big data technologies to access and process the data sources
and optimizing their execution on the QoS infrastructure.
Abstractions for specifying QoS constraints (e.g., jobs execution deadlines, minimum throughput rate to the
storage sub-system) will be integrated with the programming model and will be translated into resource
management policies (see 4.1.2).
There is a strong integration with the tools provided in WP3 in order to make use of the execution and
deployment services and to adapt the runtimes to the changes in the available resources according to the
QoS policies.
According to the design of the Big Data Ecosystem described in D4.1, a minimum set of modules to be
implemented should provide support to the three use cases and related scenarios:

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 10


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

• Data Access and loading: Entity Matching Data quality analysis, data ingestion. Applications that
periodically read sources of data and execute a potentially parallel algorithm that produces a big
output of data plus some indicators. Here QoS is critical to ensure that the data is obtained on time.
Other applications are basically related to real-time data inspection, which will need a scalable
persistent service that supports client requests. For this use case, data parallel programming models
that support streaming will be adopted.
• Descriptive Models: Long-lasting execution tasks that run in parallel and train models with the new
data. Running periodically with a deadline and producing models, whose output may be classified
as more sensitive to privacy protection than the input. In this case, different option for clustering
algorithms should be provided by the abstraction layer.
• Predictive models: A service based continuously running with scalability capabilities, running jobs
that produce the result of the prediction. The scenarios included in this use case are more
computing intensive than data intensive bounded. Task parallel approaches are well suited to
implement this use case even though user stories include the runs of near real time predictions.

In order to ease the composition of the programmed modules, a tool for the generation of code will be
introduced and extended to support the programming frameworks.
The following sections provide a detailed description on the components that implement each functionality.

4.1.1 Programming frameworks


One of the main ambitions of the EUBra-BIGSEA project is to offer a programming layer that will make
applications effectively scale across the infrastructure, providing also to the developers appropriate
abstractions to specify QoS constraints and a unified interface that includes computing, data analytics, and
security APIs. The base of this layer is the COMPSs [R01] framework that provide a simple programming
model based on sequential development and a runtime system in charge of exploiting the inherent
concurrency of the code, automatically detecting and enforcing the data dependencies between tasks and
spawning these tasks to the available resources. In order to guarantee the interoperability with existing
applications, the Spark programming ecosystem is also included. D3.1 and D4.1 have already analysed the
use of Spark at the level of execution services and big data ecosystem. This report addresses the integration
of Spark at the level of programming interface also including issues related to the definition of data locations
in HDFS used in COMPSs.
Figure 4 depicts the detailed architecture of the programming frameworks. COMPSs is used to implement a
set of high-level functionalities that could be workflows of Ophidia [R03] operators or modules implementing
operations on Big Data backends to address fast data analysis over continuous streams from external data
sources, general purpose data mining and machine learning tools as well as OLAP based systems for
multidimensional data analysis. At the level of WP7, each of these functionalities is a building block for the
implementation of complex use cases. The implementation of each block is transparent to the user that has
only to import the module in the code, optionally providing constraints on the execution of that module
through QoS annotations.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 11


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Figure 4 - Detailed view of WP5 components

4.1.1.1 Technologies evaluation

Identification COMPSs

Type Programming framework

License Apache v.2

Current version 1.4

Website http://compss.bsc.es

Purpose Programming model which aims to ease the development of applications for
distributed infrastructures, such as Clusters, Grids and Clouds. COMP superscalar
also features a runtime system that exploits the inherent parallelism of
applications at execution time.

High level The following figure depicts the architecture of COMPSs


architecture

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 12


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

The COMPSs runtime is implemented using the Java language, so the most natural
programming language for new COMPSs applications is Java. Nevertheless, to
simplify the porting of existing applications written in other languages, COMPSs
has support also for C/C++ and Python applications.
A central concept in COMPSs is that of a task, which represents the model's unit
of parallelism. A task is a method or a service called from the application code that
is intended to be spawned asynchronously and possibly run in parallel with other
tasks on a set of resources, instead of locally and sequentially. In the model, the
user is mainly responsible for identifying and selecting which methods/services
she wants to be tasks.
When the sequential code is executed, the COMPSs runtime intercepts the
methods invocations and replaces them with calls to the runtime that create new
asynchronous tasks. Accesses to task data within the main code are also
instrumented, so that the runtime can fetch the correct data values if necessary
from the remote resource where the task was generated (synchronization).

This task selection is done by means of an annotated interface where all the
methods that have to be considered as tasks are defined with annotations
describing their data accesses and constraints on the execution of resources. At
execution time this information is used by the runtime to build a dependency
graph and orchestrate the tasks on the available resources.

Dependencies
COMPSs dependencies are solved at installation time.

Interfaces and COMPSs does not provide any specific API for the development of applications.
language support Supported languages are Java, Python and C/C++.

Security support The interoperability with different backends is implemented through


connectors. In this way, specific security policies can be configured on the
connector.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 13


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Data Primitive types (integer, long, float, boolean), strings, objects (instances of user-
defined classes, dictionaries, lists, tuples, complex numbers) and files are
supported in the definition of a task.

Needed Support to Mesos and reconfiguration of the resources according to proactive


improvement policies.

Identification Apache Spark

Type Framework for processing big data

License Apache License, Version 2.0.

Current 2.0.0 - Aug/2016


Version

Website Website: http://spark.apache.org


Documentation: http://spark.apache.org/docs/latest/
Download/Source code: http://spark.apache.org/downloads.html

Purpose To provide a functional programming paradigm abstraction to implement ETL and


machine learning algorithms. Spark has implemented many operations to data processing
and supports different programming paradigms:
● Functional programming using Scala, Python or Java;
● Declarative programming using the SQL language compatible with 2003
specification.

High Level In the lastest version, 2.0.0, Sparks supports 3 different views of data: RDD (Resilient
Architecture Distributed DataSet), DataFrames and DataSets. All structures are stored in memory if
possible and may be written to the disk, otherwise.
RDDs has been in Spark since version 1.0. It provides a set of transformation methods,
such as map(), filter(), reduce() for processing data. Each transformation creates a new
RDD representing the transformed data. Operations on RDDs are executed in a lazy
fashion; transformations are not performed until an action method, for example, collect()
or count(), is called.
The DataFrame API was introduced in version 1.3.0 to improve Spark performance. A new
concept of schema was introduced to describe the data, enabling much more expressive
code to be built using efficient network communication and off-heap memory JVM
optimization. Catalyst, Spark query processor optimizer, was built on top of the
DataFrame API and now allows users to write SQL 2003 compatible queries or use a
fluent-style API to process data.
The last API, DataSet, was introduced in Spark 1.6.0. and aims to provide the best of the
RDD and DataFrame worlds: the familiar object-oriented programming style, with

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 14


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

compile-time type-checking (present in RDD) and optimization capabilities (present in


DataFrame). The DataSet API uses a specialized Encoder to serialize the objects for
processing or transmission over the network. Conceptually, a DataFrame is an alias for a
collection of generic objects of type Dataset[Row], where a Row is a generic untyped JVM
object. A Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a
case class defined in Scala or by a class in Java.

[Image source: http://spark.apache.org/docs/latest/img/cluster-overview.png]

Under the hood, Spark is am in-memory engine for large-scale data processing. Apache
Spark is a fast and general-purpose cluster computing system and an optimized engine
that supports general execution graphs.
As shown in the previous image, a Spark program is controlled by a driver program started
by the user, that interacts with a cluster manager to start worker nodes where data
processing tasks are executed. In a standalone cluster deployment, the cluster manager
is a Spark master instance. When using Mesos, the Mesos master replaces the Spark
master as the cluster manager. Similarly, when using YARN, the YARN scheduler takes that
role.
Spark can be used for batch jobs through spark-submit, which can be used to execute
binaries remotely. There is also the Spark-shell, a Scala interactive console, and PySpark,
a Python shell. This way, one can execute data analytic operations and execute them
interactively on a remote system.

Dependencies A bare metal, YARN or Mesos cluster. The use of an HDFS file server architecture is
optional.

Interfaces and
languages It provides high-level APIs in Java, Scala, Python and R.
supported

Security Spark currently supports authentication via a shared secret. Spark supports SSL for Akka
support and HTTP (for broadcast and file server) protocols. SASL encryption is supported for the
block transfer service. Encryption is not yet supported for the WebUI.
Encryption is not yet supported for data stored by Spark in temporary local storage, such
as shuffle files, cached data, and other application files. If encrypting this data is desired,

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 15


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

a workaround is to configure the cluster manager to store application data on encrypted


disks.

Data Data stored in file systems (e.g., local, NFS, HDFS). There are many other connectors that
allows Spark read/write data to other data sources/storage (e.g. cloud blobs like Amazon
S3 and Microsoft xxx).

Potential
usage within Spark is one of the supported programming models in EUBra-BIGSEA. It provides a library,
BIGSEA called ML, that supports the execution of different machine learning techniques (e.g.
linear regression, classification, clustering) in a distributed way, by using programming
abstractions and infrastructure of Spark.

4.1.2 Support to QoS specification


QoS plays a pivotal role in EUBRA_BIGSEA, whose main goal is to provide solutions for the optimal delivery
and runtime management of big data applications. Since in private or public clouds applications share the
same infrastructure, their demand for resources may create contention that reduces the final QoS perceived
by the users.
As discussed in Section 3.1, EUBRA-BIGSEA will support the execution of three main classes of QoS based
jobs:
• QoS Batch jobs, characterized by a total resource budget for application execution in terms of
number of cores/containers/memory.
• Deadline-based jobs, characterized by a maximum execution time/deadline.
• Short jobs, possibly executed by streaming systems (e.g., for tweets analysis) characterized, not only
by a deadline, but also by the sustained throughput that must be guaranteed.

A constraint is completely defined by the fields reported in Table 1.

Field Description Mandatory Example

Name Constraint unique identifier Yes my_constraint_name

Target The application which the constraint is Yes my_job_name


application applied on

Target Metric the application predicates Yes CPU or container_number (cpu)


metric on. Multiple target metrics can be memory size (memory)
provided application make span
(application_execution_time)
Number of successful runs per
time unit
(application_throughput)

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 16


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Metric Value and inequality that must be Yes <=10 (for CPU or container
value fulfilled (one for each target metric) number)
<=10GB (for memory)
<=10 min (application execution
time)
>=4 completions/min
(application throughput)

Priority A value defining whether a constraint Yes Integer value


cannot be violated (hard constraint
characterized by value 0) or can be
violated (soft constraint, in this case
the value provides a ranking among
constraints, the higher the
better). One priority field has to be
specified for each target metric

Table 1 - QoS constraints specification

The set of metrics that will be initially considered are the number of CPUs/containers that support the
application execution, the total memory allocated in the infrastructure, application execution time and
throughput.
Constraints can, possibly, predicate on multiple metrics (e.g., short jobs are characterised by a deadline but
also by a minimum throughput). Constraints will be specified as JSON files and stored in the Mesos master.
The information is coded within the application JSON description, as described in the next figure.

{ "type": "CMD",
"name": "my_job_name",
"periodic": "R24P60",
"QoS" : [
{ "metric": "deadline",
"op": "=="
"value": "2016-06-10T17:22:00Z+2",
"priority": 0 },
{ "metric": "cpu",
"op": "<=",
"value": 10,
"priority": 2},
{ "metric": "memory",
"op": "<=",
"value": 10G,
"priority": 1},
{ "metric": "application_execution_time",
"op": "<=",
"value": "10M",
"priority": 1},
{ "metric": "application_throughput",
"op": ">=",
"value": "24d",
"priority": 3 }
]

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 17


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

"command" : "mycommand"
}
Figure 5 - JSON specification of the QoS

QoS constraints will be specified through Lemonade (see Section 4.1.3.1). Moreover, since EUBra-BIGSEA
envisions the definition and runtime support of application encompassing multiple runtime environments
(e.g., applications including part of the data analysis workflow implemented in Spark and part in COMPSs,
possibly accessing Ophidia operators), task T5.3 will develop solutions to optimally split a global application
constraint (i.e., a constraint predicating, e.g., on the whole application execution time), to local constraints
that have to be enacted on the underlying runtime environments. In this way the definition of WP3
proactive-runtime management policies can be simplified by specifying sets of adaptation rules predicating
on metrics and actuating mechanisms provided by the individual runtime frameworks that support the
application execution.
According to EUBRA-BIGSEA DoW, this latter activity will start at M13.

4.1.3 Code generation and composition

4.1.3.1 Lemonade
Lemonade (Live Environment for Mining Of Non-trivial Amount of Data Everywhere) is a web application tool
in which users can drag and drop operations and data sources to compose different ETL and machine learning
workflows. Lemonade targets users that do not want to learn a programming language or that need to
develop workflows using the existing toolset.

Regarding users’ spectrum, Lemonade fits well to those users from areas such as Mathematicians, Statistics,
Business Administration and those learning about Data Science.

Lemonade components are shown in Figure 5.

Figure 6 - Lemonade components and supported frameworks (in progress)

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 18


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

The first component, Citron (Figure 6), is a web based user interface to create workflows. Users can choose
among a set of predefined operations which will compose the workflow by dragging and dropping them into
the design area. Input data is specified by choosing one or more data sets from the toolbox. Workflows are
stored in Citron’s relational database and when a user triggers the execution of a workflow, a JSON file
describing it is generated and sent to Juicer for processing.

Figure 7 - Lemonade Citron user interface

Only data sets accessible by each logged user are made available through his/her interface. Operations are
the smallest unit of processing and represent a coarse granularity task executed on one of the supported
backends. Currently, Lemonade supports ETL and some machine learning operations, as listed in Table 1.

Operation Purpose
Add Columns Adds columns from one data source to another
Aggregation Performs aggregation of data grouped by a set of fields
Apply math Apply math
Classification model Trains and applies a classification model
Clean missing Cleans or replaces missing values from fields
Clustering model Trains and applies a clustering model
Comment Comment
Correlation Identifies correlations between records
Data reader Reads data from a data set

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 19


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Data writer Writes a new data set


Filter (selection) Filters data according to some criteria
Join Joins two data sets using a set of fields (keys)
K-Means Clustering Users K-Means algorithm for clustering
Linear regression Applies a linear regression algorithm
Logistic regression Performs logistic regression
Naive Bayes Classifier Uses a Naive Bayes Classifier
Outlier detection Performs outlier detection
Projection/Select columns Selects a subset of the fields from data set
Publish as a visualization Publishes result as a visualization
Publish as web service Publishes a workflow as a web service
Sample Generates a sample of data
Score model Scores a machine learning model
Set intersection Performs set intersection
Sort Sorts data from data set according to a set of fields and directions
Split Splits dataset in 2 different data sets using weights
SVM Classification Uses a SVM Classifier
Time series Time series
Topic discovery Performs topic discovery in text
Transformation Performs a data transformation
Union/set union Performs set union

Table 2 - Lemonade supported operations

New operations can be implemented if the underlying processing framework supports them.

The second component is called Tahiti and is responsible for keeping all operations’ metadata needed to run
the workflows. Metadata include operation name, description, parameters and ports. Ports are
communication points that have direction (input and output), multiplicity (how many supported connections)
and should “implement” interfaces in order to guarantee compatibility between operations. For example, if
an operation has only one output port that implements an interface “Algorithm”, it can only connect to an
input port that implements the same interface; it is not possible to connect it to an operation with an output
port that only implements the interface “Data”.

Each operation has a set of parameters grouped as forms. Forms are organized in 3 classes: execution
parameters, AAA parameters and QoS parameters. Execution parameters allow users to configure algorithms
run-time arguments and behaviour. AAA parameters are related to security and privacy aspects and will be
aligned with WP6 guidelines. QoS parameters define infrastructure requirements to execute the workflow
and are related to WP3.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 20


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

The third component, Limonero, is similar to Tahiti, but instead of keeping metadata about operations, it
keeps metadata information about data sources. Data sources can be input to workflows and also can be
created by them as output. Data source metadata includes:

● Location: where data are located and in which storage technology (for instance, HDFS).
● Data format and structure: If the data are in JSON format, what are the columns and their data types,
if any given column is optional, if it is a feature or a label.
● Access restrictions: ownership of data sets, authorization and privacy concerns.
● Statistics about the data: number of records, size in MB, column-specific information such as total of
missing records, min/max/average/median values, deciles distribution, etc.

Metadata are used by web interface to enable or disable data visualisations and operations, according to
data/visualisation and data/operation compatibility. For example, a pie chart would require at least 2 fields:
one for the label and other for the value and the value should be numeric. If data set attributes do not match
the visualisation requirements, the visualization will not be available. In another example, a classification
operation would be disabled in the interface if the input data set does not have a column specified as a label,
otherwise the operation would not be able to learn how to classify the data.

Metadata enable Lemonade to load data in optimized formats. Instead of having to parse CSV or JSON files
into records, Lemonade can load data in binary formats, such as Parquet[R02].

Finally, the last component, Juicer, has four main responsibilities:


1. Receive a workflow specification in JSON format from Citron and convert it into executable code.
2. Execute the generated code, controlling the execution flow.
3. Report execution status to the user interface (Citron)
4. Interact with Limonero API in order to create new intermediate data sets. Such data sets cannot be
used as input to other workflows, except if explicitly specified. They are used to enable Citron to
show intermediate processed data to the user.

Under the hood, Lemonade will generate code targeting a distributed processing platform, such as COMPSs
or Spark. The current version supports only Spark, and the generated code is executed in batch mode. Future
versions may implement support to interactive execution. This kind of execution has advantages because
keeping Spark context loaded avoids any overhead from starting the processing environment and loading
data for each step. This approach (keeping the context) is used in many implementations of data analytics
notebooks, such as Jupyter, Cloudera Hue and Databricks notebook.

4.2 Application lifecycle


In this section, a sequence diagram of the interactions between components of the software architecture is
presented. In Figure 8 a developer interacts with the Lemonade interface to compose an application using
existing modules stored in the internal operation metadata database. As a result, the code of a COMPSs
application is generated by Lemonade; the COMPSs runtime takes care of the deployment of the application
on the BIGSEA QoS infrastructure and of the calls to the required Analytics Services.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 21


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Figure 8 - Application lifecycle diagram

4.3 Security aspects


The deliverable D6.1 (Requirements and Coordinated Security Strategy) defines the security scope of the
project and proposes global security solution to deal with the security objectives of the project: the
provisioning of Authentication, Authorization and Accounting (AAA), the assurance of the security properties
of the cloud and Big Data services, and the protection of the data privacy.

Figure 8 presents a high level view of the WP5 components together with the specific concerns that should
be handled by WP6. Details on the represented roles, colour-code used, and the correspondent components
can be found in D6.1, while details on Spark, COMPSs and Lemonade can be found in Section 4.1.1.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 22


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

Figure 9 - Relation between WP5 and WP6 (adapted from D6.1)

According to the analysis performed in D6.1, the security concerns related with the architecture to be
proposed in WP5 together with WP4 and WP3, can be summarized in 5 main points as identified in red and
green in figure 8 , and which are aligned with the security concerns of the project. WP5 will propose
programming abstractions that work based on the underlying layers, and therefore the security concerns of
WP5 are also integrally connected with the ones of those layers.

Regarding security, WP5 requires services for authentication, authorization and accounting to the
infrastructure and the applications and it is also important to assure the privacy and access control of the
data for the operators that work with WP4. Another concern is the security of the API provided for
application development, which should not allow the developers to perform tasks that interfere with other
applications running and also should not allow malicious users to take advantage of its inputs to subvert the
functionalities of the applications.

To address these objectives, requirements were defined in D6.1. Following we summarize the key
requirements while details can be found in Section 4 of D6.1.

WP6 AAA corresponds to the AAA Provisioning and it will be necessary to develop two distinct AAA blocks,
which have distinct functionality, as follows:
1. EUBra-BIGSEA Infrastructure AAA, which provides the AAA functionalities required for managing the
EUBra-BIGSEA framework (access to cloud resources), from both the Infrastructure and Platform
perspectives (focusing on infrastructure managers and application developers/providers). The scope
of this service is the whole EUBra-BIGSEA framework as it matches the nature of the services focused
on cloud infrastructure management.
2. EUBra-BIGSEA Applications AAAaaS, which provides AAA-as-a-Service for applications developed
and hosted in the EUBra-BIGSEA framework and in need of services for authenticating and
authorizing their end users. The scope of AAAaaS instance is limited to the application making use of
it, and AAAaaS directly matches the nature of the set of services focused on end-users and
enterprise/consumer applications.

WP6 Assurances corresponds to the security assurances requirements R6.3.1 and R6.3.2 (see D6.1), and that
can be detailed as follows:

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 23


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

1. The End Users of the applications that will run inside the EUBra-BIGSEA infrastructure should not be
able to subvert, through the inputs of such applications, the functionalities implemented. This way,
it is necessary to perform a detailed assessment of the APIs to be used in the development of
applications.
2. The Data App Developers should not be able to develop applications that interfere with the
remaining applications running inside the framework. For this, besides the assessment of the APIs
made available, it will also be necessary to propose a set of recommendations for development best
practices.

WP6 Privacy corresponds to the concerns regarding the security of the data, i.e. the protection of the
confidentiality, privacy and anonymity of the data. It is necessary to include in the WP5 programming
abstractions ways for the programmer to express the privacy characteristics of the developed algorithms.
These abstractions will be enforced in the underlying layers, through mechanisms to be implemented in WP4.
Lemonade (see 4.1.3.1) is the best candidate to provide the users to express these preferences. In practice,
a set of operators are to be developed to extend the Lemonade syntax in order to include information that
allows the characterization of the algorithms according to the way their internals influence the data being
processed in terms of privacy.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 24


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

4.4 Technology Analysis


This section analyses how the previously described components satisfy the requirements to implement the
use cases on the QoS BIGSEA platform.

Requirement Description COMPSs Spark Lemonade

RAL1. Support to QoS Support to QoS COMPSs supports QoS constraints Provided by
batch jobs bounded jobs different backends will be underlying
through connectors that supported technology.
implement specific through Will provide
functionalities; QoS command line mechanisms to
provided by the user can options annotate QoS
be translated to requirements
resources constraints for applications
and their
components.

RAL2. Integration The programming COMPSs can easily be Yes Provided by


with Mesos runtime must extended through underlying
execute the tasks connectors technology
to the Mesos
middleware and
available
frameworks as
YARN and Myriad.

RAL3. Support to The runtime must COMPSs adapts the Spark provides Provided by
reactive elasticity adapt its usage of resources basic underlying
resources according to the mechanisms technology
computational load. for adding and
according to the
Reconfiguration of the removing
changes in the pool of resources will be worker nodes.
QoS infrastructure evaluated in the project Wp3 will
implement
solutions and
advanced
policies for
runtime cluster
reconfiguration

RAL4. Support to The applications No. Support to HDFS is Yes Provided by


HDFS data locations should be able to an objective of the underlying
reference data in project technology
HDFS storage

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 25


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

RAL5. Definition of The programming It is part of the WP5 No. It is part of Yes
QoS constraints in interface must activities WP5 objectives
the programming provide ways to
interface express QoS that
will be translated
to resources
constraints by the
runtime

RAL6. Support to Include ways for No. It is part of WP5 No. It is part of Yes
privacy in the the programmer objectives WP5 objectives
definition of the to express the
algorithms privacy
characteristics of
the developed
algorithms

RAL7. Definition of Support the Yes. Workflows can be Yes Yes


Big Data workflows execution of big programmed in COMPSs
data workflows, without any API. Data
where the input dependencies are
data and the automatically
products can be discovered and
large (in the order managed by the runtime
of tens of GBs).

Table 3 - Requirements and technologies

5 CONCLUSIONS
This document has provided the description of the programming abstraction layer of the EUBra-BIGSEA
Platform. The objective of this document is to complement the deliverables D3.1, D4.1 and D6.1 that focus
on the definition of the QoS infrastructure, the Big Data ecosystem and the security strategy. Here the aim is
to identify the technologies that can be adopted by the users of the platform to transparently implement Big
Data applications.
The analysis of the user requirements has led to the definition of a set of specifications for the
implementation of the components. The basis of the abstraction layer is the COMPSs framework that
provides a programming model to define applications whose execution, where possible, is automatically
parallelized. COMPSs will be extended to be interoperable with the Mesos middleware thus benefitting from
the capability of automatically increasing resources based on the QoS driven mechanisms. COMPSs will be
used to implement new workflows on top of the data analytics layer that will be used as building blocks for

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 26


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

the end user applications. In order to complement COMPSs and to support existing ML algorithms, the Spark
programming model will be used.
The Lemonade tool will be adopted as graphical interface to compose applications and to generate code for
COMPSs and Spark.

6 REFERENCES
[R01] Badia RM, Conejero J, Diaz C, Ejarque J, Lezzi D, Lordan F, Ramon-Cortes C, Sirvent R. COMP Superscalar,
an interoperable programming framework. SoftwareX [Internet]. 2015 ;3-4:32-36. Available from:
http://www.sciencedirect.com/science/article/pii/S2352711015000151.
[R02] Apache Parquet. Available from: https://parquet.apache.org/
[R03] S. Fiore, C. Palazzo, A. D’Anca, I. T. Foster, D. N. Williams, G. Aloisio, “A big data analytics framework
for scientific data management”, IEEE BigData Conference 2013: 1-8.
[R04] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott
Shenker, and Ion Stoica. 2011. Mesos: a platform for fine-grained resource sharing in the data center. In
Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI'11).
USENIX Association, Berkeley, CA, USA, 295-308.

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 27


EUBra-BIGSEA D5.1: EUBra-BIGSEA Software Architecture

7 GLOSSARY
Acronym Explanation Usage Scope

AAA Authentication, Authorization and Accounting Security

API Application Programming Interface Interfacing

CSV Comma Separated Value Data type

EC3 Elastic Compute Cluster in the Cloud Elasticity

ETL Extraction, Transformation and Load Data Integration

HDFS Apache Hadoop Distributed File System Storage

JSON JavaScript Object Notation Data Type

JVM Java Virtual Machine Processing

MESOS A Resource Management platform that abstracts CPU, Resource Management


memory, storage, and other compute resources away from
machines

OLAP Online Analytical Processing Processing

QoS Quality of Service Scheduler

www.eubra-bigsea.eu | contact@eubra-bigsea.eu |@bigsea_eubr 28

Potrebbero piacerti anche