Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Dissemination Level
X PU: Public
PP: Restricted to other programme participants (including the Commission)
RE: Restricted to a group specified by the consortium (including the Commission)
CO: Confidential, only for members of the consortium (including the Commission)
Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric
Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European
Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT)
of Brazil in the frame of the third European-Brazilian coordinated call. The document has been
produced with the co-funding of the European Commission and the MCT.
The purpose of this report is the design of the software architecture of the EUBra-BIGSEA platform.
The document describes the overall functioning and interactions between the platform components,
and serves as development roadmap for the developers of the project.
Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0.
Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent the
views expressed by the European Commission or its services.
While the information contained in the document is believed to be accurate, the author(s) or any other participant in the EUBra-
BIGSEA Consortium make no warranty of any kind with regard to this material including, but not limited to the implied warranties of
merchantability and fitness for a particular purpose.
Neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be responsible or liable in
negligence or otherwise howsoever in respect of any inaccuracy or omission herein.
Without derogating from the generality of the foregoing neither the EUBra-BIGSEA Consortium nor any of its members, their officers,
employees or agents shall be liable for any direct or indirect or consequential loss or damage caused by or arising from any
information advice or inaccuracy or omission herein.
TABLE OF CONTENT
EXECUTIVE SUMMARY ................................................................................................................ 4
1 Introduction .............................................................................................................................. 5
1.1 Scope of the Document .................................................................................................... 5
1.2 Target Audience ............................................................................................................... 5
1.3 Structure ........................................................................................................................... 5
2 EUBra-BIGSEA Architectural Overview ................................................................................... 6
3 Programming Abstractions Layer Requirements ...................................................................... 8
3.1 Use Case and technical requirements .............................................................................. 8
3.2 Types of users .................................................................................................................. 9
4 Programming Abstractions Layer Design ............................................................................... 10
4.1 Architecture Design ........................................................................................................ 10
4.1.1 Programming frameworks ........................................................................................ 11
4.1.2 Support to QoS specification.................................................................................... 16
4.1.3 Code generation and composition ........................................................................... 18
4.2 Application lifecycle ........................................................................................................ 21
4.3 Security aspects ............................................................................................................. 22
4.4 Technology Analysis ....................................................................................................... 25
5 Conclusions ........................................................................................................................... 26
6 References ............................................................................................................................ 27
7 GLOSSARY ........................................................................................................................... 28
LIST OF TABLES
Table 1 - QoS constraints specification ......................................................................................... 17
Table 1 - Lemonade supported operations .................................................................................... 20
Table 2 - Requirements and technologies ..................................................................................... 26
LIST OF FIGURES
Figure 1 - High-level view of the EUBra-BIGSEA architecture ......................................................... 6
Figure 2 - Detailed view of the software architecture ....................................................................... 7
Figure 3 - Architecture of the Abstraction Layer ............................................................................ 10
Figure 4 - Detailed view of WP5 components................................................................................ 12
Figure 5 - JSON specification of the QoS ...................................................................................... 18
Figure 6 - Lemonade components and supported frameworks (in progress) ................................. 18
Figure 7 - Lemonade Citron user interface .................................................................................... 19
Figure 8 - Application lifecycle diagram ......................................................................................... 22
Figure 9 - Relation between WP5 and WP6 (adapted from D6.1) ................................................. 23
EXECUTIVE SUMMARY
The EUBra-BIGSEA project aims at developing a set of cloud services empowering Big Data analytics to ease
the development of massive data processing applications. EUBra-BIGSEA will develop models, predictive and
reactive cloud infrastructure QoS techniques, efficient and scalable Big Data operators and a privacy and
quality analysis framework, exposed to several programming environments. EUBra-BIGSEA aims at covering
general requirements of multiple application areas, although it will be showcased in the treatment of massive
connected society information, and particularly in route recommendation.
The abstractions layer provides functionalities that allow to transparently build applications composed of
data operators mapped to different Big Data frameworks. On the other side, the integration with the
infrastructure layer, makes applications effectively scale across the infrastructure, providing also to the
developers appropriate abstractions to specify QoS constraints and a unified programming interface that
includes computing, data analytics, and security APIs.
The programming layer will be based on COMPSs and Spark that provide complementary capabilities to
satisfy the use cases requirements. Spark is an open source data processing framework. COMPSs is a
programming framework that aims to facilitate the parallelisation of existing applications through a simple
programming model based on sequential development. The COMPSs runtime is in charge of exploiting the
inherent concurrency of the code, automatically detecting and enforcing the data dependencies between
tasks and spawning these tasks to the available resources and provide scalability and elasticity features
allowing the dynamic provision of resources. The COMPSs programming interface will be enhanced in the
project through the integration of QoS constraints and security hints; the COMPSs runtime will be extended
with the support to Mesos [R04] in order to benefit from the proactive elasticity of EUBra-BIGSEA
infrastructure.
The final goal is to have an integrated layer that provides building blocks, developed with COMPSs and Spark,
that could be imported in the user applications.
1 INTRODUCTION
1.1 Scope of the Document
This document summarizes the architectural design to be implemented in the course of the EUBra-BIGSEA
project. We highlight the interrelations among the different work packages involved in the decisions adopted,
and also outline the reasoning behind the choices made.
This document is intended for general reference but mostly focuses on the design of the Programming Model
Abstraction Layer and its integration with the other layers for applications (WP7), Big Data Ecosystem (WP4),
QoS infrastructure (WP3) and security (WP6) whose architectures have been detailed in D7.1, D4.1, D3.1 and
D6.1 respectively.
1.3 Structure
The rest of the document is structured as follows; Section 2 contains a high level summary of the BIGSEA
architecture. Section 3 analyses the use cases requirements that are also related to the programming
frameworks and proposes a list of requirements specific for the design of the abstractions layer. Section 4
addresses the main objective of the document, the design of the software architecture with an analysis of
the components selected to implement the layer. Section 5 concludes the document providing a timeline for
the remaining implementation activities.
Figure 2 highlights the separation between the infrastructure components for the management of the
resources (described in D3.1) and the software components that are targeted in this document. In particular,
WP5 focuses on how applications are composed using the abstractions provided by the programming models
and how those applications are deployed benefitting from the high-availability and reliability features of the
infrastructure.
As shown in figure 2, resources of the infrastructure are managed by a Cloud Management Framework
(OpenNebula or OpenStack) that deploys or undeploys Virtual Machines (VMs) when requested. These VMs
build up the Mesos cluster, providing the agents on demand. The cluster is configured by the Infrastructure
Manager (IM) and the elasticity at the level of the resources is managed through Elastic Computing Clusters
in the Cloud (EC3), which monitors the Mesos cluster to detect the need of resources and the opportunity to
power off them. The Mesos cluster is accessed through a scheduler.
From the logical components-side, users write their applications using the Lemonade IDE which transforms
them into Spark and COMPSs code. Users can write the programs directly in Spark or COMPSs or port existing
applications written in Java or Python. Proactive policies give an estimation of the resources needed by the
execution, which are readjusted by the Monitoring system which monitors the QoS compliance.
This section provides a summary of the requirements, described in D7.1, relevant for the definition of the
abstraction layer.
● RE.1. Batch jobs. The infrastructure must support unrestricted batch execution of data analytic jobs.
Unrestricted in the sense of no QoS-bounded batch jobs, where latency is not a key issue. Single jobs
will be those that normally could fit in memory.
● RE.2. Bag of tasks. The infrastructure must support unrestricted batch execution of a bag of data
analytics jobs. A Bag of jobs is a model that fits the High-Throughput Computing (HTC) paradigm.
● RE.3. QoS Batch jobs. The infrastructure must execute batch jobs with associated QoS. Executions
should be characterized on time and could also require a bounded budget expressed in the maximum
resources time to be spent. The scheduler should adjust the resources to meet the expected QoS.
● RE.4. Deadline-based jobs. The execution service of the infrastructure should provide deadline-
based execution requests, which will have to finish at a given time, and are characterized in terms of
resources and expected execution time. If the deadline is not feasible at submission time, it will notify
the user and run immediately as resources are available. If not, it will schedule the execution for the
future. If the execution takes place closer to the deadline, the data will be more up-to-date.
● RE.5. Self-adapting elasticity. The algorithms will be described in a way that the infrastructure can
dedicate more resources to fulfill the QoS. The infrastructure must be reactive in both allocated
computing resources and allocated memory. The infrastructure must be self-adapting in order to
accommodate the workload peaks that can appear in HTC applications.
● RE.6. Short jobs. The infrastructure must support the execution of short-jobs, finishing in interactive
time, which could arrive massively (hundreds per minute).
● RE.7. Workflows management. The infrastructure must support the execution of big data workflows,
where the input data and the products can be large (in the order of tens of GBs).
● RA.1. Authentication. The infrastructure must support end-user authentication for access control
and accounting purposes.
● RA.2. Authorization. The infrastructure must support end-user authorization for accessing the data
and the applications deployed with the infrastructure.
● R1.5. Data Access API. An API must be exposed to deal with the storage resources to authenticate,
populate data, retrieve and filter data, update data. Same operations for metadata. Data access
should have a short latency (near real-time access).
The requirements analysis has been already performed in other technical WPs from different points of view,
leading to technical choices related to the definition of the QoS infrastructure, of the Big Data ecosystem and
of the security strategy. The findings of those activities are relevant for the definition and implementation of
the programming abstraction layer and this document analyses the technological choices described in the
deliverables focusing on how the WP5 components have to be selected and extended to be fully integrated
in the BIGSEA platform.
In particular, the D3.1 document has the objective of identifying the services of the QoS cloud architecture
for the Big Data analytics platform developed in EUBra-BIGSEA. Mesos has been selected for the
management of distributed resources whose availability will be monitored and elastically managed according
to the QoS parameters defined in the applications.
D4.1 describes the big data systems integrated to address multifaceted use cases requirements, including
fast data analysis over continuous streams from external data sources, general purpose data mining and
machine learning tools as well as OLAP-based systems for multidimensional data analysis. A relevant
outcome of the document is the proposal of a data access API (to address R1.5) that can be used directly by
the applications or through the tools developed in WP5.
D6.1 addresses security requirements related to the application development tools. In particular, the main
concern is the possibility to define privacy annotations in the programming model interface and to support
authentication, authorization and accounting mechanisms.
Based on this the Abstraction Layer (AL) has the following requirements
RAL1. Support to QoS batch jobs. The programming framework runtime must provide support to the
execution of batch jobs with possible QoS constraints as time and number of resources.
RAL2. Integration with Mesos. The programming framework runtime has to be able to schedule tasks to the
Mesos middleware.
RAL3. Support to reactive elasticity. The runtime must be aware of the changes in the QoS infrastructure in
order to adapt the scheduling policies.
RAL4. Support to HDFS data locations. The applications should be able to read and write data in HDFS
backends. This could imply to extend the runtime data manager.
RAL5. Definition of QoS constraints in the programming interface. A centric topic of the project, the ability
to define QoS parameters both at application definition and at execution time, depending on the type of
metric.
RAL6. Support to data privacy in the definition of the algorithms. Include ways for the programmer to
express the privacy characteristics of the developed algorithms.
RAL7. Definition of Big Data workflows. Directly maps the RE.7; as explained later the idea of the
Abstractions Layer is to provide building blocks for the use cases that can be composed as workflows.
● Developers, use the Programming Abstraction Layer to develop end-users applications, to test the
underlying system and perform the execution of ETL, data mining and analytics processing. Also,
developers may define restrictions regarding QoS and AAA.
● Domain experts, use higher levels of abstractions (e.g. workflows), to compose processing tasks,
generate new machine learning models and assess AAA policies.
● Students and practitioners learning about distributed algorithm processing, machine learning and
data science.
The programming frameworks enable the implementation of the use cases providing modules and libraries
(building blocks in the figure) that abstract the big data technologies to access and process the data sources
and optimizing their execution on the QoS infrastructure.
Abstractions for specifying QoS constraints (e.g., jobs execution deadlines, minimum throughput rate to the
storage sub-system) will be integrated with the programming model and will be translated into resource
management policies (see 4.1.2).
There is a strong integration with the tools provided in WP3 in order to make use of the execution and
deployment services and to adapt the runtimes to the changes in the available resources according to the
QoS policies.
According to the design of the Big Data Ecosystem described in D4.1, a minimum set of modules to be
implemented should provide support to the three use cases and related scenarios:
• Data Access and loading: Entity Matching Data quality analysis, data ingestion. Applications that
periodically read sources of data and execute a potentially parallel algorithm that produces a big
output of data plus some indicators. Here QoS is critical to ensure that the data is obtained on time.
Other applications are basically related to real-time data inspection, which will need a scalable
persistent service that supports client requests. For this use case, data parallel programming models
that support streaming will be adopted.
• Descriptive Models: Long-lasting execution tasks that run in parallel and train models with the new
data. Running periodically with a deadline and producing models, whose output may be classified
as more sensitive to privacy protection than the input. In this case, different option for clustering
algorithms should be provided by the abstraction layer.
• Predictive models: A service based continuously running with scalability capabilities, running jobs
that produce the result of the prediction. The scenarios included in this use case are more
computing intensive than data intensive bounded. Task parallel approaches are well suited to
implement this use case even though user stories include the runs of near real time predictions.
In order to ease the composition of the programmed modules, a tool for the generation of code will be
introduced and extended to support the programming frameworks.
The following sections provide a detailed description on the components that implement each functionality.
Identification COMPSs
Website http://compss.bsc.es
Purpose Programming model which aims to ease the development of applications for
distributed infrastructures, such as Clusters, Grids and Clouds. COMP superscalar
also features a runtime system that exploits the inherent parallelism of
applications at execution time.
The COMPSs runtime is implemented using the Java language, so the most natural
programming language for new COMPSs applications is Java. Nevertheless, to
simplify the porting of existing applications written in other languages, COMPSs
has support also for C/C++ and Python applications.
A central concept in COMPSs is that of a task, which represents the model's unit
of parallelism. A task is a method or a service called from the application code that
is intended to be spawned asynchronously and possibly run in parallel with other
tasks on a set of resources, instead of locally and sequentially. In the model, the
user is mainly responsible for identifying and selecting which methods/services
she wants to be tasks.
When the sequential code is executed, the COMPSs runtime intercepts the
methods invocations and replaces them with calls to the runtime that create new
asynchronous tasks. Accesses to task data within the main code are also
instrumented, so that the runtime can fetch the correct data values if necessary
from the remote resource where the task was generated (synchronization).
This task selection is done by means of an annotated interface where all the
methods that have to be considered as tasks are defined with annotations
describing their data accesses and constraints on the execution of resources. At
execution time this information is used by the runtime to build a dependency
graph and orchestrate the tasks on the available resources.
Dependencies
COMPSs dependencies are solved at installation time.
Interfaces and COMPSs does not provide any specific API for the development of applications.
language support Supported languages are Java, Python and C/C++.
Data Primitive types (integer, long, float, boolean), strings, objects (instances of user-
defined classes, dictionaries, lists, tuples, complex numbers) and files are
supported in the definition of a task.
High Level In the lastest version, 2.0.0, Sparks supports 3 different views of data: RDD (Resilient
Architecture Distributed DataSet), DataFrames and DataSets. All structures are stored in memory if
possible and may be written to the disk, otherwise.
RDDs has been in Spark since version 1.0. It provides a set of transformation methods,
such as map(), filter(), reduce() for processing data. Each transformation creates a new
RDD representing the transformed data. Operations on RDDs are executed in a lazy
fashion; transformations are not performed until an action method, for example, collect()
or count(), is called.
The DataFrame API was introduced in version 1.3.0 to improve Spark performance. A new
concept of schema was introduced to describe the data, enabling much more expressive
code to be built using efficient network communication and off-heap memory JVM
optimization. Catalyst, Spark query processor optimizer, was built on top of the
DataFrame API and now allows users to write SQL 2003 compatible queries or use a
fluent-style API to process data.
The last API, DataSet, was introduced in Spark 1.6.0. and aims to provide the best of the
RDD and DataFrame worlds: the familiar object-oriented programming style, with
Under the hood, Spark is am in-memory engine for large-scale data processing. Apache
Spark is a fast and general-purpose cluster computing system and an optimized engine
that supports general execution graphs.
As shown in the previous image, a Spark program is controlled by a driver program started
by the user, that interacts with a cluster manager to start worker nodes where data
processing tasks are executed. In a standalone cluster deployment, the cluster manager
is a Spark master instance. When using Mesos, the Mesos master replaces the Spark
master as the cluster manager. Similarly, when using YARN, the YARN scheduler takes that
role.
Spark can be used for batch jobs through spark-submit, which can be used to execute
binaries remotely. There is also the Spark-shell, a Scala interactive console, and PySpark,
a Python shell. This way, one can execute data analytic operations and execute them
interactively on a remote system.
Dependencies A bare metal, YARN or Mesos cluster. The use of an HDFS file server architecture is
optional.
Interfaces and
languages It provides high-level APIs in Java, Scala, Python and R.
supported
Security Spark currently supports authentication via a shared secret. Spark supports SSL for Akka
support and HTTP (for broadcast and file server) protocols. SASL encryption is supported for the
block transfer service. Encryption is not yet supported for the WebUI.
Encryption is not yet supported for data stored by Spark in temporary local storage, such
as shuffle files, cached data, and other application files. If encrypting this data is desired,
Data Data stored in file systems (e.g., local, NFS, HDFS). There are many other connectors that
allows Spark read/write data to other data sources/storage (e.g. cloud blobs like Amazon
S3 and Microsoft xxx).
Potential
usage within Spark is one of the supported programming models in EUBra-BIGSEA. It provides a library,
BIGSEA called ML, that supports the execution of different machine learning techniques (e.g.
linear regression, classification, clustering) in a distributed way, by using programming
abstractions and infrastructure of Spark.
Metric Value and inequality that must be Yes <=10 (for CPU or container
value fulfilled (one for each target metric) number)
<=10GB (for memory)
<=10 min (application execution
time)
>=4 completions/min
(application throughput)
The set of metrics that will be initially considered are the number of CPUs/containers that support the
application execution, the total memory allocated in the infrastructure, application execution time and
throughput.
Constraints can, possibly, predicate on multiple metrics (e.g., short jobs are characterised by a deadline but
also by a minimum throughput). Constraints will be specified as JSON files and stored in the Mesos master.
The information is coded within the application JSON description, as described in the next figure.
{ "type": "CMD",
"name": "my_job_name",
"periodic": "R24P60",
"QoS" : [
{ "metric": "deadline",
"op": "=="
"value": "2016-06-10T17:22:00Z+2",
"priority": 0 },
{ "metric": "cpu",
"op": "<=",
"value": 10,
"priority": 2},
{ "metric": "memory",
"op": "<=",
"value": 10G,
"priority": 1},
{ "metric": "application_execution_time",
"op": "<=",
"value": "10M",
"priority": 1},
{ "metric": "application_throughput",
"op": ">=",
"value": "24d",
"priority": 3 }
]
"command" : "mycommand"
}
Figure 5 - JSON specification of the QoS
QoS constraints will be specified through Lemonade (see Section 4.1.3.1). Moreover, since EUBra-BIGSEA
envisions the definition and runtime support of application encompassing multiple runtime environments
(e.g., applications including part of the data analysis workflow implemented in Spark and part in COMPSs,
possibly accessing Ophidia operators), task T5.3 will develop solutions to optimally split a global application
constraint (i.e., a constraint predicating, e.g., on the whole application execution time), to local constraints
that have to be enacted on the underlying runtime environments. In this way the definition of WP3
proactive-runtime management policies can be simplified by specifying sets of adaptation rules predicating
on metrics and actuating mechanisms provided by the individual runtime frameworks that support the
application execution.
According to EUBRA-BIGSEA DoW, this latter activity will start at M13.
4.1.3.1 Lemonade
Lemonade (Live Environment for Mining Of Non-trivial Amount of Data Everywhere) is a web application tool
in which users can drag and drop operations and data sources to compose different ETL and machine learning
workflows. Lemonade targets users that do not want to learn a programming language or that need to
develop workflows using the existing toolset.
Regarding users’ spectrum, Lemonade fits well to those users from areas such as Mathematicians, Statistics,
Business Administration and those learning about Data Science.
The first component, Citron (Figure 6), is a web based user interface to create workflows. Users can choose
among a set of predefined operations which will compose the workflow by dragging and dropping them into
the design area. Input data is specified by choosing one or more data sets from the toolbox. Workflows are
stored in Citron’s relational database and when a user triggers the execution of a workflow, a JSON file
describing it is generated and sent to Juicer for processing.
Only data sets accessible by each logged user are made available through his/her interface. Operations are
the smallest unit of processing and represent a coarse granularity task executed on one of the supported
backends. Currently, Lemonade supports ETL and some machine learning operations, as listed in Table 1.
Operation Purpose
Add Columns Adds columns from one data source to another
Aggregation Performs aggregation of data grouped by a set of fields
Apply math Apply math
Classification model Trains and applies a classification model
Clean missing Cleans or replaces missing values from fields
Clustering model Trains and applies a clustering model
Comment Comment
Correlation Identifies correlations between records
Data reader Reads data from a data set
New operations can be implemented if the underlying processing framework supports them.
The second component is called Tahiti and is responsible for keeping all operations’ metadata needed to run
the workflows. Metadata include operation name, description, parameters and ports. Ports are
communication points that have direction (input and output), multiplicity (how many supported connections)
and should “implement” interfaces in order to guarantee compatibility between operations. For example, if
an operation has only one output port that implements an interface “Algorithm”, it can only connect to an
input port that implements the same interface; it is not possible to connect it to an operation with an output
port that only implements the interface “Data”.
Each operation has a set of parameters grouped as forms. Forms are organized in 3 classes: execution
parameters, AAA parameters and QoS parameters. Execution parameters allow users to configure algorithms
run-time arguments and behaviour. AAA parameters are related to security and privacy aspects and will be
aligned with WP6 guidelines. QoS parameters define infrastructure requirements to execute the workflow
and are related to WP3.
The third component, Limonero, is similar to Tahiti, but instead of keeping metadata about operations, it
keeps metadata information about data sources. Data sources can be input to workflows and also can be
created by them as output. Data source metadata includes:
● Location: where data are located and in which storage technology (for instance, HDFS).
● Data format and structure: If the data are in JSON format, what are the columns and their data types,
if any given column is optional, if it is a feature or a label.
● Access restrictions: ownership of data sets, authorization and privacy concerns.
● Statistics about the data: number of records, size in MB, column-specific information such as total of
missing records, min/max/average/median values, deciles distribution, etc.
Metadata are used by web interface to enable or disable data visualisations and operations, according to
data/visualisation and data/operation compatibility. For example, a pie chart would require at least 2 fields:
one for the label and other for the value and the value should be numeric. If data set attributes do not match
the visualisation requirements, the visualization will not be available. In another example, a classification
operation would be disabled in the interface if the input data set does not have a column specified as a label,
otherwise the operation would not be able to learn how to classify the data.
Metadata enable Lemonade to load data in optimized formats. Instead of having to parse CSV or JSON files
into records, Lemonade can load data in binary formats, such as Parquet[R02].
Under the hood, Lemonade will generate code targeting a distributed processing platform, such as COMPSs
or Spark. The current version supports only Spark, and the generated code is executed in batch mode. Future
versions may implement support to interactive execution. This kind of execution has advantages because
keeping Spark context loaded avoids any overhead from starting the processing environment and loading
data for each step. This approach (keeping the context) is used in many implementations of data analytics
notebooks, such as Jupyter, Cloudera Hue and Databricks notebook.
Figure 8 presents a high level view of the WP5 components together with the specific concerns that should
be handled by WP6. Details on the represented roles, colour-code used, and the correspondent components
can be found in D6.1, while details on Spark, COMPSs and Lemonade can be found in Section 4.1.1.
According to the analysis performed in D6.1, the security concerns related with the architecture to be
proposed in WP5 together with WP4 and WP3, can be summarized in 5 main points as identified in red and
green in figure 8 , and which are aligned with the security concerns of the project. WP5 will propose
programming abstractions that work based on the underlying layers, and therefore the security concerns of
WP5 are also integrally connected with the ones of those layers.
Regarding security, WP5 requires services for authentication, authorization and accounting to the
infrastructure and the applications and it is also important to assure the privacy and access control of the
data for the operators that work with WP4. Another concern is the security of the API provided for
application development, which should not allow the developers to perform tasks that interfere with other
applications running and also should not allow malicious users to take advantage of its inputs to subvert the
functionalities of the applications.
To address these objectives, requirements were defined in D6.1. Following we summarize the key
requirements while details can be found in Section 4 of D6.1.
WP6 AAA corresponds to the AAA Provisioning and it will be necessary to develop two distinct AAA blocks,
which have distinct functionality, as follows:
1. EUBra-BIGSEA Infrastructure AAA, which provides the AAA functionalities required for managing the
EUBra-BIGSEA framework (access to cloud resources), from both the Infrastructure and Platform
perspectives (focusing on infrastructure managers and application developers/providers). The scope
of this service is the whole EUBra-BIGSEA framework as it matches the nature of the services focused
on cloud infrastructure management.
2. EUBra-BIGSEA Applications AAAaaS, which provides AAA-as-a-Service for applications developed
and hosted in the EUBra-BIGSEA framework and in need of services for authenticating and
authorizing their end users. The scope of AAAaaS instance is limited to the application making use of
it, and AAAaaS directly matches the nature of the set of services focused on end-users and
enterprise/consumer applications.
WP6 Assurances corresponds to the security assurances requirements R6.3.1 and R6.3.2 (see D6.1), and that
can be detailed as follows:
1. The End Users of the applications that will run inside the EUBra-BIGSEA infrastructure should not be
able to subvert, through the inputs of such applications, the functionalities implemented. This way,
it is necessary to perform a detailed assessment of the APIs to be used in the development of
applications.
2. The Data App Developers should not be able to develop applications that interfere with the
remaining applications running inside the framework. For this, besides the assessment of the APIs
made available, it will also be necessary to propose a set of recommendations for development best
practices.
WP6 Privacy corresponds to the concerns regarding the security of the data, i.e. the protection of the
confidentiality, privacy and anonymity of the data. It is necessary to include in the WP5 programming
abstractions ways for the programmer to express the privacy characteristics of the developed algorithms.
These abstractions will be enforced in the underlying layers, through mechanisms to be implemented in WP4.
Lemonade (see 4.1.3.1) is the best candidate to provide the users to express these preferences. In practice,
a set of operators are to be developed to extend the Lemonade syntax in order to include information that
allows the characterization of the algorithms according to the way their internals influence the data being
processed in terms of privacy.
RAL1. Support to QoS Support to QoS COMPSs supports QoS constraints Provided by
batch jobs bounded jobs different backends will be underlying
through connectors that supported technology.
implement specific through Will provide
functionalities; QoS command line mechanisms to
provided by the user can options annotate QoS
be translated to requirements
resources constraints for applications
and their
components.
RAL3. Support to The runtime must COMPSs adapts the Spark provides Provided by
reactive elasticity adapt its usage of resources basic underlying
resources according to the mechanisms technology
computational load. for adding and
according to the
Reconfiguration of the removing
changes in the pool of resources will be worker nodes.
QoS infrastructure evaluated in the project Wp3 will
implement
solutions and
advanced
policies for
runtime cluster
reconfiguration
RAL5. Definition of The programming It is part of the WP5 No. It is part of Yes
QoS constraints in interface must activities WP5 objectives
the programming provide ways to
interface express QoS that
will be translated
to resources
constraints by the
runtime
RAL6. Support to Include ways for No. It is part of WP5 No. It is part of Yes
privacy in the the programmer objectives WP5 objectives
definition of the to express the
algorithms privacy
characteristics of
the developed
algorithms
5 CONCLUSIONS
This document has provided the description of the programming abstraction layer of the EUBra-BIGSEA
Platform. The objective of this document is to complement the deliverables D3.1, D4.1 and D6.1 that focus
on the definition of the QoS infrastructure, the Big Data ecosystem and the security strategy. Here the aim is
to identify the technologies that can be adopted by the users of the platform to transparently implement Big
Data applications.
The analysis of the user requirements has led to the definition of a set of specifications for the
implementation of the components. The basis of the abstraction layer is the COMPSs framework that
provides a programming model to define applications whose execution, where possible, is automatically
parallelized. COMPSs will be extended to be interoperable with the Mesos middleware thus benefitting from
the capability of automatically increasing resources based on the QoS driven mechanisms. COMPSs will be
used to implement new workflows on top of the data analytics layer that will be used as building blocks for
the end user applications. In order to complement COMPSs and to support existing ML algorithms, the Spark
programming model will be used.
The Lemonade tool will be adopted as graphical interface to compose applications and to generate code for
COMPSs and Spark.
6 REFERENCES
[R01] Badia RM, Conejero J, Diaz C, Ejarque J, Lezzi D, Lordan F, Ramon-Cortes C, Sirvent R. COMP Superscalar,
an interoperable programming framework. SoftwareX [Internet]. 2015 ;3-4:32-36. Available from:
http://www.sciencedirect.com/science/article/pii/S2352711015000151.
[R02] Apache Parquet. Available from: https://parquet.apache.org/
[R03] S. Fiore, C. Palazzo, A. D’Anca, I. T. Foster, D. N. Williams, G. Aloisio, “A big data analytics framework
for scientific data management”, IEEE BigData Conference 2013: 1-8.
[R04] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott
Shenker, and Ion Stoica. 2011. Mesos: a platform for fine-grained resource sharing in the data center. In
Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI'11).
USENIX Association, Berkeley, CA, USA, 295-308.
7 GLOSSARY
Acronym Explanation Usage Scope