Sei sulla pagina 1di 16

Imagining

the UK National Data Infrastructure


Connecting up Big Data in the UK



Project Directors Group (PDG)

Imagining the UK
National Data
Infrastructure
Connecting up Big Data in the UK

Report of the UK National e-Infrastructure Project Directors Group workshop held at the Farr Institute,
London, 15th December 2014







Authors:
David Fergusson, Francis Crick Institute
David Colling, Imperial College / GridPP / WLCG
David de Roure, University of Oxford / ESRC
Martin Hamilton, Jisc (editor)
Brian Matthews, STFC
Jacky Pallas, University College London / eMedLab
David Salmon, Jisc
Jeremy Yates, University College London / STFC DiRAC


Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

Contents
Contents ........................................................................................................................ 2
1. Purpose and scope ..................................................................................................... 3
2. Integration ................................................................................................................. 4
3. Capability ................................................................................................................... 6
4. Connections ............................................................................................................... 8
5. Infrastructure ............................................................................................................. 9
6. Deliverables ............................................................................................................. 11
7. An Imagined Data Infrastructure Another Traditional View .................................... 15


Contents

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

1. Purpose and scope


The data ecosystem in the UK is expanding rapidly to cope with the demands of the UKs data intensive research.
We recognise the key challenges ahead if we are to develop our world-leading data infrastructure in a sustainable
and innovative way. In response to these challenges the National e-Infrastructure Project Directors Group (NeI-
PDG) brought together in December 2014 a large number of representatives from RCUK-funded Big Data
projects to imagine how the national data infrastructure could develop.


Figure 1 The UK National Data Landscape for Research

The working group have made a number of recommendations in the key themes of Integration, Capability,
Connections, and Infrastructure (as identified in the EPSRC e-Infrastructure roadmap1) and we outline some key
deliverables for 2015.

1 http://www.epsrc.ac.uk/newsevents/pubs/e-infrastructure-roadmap/

1. Purpose and scope

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

2. Integration

Our aspiration is for the UK to have an integrated e-infrastructure: one that is run and managed as a whole
without silos or boundaries, where there are simple processes by which users can get access to the e-
infrastructure they need across the eco-system, as appropriate for the type or stage of research they are
doing.
We do not envisage the UK data infrastructure as a single system but rather an integrated solution which reflects
the range of excellent science supported via both large-scale projects and research institutions. We propose to
build on existing resources and work towards better integration through best practice for sharing data coupled to
extensive training support.
The UK engages in a broad range of international projects such as EUDAT, ELIXIR, and SKA. There is a need for a
single voice for UK in the international arena which can represent the academic community in large
collaborations.
Recommendation: Build on international activity standards, policies etc. in a more strategic and co-ordinated
way. Role for RCUK coordinator to ensure that UK gets value for money from its involvement/subscriptions in
large scale international collaborations.
There is an expectation that significant capital investment in the research e-Infrastructure should deliver benefits
for UK industry, especially allowing SMEs to benefit through access to big data and compute resources. Some of
these benefits can be realised through direct collaboration between industry and academic institution(s).
However we believe that there are additional opportunities by leveraging funding with Innovate UK and
established (or future) Catapult Centres.
Recommendation: Identify funding opportunities within existing streams to allow academic institutions to
interact more effectively with the existing and projected future Catapult centres, as a mechanism to engage
industry more effectively around key areas such as digital health and futures cities/urban transformation.

We can only work effectively and share data with researchers, whether UK, international or industry, if datasets
are managed and discoverable.


2. Integration

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

Standards

Datasets
Metadata, e.g. schema.org, CKAN, DataCite and others. We need methods to capture metadata
automatically
Internationally agreed, community driven
Domain/project specific, regulatory (e.g. health)

De facto standards have often been driven by common hardware in instruments across domains, e.g. EXIF in
digital imaging. We then need to layer on top of those domain specific metadata standards with Discovery
Metadata. In some domains these are well established, such as Biosharing.org, however this is not widely the
case.
Metadata is a key enabler of data management and discovery, and at big data scales its collection and
sometimes its use must be automated. However, there is a need to document the current metadata landscape
and best practice, and identify areas for further development, improvement and standardization. This will
become a living document, in collaboration with those organisations involved in the Open Research Data and
Data Transparency areas e.g. Digital Curation Centre and HE institutions.
Recommendation: Metadata is a key enabler of data management and discovery, which at big data scales
must be automated. However, there is a need to document the current metadata landscape and best practice,
and identify areas for further development, improvement and standardization.

In order to promote sharing at scale researchers must see some benefit beyond compliance with RCUK and other
funder policies. Sharing of datasets should bring academic credit through data citations (for example the
DataCite consortium) with DOIs or other persistent identifiers being associated with published datasets.
Publication of datasets should be captured as an impact outcome of funded research through metrics portals
such as Researchfish. Jisc are also reviewing proposals innovations in Data Management in the Research Data
Spring initiative2.
Recommendation: Recognition for the impact of research datasets to the community through the use of DOIs
or other common identifiers and, equally, giving credit to researchers for generating datasets. Metrics should be
captured via existing mechanisms such as Jisc, Gateway to Research, Researchfish for example.


2 http://www.Jisc.ac.uk/rd/projects/research-data-spring

2. Integration

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

3. Capability

There is broad recognition of the concept of research data management as an essential activity across the
project lifecycle rather than just a paper exercise at the time of grant submission, as illustrated in the DCC Data
Lifecycle model below. RCUK has driven the requirement for institutions to show leadership in research data
management, management, with a joint position on Data Management3 and the EPSRC in particular asking HEIs
to meet specific standards by May 20154 .


Figure 2 - The Digital Curation Centre Lifecycle Model

Training in research data management needs to speak to projects/centres, institutions and individual researchers
at all levels. There is a huge opportunity to reach Early Career Researchers in particular through existing Centres
for Doctoral Training via a train the trainers type approach.


3 http://www.rcuk.ac.uk/research/datapolicy
4 http://www.epsrc.ac.uk/files/aboutus/standards/clarificationsofexpectationsresearchdatamanagement/

3. Capability

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)


Recommendation: Training in data management - Build upon existing PDG, SSI and DCC activities to create a
concerted and coordinated approach to promoting best practice in data management. Capitalize on existing
activities to orchestrate this, e.g. train the trainers whereby the actual training is delivered by projects and
institutions.

Capacity building and skills training



The need for technology transfer between subject domains, in terms of staff experience rather than
commercialization, was recognised. While RCUK has a number of schemes for academic placements such as
Bridging the Gaps, there is no equivalent for technical staff. One possible activity was proposed - Cross-RCUK
big data tech-specific scheme. Proposals to such a scheme would preferably driven by an actual problem,
ideally across disciplines or e-Infrastructure projects and provide potential for host institution staff to gain
management or supervisory experience.

Recommendation: Sharing excellence across domains - e.g. cross RCUK initiative, buying out staff time (not just
academics) for a defined period to work on specific activities, proposal from two subject domains as a minimum.


3. Capability

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

4. Connections
User management

User management systems are essential to enable researcher access to regional and national systems. This is
especially important for the health informatics and administrative data networks which require additional
security and two-factor authentication systems. There are existing activities around Shibboleth, SAFE, VOMS,
Moonshot and Safe Share, but existing well established services and facilities have their own approaches that
need to be taken into account. Pilots will lead to recommendations for common standards. There is a
particular role for Jisc and RCUK here in terms of international standards liaison e.g. W3C, schema.org, Research
Data Alliance. This will require wider buy-in from the community as well as pump-priming funding.

Data Transfer and access



Lots of closely coupled systems with compute and storage are co-located, and there are some examples of tiered
approaches when huge volumes of data involved e.g. WLCG. The group felt that these issues were typically
addressed as part of projects. Exemplars for researcher access to datasets (and compute) respecting trust
boundaries include EBI, UKDA, NERC data centres, GridPP data movement orchestration. The comparison was
made between between LHC data (instrument in the stream) and the Twitter firehose for social sciences
studies.
There is a requirement for remote data access for researchers with the necessary control and orchestration, and
caching tiers. Examples range from a client running on an end user workstation (GridPP) versus access mediated
through a website (EBI). We propose a new project to develop cross-discipline solutions to managing data
transfer through joint working with biomedical and physical science domains.
Recommendation: Particular example around orchestrating data transfer - problem is widely recognised, and
there are already understood approaches in some subject domains. Orchestrating data transfer - Crick, EBI,
GridPP joint project


4. Connections

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

5. Infrastructure
Networks

The group felt that with the recent investment in Janet6, the network had sufficient capacity and room for
expansion. However, access to high capacity for short periods would increasingly be required. A number of
points were raised about campus networks which would be challenging to address and difficult or expensive.
Last mile - e.g. campus network to end user.
Is the campus LAN fit for purpose for NeI users?
Do campus firewalls have sufficient throughput?
Is campus Janet connection oversubscribed / separate research connection required?
What would a campus focal point look like? e.g. GridPP use of Squid cache
Estates constraints on many institutions - listed buildings, busy city streets etc
Investment in Janet6, improved connectivity to major research institutions and improved resilience for
day-to-day use.

Q: Do we need a new equivalent to the HEFCE LAN/MAN initiative?

Q: What would a NeI Network Appliance look like?


Would it be
a Virtual Machine (VM) image or
a Transmission Control Protocol (TCP) stack tuned e.g. Maximum Transmission Unit (MTU)

It would need to use AAAI and it should scheduled file transfers

Recommendation: The group felt that more flexible access to high capacity networking for defined periods
would increasingly be required. For example the eMedLab project will be moving 2.5PB data from EBI at the start
of the project (April 2015).


5. Infrastructure

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

Archive

There was much discussion around archives, defined as long-term storage of immutable datasets. Some projects
have their own archives and some disciplines have international repositories (e.g. EBI). However the RCUK data
sharing policy has specific requirements to make research data objects available for up to 10 years after the last
requested access. The group felt that it was difficult to focus on approaches offered individual institutions and
proposed a survey of the data management landscape. Any institutional archive should provide DOIs or
persistent identifiers for datasets to allow discovery, and a means of crediting researchers for creating and
depositing datasets (as outlined earlier).


5. Infrastructure

10

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

6. Deliverables
Pre-Requisites

The Data Analytics and Open Research Data activities in the data e-Infrastructure should be supported
by a simple layered middleware and software e-Infrastructure.
This e-infrastructure should consist of a Common Basic Layer (CBL) on which a Research Domain
Specific layer would sit.
The Common Basic Layer (CBL) should therefore be small and capable of generic use.
The Research Domain Specific Layer (RDSL) needs to be constructed at the same time.
Key elements of the CBL are
o The AAAI and Security Models I am who I am and I can use resources.
o Control access to data The RCUK AAAI project SAFE SHARE is delivering aspects of this.
o Data In-flight Security my data is going to flow ok and only the right people will get it and see it
o Data at-rest Security its looked after and I am obeying the pertinent regulations. The data are
open to those who are allowed to see it; it is searchable and query-able.
o Cloud/Grid middleware to enable appropriate resources to be used. From the user perspective
this can be broken down into the following attributes:
1. Can I see resources?
2. You can use resources,
3. and actually using resources,
4. here is what you have used and
5. here are your results in the place you asked them to be put.
o Wrapping compute around big data use of virtualisation and containers to send our workflows
to where the data are residing. The local compute simply executes the workflows we have
constructed/run on other machines.
o An Application Program Interface (API) that allows Data Policies (e.g. metadata requirements) to
be actualised in applications.
o Simple Tools and Services to enable data discovery and exploration. Data can be accessed and
queried using published metadata and data transport tools.
An RDSL would have elements such as
o Applications or web portals that allow its researchers to use CBL services. These are the user-
friendly User Interface (UI) and would be the gateway to the NeI for the average researcher.
o If needed, extra security and AAAI requirements could be included here.
o Access to training resources could be included, such as online courses and tests.
o The interfaces and APIs to the Data Analytics and Open Research Data infrastructures would
reside in the RDSL.


6. Deliverables

11

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

Hardware will be domain and activity specific. However object stores that can act as repositories could
be centralised and be a common activity between the RCs.

In terms of current activities our progress in creating these Pre-Requisites is also listed below.

Table 1: Pre-Requisites for the Data Infrastructure
Infrastructure

Projects

Who is Responsible?

Authentication, Allocation and


Jisc-led Safe Share Project already Jisc and partners from ESRC and
Authorisation Infrastructure with 2 underway
MRC
factor Security Controls
Research Domain aspects of AAAI Research Domains

need to be constructed.
Data-in-flight Information
Assurance

Jisc

Jisc, Research Domains


Data-at-rest Information assurance No overall description, or indeed NeI as a whole
none
Data abstraction layer development NeI Projects

PDG members, RCs

Networks

High Capacity Networking

Jisc

Local Research Organisation

RO

Links to Business

Jisc

Advanced Compute

NeI Projects

PDG members, RCs

Data Storage Facilities

NeI Projects

PDG Members, RCs

Cloud/Grid Infrastructure

GridPP, JASMIN2, EMBASSY


CLOUD, eMedLab

Cloud WG, PDG


6. Deliverables

12

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

Infrastructure

Projects

Who is Responsible?

Tools and Software

Varied no coherence

Big Data SIG, PDG and RCs



What needs to be tried out and tested?

The tools and software needed to discover data and move data around (needed for multiple data sources) need
to be developed into a coherent and simple package.

Below are listed a set of deliverables that can be achieved in 2015 to enable this. However these are dependent
on activities listed in Table 1. This is why the tests will be done in the field on live NeI systems.


Table 2: List of Deliverables
Recommendations

Action

Training in data management

Projects to produce data


DMPs and Courses in place by June
management plans and run
2015 (PDG)
courses on data management for
user communities and staff. CDTs
to be involved.

Document the current metadata RCs to document the relevant


landscape and best practice
Metadata standards and publish
these standards

Create code libraries that
applications can use to produce
metadata when data are
produced.
Develop data abstraction layer

Milestone (OWNER)

Publish Standards and insist on


their use particularly when data
are produced (RCUK).

Demonstrate on PDG Projects
systems (PDG)

Build test and open source


Integration of iRODS and
software tools for data abstraction OpenStack as a POC for data
and presentation of meta-data
integration and presentation
(PDG)

Co-ordination of International
Produce report on the various
Produce Strategy Document
Projects to extract best value and national and international projects (RCUK)
influence Agendas
the UK is involved in


6. Deliverables

13

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

Recommendations

Action

Milestone (OWNER)

Working with Catapult Centres

Work with Innovate UK to ensure


that business has access to Janet

RCUK NeI Group to communicate
to academic community
opportunities to work with
catapult centres

Simple Contracts and portal make


sure Business can book network
access easily (Jisc).

Adding to existing regular research
bulletins (RCUK)

Organise joint academic/Innovate
UK workshops to link academy to
Catapults (RCUK)

Data Transfer 1 Data transport


and orchestration

Make FTS a generic tool to act as Test on the DiRAC, JASMIN2 and
an aggregator and orchestrator eMedLab systems (PDG, Jisc)
and link to the RCUK AAAI

Data Transfer 2 High Capacity


Network Access

Secure Transport of Data to


eMedLab and RAL WOS

Transfer of multi-PB EBI data to


eMedLab and and
DiRAC@Durham Data to RAL
WOS (PDG, Jisc)

Data Transfer 3 Creating Single Create WLAN and VLANS in


Name Spaces
projects to create single
filesystems (global spaces)
between distributed systems

Test on DiRAC systems between


Durham and Edinburgh and
between EBI and eMedLab (PDG,
Jisc)

Test on wLHC and DiRAC (PDG,
Jisc)

Knowledge Transfer and


Consultancy

Produce by April 2015

Produce Work programme


6. Deliverables

14

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

7. An Imagined Data Infrastructure


Another Traditional View
A schematic of what a National Data e-Infrastructure may look like. Note the ubiquitous presence of Janet.
Key: a Janet Connection

The Proposed CBL and RDSL would be the enabling middleware infrastructure for this e-Infrastructure.

HEI 3

HEI 2

HEI 1
DIAMOND


National Deep
Archive
Service

Sanger, EBI, ESRC,


DiRAC, ARCHER

JASMINE2
National
Tertiary Storage

Service

Meta Data
Presented to
World

Local Tertiary
Storage Layer

Database Creation/Ingestion Layer


and Analytics
Parallel File
System, HEI
RDM/Repository

Data Generator.
Experiments,
Clusters, PCs....

The Attributes and


functional blow-up of a
TYPICAL Local System,
the National Tertiary
Storage Service and the
National Deep Archive
Service



7. An Imagined Data Infrastructure Another Traditional View

15

Imagining the UK National Data Infrastructure


Connecting up Big Data in the UK
Project Directors Group (PDG)

The principal components needed for such an e-Infrastructure are:1. Local tertiary storage platforms for active data.
2. Data Base Creator/Ingestor widget to create structured data from unstructured data and policies to
meta-data tag such data e.g. owner, project, grant no. etc.
3. A National tertiary storage /metadata service to build up and store metadata from the other databases
in the National e-I, as well as store our major active databases.
4. A National Deep Archive Service to store data that has been produced by National Facilities and to
provide data replication services for the National E-Infrastructure.
This is a traditional representation of a computing infrastructure. It is very much the end point of the proposed work
in this document, which is why it belongs at the end.
The work proposed in this document enables this infrastructure to exist in an efficacious way. The outputs we
propose are the real Data Infrastructure in that they enable data to be moved, selected, and queried. It is these
that give the data its form and value.

16

Potrebbero piacerti anche