Imagining The UK National Data Infrastructure

Imagining
the UK National Data Infrastructure

Connecting up Big Data in the UK

Project Directors Group (PDG)
Imagining the UK
National Data
Infrastructure

Report of the UK National e-Infrastructure Project Directors Group workshop held at the Farr Institute,
London, 15th December 2014

Authors:
David Fergusson, Francis Crick Institute
David Colling, Imperial College / GridPP / WLCG
David de Roure, University of Oxford / ESRC
Martin Hamilton, Jisc (editor)
Brian Matthews, STFC
Jacky Pallas, University College London / eMedLab
David Salmon, Jisc
Jeremy Yates, University College London / STFC DiRAC

Imagining the UK National Data Infrastructure

Contents
Contents ........................................................................................................................ 2
1. Purpose and scope ..................................................................................................... 3
2. Integration ................................................................................................................. 4
3. Capability ................................................................................................................... 6
4. Connections ............................................................................................................... 8
5. Infrastructure ............................................................................................................. 9
6. Deliverables ............................................................................................................. 11
7. An Imagined Data Infrastructure Another Traditional View .................................... 15

Contents

1. Purpose and scope

The data ecosystem in the UK is expanding rapidly to cope with the demands of the UKs data intensive research.
We recognise the key challenges ahead if we are to develop our world-leading data infrastructure in a sustainable
and innovative way. In response to these challenges the National e-Infrastructure Project Directors Group (NeI-
PDG) brought together in December 2014 a large number of representatives from RCUK-funded Big Data
projects to imagine how the national data infrastructure could develop.

Figure 1 The UK National Data Landscape for Research
The working group have made a number of recommendations in the key themes of Integration, Capability,
Connections, and Infrastructure (as identified in the EPSRC e-Infrastructure roadmap1) and we outline some key
deliverables for 2015.

1 http://www.epsrc.ac.uk/newsevents/pubs/e-infrastructure-roadmap/

1. Purpose and scope

2. Integration

Our aspiration is for the UK to have an integrated e-infrastructure: one that is run and managed as a whole
without silos or boundaries, where there are simple processes by which users can get access to the e-
infrastructure they need across the eco-system, as appropriate for the type or stage of research they are
doing.
We do not envisage the UK data infrastructure as a single system but rather an integrated solution which reflects
the range of excellent science supported via both large-scale projects and research institutions. We propose to
build on existing resources and work towards better integration through best practice for sharing data coupled to
extensive training support.
The UK engages in a broad range of international projects such as EUDAT, ELIXIR, and SKA. There is a need for a
single voice for UK in the international arena which can represent the academic community in large
collaborations.
Recommendation: Build on international activity standards, policies etc. in a more strategic and co-ordinated
way. Role for RCUK coordinator to ensure that UK gets value for money from its involvement/subscriptions in
large scale international collaborations.
There is an expectation that significant capital investment in the research e-Infrastructure should deliver benefits
for UK industry, especially allowing SMEs to benefit through access to big data and compute resources. Some of
these benefits can be realised through direct collaboration between industry and academic institution(s).
However we believe that there are additional opportunities by leveraging funding with Innovate UK and
established (or future) Catapult Centres.
Recommendation: Identify funding opportunities within existing streams to allow academic institutions to
interact more effectively with the existing and projected future Catapult centres, as a mechanism to engage
industry more effectively around key areas such as digital health and futures cities/urban transformation.

We can only work effectively and share data with researchers, whether UK, international or industry, if datasets
are managed and discoverable.

2. Integration

Standards
Datasets
Metadata, e.g. schema.org, CKAN, DataCite and others. We need methods to capture metadata
automatically
Internationally agreed, community driven
Domain/project specific, regulatory (e.g. health)
De facto standards have often been driven by common hardware in instruments across domains, e.g. EXIF in
digital imaging. We then need to layer on top of those domain specific metadata standards with Discovery
Metadata. In some domains these are well established, such as Biosharing.org, however this is not widely the
case.
Metadata is a key enabler of data management and discovery, and at big data scales its collection and
sometimes its use must be automated. However, there is a need to document the current metadata landscape
and best practice, and identify areas for further development, improvement and standardization. This will
become a living document, in collaboration with those organisations involved in the Open Research Data and
Data Transparency areas e.g. Digital Curation Centre and HE institutions.
Recommendation: Metadata is a key enabler of data management and discovery, which at big data scales
must be automated. However, there is a need to document the current metadata landscape and best practice,
and identify areas for further development, improvement and standardization.

In order to promote sharing at scale researchers must see some benefit beyond compliance with RCUK and other
funder policies. Sharing of datasets should bring academic credit through data citations (for example the
DataCite consortium) with DOIs or other persistent identifiers being associated with published datasets.
Publication of datasets should be captured as an impact outcome of funded research through metrics portals
such as Researchfish. Jisc are also reviewing proposals innovations in Data Management in the Research Data
Spring initiative2.
Recommendation: Recognition for the impact of research datasets to the community through the use of DOIs
or other common identifiers and, equally, giving credit to researchers for generating datasets. Metrics should be
captured via existing mechanisms such as Jisc, Gateway to Research, Researchfish for example.

2 http://www.Jisc.ac.uk/rd/projects/research-data-spring

2. Integration

3. Capability

There is broad recognition of the concept of research data management as an essential activity across the
project lifecycle rather than just a paper exercise at the time of grant submission, as illustrated in the DCC Data
Lifecycle model below. RCUK has driven the requirement for institutions to show leadership in research data
management, management, with a joint position on Data Management3 and the EPSRC in particular asking HEIs
to meet specific standards by May 20154 .

Figure 2 - The Digital Curation Centre Lifecycle Model
Training in research data management needs to speak to projects/centres, institutions and individual researchers
at all levels. There is a huge opportunity to reach Early Career Researchers in particular through existing Centres
for Doctoral Training via a train the trainers type approach.

3 http://www.rcuk.ac.uk/research/datapolicy
4 http://www.epsrc.ac.uk/files/aboutus/standards/clarificationsofexpectationsresearchdatamanagement/

3. Capability


Recommendation: Training in data management - Build upon existing PDG, SSI and DCC activities to create a
concerted and coordinated approach to promoting best practice in data management. Capitalize on existing
activities to orchestrate this, e.g. train the trainers whereby the actual training is delivered by projects and
institutions.
Capacity building and skills training

The need for technology transfer between subject domains, in terms of staff experience rather than
commercialization, was recognised. While RCUK has a number of schemes for academic placements such as
Bridging the Gaps, there is no equivalent for technical staff. One possible activity was proposed - Cross-RCUK
big data tech-specific scheme. Proposals to such a scheme would preferably driven by an actual problem,
ideally across disciplines or e-Infrastructure projects and provide potential for host institution staff to gain
management or supervisory experience.

Recommendation: Sharing excellence across domains - e.g. cross RCUK initiative, buying out staff time (not just
academics) for a defined period to work on specific activities, proposal from two subject domains as a minimum.

3. Capability

4. Connections
User management

User management systems are essential to enable researcher access to regional and national systems. This is
especially important for the health informatics and administrative data networks which require additional
security and two-factor authentication systems. There are existing activities around Shibboleth, SAFE, VOMS,
Moonshot and Safe Share, but existing well established services and facilities have their own approaches that
need to be taken into account. Pilots will lead to recommendations for common standards. There is a
particular role for Jisc and RCUK here in terms of international standards liaison e.g. W3C, schema.org, Research
Data Alliance. This will require wider buy-in from the community as well as pump-priming funding.
Data Transfer and access

Lots of closely coupled systems with compute and storage are co-located, and there are some examples of tiered
approaches when huge volumes of data involved e.g. WLCG. The group felt that these issues were typically
addressed as part of projects. Exemplars for researcher access to datasets (and compute) respecting trust
boundaries include EBI, UKDA, NERC data centres, GridPP data movement orchestration. The comparison was
made between between LHC data (instrument in the stream) and the Twitter firehose for social sciences
studies.
There is a requirement for remote data access for researchers with the necessary control and orchestration, and
caching tiers. Examples range from a client running on an end user workstation (GridPP) versus access mediated
through a website (EBI). We propose a new project to develop cross-discipline solutions to managing data
transfer through joint working with biomedical and physical science domains.
Recommendation: Particular example around orchestrating data transfer - problem is widely recognised, and
there are already understood approaches in some subject domains. Orchestrating data transfer - Crick, EBI,
GridPP joint project

4. Connections

5. Infrastructure
Networks

The group felt that with the recent investment in Janet6, the network had sufficient capacity and room for
expansion. However, access to high capacity for short periods would increasingly be required. A number of
points were raised about campus networks which would be challenging to address and difficult or expensive.
Last mile - e.g. campus network to end user.
Is the campus LAN fit for purpose for NeI users?
Do campus firewalls have sufficient throughput?
Is campus Janet connection oversubscribed / separate research connection required?
What would a campus focal point look like? e.g. GridPP use of Squid cache
Estates constraints on many institutions - listed buildings, busy city streets etc
Investment in Janet6, improved connectivity to major research institutions and improved resilience for
day-to-day use.

Q: Do we need a new equivalent to the HEFCE LAN/MAN initiative?
Q: What would a NeI Network Appliance look like?

Would it be
a Virtual Machine (VM) image or
a Transmission Control Protocol (TCP) stack tuned e.g. Maximum Transmission Unit (MTU)

It would need to use AAAI and it should scheduled file transfers

Recommendation: The group felt that more flexible access to high capacity networking for defined periods
would increasingly be required. For example the eMedLab project will be moving 2.5PB data from EBI at the start
of the project (April 2015).

5. Infrastructure

Archive

There was much discussion around archives, defined as long-term storage of immutable datasets. Some projects
have their own archives and some disciplines have international repositories (e.g. EBI). However the RCUK data
sharing policy has specific requirements to make research data objects available for up to 10 years after the last
requested access. The group felt that it was difficult to focus on approaches offered individual institutions and
proposed a survey of the data management landscape. Any institutional archive should provide DOIs or
persistent identifiers for datasets to allow discovery, and a means of crediting researchers for creating and
depositing datasets (as outlined earlier).

5. Infrastructure
10

6. Deliverables
Pre-Requisites

The Data Analytics and Open Research Data activities in the data e-Infrastructure should be supported
by a simple layered middleware and software e-Infrastructure.
This e-infrastructure should consist of a Common Basic Layer (CBL) on which a Research Domain
Specific layer would sit.
The Common Basic Layer (CBL) should therefore be small and capable of generic use.
The Research Domain Specific Layer (RDSL) needs to be constructed at the same time.
Key elements of the CBL are
o The AAAI and Security Models I am who I am and I can use resources.
o Control access to data The RCUK AAAI project SAFE SHARE is delivering aspects of this.
o Data In-flight Security my data is going to flow ok and only the right people will get it and see it
o Data at-rest Security its looked after and I am obeying the pertinent regulations. The data are
open to those who are allowed to see it; it is searchable and query-able.
o Cloud/Grid middleware to enable appropriate resources to be used. From the user perspective
this can be broken down into the following attributes:
1. Can I see resources?
2. You can use resources,
3. and actually using resources,
4. here is what you have used and
5. here are your results in the place you asked them to be put.
o Wrapping compute around big data use of virtualisation and containers to send our workflows
to where the data are residing. The local compute simply executes the workflows we have
constructed/run on other machines.
o An Application Program Interface (API) that allows Data Policies (e.g. metadata requirements) to
be actualised in applications.
o Simple Tools and Services to enable data discovery and exploration. Data can be accessed and
queried using published metadata and data transport tools.
An RDSL would have elements such as
o Applications or web portals that allow its researchers to use CBL services. These are the user-
friendly User Interface (UI) and would be the gateway to the NeI for the average researcher.
o If needed, extra security and AAAI requirements could be included here.
o Access to training resources could be included, such as online courses and tests.
o The interfaces and APIs to the Data Analytics and Open Research Data infrastructures would
reside in the RDSL.

6. Deliverables
11

Hardware will be domain and activity specific. However object stores that can act as repositories could
be centralised and be a common activity between the RCs.

In terms of current activities our progress in creating these Pre-Requisites is also listed below.

Table 1: Pre-Requisites for the Data Infrastructure
Infrastructure
Projects
Who is Responsible?
Authentication, Allocation and

Jisc-led Safe Share Project already Jisc and partners from ESRC and
Authorisation Infrastructure with 2 underway
MRC
factor Security Controls
Research Domain aspects of AAAI Research Domains

need to be constructed.
Data-in-flight Information
Assurance
Jisc
Jisc, Research Domains

Data-at-rest Information assurance No overall description, or indeed NeI as a whole
none
Data abstraction layer development NeI Projects
PDG members, RCs
Networks
High Capacity Networking
Jisc
Local Research Organisation
RO
Links to Business
Jisc
Advanced Compute
NeI Projects
PDG members, RCs
Data Storage Facilities
NeI Projects
PDG Members, RCs
Cloud/Grid Infrastructure
GridPP, JASMIN2, EMBASSY

CLOUD, eMedLab
Cloud WG, PDG

6. Deliverables
12

Infrastructure
Projects
Who is Responsible?
Tools and Software
Varied no coherence
Big Data SIG, PDG and RCs

What needs to be tried out and tested?

The tools and software needed to discover data and move data around (needed for multiple data sources) need
to be developed into a coherent and simple package.

Below are listed a set of deliverables that can be achieved in 2015 to enable this. However these are dependent
on activities listed in Table 1. This is why the tests will be done in the field on live NeI systems.

Table 2: List of Deliverables
Recommendations
Action
Training in data management
Projects to produce data

DMPs and Courses in place by June
management plans and run
2015 (PDG)
courses on data management for
user communities and staff. CDTs
to be involved.
Document the current metadata RCs to document the relevant

landscape and best practice
Metadata standards and publish
these standards

Create code libraries that
applications can use to produce
metadata when data are
produced.
Develop data abstraction layer
Milestone (OWNER)
Publish Standards and insist on

their use particularly when data
are produced (RCUK).

Demonstrate on PDG Projects
systems (PDG)
Build test and open source

Integration of iRODS and
software tools for data abstraction OpenStack as a POC for data
and presentation of meta-data
integration and presentation
(PDG)
Co-ordination of International
Produce report on the various
Produce Strategy Document
Projects to extract best value and national and international projects (RCUK)
influence Agendas
the UK is involved in

6. Deliverables
13

Recommendations
Action
Milestone (OWNER)
Working with Catapult Centres
Work with Innovate UK to ensure

that business has access to Janet

RCUK NeI Group to communicate
to academic community
opportunities to work with
catapult centres
Simple Contracts and portal make

sure Business can book network
access easily (Jisc).

Adding to existing regular research
bulletins (RCUK)

Organise joint academic/Innovate
UK workshops to link academy to
Catapults (RCUK)
Data Transfer 1 Data transport

and orchestration
Make FTS a generic tool to act as Test on the DiRAC, JASMIN2 and
an aggregator and orchestrator eMedLab systems (PDG, Jisc)
and link to the RCUK AAAI
Data Transfer 2 High Capacity

Network Access
Secure Transport of Data to

eMedLab and RAL WOS
Transfer of multi-PB EBI data to

eMedLab and and
DiRAC@Durham Data to RAL
WOS (PDG, Jisc)
Data Transfer 3 Creating Single Create WLAN and VLANS in

Name Spaces
projects to create single
filesystems (global spaces)
between distributed systems
Test on DiRAC systems between

Durham and Edinburgh and
between EBI and eMedLab (PDG,
Jisc)

Test on wLHC and DiRAC (PDG,
Jisc)
Knowledge Transfer and

Consultancy
Produce by April 2015
Produce Work programme

6. Deliverables
14

7. An Imagined Data Infrastructure

Another Traditional View
A schematic of what a National Data e-Infrastructure may look like. Note the ubiquitous presence of Janet.
Key: a Janet Connection
The Proposed CBL and RDSL would be the enabling middleware infrastructure for this e-Infrastructure.

HEI 3
HEI 2
HEI 1
DIAMOND

National Deep
Archive
Service
Sanger, EBI, ESRC,

DiRAC, ARCHER
JASMINE2
National
Tertiary Storage
Service
Meta Data
Presented to
World
Local Tertiary
Storage Layer
Database Creation/Ingestion Layer

and Analytics
Parallel File
System, HEI
RDM/Repository
Data Generator.
Experiments,
Clusters, PCs....
The Attributes and

functional blow-up of a
TYPICAL Local System,
the National Tertiary
Storage Service and the
National Deep Archive
Service

7. An Imagined Data Infrastructure Another Traditional View
15

The principal components needed for such an e-Infrastructure are:1. Local tertiary storage platforms for active data.
2. Data Base Creator/Ingestor widget to create structured data from unstructured data and policies to
meta-data tag such data e.g. owner, project, grant no. etc.
3. A National tertiary storage /metadata service to build up and store metadata from the other databases
in the National e-I, as well as store our major active databases.
4. A National Deep Archive Service to store data that has been produced by National Facilities and to
provide data replication services for the National E-Infrastructure.
This is a traditional representation of a computing infrastructure. It is very much the end point of the proposed work
in this document, which is why it belongs at the end.
The work proposed in this document enables this infrastructure to exist in an efficacious way. The outputs we
propose are the real Data Infrastructure in that they enable data to be moved, selected, and queried. It is these
that give the data its form and value.

16

Imagining The UK National Data Infrastructure - Recommendations

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Imagining The UK National Data Infrastructure - Recommendations

Caricato da

Copyright:

Formati disponibili

Imagining

the UK National Data Infrastructure

Imagining the UK National Data Infrastructure

1. Purpose and scope

Imagining the UK National Data Infrastructure

Imagining the UK National Data Infrastructure

Imagining the UK National Data Infrastructure

Imagining the UK National Data Infrastructure

Capacity building and skills training

Imagining the UK National Data Infrastructure

Data Transfer and access

Imagining the UK National Data Infrastructure

Q: What would a NeI Network Appliance look like?

Imagining the UK National Data Infrastructure

Imagining the UK National Data Infrastructure

Imagining the UK National Data Infrastructure

Authentication, Allocation and

Jisc, Research Domains

PDG members, RCs

High Capacity Networking

Local Research Organisation

PDG members, RCs

Data Storage Facilities

PDG Members, RCs

GridPP, JASMIN2, EMBASSY

Cloud WG, PDG

Imagining the UK National Data Infrastructure

Tools and Software

Big Data SIG, PDG and RCs

Training in data management

Projects to produce data

Document the current metadata RCs to document the relevant

Publish Standards and insist on

Build test and open source

Imagining the UK National Data Infrastructure

Working with Catapult Centres

Work with Innovate UK to ensure

Simple Contracts and portal make

Data Transfer 1 Data transport

Data Transfer 2 High Capacity

Secure Transport of Data to

Transfer of multi-PB EBI data to

Data Transfer 3 Creating Single Create WLAN and VLANS in

Test on DiRAC systems between

Knowledge Transfer and

Produce by April 2015

Produce Work programme

Imagining the UK National Data Infrastructure

7. An Imagined Data Infrastructure

Sanger, EBI, ESRC,

Database Creation/Ingestion Layer

The Attributes and

Imagining the UK National Data Infrastructure

Potrebbero piacerti anche