Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Imagining
the
UK
National
Data
Infrastructure
Connecting
up
Big
Data
in
the
UK
Report
of
the
UK
National
e-Infrastructure
Project
Directors
Group
workshop
held
at
the
Farr
Institute,
London,
15th
December
2014
Authors:
David
Fergusson,
Francis
Crick
Institute
David
Colling,
Imperial
College
/
GridPP
/
WLCG
David
de
Roure,
University
of
Oxford
/
ESRC
Martin
Hamilton,
Jisc
(editor)
Brian
Matthews,
STFC
Jacky
Pallas,
University
College
London
/
eMedLab
David
Salmon,
Jisc
Jeremy
Yates,
University
College
London
/
STFC
DiRAC
Contents
Contents
........................................................................................................................
2
1.
Purpose
and
scope
.....................................................................................................
3
2.
Integration
.................................................................................................................
4
3.
Capability
...................................................................................................................
6
4.
Connections
...............................................................................................................
8
5.
Infrastructure
.............................................................................................................
9
6.
Deliverables
.............................................................................................................
11
7.
An
Imagined
Data
Infrastructure
Another
Traditional
View
....................................
15
Contents
Figure
1
The
UK
National
Data
Landscape
for
Research
The
working
group
have
made
a
number
of
recommendations
in
the
key
themes
of
Integration,
Capability,
Connections,
and
Infrastructure
(as
identified
in
the
EPSRC
e-Infrastructure
roadmap1)
and
we
outline
some
key
deliverables
for
2015.
1
http://www.epsrc.ac.uk/newsevents/pubs/e-infrastructure-roadmap/
1.
Purpose
and
scope
2.
Integration
Our
aspiration
is
for
the
UK
to
have
an
integrated
e-infrastructure:
one
that
is
run
and
managed
as
a
whole
without
silos
or
boundaries,
where
there
are
simple
processes
by
which
users
can
get
access
to
the
e-
infrastructure
they
need
across
the
eco-system,
as
appropriate
for
the
type
or
stage
of
research
they
are
doing.
We
do
not
envisage
the
UK
data
infrastructure
as
a
single
system
but
rather
an
integrated
solution
which
reflects
the
range
of
excellent
science
supported
via
both
large-scale
projects
and
research
institutions.
We
propose
to
build
on
existing
resources
and
work
towards
better
integration
through
best
practice
for
sharing
data
coupled
to
extensive
training
support.
The
UK
engages
in
a
broad
range
of
international
projects
such
as
EUDAT,
ELIXIR,
and
SKA.
There
is
a
need
for
a
single
voice
for
UK
in
the
international
arena
which
can
represent
the
academic
community
in
large
collaborations.
Recommendation:
Build
on
international
activity
standards,
policies
etc.
in
a
more
strategic
and
co-ordinated
way.
Role
for
RCUK
coordinator
to
ensure
that
UK
gets
value
for
money
from
its
involvement/subscriptions
in
large
scale
international
collaborations.
There
is
an
expectation
that
significant
capital
investment
in
the
research
e-Infrastructure
should
deliver
benefits
for
UK
industry,
especially
allowing
SMEs
to
benefit
through
access
to
big
data
and
compute
resources.
Some
of
these
benefits
can
be
realised
through
direct
collaboration
between
industry
and
academic
institution(s).
However
we
believe
that
there
are
additional
opportunities
by
leveraging
funding
with
Innovate
UK
and
established
(or
future)
Catapult
Centres.
Recommendation:
Identify
funding
opportunities
within
existing
streams
to
allow
academic
institutions
to
interact
more
effectively
with
the
existing
and
projected
future
Catapult
centres,
as
a
mechanism
to
engage
industry
more
effectively
around
key
areas
such
as
digital
health
and
futures
cities/urban
transformation.
We
can
only
work
effectively
and
share
data
with
researchers,
whether
UK,
international
or
industry,
if
datasets
are
managed
and
discoverable.
2.
Integration
Standards
Datasets
Metadata,
e.g.
schema.org,
CKAN,
DataCite
and
others.
We
need
methods
to
capture
metadata
automatically
Internationally
agreed,
community
driven
Domain/project
specific,
regulatory
(e.g.
health)
De
facto
standards
have
often
been
driven
by
common
hardware
in
instruments
across
domains,
e.g.
EXIF
in
digital
imaging.
We
then
need
to
layer
on
top
of
those
domain
specific
metadata
standards
with
Discovery
Metadata.
In
some
domains
these
are
well
established,
such
as
Biosharing.org,
however
this
is
not
widely
the
case.
Metadata
is
a
key
enabler
of
data
management
and
discovery,
and
at
big
data
scales
its
collection
and
sometimes
its
use
must
be
automated.
However,
there
is
a
need
to
document
the
current
metadata
landscape
and
best
practice,
and
identify
areas
for
further
development,
improvement
and
standardization.
This
will
become
a
living
document,
in
collaboration
with
those
organisations
involved
in
the
Open
Research
Data
and
Data
Transparency
areas
e.g.
Digital
Curation
Centre
and
HE
institutions.
Recommendation:
Metadata
is
a
key
enabler
of
data
management
and
discovery,
which
at
big
data
scales
must
be
automated.
However,
there
is
a
need
to
document
the
current
metadata
landscape
and
best
practice,
and
identify
areas
for
further
development,
improvement
and
standardization.
In
order
to
promote
sharing
at
scale
researchers
must
see
some
benefit
beyond
compliance
with
RCUK
and
other
funder
policies.
Sharing
of
datasets
should
bring
academic
credit
through
data
citations
(for
example
the
DataCite
consortium)
with
DOIs
or
other
persistent
identifiers
being
associated
with
published
datasets.
Publication
of
datasets
should
be
captured
as
an
impact
outcome
of
funded
research
through
metrics
portals
such
as
Researchfish.
Jisc
are
also
reviewing
proposals
innovations
in
Data
Management
in
the
Research
Data
Spring
initiative2.
Recommendation:
Recognition
for
the
impact
of
research
datasets
to
the
community
through
the
use
of
DOIs
or
other
common
identifiers
and,
equally,
giving
credit
to
researchers
for
generating
datasets.
Metrics
should
be
captured
via
existing
mechanisms
such
as
Jisc,
Gateway
to
Research,
Researchfish
for
example.
2
http://www.Jisc.ac.uk/rd/projects/research-data-spring
2.
Integration
3.
Capability
There
is
broad
recognition
of
the
concept
of
research
data
management
as
an
essential
activity
across
the
project
lifecycle
rather
than
just
a
paper
exercise
at
the
time
of
grant
submission,
as
illustrated
in
the
DCC
Data
Lifecycle
model
below.
RCUK
has
driven
the
requirement
for
institutions
to
show
leadership
in
research
data
management,
management,
with
a
joint
position
on
Data
Management3
and
the
EPSRC
in
particular
asking
HEIs
to
meet
specific
standards
by
May
20154
.
Figure
2
-
The
Digital
Curation
Centre
Lifecycle
Model
Training
in
research
data
management
needs
to
speak
to
projects/centres,
institutions
and
individual
researchers
at
all
levels.
There
is
a
huge
opportunity
to
reach
Early
Career
Researchers
in
particular
through
existing
Centres
for
Doctoral
Training
via
a
train
the
trainers
type
approach.
3
http://www.rcuk.ac.uk/research/datapolicy
4
http://www.epsrc.ac.uk/files/aboutus/standards/clarificationsofexpectationsresearchdatamanagement/
3.
Capability
Recommendation:
Training
in
data
management
-
Build
upon
existing
PDG,
SSI
and
DCC
activities
to
create
a
concerted
and
coordinated
approach
to
promoting
best
practice
in
data
management.
Capitalize
on
existing
activities
to
orchestrate
this,
e.g.
train
the
trainers
whereby
the
actual
training
is
delivered
by
projects
and
institutions.
3.
Capability
4.
Connections
User
management
User
management
systems
are
essential
to
enable
researcher
access
to
regional
and
national
systems.
This
is
especially
important
for
the
health
informatics
and
administrative
data
networks
which
require
additional
security
and
two-factor
authentication
systems.
There
are
existing
activities
around
Shibboleth,
SAFE,
VOMS,
Moonshot
and
Safe
Share,
but
existing
well
established
services
and
facilities
have
their
own
approaches
that
need
to
be
taken
into
account.
Pilots
will
lead
to
recommendations
for
common
standards.
There
is
a
particular
role
for
Jisc
and
RCUK
here
in
terms
of
international
standards
liaison
e.g.
W3C,
schema.org,
Research
Data
Alliance.
This
will
require
wider
buy-in
from
the
community
as
well
as
pump-priming
funding.
4.
Connections
5.
Infrastructure
Networks
The
group
felt
that
with
the
recent
investment
in
Janet6,
the
network
had
sufficient
capacity
and
room
for
expansion.
However,
access
to
high
capacity
for
short
periods
would
increasingly
be
required.
A
number
of
points
were
raised
about
campus
networks
which
would
be
challenging
to
address
and
difficult
or
expensive.
Last
mile
-
e.g.
campus
network
to
end
user.
Is
the
campus
LAN
fit
for
purpose
for
NeI
users?
Do
campus
firewalls
have
sufficient
throughput?
Is
campus
Janet
connection
oversubscribed
/
separate
research
connection
required?
What
would
a
campus
focal
point
look
like?
e.g.
GridPP
use
of
Squid
cache
Estates
constraints
on
many
institutions
-
listed
buildings,
busy
city
streets
etc
Investment
in
Janet6,
improved
connectivity
to
major
research
institutions
and
improved
resilience
for
day-to-day
use.
Q:
Do
we
need
a
new
equivalent
to
the
HEFCE
LAN/MAN
initiative?
Recommendation:
The
group
felt
that
more
flexible
access
to
high
capacity
networking
for
defined
periods
would
increasingly
be
required.
For
example
the
eMedLab
project
will
be
moving
2.5PB
data
from
EBI
at
the
start
of
the
project
(April
2015).
5.
Infrastructure
Archive
There
was
much
discussion
around
archives,
defined
as
long-term
storage
of
immutable
datasets.
Some
projects
have
their
own
archives
and
some
disciplines
have
international
repositories
(e.g.
EBI).
However
the
RCUK
data
sharing
policy
has
specific
requirements
to
make
research
data
objects
available
for
up
to
10
years
after
the
last
requested
access.
The
group
felt
that
it
was
difficult
to
focus
on
approaches
offered
individual
institutions
and
proposed
a
survey
of
the
data
management
landscape.
Any
institutional
archive
should
provide
DOIs
or
persistent
identifiers
for
datasets
to
allow
discovery,
and
a
means
of
crediting
researchers
for
creating
and
depositing
datasets
(as
outlined
earlier).
5.
Infrastructure
10
6.
Deliverables
Pre-Requisites
The
Data
Analytics
and
Open
Research
Data
activities
in
the
data
e-Infrastructure
should
be
supported
by
a
simple
layered
middleware
and
software
e-Infrastructure.
This
e-infrastructure
should
consist
of
a
Common
Basic
Layer
(CBL)
on
which
a
Research
Domain
Specific
layer
would
sit.
The
Common
Basic
Layer
(CBL)
should
therefore
be
small
and
capable
of
generic
use.
The
Research
Domain
Specific
Layer
(RDSL)
needs
to
be
constructed
at
the
same
time.
Key
elements
of
the
CBL
are
o The
AAAI
and
Security
Models
I
am
who
I
am
and
I
can
use
resources.
o Control
access
to
data
The
RCUK
AAAI
project
SAFE
SHARE
is
delivering
aspects
of
this.
o Data
In-flight
Security
my
data
is
going
to
flow
ok
and
only
the
right
people
will
get
it
and
see
it
o Data
at-rest
Security
its
looked
after
and
I
am
obeying
the
pertinent
regulations.
The
data
are
open
to
those
who
are
allowed
to
see
it;
it
is
searchable
and
query-able.
o Cloud/Grid
middleware
to
enable
appropriate
resources
to
be
used.
From
the
user
perspective
this
can
be
broken
down
into
the
following
attributes:
1. Can
I
see
resources?
2. You
can
use
resources,
3. and
actually
using
resources,
4. here
is
what
you
have
used
and
5. here
are
your
results
in
the
place
you
asked
them
to
be
put.
o Wrapping
compute
around
big
data
use
of
virtualisation
and
containers
to
send
our
workflows
to
where
the
data
are
residing.
The
local
compute
simply
executes
the
workflows
we
have
constructed/run
on
other
machines.
o An
Application
Program
Interface
(API)
that
allows
Data
Policies
(e.g.
metadata
requirements)
to
be
actualised
in
applications.
o Simple
Tools
and
Services
to
enable
data
discovery
and
exploration.
Data
can
be
accessed
and
queried
using
published
metadata
and
data
transport
tools.
An
RDSL
would
have
elements
such
as
o Applications
or
web
portals
that
allow
its
researchers
to
use
CBL
services.
These
are
the
user-
friendly
User
Interface
(UI)
and
would
be
the
gateway
to
the
NeI
for
the
average
researcher.
o If
needed,
extra
security
and
AAAI
requirements
could
be
included
here.
o Access
to
training
resources
could
be
included,
such
as
online
courses
and
tests.
o The
interfaces
and
APIs
to
the
Data
Analytics
and
Open
Research
Data
infrastructures
would
reside
in
the
RDSL.
6.
Deliverables
11
Hardware
will
be
domain
and
activity
specific.
However
object
stores
that
can
act
as
repositories
could
be
centralised
and
be
a
common
activity
between
the
RCs.
In
terms
of
current
activities
our
progress
in
creating
these
Pre-Requisites
is
also
listed
below.
Table
1:
Pre-Requisites
for
the
Data
Infrastructure
Infrastructure
Projects
Who is Responsible?
Jisc
Data-at-rest
Information
assurance
No
overall
description,
or
indeed
NeI
as
a
whole
none
Data
abstraction
layer
development
NeI
Projects
Networks
Jisc
RO
Links to Business
Jisc
Advanced Compute
NeI Projects
NeI Projects
Cloud/Grid Infrastructure
6.
Deliverables
12
Infrastructure
Projects
Who is Responsible?
Varied no coherence
What
needs
to
be
tried
out
and
tested?
The
tools
and
software
needed
to
discover
data
and
move
data
around
(needed
for
multiple
data
sources)
need
to
be
developed
into
a
coherent
and
simple
package.
Below
are
listed
a
set
of
deliverables
that
can
be
achieved
in
2015
to
enable
this.
However
these
are
dependent
on
activities
listed
in
Table
1.
This
is
why
the
tests
will
be
done
in
the
field
on
live
NeI
systems.
Table
2:
List
of
Deliverables
Recommendations
Action
Milestone (OWNER)
Co-ordination
of
International
Produce
report
on
the
various
Produce
Strategy
Document
Projects
to
extract
best
value
and
national
and
international
projects
(RCUK)
influence
Agendas
the
UK
is
involved
in
6.
Deliverables
13
Recommendations
Action
Milestone (OWNER)
Make
FTS
a
generic
tool
to
act
as
Test
on
the
DiRAC,
JASMIN2
and
an
aggregator
and
orchestrator
eMedLab
systems
(PDG,
Jisc)
and
link
to
the
RCUK
AAAI
6.
Deliverables
14
The
Proposed
CBL
and
RDSL
would
be
the
enabling
middleware
infrastructure
for
this
e-Infrastructure.
HEI 3
HEI 2
HEI
1
DIAMOND
National
Deep
Archive
Service
JASMINE2
National
Tertiary
Storage
Service
Meta
Data
Presented
to
World
Local
Tertiary
Storage
Layer
Data
Generator.
Experiments,
Clusters,
PCs....
7.
An
Imagined
Data
Infrastructure
Another
Traditional
View
15
The principal components needed for such an e-Infrastructure are:1. Local tertiary storage platforms for active data.
2. Data Base Creator/Ingestor widget to create structured data from unstructured data and policies to
meta-data tag such data e.g. owner, project, grant no. etc.
3. A National tertiary storage /metadata service to build up and store metadata from the other databases
in the National e-I, as well as store our major active databases.
4. A National Deep Archive Service to store data that has been produced by National Facilities and to
provide data replication services for the National E-Infrastructure.
This is a traditional representation of a computing infrastructure. It is very much the end point of the proposed work
in this document, which is why it belongs at the end.
The work proposed in this document enables this infrastructure to exist in an efficacious way. The outputs we
propose are the real Data Infrastructure in that they enable data to be moved, selected, and queried. It is these
that give the data its form and value.
16