Sei sulla pagina 1di 34

Research Data Curation and

Lifecycle Management
REPORT
















Submitted to: Pennington Biomedical
Research Center
Submission date: April 24, 2014
Submitted by: Just in Time Data Solutions



Prepared by: Jennifer Clark
jlclark6@illinois.edu

Katie Schmitt
kmschmi2@illinois.edu

Troy Babbs
babbs2@illinois.edu

Katrina Durance
kduran2@illinois.edu















Table of Contents

Executive Summary ............................................................................................................. 4

The Curation Lifecycle

Getting Started .................................................................................................................... 6
Educate & Plan .................................................................................................................... 7
Receive & Pre-process ......................................................................................................... 8
Appraise & Select ............................................................................................................... 10
Secure ................................................................................................................................. 11
Quality Assurance .............................................................................................................. 12
Store & Preserve ................................................................................................................ 13
Access, Use & Reuse .......................................................................................................... 15
Transform .......................................................................................................................... 16

Glossary .............................................................................................................................. 17
Workflow Diagrams .......................................................................................................... 19

Appendices

Appendix A: Data Management Plan ................................................................................ 20
Appendix B: Recommended Metadata Schemas and Tools ............................................. 21
Appendix C: Descriptive Metadata Template ................................................................... 25
Appendix D: Deposit Agreements ..................................................................................... 26
Appendix E: Repository Software Recommendations ..................................................... 27
Appendix F: Budget Tools ................................................................................................. 34





4 | P a g e

Executive Summary
This report has been prepared for Pennington Biomedical Research Center by Just in Time Data Solutions
(JTDS) to assist in the storage and preservation of research data for current and future analytical use and
assigns responsibility for the implementation and subsequent workflows to Pennington Biomedicals
Library and Information Center staff. It features nine sections, each relating to a phase of the lifecycle, as
well as a glossary defining all bolded terms, workflow diagrams, and six appendices. In this context, data
is defined as the digital output of researchers and can include manuscripts, images, research data sets,
supplemental computer code, and any supplemental lab reports/documentation.
The report presented is based in part on the Digital Curation Center (DCC) Lifecycle Model (Figure 1).
While JTDS has used the DCC Lifecycle Model as a backbone for the plan, recommendations presented
in the report are considered to be the best option for Pennington Biomedical, regardless of their
conformance to the DCC Model.


Figure 1. Digital Curation Center (DCC) Lifecycle Model.
5 | P a g e

The proposed data curation plan and lifecycle have been designed specifically for Pennington Biomedical
research data and more accurately reflects the action steps to take place in each phase (Figure 2).


Figure 2. Pennington Biomedical Data Curation Lifecycle.









6 | P a g e

The Curation Lifecycle: Getting Started



Conducting a Data Audit

Before a data curation plan can be implemented, the Library and Information Center staff should conduct
a data audit. This will allow the staff to better understand Pennington Biomedicals potential digital
holdings as well as benchmark Researchers inclinations to transfer responsibility of research data. To
conduct the audit, JTDS recommends Library and Information Center staff begin with the Data Asset
Framework developed by HATII at the University of Glasgow in conjunction with the DCC. The Data
Asset Framework provides example interview questions and web surveys to assist institutions with the
means to identify, locate, describe and assess how they are [currently] managing their research data
assets. The data audit will also begin an open dialogue between Researchers and the Library and
Information Center staff and will ensure a successful implementation of the curation plan outlined in this
document.

The Implementation Guide: http://www.data-audit.eu/docs/DAF_Implementation_Guide.pdf
More on the Data Asset Framework: http://www.data-audit.eu/index.html
Audit Trail
JTDS strongly suggests an audit trail be created immediately upon the implementation of this data
curation plan. A large portion of the curation process takes place outside of the repository software;
therefore, each step should be tracked upon completion. This includes, but is not limited to:
Date/time a digital item is received by the Library and Information Center staff
Date/time the item is processed
Name of staff member who processed the item
Date/time a repository record is created
Name of staff member who created the record

The audit trail will ensure Library and Information Center staff are completing the workflow in a timely
and accurate manner. It will also provide detailed information in the case that a digital item is either lost
or corrupted. JTDS recommends that specific audit trail information be captured both in an items
administrative metadata (see Appendix B) and in a secure spreadsheet.


7 | P a g e

Educate & Plan
DCC Lifecycle Step: Conceptualise





At the beginning of the data lifecycle, it is important that Library and Information Center staff ensure data
is created or collected by Researchers in an efficient manner. Researchers should be made aware of
Pennington Biomedicals digital curation workflow (Figure 2) and the digital repository managed by the
Library and Information Center staff. The roles of staff and Researchers in the workflow should be clearly
defined. It is critical that Researchers know what is expected before they create or collect data and
throughout the digital curation process for the workflow to be a success. JTDS stresses the need for open
and consistent communication between Researchers and Library and Information Center staff, as the
curation process will be new and will evolve over time.

JTDS recommends that Pennington Biomedical publish a website to facilitate communication regarding
the data lifecycle. The site should give Researchers information on policies and procedures for the new
repository, answer frequently asked questions, and ensure consistent processes. Contact information and
hours for Library and Information Center staff should be easily accessible.

Examples of similar websites:
University of Michigan - ICPSR: http://www.icpsr.umich.edu/icpsrweb/deposit/index.jsp
Purdue University Research Repository (PURR): https://purr.purdue.edu/about

Data Management Plan

JTDS encourages Library and Information Center staff to provide a customized sample Data
Management Plan (DMP) to its Researchers. DMPs are required for many funding sources and are an
integral part of open science. A sample DMP with details specific to Pennington Biomedicals repository
will not only save Researchers time, but will also assist in the grant application process (see Appendix A).




8 | P a g e

Receive & Pre-process
DCC Lifecycle Step: Create or Receive


Immediately following the completion of a research or analysis project, all data and supplemental files
(including code) should be transferred to a secured shared drive that both the projects Researchers and
Library and Information Center staff may access. While this step may occur well before project results are
published, JTDS believes that this proactive transfer of responsibility will encourage Researchers to make
curation part of their workflow. If publication of the research/analysis occurs at a later time, the pre-
publication peer-reviewed manuscript, as well any other supplemental figures and images, should also be
transferred to the Library and Information Center staff at the notification of publication. Library and
Information Center staff should request Researchers sign a Deposit Agreement (see Appendix D) at the
time Pennington Biomedical takes ownership of the digital files, if they have not done so already.

File Formats

Files should be delivered to Library and Information Center staff in a non-proprietary, uncompressed
format. Research data should be stored in comma separated files (.csv), manuscripts and all
accompanying documentation should be stored as plain text (.txt), and images should be stored as TIFF
files (.tiff). As the majority of Pennington Biomedical research items are stored in Microsoft Office
proprietary file formats (.doc, .docx, .xls, .xlsx), the Library and Information Center staff may decide to
encourage the transfer of data in proprietary formats which will later be processed into the recommended
formats. If this decision is enacted by the Library and Information Center staff, it should be well-
documented, included in deposit agreements and audit trails, and originals should be kept for quality
assurance purposes later in the lifecycle.

File Names

All digital files that are deposited by Researchers should follow Pennington Biomedicals file naming
conventions. If the research item is not properly named, a member of the Library and Information Center
staff should rename the file before proceeding. For file naming conventions, JTDS recommends that the
researchers name or initials, at least a portion of the title, and a date associated with the item be included
in the file name. JTDS also recommends that the length of the name not exceed 207 characters. The file
name should not include spaces or periods, except for one period before the file extension. The file
naming convention/rules should be clearly documented on Penningtons repository website maintained by
the Library and Information Center staff.



9 | P a g e

Resources for best practices of file naming:
Stanford University Libraries: http://library.stanford.edu/research/data-management-
services/data-best-practices/best-practices-file-naming
University of Leicester: http://www2.le.ac.uk/services/research-data/organise-data/naming-files
University of Illinois at Urbana-Champaign:
http://www.library.illinois.edu/dcc/pdfs/best_practicespdfs/02_best_practices_for_file_naming_o
pt.pdf

Metadata

Metadata facilitates the management, use, and retrieval of a digital object or record, and it is a critical part
of the digital curation process. JTDS strongly recommends that metadata is created and validated quickly
after an object is first received to ensure accuracy, a complete audit trail, and the availability of the
depositing Researcher. Metadata schema recommendations, as well as links to specifications for these
schemas are included in Appendix B.

The majority of descriptive metadata will be provided by the Researcher upon transfer of data to Library
and Information Center staff. At a minimum, the Researcher should provide:

Title
Author
Publisher (if applicable)
Journal Name (if applicable)
Abstract/Description
Related Publications/Datasets
Grant Support
Researcher ID
Null value (for data sets)
Embargo Periods

This information can be provided in a plain text template transferred with the research item. An example
has been provided in Appendix C. Library and Information Center staff should review the submission
form to find missing or inconsistent fields. Researchers should then be contacted to clarify any potential
confusion before the research item moves on to the Appraise & Select phase.

File Fixity

For preservation and validation purposes, a checksum should be generated for any digital items at the
time they are transferred to the Library and Information Center staff. Checksums assist with fixity checks
throughout the preservation process. Many online tools exist to generate these numbers, including Online
MD5|SHA1, available at http://onlinemd5.com/. The checksum should be stored in the plain text metadata
template for later use (see Appendix C).
10 | P a g e

Appraise & Select
DCC Lifecycle Step: Appraise & Select


The Library and Information Center staff should create an official written collection policy to define
which data files will be stored in the Pennington Biomedical repository. JTDS recommends that Library
and Information Center staff choose to store, at minimum, de-identifiable data and supplemental files
related to studies conducted by Pennington Biomedical. Pennington Biomedical may also choose to store
pre-published, post-peer review manuscripts and supplemental images. Distinctions that may be
determined within the collection policy may include:

Scope
Purpose
Documentation
Accuracy
Acceptable file formats
Federal Mandates

Because Pennington Biomedical is a federally funded organization, it may be affected by the Open Data
Policy mandated by Executive Order on May 9, 2013 as well as the Open Access Policy mandated by the
National Institutes of Health. With the help of Pennington Biomedical legal counsel, the collection policy
should ensure Pennington Biomedical remains in compliance in order to receive future federal monies.
Deposited materials which meet the terms of the collection policy should move on to the Secure phase.

The collection policy should also define steps that will be taken when deposited materials do not meet the
requirements for long term storage. JTDS recommends that Researchers are made clearly aware of steps
for rejected materials and those steps are followed exactly. Possibilities for rejection include personally-
identifiable or HIPAA protected data which should be either returned to the Researcher or securely
disposed of.

Examples of a collection policy include:
Purdue University Research Repository (PURR): https://purr.purdue.edu/legal/collection-policy
Western Australia Department of Health:
http://www.health.wa.gov.au/CircularsNew/attachments/664.pdf



11 | P a g e

Secure
DCC Lifecycle Step: Ingest


During the Secure phase of the lifecycle, each data file should be moved to a secured file folder or
network drive which only the Library and Information Center staff may access. This process ensures that
any future manipulation of the object may only be done by Library and Information Center staff. This
folder or drive should be backed up frequently - preferably once every 24 hours, or more often.

At this point, JTDS recommends Library and Information Center staff pause to confirm that each data file
is ready to move on in the lifecycle. Each file should be accompanied by a plain text metadata or
description file as outlined in the Receive and Pre-process step. The Library and Information Center staff
should also have a signed deposit agreement from the Researcher (more information about deposit
agreements and why they are an essential part of the curation lifecycle can be found in Appendix D). If a
staff member confirms that the previous steps have been completed, the data files are now ready for
Quality Assurance.

12 | P a g e

Quality Assurance
DCC Lifecycle Step: Preservation Actions


To begin the preservation process, all data files should be reviewed to make sure that they are saved in
uncompressed and non-proprietary formats and conform to Pennington Biomedicals file naming
conventions. If the file was migrated from a proprietary format into a recommended format, the Library
and Information Center staff should validate the results against the original files by opening both files and
comparing the new copy to the original. Additionally, as outlined previously, JTDS recommends audit
trails are consistently recorded to make certain any changes and quality assurance are fully documented.

Library and Information Center staff will create administrative and preservation metadata for each item.
JTDS recommends METS for administrative metadata and PREMIS for preservation metadata (links to
specifications for these schemas are included in Appendix B). All metadata created should be stored in
XML. The researchers descriptive metadata submission should be entered into the descriptive schema
chosen and then wrapped inside the METS record, which will also include the METS administrative
metadata fields. The PREMIS record should be kept as a separate file. Both file names should match the
file name of the data file with either _METS.xml or _PREMIS.xml added to the end.

Finally, a fixity/checksum check should be performed to ensure that the transfer did not affect the file. If
changes were made to the file, a new checksum should be generated and stored in the PREMIS file.





13 | P a g e

Store & Preserve
DCC Lifecycle Step: Store


Ultimately, Pennington Biomedicals research items should be stored in an institutional repository with
preservation capabilities for long-term management of these items. However, it may be necessary in the
interim to simply create a drive on the Pennington Biomedical network to act as a temporary repository.
If an interim storage space is created, JTDS strongly recommends Library and Information Center staff
backup all files frequently - preferably every 24 hours or more often.

Repository Software

A full explanation of recommendations for repository software is included in Appendix E. Many options
and resources exist, and only JTDSs recommended options have been presented in this report.
The storage location and repository software that Pennington Biomedical choose to implement will
dictate a specific workflow for creating a repository record to store each data file with its metadata. For
all options, Library and Information Center staff should ensure the workflow is repeatable and consistent.
Library and Information Center staff should consider the following:

Pennington Biomedical Hosted Repository - Once items are prepared, a member of the Library
and Information Center staff will create a record in the repository software, enter the required
metadata into the web form or upload the XML files, and attach the object to the record. These
steps should be documented as part of the data lifecycle website that the Library and Information
Center maintains. This part of the website should include which staff have the administrative
rights to create repository records (i.e. only full-time staff and management, only full-time staff,
management, and graduate school trainees, etc.), the timeline for record creation (i.e. how long
after an item is prepared/processed will a repository item be created), and if/how the researcher
responsible for creation of the item will be notified that a repository record has been created, as
well as screenshots with an explanation of the process for knowledge transfer purposes.

Louisiana State University (LSU) Repository The questions included in Appendix E should
be answered by LSU staff and documented in a standard operating procedure that outlines the
entire process which should be maintained by both LSU and Pennington Biomedical staff.

Third Party Hosted Repository Once items are prepared, a member of the Library and
Information Center staff will create a record in the repository software, enter the required
metadata into the web form or upload XML files, and attach the object to the record. These steps
should be documented in the online procedures manual that the Library and Information Center
14 | P a g e

maintains. This part of the manual should include which staff have the administrative rights to
create repository records (i.e. only full-time staff and management, only full-time staff,
management, and graduate school trainees, etc.), the timeline for record creation (i.e. how long
after an item is prepared/processed will a repository item be created), and if/how the researcher
responsible for creation of the item will be notified that a repository record has been created, as
well as screenshots with an explanation of the process for knowledge transfer purposes.

Additionally, depending on repository infrastructure implementation, XML metadata files may not be
able to be uploaded directly into the repository software. If this is the case, JTDS recommends storing the
XML metadata files separately on the Pennington Biomedical network in a file folder or drive that is
backed up frequently - preferably every 24 hours or more often.
Many of the repository infrastructures/software packages recommended in Appendix E require additional
modules or add-on architecture to properly preserve objects in the repository. Links and overviews to
certain modules/add-ons are also included in Appendix E. This list is not comprehensive but presents a
few of the best options for Pennington Biomedical.
Review Schedules

In addition to the add-on architecture, a review schedule of file formats, file fixity, and deposit
agreements should be implemented. Some of the add-on architecture for the repository software includes
configurable automatic alerts for format obsolescence. JTDS recommends that this process occur, at
minimum, yearly.
File fixity checks, using the stored checksums, will also need to be performed. JTDS recommends that
fixity checks are performed every time a research item is transferred or moved. Additionally, one percent
of research items should receive random sample checks every month, at minimum, to verify that items
remain stable.
Backups/Redundancy

Backups of each repository implementation should also be considered. If a Pennington Biomedical-based
repository option is chosen, this will require Pennington Biomedical to select either a cloud-based or
another geographic location for backups of the entire repository and its contents. If the LSU repository or
a third-party hosted solution is chosen, this is an issue that should be addressed. The following questions
provide high-level considerations:

Where are the backups? (I.e. in a different geographical location and/or cloud-based?)
How often do the backups occur?
How many copies are made?
What is the maximum down-time in case of an outage?
How do these features affect the final cost of the repository solution?


15 | P a g e

Access, Use & Reuse
DCC Lifecycle Step: Access, Use & Reuse


As most of Pennington Biomedicals research items are subject to comply with the directives of the
United States Office of Science and Technology Policy (OSTP) and National Institutes of Health (NIH),
these items must be discoverable and reusable with free access to metadata. Open access should drive
continual public-private collaboration, as well as adding to public knowledge without compromising
confidentiality and respecting proprietary interests. In order for these items to be reusable, they must be
stored in a machine-readable format (.txt, .csv, .tiff) and have appropriate metadata, as outlined in the
Receive & Pre-process phase. Care must be taken to ensure all open data has been de-identified, as well
as respecting embargo periods.
All repository software recommended include a web access component, allowing all items with the proper
permissions to be available through the web (see Appendix E). Users will access the items via these
simple web forms, allowing Pennington Biomedical to monitor usage statistics and check for proper
citations. Only copies of items will be distributed to users.
All versioned or manipulated data files should trigger the beginning of the curation lifecycle for the new
file.


16 | P a g e

Transform
DCC Lifecycle Step: Transform


As outlined in the Store and Preserve phase, a review schedule should be implemented for format
obsolescence. If items require format migration, a new copy of the item in the updated format should be
created. This copy should be treated as a new item and trigger the beginning of the curation lifecycle.


17 | P a g e

Glossary
Checksum: An algorithmically-computed numeric value for a file or a set of files used to validate the
state and content of the file for the purpose of detecting accidental errors that may have been introduced
during its transmission or storage. The integrity of the data can be checked at any later time by re-
computing the checksum and comparing it with the stored one. If the checksums match, the data has not
been altered.
Collection Policy: A formal collection policy defines specific criteria to determine the value of data
deposits. The policy ensures appraisal decisions are made in an open, consistent, and lawful manner. For
more information, visit http://www.dcc.ac.uk/resources/how-guides/appraise-select-data#5
Controlled Lots of Copies Keep Stuff Safe (CLOCKSS): A global long-term archive committed to
open access. Scholarly publishers have agreed to make content available for free under a creative
commons license in the event that they can no longer supply it. The archive is distributed across 12
geopolitically and geographically diverse long-lived steward libraries that have agreed to take on an
archival role on behalf of the wider international community. For more information, visit
http://www.clockss.org/clockss/Home.
Crosswalk: A table or schema that maps one metadata standard to another, showing equivalent fields.
Data: The digital output of researchers, which may include manuscripts, images, research data sets,
supplemental computer code, and any supplemental lab reports/documentation. According to the DCC
Lifecycle Model, data is any information in binary digital form and can include simple digital objects
(discrete digital items such as text files, image files or sound files, along with their related identifiers and
metadata) or complex digital objects (discrete digital objects made by combining a number of other
digital objects, such as websites), as well as databases, which are structured collections of records or
data stored in a computer system.
Data Asset Framework: A system of interview questions and web surveys used by institutions to audit
and assess data holdings and data management procedures. For more information, visit http://www.data-
audit.eu/index.html
Data Management Plan (DMP): A plan that is generated before a scientific study begins and is often
included with grant applications. The plan states what data are to be created and managed and describes
the specific plans for preservation and access. For more information, visit
http://www.dcc.ac.uk/resources/data-management-plans.
Deposit Agreement: A receipt of transfer signed at the time custody of digital files is transferred from
researchers to the digital repository staff (see Appendix D for recommendations).
Fixity: The property of a digital object being fixed or unchanged. Fixity information, such as checksums,
provides evidence for the integrity and authenticity of the digital objects and are essential to enabling
trust.
Institutional Repository (IR): A set of services that a university offers to the members of its
18 | P a g e

community for the management and dissemination of digital materials created by the institution and its
community members. It is most essentially an organizational commitment to the stewardship of these
digital materials, including long-term preservation where appropriate, as well as organization and access
or distribution. For more information, visit http://www.arl.org/storage/documents/publications/arl-br-
226.pdf.
Metadata: Structured, descriptive data or information about digital and physical objects or records. For
more information, visit http://www.niso.org/publications/press/UnderstandingMetadata.pdf.
Administrative Metadata: Metadata that describes how the information in a digital record is
organized. This can include management metadata such as when and how the record was created,
file types, digital object identifiers (DOIs) and other technical info, and intellectual property
metadata including how proprietary data is protected and who can have access to it.
Descriptive Metadata: Metadata that captures important characteristics about an object for
discovery and identification. It can include elements such as title, abstract, author, and keywords.
Preservation Metadata: Metadata that supports and documents the process of digital
preservation. Usually reserved for metadata that specifically supports the functions of maintaining
the fixity, viability, renderability, understandability, and/or authenticity of digital materials in a
preservation context. For more information, visit
http://www.dcc.ac.uk/sites/default/files/documents/resource/curation-
manual/chapters/preservation-metadata/preservation-metadata.pdf.
Open Access Policy: The National Institutes of Health enacted a Public Access Policy in 2008. This
policy ensures that the public has access to the published results of NIH-funded research. [It] requires
that these final peer-reviewed manuscripts be accessible to the public on PubMed Central to help advance
science and improve human health. For more information, visit
http://publicaccess.nih.gov/FAQ.htm#821.
Open Data Policy: The Executive Office of the President defines open data as publicly available data
structured in a way that enables the data to be fully discoverable and usable by end users. To read the
Policy, visit http://www.whitehouse.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf.
Repository Software: The technical infrastructure, or software package, for an institutional repository.
Most repository software includes architecture for a web access portal, a database, as well as
administrative portals for data management.



19 | P a g e

Workflow Diagrams














Receive File
Migrate File Format
Rename File
Review Metadata
Generate Checksum
Review File
Create Administrative and
Preservation Metadata
Move Descriptive Metadata to XML
Perform Checksum Check
Create Repository Record
Workflow for Receive &
Pre-process Phase
Workflow for Quality
Assurance and Store &
Preserve Phases
20 | P a g e

Appendix A: Data Management Plan
A Researcher who is applying for funding will almost always need to write a data management plan with
specific details on their repository. JTDS recommends the Library and Information Center staff provide a
sample DMP to encourage Researchers to deposit data through Pennington Biomedical and to ease the
burden of the grant application process.

Although each funding agency will have its own unique set of requirements, a majority of DMPs will
include the following points:

What types of data will be created or collected?
Which data will be retained?
How will the data be managed and preserved (short and long term)?
How will the primary data be shared?
What factors may affect the ability to manage data?
i.e. legal or ethical restrictions on non-aggregated data
What other information should be preserved?
i.e. code, supplemental files, metadata
What formats will data be stored in?
How will data be disseminated?
Any additional data management requirements

JTDS recommends Library and Information Center staff use the DMP Tool, created by the California
Digital Library, DMPonline, developed by the Digital Curation Centre, or templates provided by
Columbia University Libraries, to create the sample DMP.

DMP Tool: https://dmp.cdlib.org/
DMP Online: https://dmponline.dcc.ac.uk/
Templates from Columbia University Libraries: http://scholcomm.columbia.edu/data-
management/data-management-plan-templates/

The following are examples of sample plans provided by a repository:

http://www.northumbria.ac.uk/static/5007/ceispdf/dmpfull.pdf
http://rci.ucsd.edu/_files/DMP%20Example%20Nitz.pdf
http://www.irss.unc.edu/odum/contentSubpage.jsp?nodeid=570





21 | P a g e

Appendix B: Recommended Metadata Schemas and
Tools
Descriptive Metadata Schemas
Dublin Core - The Dublin Core Metadata Initiative is a metadata schema that features a small set of
descriptive terms for web resources. For more information, visit http://dublincore.org/documents/dcmi-
terms/.
MODS - The Metadata Object Description Schema (MODS) is a schema for a bibliographic element set
that may be used for a variety of purposes, and particularly for library applications. For more
information, visit http://www.loc.gov/standards/mods/mods-outline-3-5.html.
Metadata Crosswalk from Dublin Core to MODS:
http://www.loc.gov/standards/mods/dcsimple-mods.html
NISO JATS - The Journal Article Tag Suite (JATS) is a National Information Standards Organization
(NISO) standard that defines a set of XML elements and attributes for tagging journal articles... JATS is
a continuation of the National Library of Medicine Archiving and Interchange DTD work begun in 2002
by the National Center for Biotechnology Information. For more information, visit
http://jats.nlm.nih.gov/archiving/tag-library/1.1d1/ and http://jats.niso.org/.
DataCite - The DataCite Metadata Schema is a list of core metadata properties chosen for the accurate
and consistent identification of a resource for citation and retrieval purposes, along with recommended
use instructions. For more information, visit http://schema.datacite.org/meta/kernel-3/doc/DataCite-
MetadataKernel_v3.0.pdf.
Descriptive Metadata Recommendations
Table 1 and Table 2 feature the best schema choices for descriptive metadata for
manuscripts/publications, images, and datasets, as well as crosswalks between the schemas. Each items
metadata should include, at minimum, the fields listed in the tables. For publications and images, JTDS
recommends that a more robust schema like NISO JATS be used in order to capture fields that will
support the special uses and reuses of Pennington Biomedical research items. For datasets, JTDS
recommends either NISO JATS or DataCite. While NISO JATS is usually reserved for
manuscripts/publications, using the schema to describe datasets and/or images
1
will assist with
consistency of workflows.


1
Contextual metadata for images, such as the related publications/dataset field, should always be included. This
type of information describes why a digital object was created and how it relates to or is distinguished from other
digital objects, which is especially important for graphs and other images that can be misunderstood out of context.
22 | P a g e


Table 1. Publication and Image Descriptive Metadata Crosswalks
Suggested Fields Dublin Core Fields MODS NISO JATS
Title Title Title Title
Author Creator Name Contributor / Contributor
Group
Publisher Publisher Publisher Publisher
Journal Name

Journal Title
Volume

Volume
Issue

Issue
Date Date Date Issued
Date Created
Date Captured
Date Other
Date
Page Range

Page Range
DOI/Unique Identifier Identifier Identifier Object ID
Abstract Description Abstract Abstract
PubMed ID

Article ID
PubMed Central ID

Article ID
MeSH Terms Subject Subject Kwd
Related Publication or
Dataset
Relation Related Item Related Article / Related
Object
Grant Support

Note can be used with
Note Type: funding
Funding Source
Researcher ID

Contributor Identifier
File Type Type Type of Resource Custom Meta
File Format Format Physical Description Custom Meta
Copyright Rights Access Condition Copyright Holder /
Copyright Statement
Embargo Period

Note can be used with
Note Type: restriction
Custom Meta
23 | P a g e

Open Access Statement

Open Access

Table 2. Dataset Descriptive Metadata Crosswalks
Suggested
Fields
Dublin Core MODS NISO JATS Data Cite
Title Title Title Title Title
Author Creator Name Contributor /
Contributor Group
Creator
Publisher, if applicable Publisher Publisher Publisher Publisher
Date Date Date Issued
Date Created
Date Captured
Date Other
Date Publication Year
(required)
Date
Abstract/
Description
Description Abstract Abstract Description
DOI/ Unique Identifier Identifier Identifier Object ID Identifier
MeSH Terms Subject Subject Kwd Subject
Related Publication or
Dataset
Relation Related Item Related Article /
Related Object
Related Identifier
Grant Support Note can be used with
Note Type: funding
Funding Source Description can be
used
Researcher ID Contributor Identifier Name Identifier
File Type Type Type of Resource Custom Meta Description can be
used
File Format Format Physical Description Custom Meta Format
Copyright Rights Access Condition Copyright Holder /
Copyright Statement
Rights
Embargo Period Note can be used with
Note Type: restriction
Custom Meta
Open Access
Statement
Open Access
NULL value Custom Meta Description can be
used
Version of Dataset Note can be used with
Note Type: version
identification
Custom Meta Version
Size of dataset Custom Meta Size
24 | P a g e



Administrative Metadata Schema
METS - The Making of America II project (MOA2) attempted to address these issues in part by
providing an encoding format for descriptive, administrative, and structural metadata for textual and
image-based works. METS... attempts to build upon the work of MOA2 and provides an XML document
format for encoding metadata necessary for both management of digital library objects within a repository
and exchange of such objects between repositories (or between repositories and their users). For more
information, visit http://www.loc.gov/standards/mets/METSOverview.v2.html.
Preservation Metadata Schema
PREMIS - The PREMIS Data Dictionary for Preservation Metadata is the international standard for
metadata to support the preservation of digital objects and ensure their long-term usability The
PREMIS Editorial Committee coordinates revisions and implementation of the standard, which consists
of the Data Dictionary, an XML schema, and supporting documentation. For more information, visit
http://www.loc.gov/standards/premis/v2/premis-2-2.pdf.
Additional Metadata Resources
Minimum Information for Biological and Biomedical Investigations (MIBBI) is a portal to a large
number of minimum information guidelines for various biological disciplines:
http://www.dcc.ac.uk/resources/metadata-standards/mibbi-minimum-information-biological-and-
biomedical-investigations
Descriptive Ontology for Biomedical Investigations (OBI): http://obi-
ontology.org/page/Main_Page
The DCC Digital Curation Reference Manual Installment on Scientific Metadata, which provides
an overview in order to help determine a scientific institutions metadata needs:
http://www.dcc.ac.uk/sites/default/files/documents/Scientific%20Metadata_2011_Final.pdf



25 | P a g e

Appendix C: Descriptive Metadata Template
Name and Department: ________________
Researcher ID: _______________________
Todays Date: ________________________
(Please highlight) Are you transferring a manuscript a dataset supplemental images other ?
If you highlighted other, please describe:

If you highlighted dataset, what value is used for Nulls in the data?

What is the title of the item?

Please list the author(s):

If the item has been accepted for publication, please list the Publisher and Journal Name:

Please provide the abstract or a description: (Be as complete as possible.)

Is the item related to a publication or a dataset (i.e. is this a supplemental image for a publication or does
this manuscript have accompanying data)? If yes, please list the title, author, and date for the related
publication or dataset:

Was this item created with funding from the National Institutes of Health? If yes, please list all NIH grant
numbers: (This will help ensure that your research items remain in compliance with open access
requirements.)

Does this item have any embargo periods that the Library and Information Center staff should be aware
of? If yes, please describe the type and length:


___________________________________________________________________________
**For Administrative Use Only** | Checksum value:
26 | P a g e

Appendix D: Deposit Agreements
Once the data audit process is completed and the technical infrastructure for this plan has been
implemented, the Library and Information Center staff will need all Pennington Biomedical Researchers
to sign a deposit agreement for the transfer of responsibility of their data files to the Library and
Information Center staff as well as for their storage in the new institutional repository, regardless of the
choice in repository software. This agreement should include a high-level outline of workflows for
metadata creation, preservation processing, and how copyright will be handled - especially for items that
are open access. This may require additional investigations into the Researchers publisher agreements
and how publisher restrictions can/will be handled. It should also include a statement granting the Library
and Information Center staff proxy to upload the processed data files into the repository.
The length of time for which the deposit agreement is valid should be based on the Researchers position.
Since Pennington Biomedicals tenure appointments are five years in length, tenured faculty should sign
agreements based on the length remaining in her/his tenure. For example, if a tenured researcher has three
years remaining in her/his tenure, the researcher should sign a three year agreement. If a tenured
researcher has just started a new tenure appointment, the tenured faculty should sign a five year
agreement. Post-Docs or Adjunct researchers should sign shorter agreements, only lasting one or two
years.
JTDS recommends that Pennington Biomedical consult with their legal counsel for the exact wording, but
the following examples may help to assist:
http://www.lib.cam.ac.uk/repository/deposit_agreement.html
http://www.unimelb.edu.au/copyright/umeragreement13August07.pdf


27 | P a g e

Appendix E: Repository Software Recommendations
A full explanation of recommendations for repository software follows. Many options and resources exist,
and only the best options have been presented.
Before getting started, becoming familiar with a high-level repository guide will help to gauge which
features are important to Pennington Biomedical, as well as answering additional questions:

JISC Digital Repositories InfoKit: http://tools.jiscinfonet.ac.uk/downloads/repositories/digital-
repositories.pdf
LEarning About Digital Institutional Repositories (LEADIRs) Workbook:
http://dspace.mit.edu/bitstream/handle/1721.1/26698/Barton_2004_Creating.pdf?sequence=1
The repository software recommendations are broken into three categories:
Pennington Biomedical Hosted Repository - All hardware and all technical infrastructure will
be housed by Pennington Biomedical. Additionally, all IT support will be Pennington
Biomedicals responsibility (Table 1).
Louisiana State University Hosted Repository - All hardware and all technical infrastructure
will be housed by Louisiana State University. Additionally, IT support may be a combination of
Louisiana State Universitys and Pennington Biomedicals responsibility depending on the
agreement signed between the two institutions (Table 2).
Third-Party Hosted Repository - All hardware and all technical infrastructure will be housed by
a third-party. These types of repositories can either be a stand-alone Pennington Biomedical
repository, or a shared repository in which researchers from many institutions deposit (Table 3).
High-level pros and cons have been presented for each repository software option presented.
More information on comparing and contrasting repository software features can be found in the
following guides:
DCCs Preservation and Curation in Institutional Repositories:
http://www.dcc.ac.uk/sites/default/files/documents/reports/irpc-report-v1.3.pdf
Institutional repository software comparison: DSpace, EPrints, Digital Commons, Islandora and
Hydra:
https://circle.ubc.ca/bitstream/handle/2429/44812/Castagne_M_LIBR596_IR_comparison_2013.
pdf?sequence=1
United Nations Educational, Scientific, and Cultural Organization (UNESCO) Institutional
Repository Software Comparison: http://unesdoc.unesco.org/images/0022/002271/227115E.pdf


28 | P a g e

Table 1. Pennington Biomedical Hosted Repository.
Islandora DSpace
Documentation:
https://wiki.duraspace.org/display/ISLANDOR
A6131/Islandora
Documentation:
https://wiki.duraspace.org/display/DSDOC/All+Doc
umentation
Islandora Modules/Add-on Architecture:
http://islandora.ca/resources/modules

Pros
Robust both in customization and
documentation
Easy, out-of-the-box implementation
Open Source & Free Download Open Source & Free Download
Well-documented organization/support
structure with 79 implementations worldwide
Well-documented organization/support structure
with over 1000 implementations worldwide
Built-in relationship functionality to support
links between Pennington Biomedical open
access publications, supplementary files, and
data, while still allowing them to stand on their
own as an object
N/A
Requires persistent identifiers, which can
benefit grant applications and renewals that
require open access
Requires persistent identifiers, which can benefit
grant applications and renewals that require open
access
Easy to use interface for adding objects,
changing metadata, and other administrative
tasks
Easy to use interface for adding objects, changing
metadata, and other administrative tasks
Features batch ingest and workflow add-on
tools
Supports batch importing and METS package
imports, as well as metadata ingest from PubMed

29 | P a g e

New modules include support for:
BagIt - This module provides a
Create Bag option that allows the
packaging of the datastreams in
Islandora objects.
Checksum - A simple module to allow
repository managers to enable the
creation of a checksum for objects. If
enabled, the following checksum
algorithms are available: MD5, SHA-1,
SHA-256, SHA-384, SHA-512. Note:
This is will checksum all datastreams.
Basic PREMIS - This module
produces XML and HTML
representations of PREMIS metadata
for objects in your repository.
Currently, it documents all fixity
checks performed on datastreams,
includes agent entries for your
institution and for the Fedora
Commons software and maps contents
of each object's rights elements in
DC datastreams to equivalent PREMIS
rightsExtension elements.
Checksum Checker - This module
verifies the checksums derived from
Islandora object datastreams and adds a
PREMIS fixity check entry to the
object's audit log for each datastream
checked.
Allows tasks to be run on the items stored in the
repository that assist in long-term preservation
efforts. Some examples include:
applying a virus scan to item bitstreams
identifying a collection based on format
types which can help assist in format
migrations
ensuring a given set of metadata fields are
present in every item
ensuring all item bitstreams are readable and
their checksums agree with the ingest values

Scholar Module allows for ingest from
PubMed, as well as setting embargo periods
and citation suggestions:
http://islandora.ca/sites/default/files/Islandora%
20Scholar%20Module%20-
%20Islandora%20Camp%20NY.pdf
Supports embargo periods
N/A Allows versioning of items, but currently has
restrictions on the versioning functionality:
https://wiki.duraspace.org/display/DSDOC4x/Item+
Level+Versioning


30 | P a g e

Cons
Implementation and installation would require
heavy IT involvement. Any customizations
would most likely require additional IT time or
Library and Information Center staff familiar
with XML.
Implementation and installation would require heavy
IT involvement. Any customizations would most
likely require additional IT time or Library and
Information Center staff familiar with XML, the
command line, and SQL:
http://www.dspace.org/sites/dspace.org/files/dspaceh
owtoguide.pdf
Out-of-the-box descriptive metadata support is
only for Dublin Core and MODS. Though
automatic generation of technical metadata is
supported through an additional add-on, it is
limited only to integration with FITS. New
Content Models would need to be created for
any other metadata standards.
Out-of-the-box support for descriptive,
administrative, and structural metadata uses a custom
DSpace schema. Other metadata schemas would
require customized ingest forms.
Out-of-the-box format support is limited to
PDFs, video, audio, image, and books (TIFFs).
This creates a need to build custom content
models for Pennington Biomedical Excel/CSV
data files, as well as text manuscript items.
Default bitstreams do not include comma separated
files (.csv), only Excel spreadsheets (.xls).
No automatic support for any preservation
actions. These steps would need to happen
outside of the repository, requiring additional
policies, workflows, and staff time.
Only built-in preservation action is checksums. No
automatic support for any other preservation actions.
These steps would need to happen outside of the
repository, requiring additional policies, workflows,
and staff time.
Table 2. Louisiana State University (LSU) Hosted Repository.
Hub-Zero
Documentation: http://hubzero.org/documentation
Requires a different deposit agreement for researchers and a guarantee of open access by Louisiana
State University.
Basic considerations to ensure before agreeing to deposit: http://www.crl.edu/archiving-
preservation/digital-archives/metrics-assessing-and-certifying/core-re
31 | P a g e

Additional questions to consider:
How is LSU going to handle preservation workflows, specifically the steps recommended in
this document?
What metadata is being captured for each item? Is it based on a trusted and well-known
schema? Can HubZero support preservation and technical metadata? Are there places to
capture grant information and open access statements?
Are batch uploads possible? Can metadata be imported from PubMed?
Are persistent identifiers being implemented?
Which modules are being installed? How will those support and/or add-on to these
recommendations?
Is the implementation of HubZero and workflow of ingest and long-term preservation being
documented by LSU staff?
Will a member of Pennington Biomedical Library and Information Center staff be allowed to
have manager privileges? If not, how will the storage workflow/transfer of research items from
Pennington to LSU take place? How will updates to Pennington Biomedical items be handled?
This should be well-documented and included in an agreement between Pennington Biomedical
and LSU.
Will researchers maintain copyright? Or will some sort of open copyright (like Creative
Commons) be required?
Are embargos supported?
Which formats are supported by default? Are customizations to this default list being
considered? Ensure that all formats listed in data audit results are being accounted for.
What backups/redundancy is being implemented? Is at least one of these backups occurring in
a different geographical location?
What costs will Pennington be expected to cover? First year? Five years? Seven to Ten years?
Are review schedules in place for hardware? For format monitoring? Is format monitoring
occurring automatically?
Pros
Implementation would be Louisiana State Universitys responsibility.
IT involvement may be Louisiana State Universitys responsibility if outlined in the agreement between
the two institutions.
Cons
Little control over customizations.
May have to compromise on metadata and long-term preservation workflows.

32 | P a g e

Table 3. Third-Party Hosted Repository.
DSpace - Pennington Biomedical Stand Alone
Documentation: http://dspacedirect.org/
DSpace also offers a hosted option, in which libraries and small institutions can pay a subscription fee.
Pros
In addition to the pros listed under a local implementation of DSpace, costs would be deferred to a
subscription fee rather than IT involvement, hardware, and staff time for implementation.
Cons
In addition to the cons listed under a local implementation of DSpace, customizations may not be
available. If they are, they will require additional fees.
May have to compromise on metadata and long-term preservation workflows.
Dryad - Shared Repository
Documentation: http://datadryad.org/pages/repository
Pros
Costs would be deferred to a subscription fee rather than IT involvement, hardware, and staff time for
implementation. Pricing information: http://datadryad.org/pages/pricing
Repository infrastructure is built upon the DSpace software and partners with CLOCKSS to ensure
long-term access.
DOIs are assigned to each item.
Versioning of items is supported, as well as automatic monitoring for format obsolescence.
Cons
May have to compromise on metadata and long-term preservation workflows.
Repository will not feature institutional branding.
All items are under a Creative Commons copyright.
figshare - Shared Repository
Documentation: http://figshare.com/about
Pros
Supports unlimited space for free, as long as items are made public.
Format agnostic.
33 | P a g e

Repository infrastructure partners with CLOCKSS to ensure long-term access.
DOIs are assigned to each item.
Supports a variety of metrics.
A separate institutional repository space can be claimed. For example,
penningtionbiomedical.figshare.com.
Cons
May have to compromise on metadata and long-term preservation workflows.
Repository will not feature institutional branding.
All items are under a Creative Commons copyright.



34 | P a g e

Appendix F: Budget Tools
Before Pennington Biomedical implements any of the recommendations in this plan, a complete budget
should be created. JTDS recommends the following resources to help Pennington Biomedical project
costs, including hardware, software, and staff:
DCCs suggestions for creating a business plan and understanding the costs of implementation of
Data Management Services: http://www.dcc.ac.uk/resources/how-guides/how-develop-rdm-
services#Business-plans
The espida model helps to make business cases for proposals that may not necessarily offer
immediate financial benefit to an organisation, but rather bring benefit in more intangible
spheres: http://www.gla.ac.uk/services/library/espida/
The Life Cycle Information for E-Literature (LIFE) Project has developed a methodology to
model the digital lifecycle and calculate the costs of preserving digital information for the next 5,
10 or 20 years: http://www.life.ac.uk/
The Transparent Approach to Costing (TRAC) provide[s] information on the income and
expenditure of universities TRAC has been the standard methodology used by Higher
Education Institutes (HEIs) in the UK for costing their main activities (teaching, research and
other core activities): http://www.jcpsg.ac.uk/guidance/
The Keeping Research Data Safe (KRDS) cost/benefit studies, funded by JISC, features tools
and methodologies that focus on the challenges of assessing costs and benefits of curation and
preservation of research data: http://beagrie.com/krds/

Potrebbero piacerti anche