Sei sulla pagina 1di 9

Final Report of Data Services Task Force

July 25, 2007

Prepared by: Harrison Dekker, Gary Peete, Chuck Eckman, James Church, Jon
Stiles(UC DATA)
Other participants: Professor Merrill Shanks (Political Science), David Greenbaum
(IS&T), Jesse Silva, Frank Lester
Teaching and learning in the social sciences at UC Berkeley for undergraduate and beginning
graduate students increasingly requires students to understand and employ data analysis and
statistical applications. Present programs and facilities such as UC DATA, the Social
Sciences Computing Lab (SSCL) and the Econometrics Lab, and the CEDA/Demography
Lab have evolved successfully to serve doctoral student and faculty needs. However, the need
exists to provide a wider range of Berkeley students with enhanced and innovative programs
and facilities. This report proposes that the Library address this need through the creation of a
Digital Scholarship Center wherein users could gain access to an appropriate level of
software, hardware, data resources, and technical and subject expertise.

The first charge of the task force was to identify models and gather ideas related to data
services, spaces, and staffing at research institutions with similar education missions as UC
Berkeley. Information was gathered from library websites and informal interviews of data
librarians in attendance at the International Association of Social Science Information
Services and Technology (IASSIST) 2007 Annual conference.

As currently practiced, data services might include access to online repositories, acquisition
of requested data, discovery tools, training in those tools, direct consultation and assistance in
finding data and resources, creation of analysis files though subsetting, variable
transformation, or file matching and manipulation, training in methods, training in specific
software, provision of software, and provision of hardware. Who provides such services
might include library staff, non-library archive staff, faculty, departmental staff, fellow
students (graduate/non-graduate), seminar, or short-course presenters. The recipients of those
services could include undergrads, grad students, students in particular courses/programs/
departments, faculty, university staff, affiliated researchers, or the general public. The setting
for service delivery might be in an indirect online context (ranging from basic informational
web pages, to wikis, to email-available staff, to instant-messenger on-demand help desks), in
decentralized physical in-person settings (e.g. multiple labs, shared training facilities, ad-hoc
departmental computer rooms), in centralized units or spaces which focus on particular kinds
of data (GIS centers), particular kinds of researchers (grad student labs, faculty research
centers, researchers in a particular field), or which provide a specialized type of service (e.g.
hardware, statistical consultation, specialized software). Finally, data services might be
considered active or passive, depending on whether there is active outreach or simply
resources and support made available.

A summary of the data services characteristics at comparable institutions compiled from


library websites is given below.

Workstations w/Stats/ Number of data staff in ICPSR


Institution Other non-Library Data Services Unit(s)
GIS software in library library in Lib
IQSS/Harvard-MIT Data Center/Murray Research
Harvard 5 PC's 2 FTE N
Archive

< 10 GIS PC's in various 1 PT (Documents Librarian) -


Yale StatLab N
locations StatLab is service point

Stanford dedicated lab 10+ pc's 3 FTE SIQSS Y


MIT 1 PC 1 FTE Harvard-MIT Data Center Y
2.8 FTE + 3PT (Business/ Center for Health and Well Being/Research
12 PC (GIS in another
Princeton Finance and Cataloging Program in Development Studies Data Archive, Y
unit in library)
Librarians) Office of Population Research
PT (Documents Librarian) -
Illinois None (?) CRESS N
CRESS is service point
10 FTE (including 2 campus IT Scholars Lab is partnership between Library and
Virginia 40+ Y
staff) campus IT
Michigan 8 PC 1 FTE ICPSR Y
Cybrary program includes hundreds of workstations
SUNY
200+ 1 PT (Documents Librarian) in 2 libraries and other campus locations (campus N
Buffalo
IT?)

This analysis revealed little evidence of any universally accepted best practices in terms of
staffing, space, and equipment. Most data service providers interviewed revealed a more or
less ad hoc approach to service delivery. All worked in environments where library data
services overlapped to some extent with those offered by campus departments. Staff skills
ranged from individuals with no social science/statistical training to PhD holders with
extensive quantitative training. Facilities ranged from institutions with a single data
workstation to campuses with almost universal access to statistical software from campus
computer labs. . Perhaps the sole commonality is that data service provision in each of the
comparison institutions is shared by both the library and other data service providers on
campus.

The second charge of the task force was to identify the types of traditional and innovative
data services activities that may take place within in-person and on-line support for campus
undergraduates, graduate students, and faculty. In addition, we were asked to describe the
characteristics and functions necessary within the physical space to support those activities;
describe the types of staffing and skills needed to support the recommended services;
describe the workstation hardware and software necessary to fulfill the campus academic
needs. The task force broke this down into three major areas of service: data processing
assistance, collection development coordination and access control, and collaboration.

With respect to data processing assistance, the task force agreed that there exists a need for
all students (including undergraduates) and faculty, regardless of their field of study, to have
access to the following services and resources. First is training, namely how to use the
various software applications available for the retrieval, manipulation, and analysis of
numeric data. Second is consulting services related to locating and assessing data for a
particular research need. Third is the provision of the necessary computing equipment and
software needed for data storage and analysis. Fourth is the provision of space to allow for
collaborative projects, course-related workshops, and, if at all possible, after-hours access.
Also related to space is the need to provide a location where data sources, documentation,
and applications are readily available and the occasional requirement for safe and secure
access to data resources as dictated by licensing terms.

The task force identified desirable characteristics of space for data services. It should provide
sufficient room for both equipment and consulting. Equipment should include high-end
workstations and with a full range of analytical software such as SAS, SPSS, Stata, TSP,
MatLab, Stat/transfer, Atlas TI, and ArcGIS. Two models for space were considered
acceptable. The first is a dedicated data lab (e.g. Stanford) model which provides a service
point and computing facility exclusively for data with enough workstations to accommodate
peak demand. The second model is a consulting office with only a few public computers, but
in close proximity to a computing lab or “commons” in which a large number of computers
provide the requisite access to data and analysis software in addition to common productivity
applications. The first model provides a more unique identity for data services, as well as
promoting a full range of data services, but has high startup and maintenance overhead, and a
larger footprint. The second offers more flexibility, particularly with regard to off-hours
access. It would still allow for a unique space and identity for data services and, if staffed and
promoted accordingly, might highlight the training, consulting, and other service aspects that
the library is positioned to provide rather than just providing a “place to use statistical
software”. A potential compromise solution would be a laptop checkout service through
which students could borrow high-end laptops loaded with analytical software for use
throughout the library.

The second major area of data services need identified by the task force was collection
development coordination and access control. Highlighting this need are several factors. First
is a lack of a central agent coordinating purchases and securing adequate funding. Second is
the perceived existence of numerous unshared or under-shared data resources on campus. A
typical unshared resource might be one acquired by an individual to meet his or her own
need, but not shared with the greater campus community even when licensing or other
restrictions might allow it. Last, is the absence of adequate discovery tools and finding aids.

The objective of a coordinated collection development and access control policy would by
necessity be multi-faceted. A survey to discover campus resources would be necessary.
Creation and maintenance of finding and indexing tools would also be required. Campus and
system-wide funds might need to be procured. Shared needs would have to be identified, and
purchases made accordingly. Documentation would have to be maintained. A means for
coordinating purchases with other U.C. campuses might be necessary. And, finally, a flexible
payment mechanism would need to be explored given the unique and varied nature data
publishing, licensing, and distribution.

The third and final area of data services activity identified by the task force concerns
collaboration. The library is just one of many data service providers on campus. While it is
critical that the library defines and carves out its own unique niche with respect to data, it's
equally important that we develop and maintain alliances and collaborations across the
Berkeley campus and potentially, the UC system. The most obvious collaboration partners
are UC DATA and Information Services and Technology Data Services division, both of
which were represented on the Task Force. Some possible areas of collaboration discussed
included, performing needs assessment, infrastructure issues, metadata standards,
development of Web 2.0 tools, undergraduate statistical literacy programs, the development
of a campus web portal for data services including web-based data extraction and analysis
tools. This task force served as a good starting point for an ongoing exploration of
collaboration opportunities. What remains to be defined however, is how future joint projects
could be managed and what roles different organization could play given their relative areas
of expertise and specialization. In addition, faculty/departmental involvement is critical if we
want to integrate library data services into students' learning experiences and faculty research
needs.

The final charge of the task force was to develop recommendations appropriate for the UC
Berkeley Library to consider as it continues to evolve its data services in support of campus
research and teaching needs. There was general consensus among the task force that
collection development was of utmost importance. Our students and faculty need access to
data and the library should make every possible effort to acquire these materials for our
collections. Building data collections will introduce extra costs in acquisitions as data
licenses are often quite different than those of more familiar e-resources and in terms of the
delivery of service given that data is not always delivered in the formats needed by the user.
Metadata is another associated cost if we want to make our collections findable and usable.
Again, data differs from the bibliographic material with which we are most familiar, and we
may not be able to tap our technical services staff for help in this area without some
additional training.

Hand in hand with the need to improve of our collections and collection development
processes goes the need to maintain and develop staff with both the domain knowledge and
technical expertise necessary to provide consultation services and technical assistance to data
users. The more developed library data services programs such as those at Virginia,
Princeton, and Stanford have all devoted three or more FTE to data services. Ultimately, the
hardest decisions we need to make boil down to what we are willing to support in terms of
positions and salary. There was agreement on the task force that complementary service
points might allow us to deliver a wide variety of service even without significant staff
additions. As discussed earlier, a consulting center located adjacent to a “computing
commons” might deliver this sort of benefit as many technical issues could be resolved by
lower paid undergraduate assistants.

The task force agreed upon the final estimates for staff, space, computing, and software
budgets. With respect to staff, the task force felt it important to address three primary needs.
Sufficient hours of staff availability, in-house technical support, social science and
quantitative methods domain expertise, integration with traditional library collections and
services, and integration with teaching and learning on the broader campus. Four roles were
defined. First is the Data Librarian position to manage and coordinate the service, as well as
fulfill other traditional library roles such as collection development and reference. Second, is
a technology support position capable of filling a wide range of roles including hardware and
software installation and troubleshooting, Linux and Windows server administration, web site
maintenance, and experience with such programming technologies as XML/XSLT, Perl,
PHP, Python, SQL, and shell programming. Third is a domain specialist (Data Consultant)
with doctoral level training in a social science and a strong background in quantitative
methods. This position could be one of fixed-length, perhaps a post-doc or fellowship
position. Fourth are graduate assistant positions to provide sufficient staff as demand for data
services grows. The primary responsibilities of these students would be to provide technical
assistance in the use of statistical and other analytical software. Salaries for these positions
should be higher than the current library graduate student rate so as to be competitive with
salaries offered to Graduate Student Instructors and Graduate Student Researchers.

At a minimum 6 hours per day of staff availability should be provided by the Data Librarian,
Data Consultant, and 2 Graduate Assistants. The Task Force also discussed the possibility of
some collaborative staffing options such as the UC Data, Data Archivist assisting in the
Library and the Data Librarian assisting in the Social Science Computing Lab. Another idea
discussed was to make a shared in-library office space environment available to qualified
Graduate Student Instructors in exchange for their providing a certain amount of data service
consulting time.

With respect to space, the task force recommends the consulting space model to provide the
greatest flexibility. It is recommended that the space provides three high-end workstations to
accommodate access to IP restricted resources and to allow for mediated assistance. The task
force also proposes a laptop checkout service featuring 10 high-end laptops offering a full
suite of analytical applications to ensure that students have access to the appropriate hardware
and software. This solution would have a number of advantages over a traditional lab. First, it
would allow students the flexibility to work in parts of the library in which they or their study
groups can be most productive. Second, it would facilitate up scaling, downscaling, or re-
purposing the service and space. Third, it would allow the office to have a smaller footprint,
thereby giving library planners more flexibility with respect to space allocation. Fourth, it
would promote the use of the space for more intensive consultation and make it somewhat
less busy and more conducive to office work. Last, it might offer a solution to the staffing
issues associated with extended hours, if laptop checkout were based or could, at peak times
of the semester, be moved to, an appropriate service point.

Ideally the space could be adjacent to a large computing lab or commons area with extended
hours in which computers are loaded with at least one or two statistical applications. This
would promote drop-in use of the consulting service as well as provide opportunities to
leverage the staff dedicated to each space. The first floor of Moffitt Library is an obvious
choice for consideration given the large amount of unused space and the existing computer
lab. Another possibility is the first floor of Doe Library, with Room 188 as a possible
consulting space. However, as currently configured, the adjacent public computing space has
no staffing, application software, or extended access hours. Another possible permanent
location is the current temporary space in Doe Graduate Services. It has the advantage of
having slightly extended access hours and an existing desk staff who already provide some
Data Lab public service (directional assistance, ID checking, reserves checkouts, etc.)
However, the increased foot traffic and noise that an expanded service point would bring are
in conflict with Graduate Service’s goal of providing an extra-quiet study space. Also, the
adjacent public computers on the second floor of Doe suffer from the same limitations as
those on the first floor.

APPENDIX I: Startup cost estimates


Salaries
Data Librarian 40,008 - 106,620 (midpoint 62,556)
Programmer/Analyst II 43,920 - 86,988 (midpoint 65,454)
Post-Doc 32,314 - 80,112 (midpoint 50,112)
16.00 – 35.00/hr (midpoint 25/hr) x
Student Assistant IV
32 weeks x 20 hrs.

ANNUAL TOTAL: $194,122

Computing Equipment (includes 3 staff workstations):


Workstation 3.4Ghz, 2GB, 250GB, 24"Monitor (3 year replacement) 1200 6 7,200
Laptop, 17" 1.7Ghz Core Duo, 2GB, 200GB (assume 1-2 yr lifespan) 2100 10 21,000

TOTAL: $27,800
Software Price License First-year cost
SAS 50.74 annual 811.84
SPSS 45.00 annual 720.00
Stata 175.00 perpetual 2800.00
MatLab 200.00 annual 3200.00
Stat/Transfer 50.00 annual 800.00
Atlas/TI 100.00 annual 1600.00
ArcGIS 100.00 annual 1600.00

TOTAL: $11,531

Furniture (4 cubicles + 3 workstations)


4 cubicles, furnished @ $4000 = $16,000
3 computer workstation desks @ $0.00 (current Data Lab furniture)

TOTAL: $16,000

APPENDIX II: Collections Needs


Overview:
The Data Lab currently uses the DATAM and DATAS funds to pay for certain acquisitions
and subscriptions. In addition, Data Services coordinates collaborative purchases involving
various selectors. At least annually, we are asked to consider large ($10K+ ), one-time
purchases, such as the China Census and the Mexican urban geospatial data set. As our
service grows, it is safe to assume that these requests will increase too. While it will never be
possible to fulfill all of our users data needs, a larger base fund will help us develop a broader
and richer collection.
Currently Data Services pays a share for the subscription and update costs for the following
resources: Latinobarometro, Sociometrics, Thomson Datastream, Unicon CPS Utilities and
SIPP, CEIC Database. This currently amounts to approximately $18,000/yr.
In addition, selectors have identified a variety of data resource needs for consideration. These
are discussed below.
Data Collections for International Government Information

At present most of the international government organization (IGO) statistical databases have
been acquired and funded via DLIB and the International Documents funds (docis and docim
in the case of one-time purchases for CDs).

However, we still lack the following government statistical subscription, some of which have
been recently released:

IndiaStat - annual subscription $1,900 annual subscription

IMF Balance of Payments Statistics Online - $1,440 annual subscription

Global Economic Monitor -- one-stop shop portal for analysis of current economic trends,
and economic and financial indicators. -- $3200 annual subscription

WiserTrade. -- International trade database with U.S. state level export series - $3,000 annual
subscription

Estimate of new subscription funds required -- approximately $9,600

Economic Survey Data

One of the key needs of the development economics students (a major constituency in the
econ department) is for household, employment, and industrial survey data. These are
typically requests for a one-time purchase of surveys conducted by foreign governments.

Examples I have received include surveys of manufacturing firms in South Africa; floating
population surveys from Shanghai, China; the annual survey of enterprises from India, data
sets from India's National Statistical Survey Organisation (NSSO); household survey data
from Brazil, and so on.

For an excellent example of a pro-active library that has made a point of acquiring and
providing access to this kind of data, please see Princeton University's Data & Statistical
Services web site. The following subject page on "Income and Employment" details the kind
of survey data that a well funded data center can acquire, including:

Brazil's National Household Sample Survey; Palestinian Child Labour Surveys; Chinese
Household Income Project; Encuesta nacional de empleo urbano, [Mexico]; Family income
and expenditure survey [Philippines]: Labor force survey [Philippines]; South African Labor
Force Survey 2000-2002, Taiwan Labor Force Survey, Thailand Socio-Economic Survey,
etc.
There are also examples of Census Microdata (e.g. Census of India 2001: housing microdata
sample, the Thailand Population and Housing Census for 1990) that would be very good to
have here.

Cost for such data sets range from $500 a piece, to several thousand dollars, the most
expensive ever requested being 8000 euros.

Subscriptions -- N/A

Monographs funding -- estimate $20,000 per year, including possibility of admin travel funds
needed to acquire difficult purchases.

Respectfully submitted,

Jim Church
International & Foreign Documents Librarian
Librarian for Development Economics & Economic History

Business/Economics/Finance:
Over the course of the last several years, it has been increasingly obvious that the campus has
not been providing adequate access to corporate, finance, and economic data sources.
Repeated requests from undergraduates, graduates, and faculty for this data have been
unfilled because the library has no funding to support such access and the departments in
need of this information cannot or will not fund. While the Business School has been the
largest consumer and monetary contributor, many other departments including economics,
law, agricultural economics, sociology, history, political science, and engineering have asked
for this data. Unfortunately, these requests have come with no funding support.
A couple of factors can account for this trend. Data sources that were once only accessible to
researchers with advanced programming skills are now available through much more user
friendly, web-based interfaces. Consequently, we are seeing a number of undergraduates and
less computer literate, users asking for this information. Also, since Enron and other
corporate scandals, there has developed a large, interdisciplinary interest in quantifying
corporate and financial operations.
In addition to limiting the research possibilities for our students and faculty, the recruitment
of professors and graduate students has been impaired. I am aware of several recent instances
where prominent scholars from other schools have been reluctant to accept positions at Cal
because they fear a lack of support for their data needs.
Examples of needed databases include Thomson’s Bank One, SDC Platinum, VentureExpert,
VentureOne, Compustat/CRSP Merged Data, EVENTUS, CISDM Hedge Fund, Board
Analyst, and ComScore. A fund in the neighborhood of $100,000 will be required to
purchase access to these databases.
Funding Estimates
One-time: $50,000… Recurring: $150,000/yr.

APPENDIX III: Job responsibilities grid and new and emerging service inventory
• planning
• collection development
• licensing negotiations
• library liaison
Data Librarian
• outreach
• consultations
• web content planning and production
• research and development
• consultations
• curriculum development
Data Consultant • outreach
• classroom instruction
• research and development
• PC support
• server administration
• web programming
Programmer/Analyst
• dataset preparation
• software installation
• data product installation
• consultations
Graduate Assistants • web content production
• classroom instruction

Traditional and emerging data services inventory


Existing Emerging
• data visualization
• statistical software assistance
• statistical literacy programs
• data retrieval software assistance
• services for new data citation standards
• location of data
• data mashup/remix services
• acquisition of data
• data-specific metadata creation (DDI, digit
• acquisition of data and statistics related print
identifiers)
resources (manuals, textbooks, journals)
• develop web-based analysis and discovery
• GIS software assistance
applications
• instruction and workshops
• scanning and digitization projects (print to
• web and print guides
digital numeric data)
• preservation projects

Potrebbero piacerti anche