Sei sulla pagina 1di 5

4.

Implement wrappers for various data sources [5] Proceedings of the First IEEE Metadata Con-
and describe the resulting wrappers and asso- ference, NOAA, Silver Spring, Maryland,
ciated data sources as HTML documents. 1996. http://www.nml.org/resources/misc/
metadata/proceedings/meta home.html.
5. Implement a Java based extension to a
browser for displaying structured data as ta- [6] Won Kim. Modern Database Systems. ACM
bles, graphs, or images. Press, New York, NY, 1995.
6. Study the feasibility of a language for the [7] Carl Lagoze, Cli ord A. Lynch, and Ron
description of dependencies between data Daniel. The warwick framework: A con-
sources, data, and meta data that results from tainer architecture for aggregating sets
computation of new data based on existing of metadata. Technical Report TR96-
data. 1593, Cornell University, 1996. http://cs-
tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/
Acknowledgments Thanks to Luc Bouganim and ncstrl.cornell
Marcin Skubiszewski for comments on this text. [8] Alberto O. Mendelzon, George A. Mihaila, and
Thanks to Gregory Becker, Matthew Jackson Tova Milo. Querying the world wide web. In
Keiran Millard, Claire Waelbroeck, and Robert Proceedings of Parallel and Distributed Infor-
Woodley for discussions on scienti c systems. mation Systems, Miami Beach, Florida, 1996.
To Appear.
References [9] Anthony Tomasic, Louiqa Raschid, and
Patrick Valduriez. A data model and query
[1] C. Mic Bowman, Peter B. Danzig, Darren R. processing techniques for scaling access to dis-
Hardy, Udi Manber, and Michael F. Schwartz. tributed heterogeneous databases in Disco.
Harvest: A scalable, customizable discovery IEEE Transactions on Computers, 1996. To
and access system. Technical Report CU- Appear.
CS-732-94, Department of Computer Science,
University of Colorado, Boulder, Colorado, [10] R. Y. Wang, M. P. Reddy, and H. B. Kon.
August 1994. Toward quality data: An attribute-based ap-
proach. Decision Support Systems, 13:349{
[2] Dienst Project. Dienst protocol version 4.0. 372, 1995.
http://www.ncstrl.org/Dienst/htdocs/Info/
protocol4.html, October 1995.
[3] Luis Gravano, Hector Garcia-Molina, and
Anthony Tomasic. The e ectiveness of
GlOSS for the text database discovery
problem. In Proceedings of 1994 ACM
SIGMOD International Conference on Man-
agement of Data, Minneapolis, MN, 1994.
ftp://db.stanford.edu/pub/gravano/1994/stan.
cs.tn.93.002.sigmod94.ps.
[4] Joachim Hammer, Hector Garcia-Molina,
Kelly Ireland, Yannis Papakonstantinou, Jef-
frey Ullman, and Jennifer Widom. Informa-
tion translation, mediation, and Mosaic-based
browsing in the TSIMMIS system. In Proceed-
ings of SIGMOD '95, 1995. SIGMOD System
Demonstration.

5
sists of a pair { an HTML document describing sents shark attack reports in French newspapers.
the component, and the collection of objects that In this case, Jim Data Provider has also provided
implement the metadata, data, and computation. the translation program. Finally, dotted box e rep-
The HTML document provides a means (through resents a mediator that describes safe beaches for
indexing engines) for locating the corresponding wind sur ng, based on the predicted wave data and
objects. A data source, for instance a database sys- recent reported shark attacks. The components of a
tem, exports metadata (a schema), data, compu- through d consists of a mix of code from the under-
tation (query processing) encapsulated as objects. lying systems and the proposed architecture. The
All of these objects are described in the associated mediator e can be written entirely within the pro-
HTML document. The document has sucient in- posed system. For Joe Windsurfer, the task of nd-
formation to permit direct browsing of the data by ing a safe beach consists of sur ng to the HTML
an (intelligent) browser that understands the query page associated with e. Jane Mayor will examine
language supported by the data source. A transla- the metadata of e to determine the analysis that
tor provides conversion of queries between two dif- leads to a judgment of a safe beach.
ferent query languages { the language 1 supported Our goal is to create an environment of \mix and
by the data source, and the language 2 generated match" mediators, each of which documents a step
as subqueries by the mediator. This functionality in the production of environmental data.
is again encapsulated as objects and described in
an HTML document. For instance, the signatures
of the functions supported by the translator are de-
scribed. Thus, a browser that generates queries in
4 Conclusion
language 1 can browse data in a data source ac- In this position paper, we have presented a model
cepting language 2 given the appropriate transla- for the tasks involved in the generation of scienti c
tor. The mediators encode the tasks of consolidata- data. We showed that this model induces a hierar-
tion, aggregation, analysis, and interpretation. The chy of data. Finally, we proposed an architecture
associated HTML document describes the scienti c to support these activities.
models used for the task and the object describes Our proposed architecture draws on research and
the metadata and data of the results. Some media- commercial heterogeneous distributed databases
tors may support the invocation of the computation [9, 6] and on work on semi-structured data sources
used to generate the data. All mediators conform [4, 1, 3]. Our focus will be on the crossing of object-
to the same language for queries, metadata, and oriented technology with information retrieval pro-
data. tocols [2, 8, 7].
Returning to the example in the previous section, The construction of a rst prototype breaks down
we describe each part of Figure 3. The dotted box into several tasks.
a around a data source represents a database of his-
torical wave data, including wave height, force of a 1. Adapt an existing data model and metadata
wave, etc. for every beach in France. This data is model into a language for describing environ-
generated by a government authority and exported mental data. Instances of the resulting lan-
directly to the WWW. The dotted box b around an- guage are parts of an HTML document.
other data source represents current meteorological
data, also produced by a government authority and 2. Adapt the Java applet interface speci cation
also exported directly to the WWW. Both of these into a language for the speci cation of inter-
data sources are managed by Jim Data Provider. faces to wrappers. Instance of the resulting
Dotted box c represents the work of Jack Scien- language are also parts of an HTML document.
tist, who has written translators of the two data The applets can read data located at sources
sources (into the standard mediator language) and and display the results in a World Wide Web
constructed a mediator which exports predictions of browser.
wave data based on the historical database and the
current weather patterns. Thus, the analysis task 3. Validate the language by describing various
is presented here. The fourth dotted box d repre- data sources in the language.

4
D Mediator
HTML Object e
Qi, Qb Analysis
Mediator c

HTML Object
D D Translator
HTML Object
Translator Translator
Qi, Qb Consolidate, Aggregate Data Source HTML Object HTML Object
HTML Object
Data Source Data Source
d
D D HTML Object HTML Object
a b

Q Q Locate
Figure 3: The proposed architecture. Dotted
Select
boxes show process (and administration) bound-
B D B B D aries. Lines show exchange of queries and updates.
Figure 2: A ow of processes. The successive iterations produce data that is in-
creasingly processed. Thus, the iterations naturally
is to generate data by consolidation, aggregation, organize the data into a graph based on the tasks
analysis and interpretation. These activities are used to produce the data. Figure 2 shows the ex-
interrelated, as shown in Figure 1. ample of such a graph. Circles represent data and
Each arrow in the gure indicates a precedence thin boxes, or transitions, represent the application
relationship between the tasks. The tasks are de- of tasks. Directed arcs represent the transforma-
ned as follows. tional ow of data. On the bottom, base and de-
rived data (resp. circles marked \B" and \D") are
Store Raw data is de ned, measured, and stored the input of two transitions implementing locate
by Jim Data Provider. In addition, other users and select tasks. The result of these transitions is
store the results of other tasks here. stored as derived data, which in turn are the input
of a transition implementing a consolidate and ag-
Locate and Extract Data is located and ex- gregate task. The result, stored as derived data, is
tracted by Joe Windsurfer, Jane Mayor and taken as an input of a last transition which returns
Jack Scientist. the result of an analysis and interpretation (e.g.,
high-level indicators). Quality factors are indicated
Consolidate and Aggregate Jack Scientist con- as annotations for the transitions: Qi, and Qb re-
solidates data by creating new sets of data that spectively mean interpretable and believable, while
approximate missing data in raw data source. Q means all quality factors. In the next section, an
In addition data is aggregated into larger mea- architecture is proposed for supporting these activ-
surements. ities and resulting hierarchy.
Analyze and Interpret Jack Scientist analyzes
and interprets data and generates new data 3 Architecture
from the results of this task. Both this
task and the previous task are accomplished Figure 3 shows a diagram of our proposed archi-
through the use of scienti c models. tecture; it is similar in structure to heterogeneous
distributed database systems [9]. The architecture
Each iteration around the four tasks produces consists of three types of components: data sources,
new data that is available for further iterations. translators, and mediators. Each component con-

3
are represented as dots of varying colors ac-
cording to their degree of safety. Or Joe lo- locate
cates a WWW server running an ad-hoc pro- select
gram that delivers useful documents to wind-
surfers, which include descriptions of the safety
of beaches.
Policy-Maker Jane Mayor (e.g., the Mayor of a consolidate store
city) aggregate
Jane needs to locate appropriate data servers
to retrieve data of the desired level of qual-
ity. For example, Jane accesses the rating of
beaches in her town. Then, she asks why her
town is not considered a safe beach. As a re- analyze
interpret
sult, she gets a de nition of a safe beach that is
understandable to her, i.e., at the appropriate
level of detail, and the data that the de ni- Figure 1: Four activities for producing environmen-
tion depends on. For instance, safety may be tal data.
de ned as a collection of criteria such as the
expected height of waves, presence of rocks,
presence of sharks, and water quality. Data Provider Jim Data Provider (e.g., Biolo-
gist)
Since Jane may never have heard of shark at- Jim collects data, and he wants to distribute
tacks on her beach, she then may want to nd it as widely and as easily as possible. Jim may
out who, and when, collected the data about
the presence of sharks near her beaches. Com- manually add his data to an existing database
pared to Joe Windsurfer, Jane has a higher through a standard form-based entry system.
requirement on data quality: she puts a high Data can also be collected using automatic sen-
value on accessibility, interpretability, and use- sors that directly transmit their data to an
fulness. She wants to query the uni ed view associated system. In this case, Jim has to
that is presented to her by means of easy- verify the quality of data, and eliminate erro-
to-use interfaces, and consequently the server neous measurements. To do this, Jim needs to
must be able to process ad hoc queries. use speci c programs for data analysis and in-
terpretation and access other data systems for
Scientist Jack Scientist (e.g., Environmental Sci- comparing his data with other related data.
entist)
In general, any individual in the real world may
Jack constructs the servers for Jane Mayor and play the role of multiple generic users described
Joe Windsurfer. Jack writes programs which here. Current data system technology, at least in
read measurement databases, administrative the environmental domain, is very simple. Data
enquiries, remote sensing data, and geograph- is stored in les. Metadata is stored as textual
ical databases, to construct a map of France descriptions in associated les (that is, metadata
that indicates the quality of beaches. Also, he does not have any formal language). Extraction
writes programs to improve the reliability of of data consists of ad-hoc programs (that is, query
data. languages are not used). Metadata for the results
Generally, Jack must nd the data required for of analysis are described in scienti c papers. Thus,
each new program that he writes. In addition, there is no formal language for describing the data
each new program uses multiple data sources. itself.
Each data source requires a unique program to From a task point of view, Joe and Jane's prin-
extract the data for the new program from a cipal activity is to locate data servers and select
data source. relevant data. Jack and Jim's principal activity

2
Improving Access to Scienti c Data
Position Paper, Second DELOS Workshop, Bonn, Germany

Anthony Tomasic Eric Simon


INRIA Rocquencourt INRIA Rocquencourt 
September 19, 1996

1 Introduction By \quality," we mean [10] that data must be:


accessible, i.e., delivered eciently, interpretable,
Users who interact with scienti c data systems face i.e., easily and unambiguously understood, useful,
many problems accessing scienti c data. Because i.e., relevant and timely, believable, i.e., complete,
of the heterogeneous nature of these systems, data consistent, and accurate.
is dicult to locate, poorly documented, dicult This position paper provides a conceptual analy-
to understand, and dicult to use. By \data" we sis of these problems for users of scienti c data sys-
mean both data resulting from probes of phenom- tems. The analysis describes the four major tasks
ena and information resulting from scienti c mod- in the production of scienti c data (from the tech-
els. nology point of view). The analysis then describes
Anecdotal evidence shows that current data sys- the organization of the data that results from these
tem technology in scienti c research consists of a tasks. Based on this analysis, we propose an ar-
wide variety of tools and techniques, but some gen- chitecture for scienti c data systems that formally
eralizations are possible. Data storage consists of models metadata and focuses on the two major
large collections of les. Each le records, usually goals described above.
in tabular format, a series of measurements from Section 2 describes a conceptual model of tasks
probes of phenomena. File sizes are generally small, for the production of scienti c data. Section 3 pro-
ranging in the low megabyte range. (A major ex- poses an architecture. Section 4 concludes the pa-
ception to this generalization is raw satellite image per.
data.) Metadata consists of text descriptions of
each eld, located either at the beginning of the le
or in a another le in an associated directory. Doc- 2 Conceptual Analysis
umentation of results derived from raw data is typ- The conceptual analysis of scienti c data systems
ically accomplished by publication in the scienti c consists of three parts: the users, the tasks for pro-
literature. Some areas of scienti c research do use duction of data, and the resulting data organiza-
databases [5], or at least are converging on standard tion. To make the discussion more concrete, we
metadata descriptions, but here again, the docu- focus on a hypothetical example of an environmen-
mentation of the metadata usually resides in text tal data system.
les. We distinguish between several categories of
A better architecture for scienti c data would users based on the data each user needs from an
support two major goals: (a) diminishing the work environmental data system.
required of each user to participate in or interactive
with scienti c data systems, and (b) improving the End User Joe Windsurfer
quality of data delivered to users.
Joe needs to locate data that matches his in-
 INRIA Rocquencourt, 78153 Le Chesnay, France; terest. For example, Joe locates a WWW page
mailto:Anthony.Tomasic@inria.fr; http://rodin.inria.fr/. that shows the map of France where beaches

Potrebbero piacerti anche