Sei sulla pagina 1di 7

Data Catalog

1. Introduction
Data Catalog service is supports registration of data collections crawled from external THREDDS catalogs or provided as LEAD Metadata Schema (LMS). The registered data collections are indexed and can be searched for using a web-service API. The Geo-Gui is a graphical portlet interface for querying the LEAD Data Catalog. A command-line client is also available for registering LMS documents and THREDDS Catalogs, and to query the Data Catalog. The THREDDS data server (TDS) is a repository for data from various sensors and instruments. Data coming in periodically using the Local Data Manager (LDM) infrastructure is organized hierarchically as files and directories according to predefined configurations. THREDDS builds XML documents describing the contents of each directory using manually defined template. These documents are THREDDS catalogs and describe spatiotemporal attributes, data format and access protocols, publisher, and so on. The hierarchical structure of the directories means that the catalogs themselves can point to child catalogs with the leaves listing data products. The catalogs are accessed through a HTTP URL. These catalogs form one of the main sources of public data for the Data Catalog. Data products can also be published from the myLEAD personal catalog to be shared among the LEAD community. These data products are already described using LMS and can be directly added to the Data Catalog.

2. Architecture

Science Portal GEO GUI

Data Catalog

Crawl Scheduler Thread


Bucket List
.

Transform to Lucene Query API

Crawl Spider Threads

Crosswalk to Lucene Index Thread

External
THREDDS

Catalogs

Index Lucene
Index & Store Data Resources

The Data Catalog service can either have data product metadata in the LMS pushed to it for publishing, or have THREDDS catalogs registered with it as a source for pulling in data product metadata. The former scheme is used for shared data published from myLEAD and the latter is the source of external data entering the LEAD community. This LMS metadata is indexed and stored by the Data Catalog to allow for complex queries based on attributes defined over the LMS.

CRAWLING
The Data Catalog uses a register and crawl approach to pulling information from the THREDDS catalogs. One or more THREDDS catalogs, encompassing community data products of interest to LEAD, are registered with the Data Catalog along with a period of crawling that approximates the data at which data comes into the THREDDS catalogs. This is usually at the frequency of every half hour. At the prescribed frequencies, the Data Catalog launches spider threads to start crawling from the registered THREDDS catalogs to other catalogs linked from it recursively, searching for data products present at the leaf of the catalog tree. When a data product is found, the THREDDS metadata description for it is extracted and added to a separate crosswalk queue in the Data Catalog. Several crosswalk threads are waiting on this queue for new THREDDS metadata objects to come in, and they asynchronously convert the THREDDS metadata to the LEAD Metadata schema. Once converted, these LMS documents are indexed using the Apache Lucene library and stored. When THREDDS catalogs are registered with the Data Catalog, they

are dropped into different buckets based on the frequency at which they need to be crawled. A scheduling thread starts a crawler to crawl over all THREDDS catalogs in a bucket at the frequency specified for a bucket. THREDDS catalogs, when registered, are put in the bucket whose crawl frequency is closest to the requested crawl frequency for that THREDDS catalog. This optimizes the crawling by using a batch crawl mode instead of crawling each THREDDS catalog independently.

PUBLISHING
Data products and collections published from the myLEAD catalog are already described using LEAD Metadata Schema. While data is pulled from the THREDDS catalogs, users push data products and experiments they wish so share from myLEAD to the Data Catalog. Also, publishing from myLEAD is supported for collections of related data products and experiments, and not just individual data products. These cohesive collections, while composed of several individual LMS documents, can also be retrieved as a single collection with all related LMS instances for, say, import back into MyLEAD by a different user.

INDEXING
The Data Catalog needs to support storing and querying over hundreds of thousands of data products and collections. Usually, a trade-off occurs between the efficiency of querying and the generality of queries. Given the scalability requirements in the Data Catalog, we have opted for an indexing and store strategy that shreds and indexes important and commonly queried metadata attributes from the LMS XML document while also retaining the original LMS document instance to be returned as the result of queries. A configurable set of 22 commonly queried typed metadata attributes that map to attributes present in the LMS were identified (see section 4) and an index built on each of those dimensions using the Apache Lucene index library. Some of the attributes like temporal and spatial ranges are multi-dimensional while others like data format and citation have a single dimension. When a LMS document instance needs to be indexed and stored in the Data Catalog, an XML parser extracts the attributes configured for indexing and passes it through special tokenizers written for specific attribute types since Lucene natively supports only String types. So a numeric data type like latitude or longitude will have to be padded with zeros and signed to enable lexical indexing and comparison of numbers. The tokens generated for the attribute value are added to the index for that attribute and associated with the unique resource ID for the LMS document. In addition, all attributes are tokenized as strings and added to a separate index that allows free text search on the LMS document. The index for each attribute is a B-tree that is persisted in the file system by Lucene. Each crawled bucket maintains its own set of B-trees for documents from its crawl. The Data Catalog maintains two sets of indices for each bucket to allow the availability of an index for querying while a new index is build

from the periodic crawl. While a live index is used for answering queries, there may be another background index being built from a fresh crawl. Once the crawl completes and a new index is built for the THREDDS catalogs in a bucket, the background index is swapped with the live index and henceforth used to answer queries. The separate index maintained for the published data is only appended to with additional documents when data products are published directly to the Data Catalog, and no new index is built. Since correlating existing documents in the index with those in the THREDDS catalog is costly, such a complete recrawl strategy is effective when the number of data products updated in each catalog update is high. However, an incremental approach is prefered and scalable when the delta change between crawls is smaller.

QUERYING
The Data Catalog allows queries over the predefined metadata attributes in addition to a free text search over all of the LMS metadata. Basic comparison and range queries can be defined over individual attributes and these can be combined recursively with AND, OR and NOT operators to form arbitrarily complex boolean queries. The comparison operators supported are equals, greater than, less than, substrings, and wildcards, while range queries for spatial and temporal attributes allow matching on subsets, supersets, and intersects of a value ranges. The queries are represented by the user using an XML schema and translated by the Data Catalog to query terms compatible with the Lucene API. This involves mapping the attribute names in the query to the attribute names used in Lucene, tokenizing the attribute values to match the tokenization of the indexed attribute values, and recursively converting the XML query predicates to equivalent Lucene query terms in Java. The resource ID of LMS documents matching the queries are returned by the Lucene query. The raw LMS document is then retrieved using this ID and returned to the query client. Currently, the top 100 data products (a configurable resultset size) are returned by the Data Catalog, sorted according to their temporal ranges. Work is on to allow incremental return of matching documents.

3. Service API, methods & parameters


3.1 Data Catalog command line interface
Data Catalog provides a command-line client to set up the data catalog itself. The client shell has several operations as following

run DataCatalogShell -url <resource_catalog_wsdl_url> <options> where <options> are one of: [-llsThredds | -llsThreddsFiles | -lsThreddsCols | -llsThreddsCols | llsDirect] List files/collection data products. The different forms of ls filters the results. llsThredds returns metadata of data products crawled in the registered THREDDS catalogs. llsThreddsFiles returns metadata of all files crawled in THREDDS. lsThreddsCols returns only titles of collections crawled in THREDDS while llsThreddsCols returns entire metadata. llsDirect returns metadata of files/collections exported (added directly) to the rescat. [-lls] Returns ALL data products available in the resource catalog. WARNING: This can run into the thousands. Side effects include an OutOfMemoryException, and severe strain on the rescat. [-listCats] Returns the THREDDS catalogs that are registered with the resource catalog for crawling. [-addThredds] -threddsUrl <thredds_catalog_url> -desc <description> [freq <frequency_of_crawling_in_mins>] Adds the given Thredds catalog to the list of catalogs to be crawled by the resource catalog. Specify a descriptio and optional the frequency in minutes at which the catalog is to be crawled. Default is 60mins. [-addData] -file <LEAD_metadata_file_path> Adds (exports) an individual LEAD Metadata document to the rescat for indexing. -queryTimeFalls [-from <absolute_or_relative_time>] [-to <absolute_or_relative_time>] [-intersect {PERIOD_INTERSECTS_DATA | DATA_ENCLOSES_PERIOD | PERIOD_ENCLOSES_DATA | PERIOD_INTERSECTS_DATA_BEGIN | PERIOD_INTERSECTS_DATA_END | PERIOD_NOT_INTERSECTS_DATA}] -queryKeywords <list_of_keywords> Queries for data products. queryKeywords does a keyword serach over the given list of keywords (grouped in quotes). queryTimeFalls does a temporal query for data that intersect the given range. The 'from' and 'to' limits is either a signed integer that gives the number of hours relative to current local time (e.g. '-5' for five hours back) or an absolute UTC time formatted as yyyy-mm-ddThh:mm:ss (e.g. '2006-06-06T12:30:05'). Default is 0. Default intersect is DATA_ENCLOSES_PERIOD. -queryId <resource_uuid>

Queries for data products by their resource UUID. -queryFile <xml_query_file> Sends a query defined by a queryContainer. [-help] This help screen.

3.2 Data Registration Service API


Data Catalog provides two ways of data registration interface. From the myLEAD, LMS documents can be directly registered and through the registering THREEDS catalogs, data can be crawled and registered periodically. 3.2.1 Published data through myLEAD When the data is published by MyLEAD, the method importLEADMetadata() in LuceneMetadataCrosswalk class which accepts ContextMetadata and LEADResourceType as arguments. Operation Description registerDataProduct Registers a single LMS (RegisterDataProductType request) data product directly registerCollection(RegisterCollectionType Registers a collection of request) LMS data products directly 3.2.2 Crawled data through THREDDS catalogs The thredds URL is added to the data catalog service by using a service or command-line client. The service persists the URL to its catalog index (in case the Data Catalog needs to be restarted) and also adds an entry in CatalogSpace for the appropriate IndexTrigger scheduling thread that will launch the crawlers periodically. The crawlers crawl thredds catalog list recursively and once they meet the data at the leaf of the catalog tree, waited crosswalk threads convert the THREDDS metadata to the LMS. A method consume() in the crosswalk thread (MetadataProcessor class) takes LMS as input and adds them to Lucene in a compatible form. Operation Description RegisterDataCatalogResponseType Registers the given registerDataCatalog(RegisterDataCatalogType Thredds catalog URL for request) crawling at suggested frequency. Actual frequency of crawling depending on closed bucket it is placed in is returned.

3.3 Data Search Service API


Data Catalog has a method named 'searchDataProduct()' which accepts a query as an argument. It is usually called by users' requests from geoGUI or by a simple client program for testing. The query message is generated and handled by using SearchDataProductDocument class which is extend of org.apache.xmlbeans.XmlObject. Through the serchDataProduct() method, the Data Catalog retrieve law LMS documents from the Lucene and return the results to users as output. Operation Description SearchDataProductResponseType It returns a list of LMS data searchDataProduct(SearchDataProductType products that matches the request) given query over the LMS metadata attributes GetDataProductResponseType It returns a single LMS data getDataProduct(GetDataProductType request) product (if present) that matches the given LMS resource ID ListDataCatalogsResponseType Lists all THREDDS catalogs listDataCatalogs(ListDataCatalogs request) that have been registered for crawling

4. LMS Attributes used in Data Catalog Search Queries


Data Catalog Attribute misc.CatalogType misc.DataType misc.CatalogUrl misc.IndexTimestamp misc.AnyField misc.LeadXml resourceID data.idinfo.citation.title data.idinfo.descript data.idinfo.citation.pubinfo.publish data.idinfo.citation.onlink data.idinfo.keywords.theme data.geospatial.idinfo.timeperd.timeinfo.begin data.geospatial.idinfo.timeperd.timeinfo.end data.geospatial.idinfo.spdom.bounding.northbc data.geospatial.idinfo.spdom.bounding.southbc LMS Attribute XPath or Description if non-LMS term Type of external catalog that was this metadatas source. Direct or Thredds. File or Collection URL of external catalog that was the source for this Timestamp at which the metadata was crawled or registered Free text keywords in the complete LMS document The LMS document as XML

Potrebbero piacerti anche