Sei sulla pagina 1di 137

WARSAW UNIVERSITY

OF TECHNOLOGY

Faculty of Electronics and


Information Technology

Ph.D. THESIS

Stanisław Adaszewski, M.Sc.

Virtualization of neuroimaging data access and processing for


multisite population brain studies

Supervisor
Professor Piotr Bogorodzki, Ph.D., D.Sc.

Warsaw, 2015
2
Dedication

I dedicate this work to my closest Family – my mother Krystyna, father Ryszard, sister Barbara and
grandmother Jadwiga. I love you all.  

3
Acknowledgements

I’d like to express my deepest gratitude to my supervisor – Professor Piotr Bogorodzki, for
introducing me to the world of neuroimaging and always supporting my efforts towards completion
of this thesis.

I’d like to thank Professor Henryk Rybiński for providing priceless guidance in the computer science
aspects of this work.

I’d like to thank as well my supervisors and colleagues from Laboratoire de Recherche en
Neuroimagerie – Professor Richard Frackowiak, Professor Bogdan Draganski, Doctor Ferath Kherif
and Doctor Juergen Dukart. Our collaboration is a lasting source of inspiration to me.

Last but not least I'd like to express my appreciation for Doctor Ewa Piątkowska-Janko for proof-
reading and editorial corrections.

4
Abstrakt
Metody diagnostyki obrazowej stwarzają unikalną możliwość identyfikacji i badania wybranych
procesów neuronalnych w żywym mózgu człowieka. Zastosowanie wieloośrodkowych badań
obrazowych w połączeniu z nowoczesnymi technologiami informatycznymi, pozwalającymi na
gromadzenie i eksploracją rozproszonych zbiorów danych, otwierają drogę dla badań populacyjnych
o charakterze podstawowym jak i poszukiwania biomarkerów ekspresji zmian chorobowych.
Niniejsza praca dotyczy zagadnienia wirtualizacji danych w kontekście wieloośrodkowych badań
obrazowych. Zawiera ona tezę o cechach niezbędnych do realizacji analiz tego typów zbiorów
danych z użyciem systemów bazodanowych oraz przedstawia nową modelową implementację części
takiej architektury nazwaną przez autora WEIRDB (od ang. słów Wide, External, Imaging, Research,
DataBase). WEIRDB realizuje mechanizm dostępu do danych o dowolnym rozmiarze w ich
oryginalnej postaci bez potrzeby wczytywania i synchronizacji z systemem bazodanowym, z
pominięciem komunikacji strumieniowej klient/serwer oraz formułowanie zapytań w języku SQL.
Porównanie wydajności WEIRDB z istniejących rozwiązaniami pod kątem obsługi danych
zewnętrznych oraz prędkości wykonywania zapytań wykazało wzrost wydajności od 2 do 200 razy w
stosunku do pozostałych baz danych. W dalszym toku pracy omówione zostały dotychczasowe
zastosowania praktyczne proponowanego oprogramowania, a powyższe spostrzeżenia zostały
poddane dyskusji pod kątem wcześniejszych przewidywań oraz celów przyszłych badań i rozwoju.

5
Abstract
Diagnostic imaging methods create a unique possibility for identification and in-vivo studies of
selected neuronal processes in human brain. Multisite imaging studies in collaboration with modern
information technologies allow to acquire and explore distributed datasets, creating the opportunity
for population brain studies as well as research of expression biomarkers in disease-modulated
changes. This work concerns uses of Data Virtualization techniques in multisite neuroimaging studies
and presents a new model implementation of such architecture called WEIRDB (acronym for: Wide,
External, Imaging, Research, DataBase). WEIRDB implements a mechanism of arbitrary-sized data
access in their source form without the necessity for import procedure or synchronization with the
database system. Furthermore, it demonstrates means of improving performance by eliminating
streaming client/server communication and allows for formulating queries in SQL. A comparison of
WEIRDB's efficiency against existing solutions is presented with respect to external data handling
capabilities and query execution speed showing from 2 to 200 times improvement over prior art.
Subsequently this work lists current practical applications of the proposed architecture followed by
discussion of the above observations with emphasis on expectations, goals and future research.

6
Contents
1. Introduction 13
1.1. Preface 13
1.2. Motivation 16
1.3. Thesis and Goals 17
1.4. Structure 19
2. Foundations and Problem Definition 21
3. Related Work 29
3.1. Relational Databases 29
3.1.1. PostgreSQL 29
3.1.2. MySQL 33
3.1.3. Teiid 36
3.1.4. HSQLDB 39
3.1.5. H2 40
3.1.6. SQLite 42
3.1.7. MonetDB 44
3.2. NoSQL Databases 47
3.2.1. CouchDB 49
3.2.2. Cassandra 52
3.2.3. Neo4j 54
3.2.4. Redis 57
3.3. Summary 60
4. WEIRDB - Complex Queries on External Neuroimaging Data 62
4.1. Architecture Description 63
4.2. Query Rewriting Engine 68
4.2.1. SQL Parser Implementation 71
4.2.2. Query Rewriting Algorithm 72
4.2.3. Query Execution and External Data Access 78
4.3. Random Access Data Model 79
4.4. Network Access Protocols 81
4.4.1. Native WEIRDB Protocol 81
4.4.2. PostgreSQL Network Protocol v3 82
4.5. Summary 85
5. Evaluation 87
5.1. Datasets and Test Queries 87

7
5.2. Benchmarking against Existing Solutions 90
5.3. Results 91
6. Practical Applications 96
6.1. Combining Classification Pipeline with Data Virtualization 101
6.2. Preparation for Statistical Analysis and Modelling using Data Virtualization 102
6.3. Novel Uses of Data Virtualization in Cerebral Cartography 103
7. Conclusions and Discussion 107
8. Disclosure Statement 111
9. Glossary 112
10. References 114
11. Appendix A 122

8
List of Tables
Table 1.1.1. List of projects following the "Brain CERN" collaborative
multi-site approach by making data publicly available using Web sites and 14
services

Table 1.1.2. Summary of some among the largest collaborative


14
neuroimaging projects publicly available

Table 2.1. Summary of characteristics of Experimental Data from [19],


22
supplemented with characteristics of sMRI and fMRI experiments

Table 4.1. CSV Access Algorithm 63

Table 4.1.1. Comparison of Index Data Structures 64

Table 4.1.2. Indexing methods support comparison in WEIRDB and


65
existing SQL DBMS

Table 4.1.3. Limits comparison of WEIRDB and existing SQL DBMS with
65
support for external data virtualization

Table 4.4.2.1. Data mapping between some Qt and SQLite types and
85
PostgreSQL Network Protocol v3 predefined type OIDs

Table 5.2.1. Overview of software/hardware configurations in performed


90
benchmarking tests and corresponding results tables

Table 5.3.1. Performance comparison of WEIRDB and


93
HSQLDB/H2/PostgreSQL

Table 5.3.2. Performance comparison of WEIRDB and


93
HSQLDB/H2/PostgreSQL using single subject Allen Brain Atlas data

Table 5.3.3. Query time comparison in microbenchmark similar to the first


93
one from [86], without WHERE condition

Table 5.3.4. Query time comparison in microbenchmark based on first one


94
from [86], with WHERE condition

Table 5.3.5. First benchmark for true in-situ CSV implementations, without
94
WHERE clause

9
Table 5.3.6. First benchmark for true in-situ CSV implementations, with
94
WHERE clause

Table 5.3.7. WEIRDB performance on a table with 400000 columns and


95
1000 rows (dummy); dummy2 has 10000 columns and 1000 rows

Table 6.1. Basic WEIRDB operations used for preparing a dataset for
100
handling by data clustering method "H"

Table 6.3.1. Operations realized using external scripts taking advantage of


104
WEIRDB's in-situ operation maintaining native file formats

10
List of Figures
Figure 1.1.1. Data Flow and Stages of Data Sharing Opportunities 15

Figure 2.1. General Architecture of Human Brain Project Data Access 24

Figure 2.2. SQL/MED Foreign Data Wrapper architecture 26

Figure 2.3. SQL/MED datalink type architecture 27

Figure 4.1.1. Functional architecture of WEIRDB 64

Figure 4.1.2. High-level architecture of WEIRDB 67

Figure 4.2.1. Examples of static and dynamic column references 69

Figure 4.2.2. Query rewriting pipeline 69

Figure 4.2.3. Block diagram of Query Rewriting Engine architecture with


70
partioning of functionality into compilation units of WEIRDB

Figure 4.2.3.1. WEIRDB External Data Access Model 78

Figure 6.1. Illustration of data sets available in ADNI phases 1, 2 and GO 97

Figure 6.2. Schematic of external data sources linked to WEIRDB 99

Figure 6.2.1. Voxel-wise intersections of healthy aging and changes


103
observed in AD in MRI (a) and FDG-PET (b)

Figure 6.3.1. Multiple modality rule set for cognitively declined and healthy
105
individuals

11
12
1. Introduction

1.1. Preface

Neuroscience is a rapidly developing field of knowledge currently entering a stage of exponential


growth and is characterized by inherently multidisciplinary nature required to study the complex
problem of understanding brain physiology and function. Diagnostic imaging methods create a
unique possibility for identification and in vivo studies of selected neuronal processes in human
brain. Multisite imaging studies in cooperation with modern information technologies allow to
acquire and explore distributed datasets, creating the opportunity for population brain studies as well
as research of expression biomarkers in disease-modulated changes.

Brain CERN (after high energy physics research center) is a trend in neuroscience based on research
using collaborative, multi-site projects. Its pragmatism and efficiency are illustrated by numerous
successful endeavours presented in Table 1.1.1. The largest databases produced by these efforts
contain terabytes of raw data which translates to hundreds of terabytes of processed data, intermediate
and final results - Table 1.1.2.

All of these projects share heterogeneous sets of data covering: imaging, social, behavioral,
electrophysiological, transcriptomics, protein interaction data, as well as disorder knowledge bases,
publication data and semantic meta information. These data are used in complex research procedures
as depicted in Figure 1.1.1 where researchers have the possibility to make use of multiple such data
sources for the needs of their experiment.

Neuroimaging is one of the most demanding research methods in terms of data storage, access speed
and processing efficiency needs. Depending on modality, region of interest, resolution and number of
subjects, the size of data produced by brain imaging protocols can nowadays reach 1 petabyte and
steadily is and will be growing further in the future. At the same time interactive retrieval and
processing times need to be maintained in order to ensure effective data exploration, exchange and
increasingly computer-aided diagnosis (e.g. [1]) and/or forecasting (e.g. [2]). Additionally, most of
the software used in neuroscience nowadays relies on specialized file formats for storing geometry
(voxels, surfaces) and accompanying measurements (scalars, vectors, tensors, timestamps, etc). Some
data (behavioral, social) are stored more straightforwardly as flat text files; others (disease knowledge
bases, publications) tend to become organized in the form of databases. At the same time temporal
resolutions involved may include momentary data used in cross-sectional studies, several time points
frequently used for illustrating brain plasticity and disease recovery, as well as multi-decade
longitudinal research protocols. These traits place neuroscience among the most demanding

13
disciplines with respect to data accumulation and processing. With broad range of requirements,
neuroscience is becoming increasingly dependant on the developments of information technology.

Table 1.1.1. List of projects following the "Brain CERN" collaborative multi-site approach by making data publicly
available using Web sites and services.

Name Description

Open Access Series of Imaging Studies (OASIS)


a project aimed at making MRI data sets of the brain freely available to the scientific community
[23]

MuTrack [24] a genome analysis system for large scale mouse mutagenesis

Neuroscience Information Framework (NIF)


a repository of global neuroscience web resources
[25]

Neurocommons [26] ontology specifications for neuroscience semantic web

PIMRider [27] exploration of protein pathways

Inferred Biomolecular Interaction Server (IBIS)


a web server to analyze and predict protein interacting partners and binding sites
[28]

MassBank [29] high quality mass spectral database

Mouse Phenome Database (MPD) [30] collaborative collection of baseline phenotypic data on inbred mouse strains

a comprehensive database including autism susceptibility gene-CNVs integrated with known noncoding
Autism Genetic Database (AGD) [31]
RNAs and fragile sites

Huntington's Disease Database (HDBase) [32] a research community website

Alzheimer's Disease Neuroimaging Initiative


a worldwide project that provides reliable clinical data for the research of Alzheimer's Disease
(ADNI) [33]

Parkinson's Progression Markers Initiative


observational clinical study aiming to identify biomarkers of Parkinson’s disease progression
(PPMI) [34]

BrainMap [127] database of published neuroimaging experiments with coordinate results

NeuronDB [128] searchable database of neuron properties

CoCoMac [129] macaque macroconnectivity database

PubMed [130] biomedical literature database

NeuroMorpho.org [35] database of neuron morphologies

Table 1.1.2. Summary of some among the largest collaborative neuroimaging projects publicly available. DTI - Diffusion
Tensor Imaging, MRI - Magnetic Resonance Imaging, PET - Positron Emission Tomography, fMRI - Functional Magnetic
Resonance Imaging, CT - Computed Tomography, SPECT - Single-Photon Emission Computed Tomography. 7T (f)MRI -
high resolution 7-tesla (f)MRI. Total Data Volume estimates were obtained by multiplying average of sizes of sample images
in each modality by the number of datasets. * For OpenFmri.org the total data volume estimate was obtained partially using
the method above and partially using estimates provided for some of the datasets on the website.

Number of Acquisition Number of Estimated Total Data


Name Modalities
Sites Datasets Volume

Alzheimer's Disease Neuroimaging Initiative


DTI, MRI, PET, fMRI 59 94311 2.36 TB
(ADNI) [33]

OpenFmri.org [37] MRI, fMRI, 7T (f)MRI 32 8732 >1 TB*

Parkinson's Progression Markers Initiative CT, DTI, MRI, PET, SPECT,


33 5413 135 GB
(PPMI) [34] fMRI

14

Figure 1.1.1. Data Flow and Stages of Data Sharing Opportunities. Neuroimaging data analysis is a
complex and multistage process which requires access to all information preceding and including
Derived Data before drawing final conclusions. Steps involved in such workflow include: (1)
Obtaining demographic and clinical information about subjects, (2) Preparing an Experimental
Design, (3) Acquiring the Data using selected imaging modalities and/or functional tasks, (4)
Processing Raw Data before delivering it to the analysis pipeline, (5) Performing modality- and
study-specific analyses, (6) Summarizing of information from stages 1-5 leading to conclusions and
novel findings published in peer-reviewed journals. Historically, data could be shared between
researchers at any of these stages. Recent explosions of computation, storage and communication
technologies facilitated such exchanges and created new opportunities for data sharing using
globally Distributed Databases and Web Services.

15
1.2. Motivation

The number of brain images collected daily in clinical and research establishments in the world is
enormous. Most of these data are represented as images that serve to help medical decision-making
and are viewed once before being archived on departmental or laboratory servers for a finite number
of years. Idea of recycling the medical data locked in hospitals and making them available to research
communities is essential to Human Brain Project (HBP) [3]. Public availability of very high-
resolution research datasets and multitude of anonymized clinical exam results are the first steps
necessitating truly database oriented neuroscience.

Storing data in form of a database confers numerous advantages, such as ability to access it remotely
(eliminating the need for storing multiple copies and providing convenient means of sharing data with
collaborators), filtering and aggregating data on the server side (thus reducing bandwidth
requirements), sharing processing and analysis scripts among researchers providing excellent
reproducibility by referencing persistent public data stores. Furthermore, databases provide a single
point of update, thus removing the need for distributing the data to the consumers each time a
modification is made. From the other hand access to databases is easy from existing languages such
as MATLAB, Python, R, Java or C++. Therefore, a data interchange between different programming
platforms, allows for different analytical algorithms such as Bayesian/ multivariate/ timeseries
analyses, Principal Components Analysis (PCA), Independent Components Analysis (ICA) and
nonparametric methods, e.g. permutation based fMRI/VBM analysis, as well as signal processing
routines in MATLAB, miscellaneous statistical functions in R, image processing in Python.

Some of the existing database solutions support data virtualization. In practive however they're not
capable of representation, integration and analyses of data as rich and diverse as are characteristic of
neuroimaging. This shortcoming stems from numerous practical limitations. Firstly, the file format
support is usually limited to comma separated values (CSV). This functionality is inadequate in face
of the needs of neuroimaging which utlizes an exceptionally wide spectrum of formats including:
Nifti [4] - multidimensional data format, Gifti [5] - geometry format, MINC [6] - medical imaging
format, Surf [7] - brain surface format, MAT [8] - Matlab file format, HDF5 [9] - hierarchical
scientific file format, Bfloat - tractography streamlines format used by Camino Diffusion MRI
Toolkit [10], MRML [11] - Medical Reality Modeling Language, TrackVis [12] - track format for
tractography data, GraphML [13] - Graph Markup Language, GEXF [14] - Graph Exchange XML
Format, CIFTI [15] - Connectivity Informatics Technology Initiative, Neurolucida [16] - advanced
biological structure format, MorphML [17] - neuron morphology description language and SWC [18]
- simple neuron morphology data format.

16
Implementing support for missing data formats can be a tedious and complicated task but is not the
only problem severely limiting usage of database software for neuroimaging data analyses. Size of
the data is another big hindrance for database adoption in cerebral research. A scan in a typical 3T
MRI structural sequence can contain about 128 million values. Furthermore, typical experiments
involve hundreds of participants and can effectively involve from few to few hundred time points
quickly contributing to the overall volume of a data set. This magnitude poses various problems for
existing solutions both when handled as specialized data types (such as vector, array, volume, tree or
blob) and as primitive numbers spread across multiple columns. In the former case additional
concerns are often raised because functions and operators available for specialized data types
typically lack the necessary expressiveness for scenarios such as computation of aggregate functions,
joins, filtering and other interactions with classical tabular data. The latter case on the other hand
often exceeds fixed technical limits of the number of columns of particular software or simply isn't
abstract enough to serve as reversible 1:1 mapping for the underlying data. Moreover, in both cases
raw performance is insufficient using current solutions.

An ideal database system accepts a high-level request for information from the user and takes away
from him the burden of precise planning of access to storage, communication and computation
resources. Such an architecture is crucial to successful neuroimaging research in the future and should
take advantage of solutions such as: MapReduce-style algorithm execution in collaboration with
cluster computing environments, sharding, load balancing, integration with scientific computing
software packages (e.g. MATLAB), data structures optimized for in-memory operation and most
crucially data virtualization (also referred to as in-situ data access) with efficient caching and lookup
mechanisms.

1.3. Thesis and Goals

Data virtualization is a consensus solution that allows to keep legacy file-based access for existing
neuroimaging workflows, while exposing database interfaces to current generation of software
without the need for storing two copies of data. Using data virtualization with SQL gives immediate
benefits in terms of having relational access to raw data, intermediate and final results from all stages
of a typical neuroimaging pipeline including: alignment, temporal/spatial smoothing, coregistration,
normalization, experimental design results, model estimates, thresholding, clustering, surface
reconstruction/inflation/registration, etc. Ad-hoc database solutions allow to keep domain-specific
data formats at the same abstracting them as tables with automatically inferred schemas. This allows
for rapid data filtering, grouping, aggregating and joining according to specific criteria. Meanwhile,
the user is not burdened with ahead-of-time planning of the database architecture and left to approach
existing files in a familiar way.

17
Design of queries cross-referencing multiple studies, modalities and scales is facilitated by high-level
query languages and the raw performance necessary to execute them benefits from distributed
computing mechanisms such as MapReduce algorithms. A database framework can serve as one of
the building blocks for implementation of the latter providing convenient synchronization, data
sharing and load balancing primitives.

For smoothly navigating, comparing, integrating and interpreting the entirety of available knowledge
further performance enhancements should be taken into account. Close coupling with existing
scientific computing packages (e.g. MATLAB, R) could result in database planner being informed by
computational metrics of e.g. MATLAB's algorithms. Recognizing the full depth of user intention
and taking advantage of special cases such as sparse matrices and distributable computation
maintaining data locality are the keys to optimal performance. Finally, for storing temporary results
or helper tables, in-memory data structures offer the best performance that goes hand in hand with
efficiency-oriented character of research software. In-memory storage should therefore be the method
of choice for all transient data and application-specific file formats appropriate to given output should
be used whenever data need to be stored permanently.

The thesis of this work is that successful adoption of databases as primary research tool for
neuroimaging projects requires that the software under consideration supports: MapReduce-
style algorithm execution in collaboration with cluster computing environments, sharding, load
balancing, integration with scientific computing software packages (e.g. MATLAB), high level
query languages (e.g. SQL), data structures optimized for in-memory operation and most
crucially data virtualization (also referred to as in-situ data access or ad-hoc database) with
efficient caching and lookup mechanisms.

One of the goals of this work was to implement a model database software package supporting three
of the above characteristics: SQL query language, data structures optimized for in-memory operation
and data virtualization with efficient caching and lookup mechanisms. Furthermore, the author aimed
to illustrate usage of such software in synthetic data benchmarks as well as in practical neuroimaging
research scenarios. The final objective was to localize the above solution and experimental results on
the roadmap to large-scale adoption of such approaches by individual researches and scientific
communities at large. To this end, the author hypothesizes about implementation of additional
mechanisms stated in the thesis and their impact on the applicability spectrum of the software.

18
1.4. Structure

This work is further structured as follows. In chapter 2 a historical background is provided on the
topic of scientific databases and characteristics of involved data sets ranging from experimental
physics data to structural Magnetic Resonance Imaging (MRI). The chapter discusses challenges
related to using database approach as primary tool for handling research data and main reasons for
low adoption of such solutions in the field. Subsequently, traits specific to neuroimaging are
discussed in the context of database software, highlighting advantages and issues involved. In-situ
data access is recognized as one of the core components necessary for successful adoption of such
architectures. Current efforts towards data virtualization standard are introduced. The chapter is
concluded with a sketch of problems regarding implementation of standard and non-standard data
virtualization mechanisms in existing Database Management Systems (DBMSes).

Chapter 3 contains a review of existing DBMSes with emphasis on current applications in research
and industry. Data virtualization support is discussed or in case of lack thereof, the author theorizes
on the best implementations of such, following given software's architecture paradigm. The chapter is
divided into two subsections discussing separately relational and non-relational DBMSes.
Capabilities and efficiency of each implementation is discussed. The chapter is concluded with a
summary underlining problems common to current approaches to in-situ data access.

In chapter 4 the author presents a model implementation called WEIRDB which proves this work's
thesis. It implements a mechanism of arbitrary-sized data access in their source form without the
necessity for import procedure or synchronization with the database system. Furthermore, it
demonstrates means of improving performance by eliminating streaming client/server communication
and allows for formulating expressive queries in SQL.

Chapter 5 introduces data sets and queries representative of typical neuroimaging research
workflows. Subsequently, a comparison of WEIRDB's performance against existing solutions is
presented with respect to external data handling capabilities and query execution speed using
aforementioned data sets. In most of the tests, WEIRDB's results were better than those of other
solutions by 2 to 200 times. Special emphasis is put on data sets exceeding technical limitations of
prior art software. Solely WEIRDB's performance results are presented in that case. The chapter is
concluded by naming advantages of WEIRDB over existing DBMSes with respect to neuroimaging
data processing.

In chapter 6 examples of practical applications of WEIRDB are presented in the context of


neuroimaging studies concluded by article publications in renowned research journals. Selected
testimonials involve multisite datasets and from one to five medical modalities such as: magnetic

19
resonance imaging (MRI), positon emission tomography (PET), biochemical sampling of blood and
cerebrospinal fluid and genetic data. In all instances this thesis software implementation constituted
an integral part of the process of data consolidation, processing and analysis, demonstrating
capabilities, performance and efficiency of use unattainable in alternative solutions.

Chapter 7 contains the summary characteristics of presented model implementation with respect to
classical database systems and alternative "new generation" solutions. Included in the discussion are
as well further development venues for WEIRDB such as: improvements in supported network
protocols, performance improvements for local programming environments, introduction of access
control mechanisms, standarization of syntax according to established norms, integration with
MATLAB environment and expansion of the portfolio of supported neuroimaging formats. The goal
of WEIRDB's future growth is increasing presence of database systems in research applications,
improvements in work efficiency and quality of results, creation of data and results sharing
mechanisms and reduction of system resources requirements.

Chapters 8, 9 and 10 contain respectively a Disclosure Statement, Glossary and References to


literature.

This work includes one appendix. Appendix A contains source tree listings used by the author to
orientate himself in the architecture of some of the existing DBMS solutions.

20
2. Foundations and Problem Definition
As early as 1984 [19], efforts have been made to study, analyze and identify requirements and
limiting factors for adoption of database systems into various scientific disciplines. Shoshani et al.
was able to outline the commonalities between different research domains and suggest sub-
categorization of data into Experimental and Associated Data (further grouped into Configuration,
Instrumentation, Analyzed, Summary and Property data). He also proposed basic metrics pertaining
to the notion of Identifier of a record in a database to classify the character of Experimental Data -
Regularity, Density and Time Variation. He named as well other characteristics of such data - Access
Pattern, Access Type, Access Sequence and Database Size. Furthermore he discussed the meaning of
aforementioned properties with regards to Associated Data, as well as naming its more specific
characteristics such as Data Modelling and Non-standard Data Types. He followed with example of
Time Projection Chamber experiment and analyzed the characteristics of its Experimental Data which
he later summarized for numerous other research setups, such as Limited Track Reconstruction,
Hydrodynamics Simulation, NMR Spectroscopy, Heavy Ion Spectroscopy, Passive Solar Simulation,
Turbulent Flow Simulation and Laser Isotope Separation. Data from his table in [19] are contained in
Table 2.1 , as well as supplemented with analogous information regarding multisite neuroimaging
population brain studies exemplified by two most common modalities - structural and functional MRI
(sMRI, fMRI respectively).

I n [20], Maier et al. recognized the fact that scientific applications resort to custom I/O handling
and/or using existing libraries for direct file mapping to/from program structures rather than relying
on database architectures. He identified four obstacles hindering adoption of database technology in
the domain of scientific computing.

First among those were data sizes exceeding operational limits of existing DBMSs. While in Table
4.1.3 we can see that theoretical limits of current solutions seem to accommodate the needs of
examples in Table 2.1, it is still the case nowadays that theoretical and practical limitations can differ
quite vastly. Moreover, operations which are possible using custom I/O are frequently much more
efficient compared to running a database on similarly configured computer, biasing researchers'
preferences towards existing better performing solutions.

21
Table 2.1. Summary of characteristics of Experimental Data from [19], supplemented with characteristics of sMRI and
fMRI experiments. * - applies often, (*) - applies sometimes, E - experiment, S - simulation

Time Limited Track NMR Turbulent Laser


Structural Functional Hydro Heavy Ion Passive
Characteristics Projection Recon- Spectro Flow Isotope
MRI MRI Dynamics Spect. Solar
Chamber struction scopy (Vortex) Separation

EXPERIMENT /
E E E E S E E S S E
SIMULATION

Regular * * * * * *
IDENTIFIER
Dense * * * * * * * *
(KEY)
Time Variation (*) * (*) *

Exact * * * * * *
ACCESS
Range * * (*) (*) (*) (*)
TYPE
Proximity * * * * * * * (*)

Local * * (*) (*) * (*) (*)

Non−Local (*) (*) (*) *


ACCESS
SEQUENCE
Linear * * (*)

Arbitrary * * * * (*) * (*)

SIZE PER UNIT (bytes) 10^8 10^6 10^(4..5) 10^(3..4) 10^7 − 10^(3..4) 10^(3..4) 10^6 −

SIZE PER COLLECTION


10^9 10^8 10^11 10^10 10^(8..9) 10^(6..7) 10^(10..11) 10^(6..7) 10^8 10^(7..8)
(bytes)

TOTAL SIZE (bytes) 10^(10..13) 10^(9..12) 10^12 10^11 10^(11,12) 10^(9..10) 10^12 10^(9..10) 10^10 10^10

Configuration * * * * * * * * * *
ASSOCIATED
Instrumentation * * * * * * *
DATA
Analyzed * * * * * * * * *

Secondly, data access patterns in scientific computing differ significantly from the usual transactional
processing of existing DBMSs. Research resources are frequently read-only or append-only and
require fetching and storing of large amounts of data at once for processing in external software
packages. These are scenarios where state of the art database solutions for multiple reasons (data
organization, network protocols) suffer performance penalties.

Lack of convenient interfaces to existing scientific programming environments (e.g. FORTRAN,


MATLAB) was listed as the third possible reason. It is still the case to date that outside of the ODBC-
or JDBC-compliant databases, many domain specific database solutions stay out of reach of
mainstream research because of lack of proper interfaces. It should be mentioned that even with such
interfaces in-place the interaction between programming environment and the database is usually
lacking the flexibility necessary to give it an advantage over direct data access. Such flexibility could
involve e.g. "out of the box" support for mixing in programming calls into database queries. A good
illustration of such approach with capability of making a broad and meaningful difference is the
announced introduction of R [21] programming language support directly into Microsoft SQL Server
2016 [22].

22
Last but not least Maier et al. mention limitations in mapping between data types used in scientific
computations and those supported by DBMSs. This is true both on more detailed level linked to the
beforementioned issue of programming language interoperation, as well as with respect to simple
mapping of numeric values to a database table. In the latter case, many state of the art solutions
(especially mainstream ones, e.g. PostgreSQL, MySQL) suffer deficiencies linked to limited number
of columns per table and/or result row. In case of lack of such theoretical limitations performance is
often unsatisfactory when working with real-life scientific data - Table 5.3.1, Table 5.3.3.

As can be seen in Table 2.1 , clinical neuroimaging represented by sMRI and fMRI modalities is
among the most demanding experimental disciplines with regards both to dataset sizes and
characteristics. Scientific neuroimaging uses microscopy techniques pushing total sizes of datasets
even further - up to and beyond the petabyte range. Following the creation of international brain
study efforts such as Human Brain Project [3], even mainstream MRI data used in neuroscience
studies will have its size increased by orders of magnitude, as e.g. hundreds of thousands of
anonymized pre-existing hospital scans are made available to the public - Figure 2.1.

Other examples of rich network databases include: Open Access Series of Imaging Studies (OASIS)
[23], a genome analysis system for large-scale mutagenesis in the mouse (MuTrack) [24],
Neuroscience Information Framework (NIF) [25], ontology specifications for neuroscience semantic
web - Neurocommons [26], exploration of protein pathways - PIMRider [27], Inferred Biomolecular
Interaction Servers (IBIS) [28], high quality mass spectral database - MassBank [29], Mouse
Phenome Database (MPD) [30], Autism Genetic Database (AGD) [31], Huntington's Disease
Database (HDBase) [32], Alzheimer's Disease Neuroimaging Initiative (ADNI) [33], Parkinson's
Progression Markers Initiative (PPMI) [34], database of published neuroimaging experiments with
coordinate results - BrainMap, searchable database of neuron properties - NeuronDB, macaque
macroconnectivity database - CoCoMac, biomedical literature database - PubMed or database of
neuron morphologies - NeuroMorpho.org [35].

At the same time, this data explosion is met by efforts such as BIDS [36] or OpenFMRI [37], aiming
to standardize directory layouts, file formats, experiment descriptions, semantic knowledge and other
accompanying meta data in a fashion not requiring a database for performing future analysis. This
approach acts as a reinforcement of the aforementioned custom I/O trend in science, as lack of
requirement of a database system is quoted as one of its advantages.

23
Figure 2.1. General Architecture of Human Brain Project Data Access. Hospital Infrastructure
consists of systems supporting HIS (Hospital Information System), PACS (Picture Archiving and
Communication System) and other domain-specific standards (e.g. PACS-EEG) in health
informatics providing means of storing respectively administrative, imaging and modality-specific
data. Data are extracted from those systems into temporary storage where they're further processed
using dedicated workflows for modality-specific feature extraction. Examples of such workflows
include 3D (volume) brain imaging (e.g. T1-weighted MRI, ASL, DWI) and 4D (time series) brain
imaging (e.g. functional MRI, resting state MRI, EEG) workflows. Processing may include
realignment, coregistration, time slicing, smoothing and other steps commonly referred to as
preprocessing. Normalisation is the process of registering subject level results to a predefined atlas
space, such as MNI or Talairach space. Corresponding atlas is subsequently used to extract regional
feature averages. Modality-specific algorithms (e.g. TIC quantification) may follow this step.
Finally, only an anonymous feature vector is exposed to data mining toolkits used by researchers
with original patient data erased from temporary storage.

24
The strengths of SQL DBMS as storage and sharing mechanisms for neuroimaging data and complex
data query usage as applied to neuroimaging analytics have been recognized previously in [38]. The
need to support management of massive amounts of data, rapid and precise analysis and collaborative
research are named as the leading motivations for advancement in this domain. Therefore, the authors
propose that importing data to existing DBMS' native storage format is the preferred way to handle
neuroimaging data within the framework of a database, as demonstrated by [39]. Effectively, the
goals of BIDS/OpenFMRI and Gray et al./Hassona et al. are as much complementary as they are in
conflict with one another.

Storing data in the form of a database confers numerous advantages, such as ability to access it
remotely (eliminating the need for storing multiple copies and providing convenient means of sharing
data with collaborators), filtering and aggregating data on the server side (thus reducing bandwidth
requirements), sharing processing and analysis scripts among researchers providing excellent
reproducibility by referencing persistent public data stores. Furthermore, databases provide a single
point of update, thus removing the need for distributing the data to the consumers each time a
modification is made. A proper notification mechanism is sufficient to inform the recipients that
they're now working with an updated data source. Databases usually provide load balancing
capabilities which make them ideal for usage in high performance parallel computing setups as well.

For reconciling conflicting approaches of custom I/O and database systems, data virtualization
appears to be the best consensus solution which allows for retaining the file-based access for I/O
intensive analytical software, while offering the possibility of high-level database-driven insights for
new applications without data duplication or necessity for database modelling and import. In such
scenario, data are kept in their original format and accessed via middleware responsible for
abstracting their internal organization into a typical database table. Existing data virtualization
solutions include e.g. Teeid, NoDB, CSV tables in MySQL and H2, virtual tables in SQLite or in the
very simplified case even the Microsoft Log Parser.

Data virtualization efforts have been recognized by the database community in creating the ISO/IEC
9075 SQL standard - Management of External Data (SQL/MED) [40]. It is an addition to the SQL
standard to define foreign data wrappers and datalink types to allow SQL DBMS to manage external
data, such as those present at different stages of neuroimaging processing pipelines in their native
formats.

25
The foreign wrapper concept allows to access non-SQL data from an SQL DBMS and combine it
with existing SQL data or other types of non-SQL data all within the scope of SQL query language,
as illustrated in Figure 2.2.

Figure 2.2. SQL/MED Foreign Data Wrapper architecture. SQL/MED specification uses the word
"foreign" to designate external data. These two words are used interchangeably in this work. The
SQL/MED API specifies the interface between SQL Server query execution logic (e.g. get number
of results, begin scan, fetch row) and external data access algorithm (Foreign Data Wrapper). The
Foreign Data Wrapper can use an arbitrary API (e.g. file access routines from standard C library,
BSD sockets or higher level libraries) to access external data either on local system or via network.
In the latter case a Foreign Server is implicated in the data access chain.

Second part of the standard defines a DATALINK type as illustrated in Figure 2.3 and a set of
operations on this type. It allows to reference local files and external URLs from within the fields of a
table under the assumption that actual data access will be implemented using native operating system
file access or network APIs outside of the database. DATALINK specification does provide however
some additional functionality as compared to storing purely textual reference to external files. These
are - referential integrity (database uses OS mechanisms to lock files which are being pointed to from
the tables so that they cannot be renamed or deleted), access control (read/write access control
mediation through database) and point in time recovery of file changes (a basic versioning system).

DATALINK extension doesn't allow execution of SQL queries on such remotely referenced data,
therefore it does not apply to the neuroimaging use case discussed previously. It could however
provide a useful mechanism as storage and access control layer.

26
Figure 2.3. SQL/MED datalink type architecture. SQL Server contains tables with DATALINK
type values. Datalinker has two roles. The first is to register for given file manager every file that is
linked and unlinked from any DATALINK on any SQL Server. This comprises the necessary
metadata for the second role of datalinker which is intercepting direct file manager calls from
external software. The external process needs to request a token from SQL Server prior to issuing
the call, otherwise the interception routine will cancel the call and report an error. Datalinker might
arrange as well for backups of referenced files so that they correspond to backups of tables
containing the references. These facilities provide referential integrity, database-mediated access
control and basic versioning for the DATALINK type.

Using data virtualization (further described interchangeably as ad-hoc databases) gives immediate
benefits in terms of having relational access to raw data, intermediate and final results from all stages
of a typical neuroimaging pipeline: alignment, temporal/spatial smoothing, coregistration,
normalization, experimental design estimation (beta coefficients), thresholding, clustering, surface
reconstruction/inflation/registration, etc. Ad-hoc database solutions allow to keep the usage-specific
data structure while abstracting it as a database with automatically inferred schema. This in turn
allows for rapid data selection according to specific criteria, crafting of queries cross-referencing
different studies, modalities and scales, smoothly navigating, comparing, integrating and interpreting
the entirety of available knowledge. Furthermore, having a database oriented point of access
constitutes a touching point with the abovementioned existing public databases and allows for
retrieval of highly specific cross-sectional data from Internet sources

Access to databases is easy and frequently part of a standard framework in existing languages such as
MATLAB, Python, R, Java or C++. Therefore, it is easier to access such data source instead of
implementing a reader/writer for a very specific data format. It also facilitates straightforward
interchange of data between different programming platforms, therefore allowing for mixing of
different analytical algorithms present in each of them, such as Bayesian/multivariate/timeseries
analyses, Principal Components Analysis (PCA), Independent Components Analysis (ICA) and non-
parametric methods, e.g. permutation-based fMRI/VBM analysis, as well as signal processing
routines in MATLAB, miscellaneous statistical functions in R, image processing in Python and C++,
etc.

27
Existing virtualization solutions are theoretically capable of representation, integration and analyses
of complex, diverse and distributed data, however in practice they're subject to numerous limitations.
Firstly, the file format support for ad-hoc access is usually limited to comma-separated values (CSV).
This is far from addressing the needs of neuroimaging which requires at the least to support formats
like Nifti [4], Gifti [5], MINC [6], Surf [7], MAT [8], HDF5 [9], etc. However, even assuming that
support is not enough in most cases to handle the actual research data. The reason for that is size. A
scan in a typical 3T MRI structural sequence can contain about 128 million values. This is beyond the
capacity of most of the existing solutions to handle as columns and array-oriented solutions such as
SciQL [41] are missing from most existing systems. In fact, to the best of my knowledge as of today,
no system exists combining SciQL-like array handling capabilities with SQL/MED-like data
virtualization.

Furthermore, empirical data show that even in the absence of technical limitations, the performance
while handling such amount of columns is usually prohibitive. Of course, the data can be transposed
but it raises multiple other concerns, such as computation of aggregate functions, joining the tables,
filtering by specimen, etc. And even assuming horizontal variants of those were supported, in case of
longitudinal population studies, the column-wise capacity of existing solutions could be exceeded
whether the data are transposed or not.

In the following chapter, I look more closely at implementations of SQL/MED foreign-data wrappers
and other analogous mechanisms in existing database software products in the context of
neuroimaging studies. I illustrate their pros and cons and discuss their shortcomings with respect to
aforementioned use cases. Afterwards, I propose an original solution - WEIRDB, which on top of
SQL/MED-like capabilities addresses specific needs of intense neuroimaging research in novel,
better performing manner, avoiding limitations of state of the art architectures. Subsequently, I
present a number of practical applications of WEIRDB in neuroimaging research. Then, I proceed to
draw conclusions and discuss the implications of WEIRDB and data virtualization in neuroimaging,
followed by hypothesis of possible future development trajectories for WEIRDB and data
virtualization in general.

28
3. Related Work

3.1. Relational Databases

3.1.1. PostgreSQL

PostgreSQL is an object-relational database management system (ORDBMS) first released in 1996. It


is jointly developed by PostgreSQL Global Development Group, which comprises many companies
and private individuals. It is written in C and is cross-platform supporting most Unix-like systems
and Windows. PostgreSQL boasts support for the majority of SQL:2011 standard, ACID compliance,
transactional support for most of data description language (DDL) statements, multiversion
concurrency control (MVCC) for avoiding locking issues, dirty read protection, full serializability,
handling of complex SQL queries, updatable views, materialized views, triggers, stored
functions/procedures and extensibility mechanisms allowing for many third-party plugins. It is
distributed with a permissive free-software license and is suitable for a range of workloads from small
single-machine servers up to big database clusters. It is used by many companies and software
products including Yahoo!, BASF, Reddit, Skype, Sun xVM, Instagram and Disqus. At the time of
this writing, PostgreSQL was the only database among the tested open-source solutions to support
SQL/MED according to the ISO standard. However, competing products offered non-standard
equivalents of SQL/MED foreign-data wrappers. In the next section, PostgreSQL implementation of
SQL/MED is discussed in more detail, followed by description of the CSV foreign-data wrapper and
its performance analysis.

SQL/MED implementation

SQL/MED (Management of External Data) is an extension to SQL standard providing means of


handling external data. Data are considered external if they're accessible but not managed by the
database. The SQL/MED standard is supported in PostgreSQL since version 9.1. Following is the
discussion of foreign-data wrappers (FDW) implementation in PostgreSQL.

The key element is FdwRoutine structure which contains pointers to various functions responsible for
providing information, processing and extracting external data. The GetForeignRelSize function
calculates relation size estimates for a foreign table. It's called at the beginning of planning which
requires scanning of such table. The arguments passed to it are global and table-specific planner
information. The GetForeignPaths function is called after GetForeignRelSize and receives the same
arguments. Its role is to generate all possible foreign scan paths along with their accompanying cost
and any private data further required by the FDW for performing the scan. A scan path contains

29
information about sorting order and required outer relations. Then the GetForeignPlan routine is
called at the end of planning and creates a ForeignScan node based on the provided path, list of scan
output targets (columns) and scan restriction clauses (e.g. WHERE conditions) to be enforced by the
executor. GetForeignPlan can decide which of the clauses can be enforced by the foreign scan
executor and which need to be processed by the core executor and set the ForeignScan node
parameters accordingly. The execution is started with a call to BeginForeignScan which is supposed
to perform all the necessary initialization. Depending on the presence of
EXEC_FLAG_EXPLAIN_ONLY flag this might be limited to the minimum necessary to make the
node state valid for further calls to ExplainForeignScan and EndForeignScan. In the absence of
EXEC_FLAG_EXPLAIN_ONLY, the next routine called is IterateForeignScan. It returns one row
from the foreign table with columns matching the layout of the table. If unnecessary columns are
optimized out, their values are set to NULL. IterateForeignScan can be called repeatedly until no
more rows are left in which case it should return NULL. ReScanForeignScan can be used to restart
the scan from the beginning. EndForeignScan is called to release resources at the end of scanning.

FDWs can also provide write access to external data by means of table update functions.

AddForeignUpdateTargets defines extra target expressions used to uniquely identify rows to be


updated by FDW, e.g. primary keys. Those values will be retrieved during an update scan.

PlanForeignModify can perform additional actions and provide private data attached to a table
INSERT, UDPATE or DELETE plan. Its input arguments are global planner information, table
modification plan, subplan index and the resulting relation table range index.

BeginForeignModify is called during executor startup and is supposed to perform any initialization
needed before actual table modification.

Invocations of ExecForeignInsert, ExecForeignUpdate or ExecForeignDelete follow for each tuple


to be inserted, updated or deleted respectively. States of the plan and target relation are provided as
arguments, as well as private data set in PlanForeignModify. Subplan index and flags (in particular
EXEC_FLAG_EXPLAIN_ONLY) are also provided.

ExecForeignInsert inserts one row into the foreign table. A tuple of columns corresponding to table
layout is provided, as well as a tuple extended with additional fields specified by
AddForeignUpdateTargets. Global execution state of the query and description of the resulting
relation are also given as arguments. ExecForeignInsert returns a tuple that was inserted or NULL if
no insertion was performed (e.g. due to triggers). The data returned is used if the INSERT query has a
RETURNING clause or table has AFTER ROW trigger.

30
ExecForeignUpdate is responsible for updating one row in a foreign table. It takes the same
arguments as ExecForeignInsert, this time using the additional targets for identifying the destination
rows. The return value semantics follow the same rules as for ExecForeignInsert.

ExecForeignDelete deletes one row from the foreign table. Its behavior is identical to
ExecForeignUpdate except that the row is removed rather than updated. EndForeignModify ends the
table update and frees any allocated resources. IsForeignRelUpdatable is an informative function,
reporting which update operations the specified foreign relation supports.

Two additional functions are provided for handling the EXPLAIN queries involving foreign tables.
ExplainForeignScan prints additional EXPLAIN output for foreign scan. ExplainForeignModify
prints additional information for foreign table modification.

The AnalyzeForeignTable routine is called when ANALYZE command is executed on a foreign


table. If supported, it should return a function for sampling rows from the target table and estimated
table size. The sampling function receives the maximum number of rows to be sampled and should
write their content into provided destination variable.

Foreign Data Wrapper for Comma-Separated Values files

The FDW for comma-separated values (CSV) included with PostgreSQL is called file_fdw and
provides read-only access to flat files with configurable delimiter, quote and escape characters,
optional header and binary and textual format support. It re-uses the COPY FROM implementation
for reading the files.

The fileGetForeignRelSize is implemented based on ANALYZE results or faked constant values. In


the presence of previous ANALYZE information, the tuple density is obtained as ratio between
number of tuples and number of pages in the given relation scaled to the current relation file size in
pages. Furthermore, the number of rows is scaled by the selectivity of provided restriction clause list.
Otherwise, the number of tuples is estimated as the file size divided by tuple size in the internal
representation as specified by table definition. This value is also scaled by the selectivity value.

The fileGetForeignPaths specifies only one possible access path and propagates convert_selectively
option in the fdw_private field of the path into the plan node. The convert_selectively option carries
information for COPY FROM regarding which columns need to be parsed from textual to binary
form.

31
As the foreign scan executor for CSV has no native ability to evaluate restriction clauses,
fileGetForeignPlan puts all non-pseudocontant clauses into the plan node for core executor to check.
A pseudoconstant is a clause which contains no variables or volatile functions and such clauses are
handled elsewhere in the PostgreSQL pipeline.

fileBeginForeignScan does nothing in the presence EXEC_FLAG_EXPLAIN_ONLY flag, otherwise it


initializes COPY FROM state and creates a FileFdwExecutionState structure for holding it along with
filename and options.

fileIterateForeignScan executes read of one row using the NextCopyFrom routine. All columns are
read, however values of those not marked for binary conversion are ignored and set to NULL in the
result array.

fileResetForeignScan restarts the COPY FROM logic by calling EndCopyFrom and again
BeginCopyFrom.

fileEndForeignScan finalizes COPY FROM algorithm by calling EndCopyFrom.

fileExplainForeignScan provides foreign file name and size to the regular EXPLAIN output.

fileAnalyzeForeignTable provides a file_acquire_sample_rows function in order to analyze sample


rows.

file_acquire_sample_rows reads requested number of rows from the beginning of the file and then
reads all consecutive rows each time replacing on random one of the previously read rows (Vitter's
algorithm [42]). Resulting rows are subjected to PostgreSQL's ANALYZE pipeline.

Analysis of PostgreSQL's Foreign Data Wrapper for CSV files

The COPY FROM algorithm uses stdio.h routines for accessing underlying files, therefore the only
caching mechanism provided is that of the C library and the operating system buffers. In case of
repeated scans of very big tables this can result in pages being removed from cache before being
reused and in effect not benefiting from cache at all.

More informed caching of row offsets, separator positions or values converted to binary is not
maintained either by FDW or higher level of PostgreSQL execution pipeline. Therefore, full CSV
parsing logic has to be invoked on each query, as well as PostgreSQL's string to datum conversion
routines. This approach can be particularly inefficient when retrieving entire rows (i.e. converting all
columns) especially if the user goal is simply to retrieve the original data from filtered rows.

32
Since foreign data access in PostgreSQL stays part of the regular query plan and those data might
need to be put in temporary tables for driving a join for example, it is subject to the same limitations
regarding maximum number of columns as the rest of the pipeline (e.g. maximum number of columns
per table 250-1600 depending on the data type, maximum number of targets in SELECT query -
1664).

The necessity to define strongly typed table layouts in order to attach external CSV files is an
unnecessary burden in most typical usage cases. Furthermore, data conversion is frequently
unnecessary in scenarios using only filtering/joining/sorting functionality in order to redirect
unmodified raw data retrieved in such manner to further external processing utilities accepting CSV
input.

With the abovementioned shortcomings in mind, PostgreSQL doesn't seem like the optimal choice for
handling neuroimaging data. This conclusion is partly confirmed by the types of foreign data
wrappers currently available, which consist chiefly of database proxies, text file formats, REST API
wrappers (e.g. Facebook, Google, Twitter), Apache Hadoop, Apache Hive, some geographic formats
(GDAL/OGR, Geocode/GeoJSON) and physics (ROOT) and genetics (VCF) file formats. No
imaging file format is supported. The biggest limiting factor seems to be the limit on maximum
number of columns.

3.1.2. MySQL

MySQL is the most popular open-source relational database management system (RDBMS). It was
intially released in 1995 and its development was sponsored at the time by a single Swedish for-profit
company - MySQL AB. Successively, MySQL AB was acquired by Sun, which later was bought by
Oracle. Currently Oracle offers a free version of MySQL software along with multiple paid versions
with different added features. MySQL similarly to PostgreSQL supports workloads scaling from
amateur webservers to replicated database farms. Some of the well-known software packages using
MySQL are: Joomla, WordPress, phpBB or Drupal. It's also used by companies such as Google,
Facebook, Twitter, Flicker and YouTube. One of the most interesting characteristics of MySQL is
presence of plugin-based storage engine system, allowing to define per-table storage mechanisms.
This allows for foreign-data wrapper functionality although it is not named this way and it doesn't
follow the SQL/MED standard. In the following paragraphs are presented the analyses of MySQL's
dynamically loadable storage engine architecture and the CSV storage engine.

33
Analysis of MySQL's dynamically loadable storage engine interface

MySQL similarly to PostgreSQL's SQL/MED implements a powerful table storage engine abstraction
layer encapsulated in a single handler class.

A number of operations can be pushed down to the table storage engine, including filtering
(cond_push() method), sorting (multi_range_read_* methods), unordered scan (ha_rnd_*), index
scan (ha_index_*), table creation (ha_create()), modification (ha_*_alter_table), removal
(drop_table(), ha_truncate()), analysis (analyze()), optimization (ha_optimizer()), repair (ha_repair()),
row insertion (write_row()), deletion (delete_row()), update (update_row()), transaction handling
(ha_start_consistent_snapshot(), ha_commit_*, ha_rollback_trans()), multi-range reads
(multi_range_read_*), autoincrement (*_auto_increment), semi-consistent reads
(was_semi_consistent_read()), foreign key handling (*_fk_*, *_foreign_key_*) and locking
(*_lock_*).

Not all of the above operations have to be supported by each engine. Appropriate capabilities can be
announced using handlerton::flags field and handler's table_flags(), alter_table_flags() and
partition_flags() methods. Furthermore, some query-specific facts can be communicated from
respective methods, e.g. multi_range_read_info_const() using HA_MRR_* constants. Some
capabilities are implemented by the handler class itself and can be left in their default form unless a
more efficient engine-specific implementation is possible.

Architecture of MySQL's CSV storage engine

For the needs of CSV storage engine implementation 33 methods were reimplemented from the
handler class. Basic engine description methods include: table_type() for returning corresponding
table type name ("CSV"), index_type() returning "NONE" as indexing is not supported in CSV
tables, bas_ext() which returns an array of meta/data file extensions (e.g. used by default
implementations of rename_table() and delete_table()), table_flags() returning a set of flags
communicating that CSV engine doesn't support transactions (HA_NO_TRANSACTIONS), row
positions are not a fixed offset apart (HA_REC_NOT_IN_SEQ), autoincrement is not supported
(HA_NO_AUTO_INCREMENT), engine is capable of row-format and statement-format logging
(HA_BINLOG_ROW_CAPABLE and HA_BINLOG_STMT_CAPABLE respectively) and that it
supports REPAIR TABLE command (HA_CAN_REPAIR). Since indexes are not supported
index_flags() returns 0, max_record_length(), max_keys(), max_key_parts() and max_key_length()
return respectively maximum record length (65535), maximum number of keys (0 since indexes are
not supported), maximum number of key parts (0) and maximum key length (0). The extra() method
serves as means of communicating additional knowledge about table usage in the opposite direction -

34
from MySQL to the table engine. The CSV implementation handles the
HA_EXTRA_MARK_AS_LOG_TABLE operation, marking the table as log table.

Methods used for query optimization include: scan_time() which returns estimated scan time,
fast_key_read() which returns 1 but will never be called due to absence of indexes support,
estimate_rows_upper_bound() which is supposed to return upper bound on the number of records in
the table but returns HA_POS_ERROR to indicate that it's not implemented. Furthermore, info()
method provides statistics for the optimizer. Currently it sets only the number of records in case the
exact value hasn't been determined. The value returned in such case is 2 to avoid erroneous
optimization such as rewriting the query with values of the single row as constants.

Methods used for core I/O in the table include: create() responsible for creating a new table with the
given name and options (e.g. charset), check_if_incompatible_data() which determines if specified
options are compatible with the engine, in case of CSV returns always 1, open() responsible for
opening the database file and allocating the structure shared between instances of the CSV engine for
the given table, close() which closes the database file and frees the shared structure if it's not used,
write_row() responsible for inserting a row, update_row() for updating a row, delete_row() for
deleting a row, delete_all_rows() for deleting all rows (used when DELETE is called without
WHERE clause). Finally for locking support, the store_lock() method stores the internal table lock
address in the provided pointer.

Since indexes are not supported only unordered scan functions are implemented: rnd_init() for
initializing an unordered scan, rnd_next() for retrieving next record, rnd_pos() for retrieving a record
from provided position, rnd_end() for finalizing an unordered scan, position() for storing the current
position in a scan into the provided buffer (to be used with rnd_pos()).

For crash and recovery the following methods are provided: check_and_repair() is just a wrapper for
repair() call, check() reads the database file row by row and reports failure if either parsing fails or
expected number of rows couldn't be read from the file. It also marks the table as crashed, a value
which is returned by is_crashed() method. The repair() method's implementation is very rudimentary.
All it does is truncate the file up to the last row read without error. auto_repair() method returns 1 to
acknowledge that the engine supports autorepair.

35
Analysis of MySQL's CSV storage engine

MySQL's CSV storage engine implementation shares many problems of other solutions. The
foremost is lack of CSV structure caching (rows and columns starting/ending positions), the need to
read whole rows in order to retrieve only a few columns, and lack of caching/indexing of columns
either frequently used in queries or being indexing candidates due to driving sorting, filtering or
grouping operations.

Although write support is not subject to consideration in this work, because of the way it is
implemented in MySQL's CSV tables, each UPDATE or DELETE statement requires rewriting of the
entire data file which causes significant overhead. The write mechanism resembles HSQLDB
implementation in that new rows are always inserted at the end of file, and update of a row consists of
removing the row and reinserting its updated version. Deletion of a row however is implemented
differently. The delete_row() method keeps track of chains of deleted rows and during the call to
rnd_end() the entire data file is recreated with deleted rows skipped.

Due to strong typing, a rigid table layout needs to be specified by the user and the engine constantly
checks whether values read from the CSV file are valid for the respective column type, raising errors
if latter is not the case. Noteworthy, although this approach adds some overhead to each query, a full
data scan is still necessary if after CREATE TABLE statement, user decides to replace the newly
created CSV file with an existing one.

Similarly to PostgreSQL, MySQL's implementation isn't well tailored for efficient in-situ CSV file
access. Additionally, its implementation as storage engine rather than explicit external data
management mechanism, forces users to manually replace table files after creation in order to be able
to access existing CSV files.

3.1.3. Teiid

Teiid is a data virtualization system which allows applications to use standardized API to access data
from multiple, heterogeneous data stores. It supports handling relational, XML, XQuery and
procedural queries. Teiid constitutes a part of JBoss Developer Studio (JBDS) - a development suite
for creating rich web applications. JBDS was first released in 2007 and is currently developed in
collaboration between JBoss (a division of Red Hat) and Exadel (software engineering company).
Teiid is advertised as being used by some of the Fortune 500 companies and Government Intel
agencies, as well as many independent software vendors. Support for external data management is the
definition of Teiid and SQL is one of several supported query languages. The details of Teiid's
external data management with respect to CSV file format and an assessment of its performance are
presented in the following section.

36
Architecture of Teiid's TEXTTABLE implementation

In the following paragraphs I describe briefly how essential components enabling CSV file access are
instantiated and logically connected to one another in embedded configuration of Teiid.

Comma-separated values format is accessible in Teiid using the TEXTABLE function. An example
statement defining the view of an existing CSV file in the Teiid's DDL language is presented in
Listing 3.1.3.1.

Listing 3.1.3.1. View definition for existing comma-separated values file

CREATE VIEW "MyModel" AS


SELECT d.* FROM
(CALL BigData.getTextFiles('nodb_test.csv')) f,
TEXTTABLE(f.file
COLUMNS "c0 integer, c1 integer, ..."
DELIMITER ';'
HEADER) d;

Such statement is valid as the schema definition of a virtual model in Teiid configuration with file
model defined (BigData is the name programmatically given to the model).

A file model is provided via SimpleConnectionFactoryProvider which is a wrapper for


BasicConnectionFactory implementation returning new instance of FileConnectionImpl class.

This mechanism is a result of using the "provider model" design pattern, enabling runtime
replacement of connection factories. The SimpleConnectionFactory provider however always
delivers one predefined connection factory.

The definition of getTextFiles() procedure is provided by the FileExecutionFactory class. It is


associated with BigData model using the addSourceMapping() method. FileExecutionFactory is
annotated with a @Translator(name="file", ...) annotation, thus the "file" string must be provided as
translator name to addSourceMapping(). The connJndi argument provides a JNDI (Java Naming and
Directory Interface) name of connection factory provider added to EmbeddedServer with
addConnectionFactoryProvider() call.

For proper resolution of the dependencies declared above, FileExecutionFactory has to be added to
EmbeddedServer deployment using an addTranslator() call.

When an EmbeddedServer instance is deployed (using deployVDB() call) with default


EmbeddedConfiguration and the two models - MyModel and BigData, standard SQL queries on
MyModel become possible - Listing 3.1.3.2.

37
Listing 3.1.3.2. Example SQL query on MyModel

SELECT c0, c1, c2 FROM MyModel;

The logic responsible for evaluating the view definition, in particular the TEXTTABLE() call starts at
the parser (SQLParser.jj) level which defines the syntax of this special function and the means of
storing its invocation specification in a TextTable instance. TextTable is a child class of
TableFunctionReference and as such is handled successively by RelationalPlanner and
PlanToProcessConverter classes resulting in creation of TextTableNode instance in the final query
execution process definition.

Analysis of Teiid's TextTableNode implementation

The TextTableNode class is responsible for CSV file processing in Teiid. It supports a header row,
configurable number of skipped rows, per-table column delimiters, quotation and escape characters
and fixed or variable column width. Text data are retrieved using the BufferedReader interface. Full
lines are always read regardless of the projection columns specified in the query. No caching
mechanism is used apart from the one provided by the Java file input/output classes. Separate code
paths are used for parsing fixed-width (parseFixedWidth()) and variable-width
(parseDelimitedLine()) columns. Header row is processed using the processHeader() method and
column name to index map is stored. Values retrieved from data rows are selected according to
indices kept in the projectionIndexes list. If a non-matching SELECTOR prefix was specified for any
column, the whole row is skipped. Each value is converted to its corresponding data type declared in
TEXTTABLE() call using DataTypeManager.transformValue() method. The above-mentioned
processing mechanism is implemented asynchronously using java.util.concurrent.Executor object
shared with the command execution context. Thanks to this solution nodes producing independent
batches of results can be executed in parallel with TextTableNode.

Similarly to external CSV access implementations in other DBMS, the main overhead in Teiid's
solution is caused by repetitive reparsing of text lines without a supporting offset lookup structure, no
caching mechanism for frequently accessed columns, numerous calls to value conversion routines and
in case of queries that cannot benefit from execution parallelism - from the asynchronous execution
pipeline.

Furthermore, it's worth noting that in contrast to other databases, CSV file support in Teiid is
implemented by means of a special function rather than table declaration. Therefore, TextTableNode
object life cycle is limited to a single query, precluding any optimizations requiring persistence across
calls and adding the overhead of node initialization during each call.

38
The latter is somewhat mitigated due to presence of result set cache and can be further circumvented
using materialized views. Both of these possibilities however spoil the original goal of no-copy data
access and in case of materialized views add significant space and time costs as by default data
materialization includes all rows and columns regardless of query-specific needs. In case of data sizes
exceeding free storage space, view materialization is not possible.

3.1.4. HSQLDB

HSQLDB is a relational database management system (RDBMS) written in Java. It supports a large
subset of SQL-92 and SQL:2008 standards within a small and fast engine using in-memory and/or
on-disk tables. Its initial release was published in 2001 under the BSD license which remains its
current distribution model. HSQLDB is developed in open-source fashion by a group of independent
developers coordinated by the Project Maintainer, currently - Fred Toussi. The project is also
accepting occasional contributions from other open-source projects, as well as commercial software
developers. HSQLDB is used as database and persistence engine in many open-source projects,
including OpenOffice, Kepler, Liferay and Jitsi. It is also used in commercial software, such as
Mathematica, Jira, Confluence and Team City. HSQLDB's support for external data management is
limited to text tables (CSV files) implemented by means of dedicated CREATE TEXT TABLE
statement. The details of successive handling of such tables along with performance discussion are
presented in the following section.

Analysis of HSQLDB text table implementation

Implementation of text tables in HSQLDB revolves mainly around three classes: TextTable,
TextFileReader and TextCache. TextTable reimplements some methods of the Table class in order to
use text files as backing store for the data, most importantly connect() which is responsible for
connecting to the data source, retrieving persistent store object, initializing reader and cache objects,
establishing collaboration between the aforementioned classes and ensuring correctness of the text
input, disconnect() which releases persistent store object, setDataSource() for swapping data sources
for the same table definition, insertData() for inserting a new row of data from CREATE TABLE AS
queries.

The rest of the operations is handled by PersistentStore interface, which for text tables is implemented
by RowStoreAVLDiskData class. This implementation is a wrapper for actual calls to TextCache
object. The core methods are: add() for inserting a row specified as CachedObject, get() for inserting
a row from provided RowInputInterface object, getNewCachedObject() for creating and inserting a
new row from array of columns, indexRow() for updating indexes relevant to the new row,
removeAll() for removing all rows from indexes and clearing accessors list, remove() for removing

39
specified row, getAccessor() for retrieving cached row, commitPersistence() for writing out a row
(it's called only by insertData()), commitRow() for commiting row in a transaction. rollbackRow() for
rolling back row in a transaction, release() for closing the TextCache object and uninitializing the
RowStoreAVLDiskData object.

The TextCache class manages text data persistence using RowInputText and RowOutputText classes
(optionally their variants for quoted text). The key characteristic is the way adding/deleting/updating
rows is handled. The saveRow() method is used directly or indirectly by other internal classes for
inserting a new row. The insertion happens always at the end of the text file. The removePersistence()
function is used for deleting a row. What it does is in fact filling all of the storage space used by the
row with spaces followed by a newline character. Therefore, deleting a row in fact leaves a blank row
in the storage file. Updating a row in HSQLDB consists of deleting a row and then re-inserting the
row with modified columns. Since text representation can be of different length than before, a new
row is appended at the end of text file, leaving a blank row behind. Relevant indexes are updated
accordingly to the operation performed.

TextFileReader class is responsible for reading text files line by line, while RowInputText provides
methods for reading column values and converting them to their declared data types. A shortcoming
of HSQLDB with respect to handling text files is strong typing as it motivated authors to scan the
whole text table already during attachment phase in order to ensure type correctness. This adds to the
significant overhead HSQLDB already suffers from handling text tables. Implementation of
RowInputText uses multiple conditionals and specialized method calls to correctly convert all values.
Furthermore each value is wrapped in a Java class corresponding to its type. All of this constitutes
significant impediment to performance, clearly visible in the benchmarks.

3.1.5. H2

H2 is a relational database management system (RDBMS) written in Java and available as open-
source software under Mozilla Public License or Eclipse Public License. It was first released in 2005
and continues to be developed in open-source manner by a group of developers identifying
themselves as H2 Group. Commercial support is provided by one of the long-time users and derived
software developer - Steve McLeod. H2 supports a subset of unspecified edition of SQL standard
(precise features supported are listed on the software website and include: referential integrity,
foreign key constraints, inner and outer joins, subqueries, read-only views, inline views, triggers and
stored procedures, many built-in functions, large objects support, sequence, autoincrement and
computed columns, collation, users and roles).

40
H2 has been used in a number of projects, including: AccuProcess, District Health Information
Software 2, Flux, ImageMapper and Magnolia. Similarly to HSQLDB, H2 is limited to external CSV
data support which is discussed in further detail in the next section.

Analysis of H2's CSV file support

Similarly to Teiid, CSV data access in H2 is achieved by means of a dedicated function:


CSVREAD(). A result of that function is wrapped in ValueResultSet class (representing the "result
set" data type). This class in turn holds reference to a SimpleResultSet instance created by text data
access object.

Text data access is realized via the Csv class which implements the SimpleRowSource interface. The
abovementioned SimpleResultSet holds reference to the Csv instance which allows rows to be read
on demand (readRow() method). Only variable-width (delimited) CSV files are supported. All
columns are retrieved and stored in the result set regardless of query dependencies, unless column
names are provided in the CSVREAD() call. In that case, columns are stored only up to the number
of names specified. However, in such scenario no header row must be present within the CSV file
itself and dropped columns are still parsed following the CSV syntax.

The simplistic nature of CSV parser is reflected by SimpleResultSet interface, providing no means of
random data access (getRowId(), absolute(), moveToCurrentRow(), first(), last(), etc. throw "Feature
not supported" exception) and limited type conversion routines (getDouble(), getFloat(), getInt(),
getLong(), getShort(), getString()). Some other conversions (e.g. getDate()) will fail during runtime
because they expect the object corresponding to given column in a row to be derived from the Date
class (simple type casting).

The rows as produced by Csv class consist of String objects - a reasonable choice for file format not
supporting strict typing. Non-trivial conversions can be achieved by calls to specific functions
accepting string arguments, e.g. PARSEDATETIME() to obtain a Date object.

There is no caching mechanism specifically for CSVREAD() function calls. Acceleration of repeated
queries can be achieved only by means of result set caching and file input/output buffering in
standard Java classes.

41
3.1.6. SQLite

SQLite is a relational database management system (RDBMS) written in C and intended for
embedded usage. Its public domain source code contains direct database API and no network
client/server functionality. SQLite's initial release is dated to 2000 and its original and primary author
- Dwayne Richard Hipp continues to develop it nowadays. SQLite is ACID-compliant, features
dynamic and weak data typing and supports a well-defined subset of SQL-92 standard, with most
important exceptions being: RIGHT and FULL OUTER joins, complete ALTER TABLE support,
complete TRIGGER support, writing to views, GRANT and REVOKE. With respect to the latter two
commands, each SQLite database is encapsulated within a single file and its permission system is
based on user's operating system rights to read/write to that file, therefore offering per-database
authorization. Finer granularity access control is not available. SQLite is cited as arguably the most
widely deployed database engine due to its use in several widespread browsers (Firefox, Chrome,
Safari), operating systems (Mac OS X, Android) and embedded systems (set-top boxes, automotive
multimedia). SQLite's support for external data management comes in the form of a mechanism
called virtual tables. A virtual table is one that provides data to the engine by means of function calls
rather than residing in SQLite's native on-disk or in-memory format. The collection of functions
required to that end are described in the next section, followed by a short discussion of third-party
plugin for external CSV data support using virtual tables.

SQLite is a very small footprint database well known for its use in embedded applications (e.g. in the
Android operating system where it makes part of the core system API). Its logic is organized in the
following core modules:

Analysis of SQLite's Virtual Tables implementation

The collection of routines describing a virtual table is held together by sqlite3_module structure. The
list of functions (some of them optional) includes xCreate() for creating the virtual table schema in
response to CREATE VIRTUAL TABLE statement, xConnect() for opening a new connection to
already existing virtual table, xBestIndex() for determining best index to be used for given
combination of WHERE and ORDER BY clauses, xDisconnect() for releasing connection to a virtual
table, xDestroy() for releasing the connection and undoing the work of xCreate(), xOpen() for
allocating new cursor object for given virtual table, xClose() for releasing previously allocated cursor,
xEof() for determining whether given cursor points to valid data row, xFilter() for initiating a search
in the virtual table, xNext() for advancing the search initiated by Filter() to the next result row,
xColumn() for retrieving data from given column in the current result row, xRowid() for retrieving
row ID for the current result row, xUpdate() for (depending on the exact semantics) inserting,
updating or deleting rows in a virtual table, xFindFunction() for optional overloading of functions

42
using virtual table columns, xBegin() for starting a transaction, xSync() for starting a two-phase
commit operation, xCommit() for finalizing the commit operation, xRollback() for handling
transaction rollback, xRename() for notifying about virtual table name change, xSavepoint() for
adding a rollback point within the current transaction, xRelease() for releasing a rollback point within
a transaction and finally xRollbackTo() for rolling back to the precise savepoint within the current
transaction.

The code referring directly or indirectly to those routines or otherwise pertinent to virtual tables
implementation is present in the following files: where.c (xBestIndex), vdbeaux.c (xClose), vdbe.c
(xOpen, xClose, xFilter, xEof, xColumn, xNext, xRename, xUpdate, xRowid), vtab.c (xConnect,
xDisconnect, xCreate, xDestroy, xSync, xBegin), update.c (generates Virtual Database Engine
(VDBE) operation codes (opcodes) for performing a virtual table update), tokenize.c (special
handling for calls from within xCreate), parse.y (parsing of CREATE VIRTUAL TABLE statement),
prepare.c (releasing of virtual table locks), main.c (disconnecting and rollback of transactions on
exit), loadext.c (loading of virtual table extensions), insert.c (generating opcodes for virtual table
inserts), expr.c (function overloading, i.e. xFindFunction), delete.c (generates opcodes for deleting
rows from a virtual table), build.c (handling of CREATE TABLE, DROP TABLE, CREATE
INDEX, DROP INDEX, BEGIN TRANSACTION, COMMIT, ROLLBACK for virtual tables),
auth.c (authentication exempt for calls from xCreate) and alter.c (rename notification).

Within SQLite codebase, the full-text search and R-Tree index are implemented using virtual tables
mechanism. However, there's no built-in extension for handling external CSV files. Such
functionality is provided however by the external project SQLITEODBC [43]. The csvtable.c file
within the project can be used as separate module with functionality corresponding exactly to similar
implementations for HSQLDB, Teiid, etc. This implementation has been tested against other in-situ
CSV solutions in Table 5.3.5 and Table 5.3.6. Due to short-lived row-based (as opposed to random
access) column look-up cache, lack of long term cache or indexing, the results of SQLite+CSVTable
are comparable to Teiid's despite seemingly more sophisticated two-level optimization (SQL and
VDBE) and SQLite's excellent performance with native tables.

Another example of usage of SQLite's virtual tables implementation is the GNOME DB project and
its libgda library. In this case however, the virtual table modules support connections to other
databases (MySQL, Firebird, PostgreSQL, Oracle, JDBC, etc.) rather than external non-DB data.

43
3.1.7. MonetDB

MonetDB is an open-source column-oriented database management system developed by Centrum


Wiskunde & Informatica (CWI) in Netherlands. MonetDB's roots reach to 1990s but in its current
form it was first created in 2002 by doctoral student Peter Alexander Boncz and professor Martin L.
Kersten. Its first open-source release (under Mozilla Public License) took place in 2004. MonetDB
was designed to provide high performance complex queries against large databases. Its most low-
level native language called MAL (MonetDB Assembly Language) is founded on the premise of
binary association tables (BAT) algebra. Binary tables are such that consist of only two columns. By
providing efficient mechanisms for combining such tables horizontally - multiple association tables
(MAT), MonetDB is able to store and process tables of arbitrary dimensions with a typical 2D layout
known from other DBMS. On top of MAL, MonetDB contains a module which can translate SQL
semantics for data organization (i.e. classical tables) and querying to MAL and execute them. The
standard posed as reference for MonetDB was SQL'03 with less frequently used features left
unimplemented, while adding some functionality beyond the standard as well. MonetDB has found
uses in online analytical processing (OLAP), data mining, GIS, streaming data processing, text
retrieval and sequence alignment processing. Significant projects advertising their use of MonetDB
include: LEO (Linked Open Earth Observation Data for Precision Farming), RETHINK big,
COMMIT/, CoherentPaaS and last but not least Human Brain Project. A module most similar to
SQL/MED within MonetDB is called Data Vaults and aims to provide transparent integration with
distributed/remote file repositories. It is however a work in progress and currently only a limited
number of Data Vaults exist, with the most common one among competitors - CSV - actually
missing. The following section contains discussion of the mechanism, its current limitations and
pending future improvements.

Analysis of MonetDB Data Vault implementation

At the moment Data Vault implementation in MonetDB is in early stages. The interface doesn't allow
for in-place data access or on demand data loading. It exposes a number of functions in the vault,
netcdf, fits, mseed, geotiff and gdal modules which correspond roughly to the bulk copying
functionality of COPY INTO albeit for different file types than CSV.

44
The vault module provides generic remote file staging functionality using cURL library. The netcdf
module allows for importing NetCDF (Network Common Data Form) array-oriented scientific data
files. The fits module allows for import of FITS (Flexible Image Transport System) format images.
mseed provides the same facilities for MSEED (Mini Standard for the Exchange of Earthquake Data)
seismographic data format, geotiff for TIFF files with embedded georeferencing information and
finally gdal supporting Geospatial Data Abstraction Library. Each of them requires calling an
import() or equivalent method which starts a lengthy process of complete data import into MonetDB's
SQL catalog.

In [44], the authors describe means of achieving on-demand data materialisation by having data vault
optimizer recognize particular access patterns to SciQL arrays and invoke data wrapper's import
procedure only when required data is not yet cached/materialised. It seems to be a good approach to
coarse (file at a time) on-demand data loading however it's not yet implemented in the publicly
available MonetDB code repositories. At the time of this writing, in the most up-to-date SciQL-2
branch of MonetDB, in the sql/backends/monet5/sql_gencode.c file there is indeed SciQL array
materialization code injection block, however neither the implementation of sciql.materialize()
routine nor any other place in the vaults module, monet5 SQL backend or SQL server modules seem
to contain any references to actual data wrapper import invocations. Finally, the compiled version
doesn't exhibit on-demand loading behavior.

To implement the above functionality, the optimizer would have to know which data wrappers are
responsible for materialising specific arrays. This kind of mapping could be part of the SciQL arrays
interface, e.g. specified in the CREATE ARRAY statement and stored alongside the array definition.
Alternately, an array naming convention could indicate candidate data wrappers for additional
materialisation code (i.e. reading from files).

Regardless of the implementation, there are two conceptual problems with the above architecture.
Firstly, access to single pixel of an image triggers the import of the whole thing (lack of granularity)
and secondly, even though authors are correct in writing that single scientific analysis is usually
limited to a small subset of all available data, however after a couple of such analysis (or in an
unlikely case of query involving all data at once) the database would inevitably end up holding a copy
of all the original data. Therefore, in the current implementation, Data Vaults are still closer to being
a variant of bulk copying rather than external data management in the sense of SQL/MED.

45
The granularity problem could be resolved by pushing up the materialisation responsibility to the
array slicing operators, however this would require keeping a map of regions which have been
retrieved from the files and then cross-referencing them each time a not fully materialised array is
accessed (or until the engine decides to import a frequently accessed array fully despite accessing
only slices). Dividing the input array into bulk regions rather than precise slices could further
simplify the mapping approach.

Another way would be to depart from mapping external files to SciQL arrays and map to entire tables.
This method would provide for natural column-wise data retrieval. The table to data wrapper
mapping would have to be stored as part of the CREATE TABLE definition and SQL optimizer
would have to retrieve the non-materialised columns each time they appear as dependencies of a
statement. The latter can be conveniently determined using the stmt_list_dependencies() routine with
COLUMN_DEPENDENCY parameter.

The second issue however would require a paradigm change from "bringing data in" to "carrying
calculations out". This is the approach of other DBMS which support in-situ access to the data (e.g.
PostgreSQL, Teiid) performing only the necessary type conversions on the fly while reading records
directly from the source file.

46
3.2. NoSQL Databases

NoSQL databases provide means for storage and retrieval of data using different models than tabular
relations. This approach existed since 1960s but was named "NoSQL" recently, while gaining
popularity due to adoption by Web 2.0 companies such as Facebook, Google and Amazon. The term
originally referred to "non-SQL" databases highlighting the fact that existing implementations didn't
support the SQL query language. More recently the abbreviation is used to signify "not only SQL" as
NoSQL solutions are beginning to implement query languages similar to SQL.

The principal characteristic of NoSQL databases is that they do not use the tabular relations model.
Instead they rely on simpler abstractions such as key/value stores, document stores, column-family
stores and graph stores. Further sub-variants of the above main approaches can include disk-based and
in-memory implementations (often referred to as caches), strongly consistent and eventually
consistent implementations, ordered and unordered key/value stores and special variants of key/value
models (such as triple/quad stores) and graph models (e.g. named graphs).

In key/value stores the data are modelled as opaque binary values of arbitrary length and their content
cannot be used by the database to filter the results of the query. The keys on the other hand consist of
binary values of arbitrary length and are indexed using a B-tree or similar structure with lexicographic
order. Due to properties of the indexing structure, keys can be used to efficiently select single entries
or ranges of entries from the database. Key/value stores are much simpler than relational databases
and in fact can serve as storage mechanism for relational DBMSes [45]. By themselves key/value
stores do not provide a query language. Instead an API is used to formulate the simple insertion,
deletion and lookup requests. As consequence, there is no mechanism to reflect relations between
records unless they share the same primary key and no mechanism to formulate a JOIN query even if
they do. Any JOIN operations must take place in the client code acting on results retrieved using the
simple lookup operations. Examples of key/value stores include: BerkeleyDB [46], Redis [47], Riak
[48] and Infinispan [49]. Concepts of key/value store are exposed as well in some multi-paradigm
databases such as OrientDB [50].

Document stores resemble key/value stores in the fact that at their core they contain mappings
between unique document IDs (keys) and documents (values). Their defining characteristic however
is that they provide mechanisms for querying based not only on IDs but also on document contents.
The documents themselves refer usually to structured data in XML, YAML, JSON or BSON formats
and their contents can also be partially or completely indexed either explicitly or implicitly, usually
on the on-demand basis. Document store implementations often provide means of organizing
documents into groups using collections, tags, non-visible metadata or directory hierarchies. With
respect to RDBMSes collections of documents vaguely correspond to tables whereas documents

47
correspond to rows, however as opposed to RDBMSes each document can have a completely
different structure. Similarly to key/value stores relations do not exist as a predefined concept in
document stores and as such JOIN operations are to be implemented by the client code. Some efforts
have been made to standardize a query language for document stores (Unstructured Data Query
Language - UnQL [51]) however those projects have come to a hold as of the time of this writing.
Examples of document stores include: CouchDB [52], MongoDB [53], Clusterpoint [54] and Lotus
Notes [55].

Column-family stores (also known as wide columnar stores) derive primarily from Google Bigtable
[56], where data are stored in column-oriented way. Bigtable maps two arbitrary string values (row
key and column name) and a timestamp into an associated arbitrary binary value. The original
architecture description [57] outlines a further organization of columns into column families (e.g.
"contents:" and "anchor:") which prefix the column name. In the illustrated use-case row key specifies
a website URL. The "contents:" column family contains only one column and refers to contents of a
website with multiple timestamps, whereas "anchor:" family contains one column per each website
referring to the website specified by row key and each such column contains the text of the hyperlink
used as reference. This constitutes the foundation for Google's website caching mechanism and
ranking algorithm based on references from other websites. Column-family stores can have arbitrary
number of column families. Bigtable as well as other column-family stores are designed to scale
horizontally and handle data in the petabyte range using clusters or cloud computing. Similarly to
key/value and document stores, column-family stores do not provide a notion of relation and JOIN
queries are not possible directly within the database engine. Some implementations support query
languages similar to SQL although without syntax for JOIN or nested queries. Examples of column-
family stores include: Bigtable [56], Cassandra [58], Druid [59] and HBase [60].

Graph stores refer to a class of database engines using data storage optimized towards representing
graph structures. Graphs are mathematical objects consisting of vertices, edges and attributes. An
edge connects two vertices and both vertices and edges can have an arbitrary number of attributes.
This entails a storage model completely different than key/value, document and column-family stores.
An example organization of such storage could contain four separate key/value stores for IDs,
vertices, edges and attributes respectively with fixed-size keys and values as they're used only for
defining relationships between different entities, as well as indexed block storage for holding the
values of vertex/edge names as well as names and values of attributes. This amounts to a much more
complicated architecture which is suitable in scenarios such as social networking applications, pattern
recognition, dependency analysis, recommendation systems and solving path-finding problems in
navigational systems. However, graph databases are not as efficient as other NoSQL solutions in
scenarios other than handling graphs and relationships. JOIN queries are possible in graph databases
given that tabular data are correctly converted to corresponding graph representation. SPARQL

48
(recursive acronym for SPARQL Protocol and RDF Query Language) is a standard Resource
Description Framework (RDF) query language supported by many graph databases. Examples of
graph databases include: Neo4j [61], HyperGraphDB [62], AllegroGraph [63] and InfiniteGraph
[64].

The topic of external data management is generally not explored by existing literature in the context
of NoSQL databases, as their particular storage models do not map easily to the variety of data
formats used in research and industry. Furthermore lack of standardized query languages and absence
of easily-formulated JOIN operations present a hindrance to interactive data exploration and pose
additional challenge of writing optimal algorithms for data filtering and joining on the client-side.
Among the existing NoSQL implementations none provide support for external data management as
understood by this work and SQL/MED specification.

The following chapters discuss the most popular open-source NoSQL solutions in the context of
possibility for external data management implementation.

3.2.1. CouchDB

CouchDB [52] is an open-source NoSQL database based on the document store model. It was first
released in 2005 and its initial authors included Damien Katz, Jan Lehnardt, Noah Slater, Christopher
Lenz, J. Chris Anderson, Paul Davis, Adam Kocoloski, Jason Davies, Benoît Chesneau, Filipe
Manana and Robert Newson. It is currently developed within the Apache Software Foundation.
CouchDB is written in Erlang and supports cross-platform deployment. Official distribution packages
include versions for Windows, Mac OS X and Ubuntu Linux. CouchDB is released on Apache
License 2.0.

CouchDB uses JSON to store documents and JavaScript as its query language. Its API is exposed via
an HTTP server and views involving aggregate functions and filters are calculated in parallel in a
framework similar to MapReduce [65] where filters correspond to "map" step and aggregation to
"reduce" step. The "map" step among manipulation of any document data can also provide a new
primary key for the view. Subsequently, views are indexed by primary key and can be further queried
for particular values or ranges of the primary key.

For indexing keys CouchDB uses custom implementation of B+Tree with added multiversion
concurrency control (MVCC) support in order to avoid the necessity for locking the database during
writes. The user is responsible for eventual conflict resolution using the revision information stored
along with the documents. MVCC in CouchDB is implemented by means of append-only log-based
file storage with good performance and compatibility with both SSD and magnetic hard drives.
However it can cause significant disk space overhead in case of multiple update operations, as well as

49
require periodic compacting operations. A compacting operation consists of writing out the "live"
version of the database to a new file, swapping it in for the old one and deleting the latter.

CouchDB supports document-level ACID semantics with eventual consistency, incremental (based
only on those documents which have changed) MapReduce and incremental replication. One of its
characteristic features is multi-master replication which allows it to scale to many machines in order
to build high performance systems.

CouchDB is suitable for a range of applications from Web solutions involving data accumulation and
versioning (e.g. Customer Relations Management, Content Management Systems) to mobile devices
due to CouchDB's ability to work offline and perform master-master synchronization when network
access is restored. CouchDB is used by a number of notable companies including: Amadeus IT
Group, Ubuntu, BBC, Credit Suisse, Meebo and Sophos.

In the following sections, CouchDB's support for SQL-like query language, JOIN queries and
external data management will be discussed.

Support for SQL-like query language

In 2011 advancements [51] have been made towards establishing an Unstructured Data Query
Language (UnQL) standard originating from the efforts of Couchbase [66] and SQLite [67]
developers. Currently however the project has come to a hold and original specifications and syntax
diagrams are no longer available online.

There is an unofficial UnQL wrapper available for CouchDB [68] created by Sam Gentle. It provides
support for the following semantics from the UnQL specification: SELECT [expression] FROM db
[WHERE condition], INSERT INTO db VALUE data, INSERT INTO db SELECT [expression]
FROM db [WHERE condition], UPDATE db SET foo=bar,foo2=bar2, DELETE FROM db
[WHERE condition], CREATE/DROP COLLECTION db, SHOW COLLECTIONS but lacks
implementation of others: UPDATE ... ELSE INSERT ..., EXPLAIN, CREATE/DROP INDEX,
BEGIN/COMMIT/ROLLBACK, GROUP BY/HAVING, ORDER BY/LIMIT/OFFSET, JOIN,
UNION, INTERSECT. Notably no support for JOINs exists at the time of this writing and the project
has come to a hold as well.

Support for JOIN-like queries

An alternative way for obtaining results similar to a join with CouchDB can be realized with the
Linked Documents functionality. When querying a view, the include_docs parameter set to true
normally requests that along with each row the document used to generate that row is fetched as well.
However, if a "map" function emits rows containing an "_id" key and such view is further queried

50
with the include_docs parameter set to true, CouchDB will fetch the documents indicated by the value
corresponding to "_id" key instead of documents used to generate the rows. In this manner a simple
single JOIN can be performed by the database engine. Further joins can be realized by querying the
resulting view provided that it as well contains rows with "_id" keys. Alternatively an intermediate
step generating new views can always precede the querying step, introducing values corresponding to
any other key as the "_id" values. During each query filtering by primary key can take place. These
two mechanisms can account for SQL-like JOIN semantics but have to be managed and optimized by
the user which is significantly more challenging than in RDBMS.

Possible Implementation of External Data Management

Current version of CouchDB doesn't provide any built-in mechanisms for external data management
or data virtualization. The Linked Documents mechanism could be extended to fetch external JSON
or arbitrary binary files when "_id" field specifies a filesystem path rather than document ID. Since
the CouchDB REST API requires that results are returned as JSON-formatted objects in case of non-
JSON sources a middleware responsible for converting them to JSON representation would be
necessary for each supported file format. Currently JSON parsing within CouchDB happens one
document at a time which can pose an insurmountable obstacle when handling neuroimaging data. A
representation of typical neuroimaging document within JSON notation would consist of an array
containing hundreds of thousands up to millions of numbers. Time required for encoding and
decoding of JSON documents of this size is prohibitive compared to handling raw binary data.
Crucially, for the "map" and "reduce" functions JSON representation could be provided by means of a
proxy object fetching necessary values from the referenced files on-demand. This would require
changes in the existing JavaScript query server to support such lazy evaluation as well as changes to
CoachDB Erlang core and JavaScript query server in order to support such "pointer" passing. The
mechanism described would allow for efficient processing of external data sources within the
CouchDB query servers leaving the JSON-based REST API as the bottleneck for data passing to
client software. This could be addressed by means of introducing alternative binary API protocol
combined with local interprocess communication mechanisms for locally deployed CouchDB
instances.

51
3.2.2. Cassandra

Cassandra [58] is an open-source NoSQL database implementing the column-family model. Its
original authors were Avinash Lakshman and Prashant Malik who created Cassandra while working
for Facebook in order to support their Inbox Search feature. Its first public release took place in 2008
on the Google Code platform. In 2009 it became an Apache Incubator project. Nowadays, similarly
to CouchDB, it is developed within the Apache Software Foundation. Cassandra is written in Java
and the official distribution package contains cross-platform Java binaries. Cassandra is released on
Apache License 2.0.

Cassandra implements a distributed, decentralized row store where rows are organized into column
families (referred to as tables since Cassandra Query Language 3 - CQL3). A table's primary key
consists of the partition key and a number of other columns. Rows are distributed according to the
partition key. The partitioning can happen either in random fashion or preserving the natural order of
the remaining columns. Random partitioning ensures equal distribution of key/value pairs whereas
order-preserving partitioning allows for similar keys to stay together at the cost of unequal
distribution of key/value pairs. Within a Cassandra node, rows are further clustered by the remaining
columns of the key. Additional indexes can be created independently for arbitrary columns.

Cassandra introduced Cassandra Query Language (CQL) as an SQL-like alternative to its native
Remote Procedure Call (RPC) interface. CQL drivers are implemented for Java (JDBC), Python
(DBAPI2), Node.js (Helenus), Go (gocql) and C++. CQL supports most of the SQL syntax with the
exception of JOINs and subqueries. This limitation is caused by the distributed nature of Cassandra
which precludes RDBMS-style JOINs as they would require all necessary data first to be assembled
on one of the nodes in order to use one of the classical JOIN algorithms. Suggested workarounds
include fetching all the necessary data in the user code and performing JOINs manually either on-
demand or by deploying a periodic joined view materialization task which would save the manual
JOIN results for reuse.

Other features of Cassandra include: support for replication and multi-data-center replication with
tunable number and location of replicas at both data-center and rack level, consistency level tunable
separately for each request letting the client decide how consistent the data must be (ranging from
complete consistency across all replicas to consistency local to one replica with different quorum
settings in-between), scalability (linear with the number of nodes), fault-tolerance (failed nodes can
be replaced with no downtime) and MapReduce integration.

52
Due to its highly robust and scalable nature Cassandra is suitable for massively parallel Big Data
processing solutions that can't afford to lose data even when entire data-center goes down. Prominent
users of Cassandra include: CERN (for archiving data acquisition system's monitoring information in
the ATLAS experiment), Facebook (powering the Inbox Search feature using 200 nodes), Wikimedia
(backend storage for REST Content API) and Apple (reported to host 100000 Cassandra nodes for
usage with unspecified products and services).

In the following section, Cassandra's support for SQL query language, workarounds for JOIN queries
and potential external data management implementation will be discussed.

Support for JOIN queries and SQL

Cassandra doesn't natively support JOIN queries. Workarounds involve using external database or
computing software. MariaDB [69] (a fork of MySQL [70]) implements the Cassandra storage
engine [71] which allows to represent Cassandra tables within MariaDB. Such tables support efficient
lookups by primary key and their data are fetched on-demand from Cassandra servers. Using
MariaDB it is therefore possible to execute a JOIN on two Cassandra tables by first reading the
relevant filtered data from one table and based on the retrieved records, read data from the second
table using primary key lookups where possible. This solution holds the added benefit of MariaDB's
support for SQL which can be freely used with Cassandra tables.

Another solution involves using DataStax [72] with Apache Spark [73]. DataStax is a Cassandra
distribution including a collection of extensions aiming to provide additional functionality and
interoperability with third party software packages. Apache Spark is an open-source cluster
computing framework providing distributed implementations for many algorithms used commonly in
data processing, including among many others: filtering, grouping and joins. Detailed instructions for
a setup using Spark algorithms to perform joins on Cassandra tables are available [74].

Possible Implementation of External Data Management

Cassandra doesn't provide any built-in mechanisms for external data management or data
virtualization. Furthermore, due to its distributed nature Cassandra doesn't natively implement JOINs,
subqueries, grouping and ordering of results. Therefore, without support of previously mentioned
third-party software it is of limited use to neuroimaging applications. When used in conjunction with
MariaDB or Spark, a solution implementing external data access directly in those software packages
could be both more maintainable and feature better performance. On the other hand external data
management implementation in Cassandra would offer benefits for certain class of queries provided
that it exploited the underlying model of column-family stores. If rows were defined as corresponding
to neuroimaging files and set of column families contained a one-column family for metadata and a

53
large family of columns for imaging data, such schema would offer easy cross-population lookups of
selected voxel values. Since external imaging data are by default not clustered the way Cassandra
clusters values based on columns in the tail of primary key, they would have to be stored on a special
content-aware filesystem. Such a hypothetical filesystem could recognize dimensions of
neuroimaging files and arrange corresponding voxels to stay close on the storage media.
Nevertheless, this novel approach would require nontrivial modifications to Cassandra codebase and
significant effort for developing the external filesystem and at the time of this writing remains purely
speculative.

3.2.3. Neo4j

Neo4j [61] is the most popular graph database. It was developed by Neo Technology, Inc. based in
San Francisco Bay Area, US and Malmö, Sweden. Neo4j was first released in 2007, with version 1.0
following in 2010 and version 2.0 in 2013. It is written in Java and distributed under open-source
model including General Public License (GPL) v3 for the database itself and Affero General Public
License (AGPL) v3 for additional modules such as Online Backup and High Availability modules.
Alternatively, both database and modules are available under a commercial license in a dual-licensing
scheme.

Neo4j's data model consists of nodes, relationships, properties and labels. Nodes are connected by
relationships which can be traversed equally efficiently in both directions. Both nodes and
relationships can be assigned an arbitrary number of properties and labels. Properties are named
values and are used to describe node-specific data whereas labels are string values used to assign
roles or types to nodes and relationships and are used by the database to group them together. Labels
allow database queries to work with selected labeled sets instead of the whole graph making queries
easier and more efficient to execute.

Internally Neo4j stores all data using linked lists of fixed-size records on a disk. Elements of the
property chain contain key/value pairs and pointers to the next element. Short key/value strings/arrays
are stored together with an element, whereas longer strings/arrays are placed in a separate store and a
pointer is saved in the property chain. Each node references its first property record as well as first
element in its relationship chain. Elements in relationship chain contain references to start/end nodes
as well as the previous/next relationship for the start/end node respectively. This enables fast
relationship traversal without indirection via the participating nodes. It also allows relationship
traversal in both directions to be equally efficient. In the end, Neo4j consists of seven stores located
in separate files: node store, relationship store, relationship type (label) store, property store, property
key store (for long keys), string store (for long string values) and array store (for long array values).

54
Neo4j implements a REST API using JSON syntax to specify queries and retrieve resulting data.
Language-specific drivers wrap the same APIs in more convenient abstractions such as Object-
Graph-Mapping (OGM) and provide type conversion utilities, lifecycle management, automatic
persistence, etc. Language drivers exist for Java, Python, .NET, JavaScript, Ruby, PHP, Perl, R, Go,
Clojure and Haskell. By default queries are formulated using Neo4j's native Cypher query language
with additional plugins supporting Gremlin [75] graph traversal language. In addition to the REST
API Neo4j offers a Native Java API for executing operations by directly calling Neo4j kernel code.
Other features of Neo4j include: true ACID transactions, high availability support, scalability to
billions of nodes and relationships, high speed querying through traversal and fulltext indexing
support via Apache Lucene [76].

Thanks to its world-leading support for highly interconnected semi-structured data Neo4j is used by
numerous enterprises ranging from Global 500 companies to startups. Prominent users include: Cisco
(Content Management, Master Data Management), Lufthansa (Graph-Based Search), Tomtom
(Navigation), Ebay (Logistics) and Adidas (Real-time Recommendation Engine). Other application
areas include: fraud detection, identity and access management, network and IT operations and social
networking.

In the following sections, Neo4j's capabilities with respect to SQL-like querying and JOIN-like
queries will be discussed, followed by discussion of hypothetical external data management
implementation.

Support for JOIN queries and SQL

Property graph model implemented by Neo4j is more generic than table relational model and graph
traversal operations which constitute the equivalents of JOIN operations in RDBMS are natively
supported in Neo4j. The Cypher query language was modelled after SQL and consists of clauses,
keywords and expressions many of which have their direct SQL equivalents (e.g. WHERE, ORDER
BY, LIMIT). Furthermore, Cypher implements graph pattern matching syntax which allows to
conveniently define one-to-one, many-to-one, one-to-many and many-to-many relationships without
the need of intermediary structures as opposed to RDBMSes (which require pivot tables for many-to-
many relations). In an example of such Cypher query:

MATCH (p:Product {productName:"Chocolate"})<-[:PRODUCT]-(:Order)<-[:ORDERED]-(c:Customer)


RETURN DISTINCT c.companyName;

the pattern following MATCH keyword is interpreted as :Customer node related to :Order node by
:ORDERED relationship, :Order node related by :PRODUCT relationship to :Product node where
:Product node contains value "Chocolate" for the key "productName". The pair of RETURN

55
DISTINCT keywords orders Neo4j to return only unique values for "companyName" key in
:Customer nodes matched by this query. The query will return list of names of companies who have
ever ordered chocolate. This powerful paradigm and query language are perfect for handling highly
interconnected semi-structured data. Neo4j supports as well grouping, aggregation and subqueries in
a manner similar to SQL.

Possible Implementation of External Data Management

Neo4j doesn't provide any built-in mechanisms for external data management or data virtualization.
Furthermore, its specific data model which offers biggest advantages for highly interconnected semi-
structured data applies better to certain classes of results of neuroimaging analysis (e.g. tractography,
connectivity analysis, neuron tracing) rather than raw data. Aforementioned types of files require
significant volume of storage space as well, thus graph data virtualization mechanism would
constitute an efficient solution for interoperability between existing dedicated analytics software and
database-driven research. Implementation of external data management for Neo4j could accompany
raw data virtualization solutions and become the next stage of in-situ neuroimaging analytics.

Adding proposed functionality would require definition of mappings between Neo4j's internal
representation of graph data structures and existing types of analytic results file formats such as
Bfloat (tractography streamlines format used by Camino Diffusion MRI Toolkit [10]), MRML
(Medical Reality Modeling Language [11]) or TrackVis [12] track format for tractography data,
graph file formats such as GraphML (Graph Markup Language [13]), GEXF (Graph Exchange XML
Format [14]) and CIFTI (Connectivity Informatics Technology Initiative [15]) for connectivity data
and finally formats used for neuron morphology reconstructions like Neurolucida [16] file format,
MorphML schema from NeuroML [17] modeling language specification and one of the most popular
- SWC [18] used extensively by Neuromorpho.org online neuron morphology database [77].

Furthermore, reimplementation of Neo4j classes would be necessary for all types of stores to account
for graphs spread across multiple files and potentially introduce some implicit relations between
them. Ideally, the improved version would transparently integrate with existing storage mechanism to
allow write access and optional replication of external files transparently converted to Neo4j's native
format across multiple server instances. Conceivably additional type of store for holding references to
external documents would need to be introduced and a subrange of each ID class would have to be
reserved for external documents. IDs in Neo4j are 35-bit integers for nodes and relationships, 36- to
38-bit integers for properties depending on property type and 16-bit integers for relationship types.
Using 1-bit for denoting external entities would reduce the capacity of Neo4j to ~17 billion of internal
nodes and relationships, ~34 billion up to ~137 billion for properties and ~32000 for relationship
types. The same numbers would be available for external entities. While certainly sufficient for needs

56
of current studies, those limitations could soon become a bottleneck and require updates to Neo4j's
disk storage format. Nevertheless, the perspective of external data access from Neo4j seems like an
interesting venue, although purely speculative at the time of this writing.

3.2.4. Redis

Redis [47] is a networked, in-memory key/value store with optional durability. It is developed by
Salvatore Sanfilippo and sponsored by Redis Labs (cloud database service provider) since June 2015.
In the past it was also sponsored by Pivotal Software and VMWare. Redis was first released in 2009
and at the time of this writing the most recent releases of 2.x and 3.x branches date to September
2015. Redis is written in ANSI C and distributed under BSD License. It features official support for
Linux and Mac OS X, however an unofficial port by Microsoft Open Tech group exists for the 64-bit
Windows platform as well. Redis has been ranked by DB-Engines.com [78] as the most popular
key/value store.

Redis data model consists of keys and values where keys can be arbitrary strings, while values can be
not only strings but also abstract data types such as: lists of strings, sets of strings, sorted sets of
strings, hash tables and HyperLogLogs [79]. Redis allows to perform additional operations on values
themselves depending on data type. This characteristic is the main difference between Redis and
other key/value stores. Examples of high-level, atomic, server-side operations supported by Redis
include: intersection, union, and difference between sets and sorting of lists, sets and sorted sets.

Internally, Redis uses C dynamic string library [80] for handling strings (to reduce amount of
memory allocations necessary for appends). Linked lists are used for storing lists. Hash tables
represent sets and hashes. Finally, skip lists [81] are used for implementing sorted sets. As an in-
memory database, Redis uses the same implementation of hash tables to store the primary keys as
well as values. Optional persistence in Redis is realized by means of on-disk storage using an append-
only file (journal) which logs all write operations performed by the server. This file is periodically
rewritten to remove redundant entries and prevent infinite growth. By default, Redis synchronizes on-
disk storage every two seconds, resulting in only few seconds of data loss in case of system failure.
Redis features as well a master-slave cluster model allowing for data replication and sharding with
eventual consistency guarantees.

57
Redis employs REdis Serialization Protocol (RESP) to communicate with clients. It's a human-
readable text protocol which offers simplicity of implementation and high parsing speed. RESP is
used to serialize different data types supported by Redis and introduces separate type for
communicating errors. Thanks to RESP, Redis language bindings are easily implemented and
currently multiple versions exist for each of the languages such as: C, C++, Clojure, Common Lisp,
Erlang, Go, Haskell, Java, JavaScript, Lua, Objective-C, Perl, PHP, Python, R, Ruby, Scala and many
others.

Thanks to its excellent performance Redis has found numerous specialized applications in fields
where consistency and durability constraints are not mission critical. Prominent users of Redis
include: Instagram (used to run the main feed, activity feed and session store), GitHub (used for
exception handling, queue management, configuration management and as part of a data routing
mechanism), Stack Overflow (used as caching layer for the entire network) and Tumblr (for
dashboard notifications).

In the following sections, Redis's capabilities with respect to SQL-like querying and JOIN-like
queries will be discussed, followed by discussion of hypothetical external data management
implementation.

Support for JOIN queries and SQL

Redis doesn't natively support either JOINs or SQL-like query language. However, due to the basic
nature of key/value stores which makes all of them potential candidates for storage engines in
RDBMSes, Redis (similarly to Cassandra [71] and Berkeley DB [45]) could potentially be used in
this role in one of the existing relational databases such as MySQL or PostgreSQL. Nevertheless no
such efforts exist at the time of this writing.

In 2012, a project called Thredis [82] emerged aiming to integrate Redis with SQLite by using the
Virtual Tables mechanism of the latter. Existing implementation allows to retrieve a table containing
a pair of columns corresponding to keys and values from redis. Abstract values however are not
further decoded and depending on their type either a binary string or a pointer to redis native data
structure is returned. As SQLite is usually used as an embedded database itself it is possible to use
Redis internal API to further handle the pointer. This precludes however handling of abstract values
using SQL syntax. Nevertheless, Thredis allows to perform JOINs on keys and simple values within
redis table as well as between redis table and native SQLite tables. Currently, the project has come to
a hold.

58
Possible Implementation of External Data Management

Redis doesn't provide any built-in mechanisms for external data management or data virtualization.
Due to its ability to handle complex value types, a hypothetical implementation of external data
access could rely on new data types (such as array) to represent contents of specific file formats.
Since Redis is an in-memory database such types could rely on direct memory-mapping of external
files into Redis address space as long as appropriate functions were provided to convert their
structures to RESP format for output and optionally to Lua [83] values for handling within user-
defined Lua scripts used with Redis's functions such as EVAL().

Alternatively all supported file formats could be mapped to the existing set of data types. This would
require however to enhance some of the current Redis implementations of e.g. linked lists as they're
not capable of representing tightly packed numbers in memory. Similarly augmented
implementations of hash sets would be necessary to accommodate data layout in external files or to
mix native Redis hash tables with values stored in memory-mapped regions.

External data management functionality in Redis promises good performance thanks to its low-level
implementation language allowing for direct memory-mapping approach. However the usefulness of
such mechanism for neuroimaging research remains questionable due to lack of support for: joins,
aggregates, grouping, math routines and array extensions within Redis and the fact that
implementations of wrappers aiming to provide such functionality are staying on hold at the time of
this writing.

59
3.3. Summary

At the time of this writing, none of the discussed NoSQL solutions offer external data management
facilities, while all of the investigated relational solutions exhibit numerous problems with
accommodating the needs of in-situ data access in the context of modern neuroscience. The main
reason is that the latter systems were designed with dedicated file formats as the preferred and default
way of storing all data. The in-situ access modes were added without exception as a supplementary
mechanism, often quoted as just means of conveniently importing existing data archives into the
database. The direct consequence of this fact is that performance optimization efforts put into
handling external data were minimal or non-existent. This applies both to the mechanism of reading
the external data (e.g. parsing optimization, caching), as well as including the characteristics of
external data in the planning and optimization stages of the SQL execution pipeline.

Another problem is that existing DBMSes share a legacy of being tuned to accessing data in a
streaming (as opposed to random) manner from slow secondary storage devices. Their design is also
done under the assumption that primary storage is a scarce resource compared to volumes of data
processed during each transaction. This in most cases precludes such novel optimizations as memory-
mapping of data files and using memory sharing mechanisms combined with inter-process
communication to expose them simultaneously to multiple locally running agents without copying
the data. With the mainstream adoption of 64-bit processor architectures and typical main memory
sizes for desktop computers reaching values as high as 16 or 32GB, those assumptions are not
necessarily valid anymore. With neuroimaging unit data sizes in the range of single GBs as indicated
i n Table 2.1 , a broad range of queries would experience better performance with this kind of
optimizations, especially when they involve calls to external software (e.g. R, MATLAB) for
processing the data.

Though lacking, the foundation mechanisms for external data access are present in many existing
solutions. However, conceivably partly because of the above limitations and partly because of general
lack of necessary time investment, the portfolio of external data formats supported by aforementioned
solutions is very narrow in the face of needs of neuroscience research. This constitutes the third flaw
of existing architectures. Additionally, some of them (e.g. H2) impose unexpected limitations even on
supported external data formats (i.e. no possibility of maintaining index structures, parser
acceleration structures or table-oriented data caches). In such cases, the poor performance is further
degraded by forcing inner joins to perform naive cross joins before applying the inner join criteria,
rendering the execution plan very inefficient. The only way of circumventing this disadvantage is by
importing data to temporary tables, therefore abandoning the original goal of in-situ data access.

60
All of the discussed solutions apart from SQLite use strict data typing for table columns. Along with
lack of orientation towards random data access this situation makes current databases a much less
flexible option compared to scripting languages - broadly used in neuroscience. An example of
operation that is fundamentally useful in formulating queries and operating on tabular data in general
is transposition. In some works [39] it's presented as the insightful and novel way of creating a
database schema for a particular problem. A system supporting this operation on-the-fly as opposed
to at the table creation stage would benefit in terms of adaptability.

In the following chapter, I will present WEIRDB - a proof of concept implementation of SQL-based
external data management system oriented towards needs of neuroimaging and neuroscience. I will
discuss how it addresses the previously identified shortcomings of existing DBMSes and compare its
performance to abovementioned solutions.

61
4. WEIRDB - Complex Queries on External
Neuroimaging Data
WEIRDB (Wide External Imaging Research DataBase) was created with the aim of addressing all the
previously discussed shortcomings in existing DBMSes. It's a database solution designed with
explicit goals of: (1) handling mostly external data, (2) aggressively taking advantage of superfluous
random access memory, (3) supporting streaming and random access patterns with good
performance, (4) handling most of the data formats relevant to neuroscience, (5) highly dynamic
typing system competing with flexibility of scripting languages, (6) support for very large data files,
(7) efficient processing of very wide tabular data and (8) both network and local (memory sharing
based) access mechanisms.

Additionally, WEIRDB's proof-of-concept implementation was realized with several technical


objectives in mind: (1) written in industry standard language - C++, (2) using well-known and well-
tested libraries, i.e. boost [84], SQLite [67] and Qt [85], (3) supporting out of the box deployment
without any installation procedure, (4) intelligent handling of data files, minimizing the amount of
necessary configuration ("zero-config" approach) and (5) automatic tuning (e.g. creation of indices)
to reduce user involvement and learning curve.

WEIRDB was designed to improve over other efforts with similar aims, e.g. NoDB [86]. It benefits
from established knowledge in the field, in particular with regards to auto-tuning - offline ([87], [88],
[89], [90], [91], [92], [93], [94], [95], [96]) and online ([97], [98]) and adaptive indexing ([99],
[100], [101], [102], [103], [104], [105]). WEIRDB realizes an online variant of both techniques
using simple and efficient greedy strategy of indexing all columns used in dynamic
(arithmetic/functional/conditional expressions, WHERE, ORDER BY, GROUP BY and JOIN
clauses) parts of the queries.

Information extraction for static parts of the query is done using lookup algorithms optimized for
individual data formats. An example of such algorithm for the CSV format is described in Table 4.1 .
WEIRDB solves the problem of integrating SQL semantics with unstructured text data retrieval as
described in [106] by treating the entire text as a table and therefore subjecting it to processing using
standard SQL query.

62
Table 4.1. CSV Access Algorithm

Step Description

1. Load or map CSV file as−is into memory.

Perform initial parsing of the file, caching starting position of every 100th column in each row, note: this doesn’t parse the numbers etc. it only traverses
2.
the file to cache column positions.

3. Initialize cursor for each row to point to the first column.

4. If data query accesses column pointed by cursor of the respective row, parse it starting from the cursor and advance cursor by one column.

Else: look for the closest cached column position and find the destination column by dynamically parsing CSV, read the column and set cursor for that
5.
row to point to the next column.

In-situ processing as described in ( [107], [108], [109], [110], [111], [112], [113]) is further
encouraged in WEIRDB compared to prior art by its nearly completely "zero-config" nature. E.g. for
CSV files, schemas are built automatically assuming that first row of a CSV file contains column
names. New CSV files are introduced to the system by dragging and dropping them over the GUI,
using a classical file selection dialog or text interface. Their names are converted to table names and
column names are inferred based on file content. SQL queries are instantly possible for all newly
attached data.

In the following chapter, I'm presenting a more detailed perspective on WEIRDB's software
architecture.

4.1. Architecture Description

The architecture of WEIRDB derives directly from aforementioned requirements as illustrated in


Figure 4.1.1.

For handling external data with simultaneous support for SQL queries, a Query Rewriting Engine
component was introduced. It is responsible for parsing queries and building an optimal plan taking
advantage of direct file access and a classical database system (SQLite) acting as internal caching
engine and indices retrieval mechanism. The choice of SQLite was dictated by a couple of reasons,
one of them being the way it handles column types. An optionally declared column type is treated
only as a hint for the execution engine, with effective types encoded separately in storage-efficient
manner along with each value. This characteristic helps to satisfy the dynamic typing system
requirement. Another important consideration were the types of index data structures Table 4.1.1.

63
Figure 4.1.1. Functional architecture of WEIRDB. Filled ellipses represent implementation
components: Internal Execution Engine, Query Rewriting Engine, Random Access Model
Implementations, Native Network Protocol, PostgreSQL Network Protocol. Transparent ellipses
with dotted outlines represent interfaces for: Memory-mapping of Files, Random Data Access,
WEIRDB Embedding. Rectangles with solid outlines represent functional requirements and group
together components and interfaces most relevant to achieving them. Rectangles with dotted
outlines group functional requirements into higher level functional requirements with Embeddable
DBMS with External Data Support referring to embeddable self-contained part of WEIRDB with
exclusion of network access modules.

Table 4.1.1. Comparison of Index Data Structures

Structure Description

A hierarchical minimum bounding rectangle representation of neighboring objects. A balanced tree with defined maximum page size and
R− Tree
minimum fill. The original insertion algorithm was to always choose such subtree which causes least bounding rectangle enlargement.

A variant of R−tree avoiding overlap of internal nodes. To this end, an object may be inserted into multiple leaf nodes. Due to this fact the
R+Tree
minimum fill is also not guaranteed.

Hash For simple equality comparisons only. A hash function is used to determine the bucket in a lookup table where given row reference is found.

Expression An index built on an expression rather than column. This type of index allows to optimize for expressions frequently used in queries.

An index built on part of data filtered by an expression. This allows e.g. construction of index just for records marked as active. It's useful if the
Partial
same filtering expression is applied in a query.

A regular B−tree index with reversed key values, e.g. 12345 becoming 54321. This operation spreads similar values uniformly across the entire
Reverse
index thus reducing contention for concurrent sequential inserts.

Uses bit arrays and bitwise logical operations on them to answer queries on low cardinality columns. Worse than B−tree performance for frequent
Bitmap
updates. Usually used in read−only tables.

Generalized Search Tree is a generalization of B+Tree. An implementation must provide 8 application−specific methods: same, consistent, union,
GiST
penalty, picksplit, compress, decompress and distance. The core algorithm doesn't make any assumptions about the data type.

Generalized Inverted Index is a generic implementation of an inverted index. Inverted index is a mapping from content to location in the database
GIN file. It can be used for fast full text searches. GIN indices are slower to build than GiST but faster to look up. There are 6 application−specific
methods: compare, extractValue, extractQuery, consistent, triConsistent and comparePartial.

Full Text A full text index is an inverted index specialized for word−to−document mapping.

A spatial index is used to optimize spatial queries. Example spatial queries are: nearest neighbor query, range query, intersection query, etc.
Spatial Standard indexes do not perform well for spatial queries. Typical spatial indices include: R−tree, R+Tree, R* Tree, kd−tree, quadtree, octree.

Forest of Trees index is similar to B−tree index but has multiple root nodes and possibly fewer levels. Multiple root nodes reduce contention
FOT
when multiple users are simultaneously writing the index. It can also improve lookup speed by reducing the number of levels to traverse.

64
Among embeddable databases SQLite supports the optimal subset of indices, including spatial ones -
a feature critically important for neuroscience investigations - Table 4.1.2.

Table 4.1.2. Indexing methods support comparison in WEIRDB and existing SQL DBMS. * Depends on the execution
backend (2) MyISAM tables only (3) MEMORY, Cluster (NDB) and InnoDB tables only.

Database R−/R+ Tree Hash Expression Partial Reverse Bitmap GiST GIN Full− text Spatial FOT

WEIRDB Yes* Yes* Yes* Yes* Yes* Yes* Yes* Yes* Yes* Yes* Yes*

HSQLDB No No No No No No No No No ? ?

H2 No No No No No No No No Yes Yes ?

Teiid No No No No No No No No No No No

NoDB ? ? ? ? ? ? ? ? ? ? ?

wormtable No No No No No No No No No No No

PostgreSQL Yes Yes Yes Yes Yes Yes Yes Yes Yes PostGIS ?

MySQL Yes(2) Yes(3) No No No No No No Yes Yes(2) ?

SQLite Yes No No Yes Yes No No No Yes Yes ?

Finally, SQLite's data limits ( Table 4.1.3) are on par with other solutions under consideration.

Table 4.1.3. Limits comparison of WEIRDB and existing SQL DBMS with support for external data virtualization. * Values
for static columns, dynamic column limits depend on the execution backend ** Data type not supported (3) Configurable
using environment variable (4) Java's limit for maximum array size of 2^31 applies (5) Limited to 2^63 (exa-scale) by Java's
length() method returning long type. Sources: [125], [120], [126].

Max Max
Max table Max columns Max Blob Max CHAR Min DATE Max column
Database Max DB size Max row size NUMBER DATE
size per row Size Size Value name size
Size Value

WEIRDB Unlimited* Unlimited* Unlimited* Unlimited* Unlimited* Unlimited* Unlimited* N/A** N/A** Unlimited*

HSQLDB 64TB Unlimited(4) Unlimited(4) Unlimited(4) 64TB Unlimited(4) Unlimited 0001/01/01 9999/12/31 128

99999999 99999999
H2 64TB Unlimited(4) Unlimited(4) Unlimited(4) 64TB Unlimited 64 bits Unlimited(4)
BC AD

2922790
Teiid Unlimited Unlimited(4) Unlimited(4) Unlimited(4) Unlimited(5) 4000 (3) 1000 digits 1970/01/01 Unlimited(4)
cent.

NoDB ? ? ? ? ? ? ? ? ? ?

2TB&minus
wormtable Unlimited Unlimited Unlimited Unlimited Unlimited** Unlimited** N/A** N/A** Unlimited
256TB

5874897
PostgreSQL Unlimited 32TB 1.6TB 250− 1600 1GB− 4TB 1GB Unlimited 4713 BC 63
AD

MySQL Unlimited 256TB/64TB 64KB 4096 4GB 64KB 64 bits 1000 AD 9999 AD 64

Limited by Limited by
SQLite 128TB 32767 2GB 2GB 64 bits N/A** N/A** Unlimited
DB size DB size

65
The ensemble of above traits along with SQLite's high performance rendered it the optimal choice for
internal caching mechanism. Dynamic typing on the external data side is handled by using Qt
framework's QAbstractItemModel class as tabular data abstraction layer. It conveniently treats all
values as variants allowing for fast operations native to a given data type when it is known and type
conversions when necessary. All external data sources supported by WEIRDB are implemented as
subclasses of QAbstractItemModel. To take advantage of abundant Random Access Memory,
external data sources which are disk-based use memory-mapped files mechanism to access the data.
This allows the operating system to decide how much of the data should be retained in memory at any
given time, resulting in good behavior when running WEIRDB along with other applications. At the
same time it offers significantly more buffering than the standard C library buffering mechanisms or
operating system file-only buffering. Query Rewriting Engine together with efficient memory-
mapping based implementations of QAbstractItemModel provide the support for very large data files,
wide tabular data (good performance with millions of columns), high streaming access performance
and when using WEIRDB's programming API - high performance random access (e.g. seamless
retrieval of arbitrary cells in any order from a SELECT query). The data processing core is wrapped
independently by two network access mechanisms - WEIRDB's native network protocol and
PostgreSQL network protocol version 3. The former provides big data-oriented messaging format
(support for unlimited columns), the latter focuses on achieving compatibility with database APIs in
existing software. A high-level collaboration diagram of WEIRDB components is presented in Figure
4.1.2. In the following chapters, I present a closer look at the components and interfaces respectively
created/implemented for WEIRDB: the Query Rewriting Engine, QAbstractItemModel, Native
Network Protocol and PostgreSQL Network Protocol v3.

66
Figure 4.1.2. High-level architecture of WEIRDB. The figure illustrates network access scenario
where researcher inputs a query using client terminal. The text is received by the chosen protocol
implementation - either WEIRDB native protocol or PostgreSQL network protocol which forwards
it to Query Rewriting and Execution Engine. The Query Rewriting Engine interfaces with Internal
Caching Engine and provides it with data necessary to execute a modified query subsequently used
to determine rows that need to be fetched from external data sources. The Random Access Model
provides abstraction layer for reading foreign data, both file-based (e.g. CSV files, Nifti-1 files) and
network-based (e.g. foreign SQL server). The SQL adapter can be used to access other WEIRDB
instances or different DBMSes. The file-based adapters can access files on local filesystems,
network filesystems or any other abstractions acting as filesystems (e.g. FUSE drivers). In particular
it can access files in a Cloud Storage. The only requirements is that files follow the respective
format supported by the adapter, e.g. CSV, Nifti-1, MINC, HDF5, etc. The Query Rewriting Engine
is responsible for synchronizing all the data retrieval pipes and serving the final result back through
the network interface to the client terminal and end-user.

67
4.2. Query Rewriting Engine

The purpose of Query Rewriting Engine is to restructure ( Figure 4.2.1, Figure 4.2.2) the original
query in such a way that only dynamic parts are retained and row identifiers added for the static parts
which later can be used to fetch data from external files. For implementing the parser I used Boost
Spirit parsing library and defined the syntax corresponding to SELECT syntax in SQLite. The parser
takes a string containing an SQL query as input and produces the corresponding Abstract Syntax Tree
(AST), which specifies to limited extent which parts of the query are dynamic. Further analysis step is
necessary to determine if expressions, which syntactically seem to be static are in fact dynamic
because they originate from dynamic columns in nested SELECT queries. The analysis module
detects these cases and for each dynamic identifier makes sure that “id” column of each of the tables
used to produce that expression is included once in the list of SELECT values. These added
identifiers are named by convention “id___N” where N is an increasing integer number for each new
generated identifier. All “id___N” values are propagated across nested queries regardless whether
they are used in the final output. The tables and columns for the dynamic parts of the query are
fetched into the internal caching and indexing engine only if they're required. This is the key feature
for obtaining high performance. In-memory tables can be used instead of default disk-based tables to
increase speed even further with a trade-off of losing persistency upon WEIRDB restarts. After
retrieving identifiers from the reformatted query results, original AST is used to fill in the missing
static parts by directly accessing external files. Efficient large files support is achieved by trying to
keep them mostly in memory (file mapping is used to this end, thus the percentage of file loaded into
physical memory depends on the amount of memory available and the file usage pattern) with
minimum overhead for caching some of the column positions in separately allocated memory.
Afterwards, columns are accessed by piecewise on-the-fly parsing of the file, using cached positions
to amortize search time for each particular column. This proves to be more efficient both
performance- and memory-wise than caching parsed data in arrays of strings or variant types. Finally,
results are either printed out as CSV file, presented by means of a dedicated graphical user interface
(GUI) implemented using the Qt library and a special data model (similarly to external file wrappers
derived from QAbstractItemModel class) or sent in CSV format over a network socket.

For a diagram of Query Rewriting Engine architecture, see Figure 4.2.3, while in the following
section, I will describe the syntax of SQL statements support implemented in WEIRDB.

68

Figure 4.2.1. Examples of static and dynamic column references. Dynamic references (marked with
dotted line) are kept in the reformatted query, whereas for static ones (marked with solid line) an
identifier column of the corresponding table is added to the modified query. The results for static
columns will be fetched directly from external files.

Figure 4.2.2. Query rewriting pipeline. In the first step, wildcard expressions are expanded and
necessary identifier columns added. In the second step, static values are filtered out providing the
query which is subsequently executed by Internal Caching Engine. Static columns are marked with
solid lines, dynamic columns are marked with dotted lines and their corresponding row identifiers
are marked with dash-dotted lines.

69

Figure 4.2.3. Block diagram of Query Rewriting Engine architecture with partioning of
functionality into compilation units of WEIRDB. The figure uses CSV files for illustrative
purposes, respective blocks can be replaced with any supported file format. Rectangles with captions
above them represent compilation units (i.e. source files). Rectangles with captions inside represent
algorithms and data structures. The control flow and data flow are indicated interchangeably by
directed arrows with the corresponding counterparts specified implicitly by the context. The
handle.cpp unit is responsible for automatic import of all CSV files in the working directory. It's a
special legacy use case present only for CSV files. The canonical function of handle.cpp unit is to
process database queries. The query handling routine forwards the query text to SQL Parser
(trivialsql.cpp) which in turn passes an Abstract Syntax Tree (AST) to the analysis and rewriting
routines (analysis.cpp). Both of them use ad-hoc schema inferred and provided by the CSV Model
create either by the automatic scanning procedure or manually using the ATTACH statement. The
rewriting step provides a modified query and a list of columns which need to be present in the
Internal Caching Engine (SQLite) in order for the modified query to be executed correctly. Control
returns then to handle.cpp unit which uses CSV model to fetch the necessary columns from CSV
files and add them to the Internal Caching Engine (SQLite). Subsequently, Data Access
Specification is generated which defines the relation between modified query results and the desired
output of original query. Based on the aforementioned specification WEIRDBModel.cpp unit
provides QAbstractItemModel implementation which uses the Data Access Specification along with
CSV Model and Internal Caching Engine to generate cells of final results. This implementation can
be called directly in the embedded usage scenario or wrapped using network access modules in
remote usage scenario. Both means are represented by the Output box.

70
4.2.1. SQL Parser Implementation

Currently, WEIRDB's proof-of-concept implementation offers read-only access to the data, thus
among standard SQL statements supports only the SELECT operation. Implementation of other basic
statements such as INSERT, UDPATE and DELETE has varying level of difficulty depending on the
output data format. Formats like CSV are friendly for INSERT statements but pose challenges for
efficient UPDATE and DELETE operations because they require rewriting of large portions of the
file. On the other hand, Nifti-1 format allows for easy UPDATE operations, while INSERT and
DELETE operations do not make sense unless corresponding metadata about volume dimensions are
updated simultaneously. This would require an array-aware syntax such as SciQL [41] which is left
for future investigation.

The SELECT statement syntax supported by WEIRDB is presented below in Extended Backus-Naur
form:

<select-stmt> ::= SELECT <result-column-list>


FROM (<table-or-subquery-list> | <join-clause>)
WHERE <expr>
GROUP BY <expr>
ORDER BY <ordering-term-list>
LIMIT <int-literal>, [<int-literal>]
<table-or-subquery-list> ::= <table-or-subquery> | <table-or-subquery>, <table-or-subquery-list>
<join-clause> ::= <table-or-subquery> {[LEFT | OUTER | INNER | CROSS] JOIN <table-or-subquery>
<join-constraint>}
<join-constraint> ::= ON <expr> | USING "(" <column-name>, {<column-name>} ")"
<result-column-list> ::= <result-column> | <result-column>, <result-column-list>
<result-column> ::= "*" | <table-name>."*" | <expr> [[AS] <column-alias>]
<ordering-term-list> ::= <ordering-term> | <ordering-term>, <ordering-term-list>
<ordering-term> ::= <column-specifier> [ASC | DESC]
<column-specifier> ::= <table-name> | <table-name>.<column-name>
<table-or-subquery> ::= <table-name> [AS <table-alias>] | "(" (<table-or-subquery-list> | <join-clause>) ")"
| "(" <select-stmt> ")" [AS <table-alias>]
<expr> ::= <literal-value>
| <bind-parameter>
| <column-specifier>
| <unary-operator> <expr>
| <expr> <binary-operator> <expr>
| <function-name> "(" [<expr>, {<expr>} | "*"] ")"
| "(" <expr> ")"
| <expr> [NOT] (LIKE | GLOB | MATCH | REGEXP) <expr>
| <expr> IS [NOT] <expr>
<literal-value> ::= <numeric-literal>
| <string-literal>
| <blob-literal>
<unary-operator> ::= - | + | ~ | NOT
<binary-operator> ::= "||"
| "*" | / | %
| + | -
| << | >> | & | "|"
| <= | >= | < | >
| == | != | <> | =
| AND | OR
<int-literal> ::= <digit> {<digit>}

Furthermore, the list of keywords used by WEIRDB is limited to the following set:

SELECT, FROM, WHERE, GROUP, ORDER, BY, LIMIT, AND


OR, NOT, AS, ASC, DESC, LIKE, IN, IS, GLOB, MATCH
REGEXP

71
Notably, the following constructs were removed/modified from the expression grammar as compared
to SQLite:

<function-name> "(" [[DISTINCT] <expr>, {<expr>} | *] ")"


CAST "(" <expr> AS <type-name> ")"
<expr> COLLATE <collation-name>
<expr> [NOT] (LIKE | GLOB | REGEXP | MATCH) <expr> [ESCAPE <expr>]
<expr> (ISNULL | NOTNULL | NOT NULL)
<expr> [NOT] BETWEEN <expr> AND <expr>
<expr> [NOT] IN ("(" [<select-stmt> | <expr>, {<expr>}] ")" | <table-name>)
[[NOT] EXISTS] "(" <select-stmt> ")"
CASE [<expr>] {WHEN <expr> THEN <expr>} [ELSE <expr>] END
<raise-function>

The lack of support for expressions specific to NULL value handling stems historically from the lack
of NULL value in CSV format. Empty cells in CSV files are represented by empty strings. Similarly,
Not a Number (NaN) and Infinity (Inf) values in Nifti-1 files are represented by their textual
representation in CSV output or their native value is kept when using WEIRDB programmatically.

WEIRDB supports as well three special SQL statements, specific to external data handling. They are:

<attach-stmt> ::= ATTACH <file-name> [AS <table-name>] TYPE [CSV | NIFTI]

<detach-stmt> ::= DETACH <table-name>

<into-stmt> ::= INTO <file-name> <select-stmt>

The ATTACH statement is used to connect existing files to the engine. Their contents are
subsequently accessible via the SELECT statement. The DETACH statement does the reverse, it
disconnects an existing file-backed table from the engine. The INTO statement followed by SELECT
statement saves the query results into specified file in CSV format.

In the following section, I will describe how Query Rewriting Engine handles the SELECT statement.

4.2.2. Query Rewriting Algorithm

Upon receiving the SELECT query text, Query Rewriting Engine divides expressions into simple
column references (SCR) and others. A simple column reference is defined as one that is used in the
list of column expressions following SELECT keyword but not as argument to a function call or
arithmetic/logic expression and is not used in any other SQL clause (e.g. JOIN, GROUP BY,
ORDER BY). Non-SCR expressions cannot be subject to data virtualization in the current
implementation.

The simple column reference trait of an expression can be determined by recursive analysis of the
query and its nested subqueries. The need for recursion raises from the fact that expressions
syntactically qualified as SCR may refer to aliases of non-SCR expressions in nested subqueries and
effectively be non-SCR themselves.

The algorithm rewriting an SQL query for data virtualization of SCRs consists of 3 phases presented

72
in Algorithm 4.2.2.1. Phase I consists of recursive execution of the algorithm for subqueries. Phase II
is responsible for creating two copies of the output column expressions from original query, one with
added row identifiers for SCRs, the other - containing only non-SCR expressions and row identifiers.
Phase III propagates row identifiers created in recursive calls from Phase I to the list of non-SCR
expressions and creates two copies of the original query with output column expressions replaced by
the above two lists respectively.

Algorithm 4.2.2.1. Prepare query for data virtualization - prepareForDataVirtualization


Input: Select(Values, From, Joins, Conds, GroupBy, OrderBy) as root node of SQL SELECT
statement abstract syntax tree, containing: Values - list of expressions following the SELECT
keyword, From - list of table names and subqueries following the FROM keyword, Joins - list of
table specifications following occurrences of JOIN keyword, Conds - list of expressions following the
WHERE keyword, GroupBy - list of expressions following the GROUP BY keyword and OrderBy -
list of columns following the ORDER BY keyword, (optional) RowIdentifiers - list of already
generated row identifier columns, Tables - dictionary of tables attached to the database

Output: Root nodes of two abstract syntax trees: Full representing the original query with row
identifier meta data added for all simple column references, Filtered created from Full by removing
simple column references and adding row identifier columns.

Phase I - run prepareForDataVirtualization recursively for subqueries


if RowIdentifiers is not specified then
RowIdentifiers ← {}
end if
Full ← Select, Filtered ← Select
for Subquery ∈ From do

(FullR, FilteredR) ← prepareForDataVirtualization(Subquery, RowIdentifiers)

Full.From[k : From[k] = Subquery] ← FullR

Filtered.From[k : From[k] = Subquery] ← FilteredR


end for

Phase II - from Values create AllExprs - a copy with row identifiers added for simple column
references and NonColumns - a set of expressions that are not simple column references
AllExprs ← {}, NonColumns ← {}, RowIdentifiers ← {}
ColumnNamespace ← createColumnNamespace(From, Joins, Tables)
AliasToRowId ← Dictionary<Alias, RowId>
for Expr ∈ Values do

73
if Expr is not Column then
AllExprs ← AllExprs ∪ {Expr}
NonColumns ← NonColumns ∪ {Expr}
else
(AllExprs, NonColumns, RowIdentifiers, AliasToRowId) ←
processColumn(Expr as Column, AllExprs,
NonColumns, RowIdentifiers, ColumnNamespace, AliasToRowId)
end if
end for

Phase III - add row identifiers meta data to existing columns in Full.Values, remove simple column
references from Filtered.Values and add row identifier columns to Filtered.Values
Full.Values ← AllExprs
Filtered.Values ← NonColumns
FilteredOutColumns ← outputColumns(Filtered, Tables)
for RowId ∈ RowIdentifiers do
if RowId ∉ FilteredOutColumns then
Filtered.Values ← Filtered.Values ∪ {RowId}
end if
end for

The subroutine processColumn (Algorithm 4.2.2.2) in algorithm prepareForDataVirtualization is


responsible for handling SCRs, in particular for generating row identifiers which allow to link output
columns with external data during query execution. processColumn looks for columns matching
given column specifier in the namespace of input columns (i.e. columns originating from FROM and
JOIN clauses). For each of the matches, non-SCRs are passed through without modification to both
SCR and non-SCR lists, whereas SCRs are appended to SCR list only. Furthermore, for SCRs the
presence of prior row identifier is checked. If a row identifier is not in place, existence of alias to row
identifier mapping is verified. If no such mapping is present for the given column alias, a new row
identifier is generated and corresponding mapping created. Row identifiers are added to non-SCR
expressions list and assigned to given column so that references from parent queries would already
contain the row identifier.

74
Algorithm 4.2.2.2. Handle simple column references in Phase II - processColumn
Input: Column - column specifier, AllExprs - list of expressions with row identifier meta data added
for simple column references, NonColumns - list of expressions that are not simple column
references, RowIdentifiers - set of already generated row identifier columns, ColumnNamespace -
namespace of columns defined by FROM and JOIN clauses, AliasToRowId - dictionary containing
mappings from column aliases to row identifiers

Output: AllExprs - updated list of expressions with row identifier meta data added for simple column
references, NonColumns - updated list of expressions that are not simple column references,
RowIdentifiers - updated set of already generated row identifier columns, AliasToRowId - updated
dictionary containing mappings from column aliases to row identifiers

Matches ← {m : m ∈ ColumnNamespace and m.Matches(Column)}


for Match ∈ Matches do
if not Match.isSimple() then
AllExprs ← AllExprs ∪ {Match}
NonColumns ← NonColumns ∪ {Match}
else
if not Match.hasRowIdentifier() then
if Match.Alias ∈ AliasToRowId then
RowId ← AliasToRowId[Match.Alias]
else
RowId ← new row identifier column : RowId ∉ RowIdentifiers
RowIdentifiers ← RowIdentifiers ∪ {RowId}
AliasToRowId[Match.Alias] ← RowId
NonColumns ← NonColumns ∪ {RowId}
end if
Match.setRowIdentifier(RowId)
end if
AllExprs ← AllExprs ∪ {Match}
end if
end for

75
A l g o r i t h m s prepareForDataVirtualization and outputColumnNames rely on subroutine
createColumnNamespace (Algorithm 4.2.2.3) for creating input column namespace (i.e. dictionary of
columns originating from FROM and JOIN clauses) for given query. To this end,
createColumnNamespace iterates over the list of expressions in FROM and JOIN clauses. The list of
input columns will contain columns from the list of output columns of subqueries in the FROM
clause, therefore outputColumnNames (Algorithm 4.2.2.4) subroutine is called in such cases. For the
remaining types of expressions (i.e. table references in FROM and JOIN clauses), the columns
defined by their schemas become part of the input columns as returned by given invocation of
createColumnNamespace. Finally, table name aliases used in FROM and JOIN clauses are
propagated to the input column namespace.

Algorithm 4.2.2.3. Create column namespace in Phase II and in outputColumnNames -


createColumnNamespace
Input: From - list of table names and subqueries following the FROM keyword, Joins - list of table
names following occurrences of JOIN keyword, Tables - dictionary of tables attached to the database

Output: ColumnNamespace - namespace of columns accessible from expressions following the


SELECT keyword

ColumnNamespace ← {}
for Expr ∈ From ∪ Joins do
if Expr is Select then
OutColumns ← outputColumns(Expr as Select, Tables)
else
OutColumns ← Tables[Expr.Table].Columns
end if
for Column ∈ OutColumns do
Column.TableAlias ← Expr.TableAlias
ColumnNamespace ← ColumnNamespace ∪ {Column}
end for
end for

The outputColumnNames algorithm (Algorithm 4.2.2.4) is part of a mutually recursive pairing with
createColumnNamespace subroutine. Its role is to create a list of columns produced by given
SELECT statement (i.e. output columns). It is accomplished by iterating over all expressions in the
SELECT clause and performing different actions depending on their types. Syntactically non-SCR
expressions are marked as such and appended to the list without further modification. To handle
SCRs an input column namespace is created from expressions in FROM and JOIN clauses and

76
columns matching the SCR are found. The cardinality of the set of matches is evaluated against
erroneous conditions (e.g. no matching columns or too many matching columns). Subsequently,
matches are marked as SCR and row identifier obtained from the column specifier is assigned to the
match. At this stage the row identifiers (if present) come from this query's subqueries treated by
recursive calls to prepareForDataVirtualization. This ensures that the same columns originating from
different subqueries have separate row identifiers. Finally, SCRs are also appended to the output
columns list.

Algorithm 4.2.2.4. Create output columns list in Phase III and in createColumnNamespace -
outputColumnNames
Input: Select(Values, From, Joins, ...) as root node of SQL SELECT statement abstract syntax tree,
containing: Values - list of expressions following the SELECT keyword, From - list of table names
and subqueries following the FROM keyword, Joins - list of table names following occurrences of
JOIN keyword, Tables - dictionary of tables attached to the database

Output: OutColumns as list of columns produced by given SELECT statement

OutColumns ← {}
ColumnNamespace ← createColumnNamespace(From, Joins, Tables)
for Expr ∈ Values do
if Expr is Column then
Column ← Expr as Column
Matches ← {m : m ∈ ColumnNamespace and m.Matches(Column)}
if |Matches| = 0 then
raise Exception(No matching columns found)
else if |Matches| > 1 and not Expr.isWildcard() then
raise Exception(Ambiguous column name)
else
for Match ∈ Matches do
Match.setSimple(TRUE)
Match.setRowIdentifier(Column.getRowIdentifier())
OutColumns ← OutColumns ∪ {Match}
end for
end if
else
Expr.setSimple(FALSE)
OutColumns ← OutColumns ∪ {Expr}

77
end if
end for

After performing the query rewriting procedure, non-SCR columns are imported to the internal
caching engine so that it can execute the non-SCR part of the new query. This concludes the Query
Rewriting stage and is followed by query execution and direct data retrieval discussed in the
following section.

4.2.3. Query Execution and External Data Access

After the internal engine is updated according to respective analysis to be in a valid state for running
the rewritten query the latter is issued and remains active throughout the life of the corresponding
WEIRDB query. When retrieving rows for file, network or internal programmatic output WEIRDB
interleaves calls to internal engine API for retrieving dynamic values (as well as row identifiers for
static columns) with calls to QAbstractItemModel (discussed in the following chapter) methods for
accessing the external data. Currently two classes - FastNiftiModel and FastCsvModel are
implementing the QAbstractItemModel interface in order to access external Nifti-1 and CSV files
respectively. When using WEIRDB programmatically (e.g. as a library) rather than separate process
or network server, completely random access to any resulting cells is possible by rewinding the
internal engine results handle to arbitrary row and using intrinsically random semantics of
QAbstractItemModel to retrieve the static data. Figure 4.2.3.1 illustrates this simple and efficient
external data access mechanism.

Figure 4.2.3.1. WEIRDB External Data Access Model. Dynamic Expressions are evaluated by the
Internal Caching Engine (SQLite) and returned in its Result Row along with Row Identifiers
inserted there by the Query Rewriting Engine. Consequently the Row Identifiers are used to access
External Files and retrieve foreign data for Static Columns of a WEIRDB Result Row. The Dynamic
Expressions are copied from the Internal Engine Result Row to WEIRDB Result Row without
modification.

In the following chapter, as the class central to WEIRDB's random data access, I will write in more
detail about QAbstractItemModel.

78
4.3. Random Access Data Model

The QAbstractItemModel class defines an abstract interface for item models in the Qt [85]
framework. Its architecture enables representation of both table- and tree-like item models. The most
important virtual method is index() which has the following signature:

QModelIndex index(int row, int column, const QModelIndex &parent);

In case of a flat table the parent argument passed to the call will always be the so-called "invalid"
QModelIndex instance which is created by the default QModelIndex() constructor. The row and
column arguments reference respectively rows and columns in a table. The index()'s method role is to
return the QModelIndex encapsulating a user-defined value which uniquely identifies the specified
item.

In case of a typical tree model, column 0 elements can have children, which would be indicated by the
following method:

bool hasChildren(const QModelIndex &parent));

returning true for parent argument referencing to item(s) in column 0.

The rowCount(parent) and columnCount(parent) methods provide respectively numbers of rows and
columns for a given parent. In case of top-level items, the parent passed is an invalid QModelIndex
instance.

For traversing the trees, the parent() method needs to be implemented as well, which for any given
QModelIndex returns a QModelIndex corresponding to its parent item (or invalid QModelIndex for
top-level items).

Although QAbstractItemModel abstraction is very powerful and provides possibility for recursively
representing tables where each cell can contain a child table, in practice it's used to represent either
classical flat tables or trees with multiple data columns shared among items. This situation
corresponds well to the available assortment of view classes (QTreeView, QTableView) which at the
moment are not capable of displaying item models customized in more sophisticated ways.

The final method required for read-only QAbstractItemModel to function has the following signature:

QVariant data(const QModelIndex&, int role)

The data() method returns data corresponding to each item and role in form of a variant type. The
role is differentiated depending on whether the item is intended to be displayed, edited, a
tooltip/status/"what's this" tip is about to be displayed for the item, etc. Similarly it can refer to
decorative (icons) and meta data about the item (e.g. size hint, text alignment). In simple models the

79
data value returned is the same for edit and display roles and returns invalid QVariant instance for the
other roles, indicating that no meta-data are available.

For full read/write support additional methods need to be implemented: setData(), insertRows(),
insertColumns(), removeRows(), removeColumns(), moveRows(), moveColumns() which are
responsible for respectively: setting the item data, adding rows, adding columns, deleting rows,
deleting columns and moving rows/columns. Notably, only a subset of those methods can be
implemented, reflecting capabilities of the model.

To notify the observers of the model, setData() method needs to emit a dataChanged() signal upon
successful completion and the remaining methods must use the beginInsertRows()/endInsertRows(),
beginInsertColumns()/endInsertColumns(), etc. notification mechanism.

Finally, QAbstractItemModel can as well: provide header section data using


headerData()/setHeaderData() for read/write access respectively (this functionality is used e.g. by the
QHeaderView class to display table row/column headers), symbolically link items using the buddy()
method, incremental population using canFetchMore() and fetchMore() methods and drag/drop
awareness using canDropMimeData()/dropMimeData() methods, sibling search optimization with
sibling() and convenience methods for bulk data retrieval/setting for all roles:
itemData()/setItemData().

In WEIRDB, QAbstractItemModel is used to represent random read-only access layer for supported
in-situ data formats, e.g. CSV, Nifti-1 transparently handling all of the related optimizations. It makes
possible as well to use existing standard QAbstractItemModel implementations such as: QDirModel,
QFileSystemModel, QSqlTableModel, QSqlRelationalTableModel, QSqlQueryModel as tables
within the WEIRDB query engine.

A flattening wrapper common for all tree-oriented QAbstractItemModel implementations would


allow to handle them correctly in the current SQL engine which is unaware of tree structures.

In the future, full QAbstractModelItem handling could be incorporated directly into the engine, using
custom SQL syntax.

Lastly, QAbstractItemModel can also be adjusted to describe directed acyclic graph models for use
with certain basic queries, however not to the point required of typical graph databases.

In the following chapter I will discuss WEIRDB's network access protocols as alternative to using
WEIRDB as a programming library.

80
4.4. Network Access Protocols

Programmatic usage of WEIRDB without writing additional bindings is limited to C++ with Qt
framework. Therefore sometimes it can be impossible or impractical to use it this way. There might
be as well situations which benefit explicitly from remote access capability, such as establishing a
public network database from existing external files or accessing one of the existing network
databases using custom QAbstractItemModel implementation integrated with WEIRDB. For these
cases WEIRDB exposes two network protocols - a simple native text protocol which prioritizes ease
of implementation and big data support over compatibility with existing access libraries and a
protocol compatible with PostgreSQL Network Protocol v3 - in order to benefit from existing
implementations of drivers such as ODBC, JDBC or native PostgreSQL access libraries.

In the following two chapters, I will present respectively the native network protocol and PostgreSQL
network protocol.

4.4.1. Native WEIRDB Protocol

WEIRDB's native network protocol is text-based. The client is responsible for initiating all
communications. Each reply from the server corresponds to one request from the client. Upon
establishing a connection with WEIRDB the client can proceed to sending commands terminated
with newline character. Incoming commands are parsed and executed sequentially, always generating
a response consisting of at least a newline character. The general response format is CSV. In case of
SELECT queries it's a natural form of resulting data. In case of control statements such as ATTACH,
DETACH or INTO ... SELECT, either an empty response (single newline character) is sent signalling
success, or a single-cell CSV file containing the error message. In practice a single-cell CSV file is
indiscernible from a simple line of text. An example of successful and erroneous queries will look as
following:

Query:

SELECT date, gender, name FROM patients;\n

Reply:

date;gender;name\n
1984-05-11;M;Alice\n
1975-01-25;M;Bob\n
1967-12-01;M;Elody\n

81
Query:

ATTACH "non_existing_file" AS "test" TYPE CSV;\n

Reply:

File "non_existing_file" does not exist\n

4.4.2. PostgreSQL Network Protocol v3

WEIRDB supports PostgreSQL Network Protocol v3 in order to provide access for applications not
supporting its native protocol. Thanks to PostgreSQL support via ODBC, JDBC or native
PostgreSQL APIs, usage of WEIRDB becomes possible within any existing database software. It
comes however with the restriction of maximum number of columns returned in a result set to 32767.
This is the limit imposed by protocol specification and might be even lower depending on the specific
API used for access. Notably, this doesn't prevent issuing multiple requests to get all of the columns
in horizontal batches and special commands could be envisioned to provide hints to WEIRDB not to
discard temporary structures before all columns are retrieved in this batching manner.

PostgreSQL network protocol is a client-server architecture using mixed text/binary messaging. It


uses big endian (i.e. most-significant byte at lowest address) ordering for multi-byte binary data
types. Each packet sent contains a 4-byte message type field followed by a 4-byte message length
field. The database acts as the server while database user connects as a client. The communication is
initiated with the following StartupPacket structure sent by the client.

typedef struct StartupPacket


{
ProtocolVersion protoVersion; /* Protocol version */
char database[SM_DATABASE]; /* Database name */
/* Db_user_namespace appends dbname */
char user[SM_USER]; /* User name */
char options[SM_OPTIONS]; /* Optional additional args */
char unused[SM_UNUSED]; /* Unused */
char tty[SM_TTY]; /* Tty for debug output */
} StartupPacket;

It contains the protocol version number and five 64-byte strings defining respectively: database name,
user name, additional option switches, an unused string, and the terminal name for debug output. The
protocol version number can be used as well to designate special startup packets. There are two such
special cases of importance to WEIRDB, defined by the NEGOTIATE_SSL_CODE ("version"
1234.5679) and CANCEL_REQUEST_CODE ("version" 1234.5678) constants.

The NEGOTIATE_SSL_CODE constant indicates attempt to negotiate SSL encryption of the


database connection. WEIRDB doesn't support this scenario, therefore it sends a 1-byte response "N".
This is a special response and is not prefixed by message type or length. Upon its reception, the client
sends the StartupPacket again, this time not attempting SSL negotiation.

82
The CANCEL_REQUEST_CODE constant indicates an attempt to cancel an ongoing query. Upon
retrieval of such protocol version, the server should interpret received data as a CancelRequestPacket
structure specified below.

typedef struct CancelRequestPacket


{
/* Note that each field is stored in network byte order! */
MsgType cancelRequestCode; /* code to identify a cancel request */
unsigned int backendPID; /* PID of client's backend */
unsigned int cancelAuthCode; /* secret key to authorize cancel */
} CancelRequestPacket;

This structure specifies the identification (PID) of the internal database process executing the query to
be killed and a request cancellation authorization code, which is provided by the server using the
BackendKeyData message during startup phase. WEIRDB doesn't support request cancellation at the
moment and therefore ignores CANCEL_REQUEST_CODE messages and doesn't send cancellation
authorization messages.

After receiving regular StartupPacket the server can send a number of messages related to
authentication - Authentication(Ok, KerberosV5, CleartextPassword, MD5Password,
SCMCredential, GSS, SSPI, GSSContinue). All of them except the AuthenticationOk message
require further action on the client side to authenticate itself. WEIRDB doesn't support authentication
at the moment and admits all the clients immediately with an AuthenticationOk.

Upon receiving AuthenticationOk, the client can still expect one of the following messages:
BackendKeyData, ParameterStatus, NoticeResponse before the final ReadyForQuery or
ErrorResponse. The ParameterStatus message can provide information about current values of server
parameters and NoticeResponse is responsible for issuing warning messages. The ErrorResponse
message signifies an error in the startup procedure, whereas ReadyForQuery marks a successful
finalization of the startup.

After receiving a ReadyForQuery message, the client can commence using either the simple or
extended query interface messages to execute SQL queries. The difference between simple and
extended interface is the possibility of usage of reusable prepared statements and parameter
placeholders. At the moment, WEIRDB supports only the simple query interface, therefore all values
have to be provided within the query text and will be parsed each time the query is issued. Also, the
query must be sent using the simple query messages. WEIRDB reports error when extended query
interface usage is detected.

83
The simple query interface is invoked with a Query message from the client side. The payload of the
Query message contains just the desired null-terminated string corresponding to the query. Responses
to SELECT queries involve seven potential messages: EmptyQueryResponse, RowDescription,
DataRow, CommandComplete, NoticeResponse, ErrorResponse and ReadyForQuery.
EmptyQueryResponse is sent when an empty query string has been recognized, therefore not yielding
any valid result rows.

RowDescription's role is to provide column names and data types before sending the actual data.
RowDescription's payload contains the number of columns in the result (specified as signed short,
hence the 32767 limit) and for each column - its name (null-terminated string), table identifier (signed
integer) if the column values come directly from a table otherwise zero, similarly column id (signed
short) and finally a type specifier consisting of type OID (Object ID), length (signed short), type
modifier (signed integer) and format (signed short). The type OID can refer to one of the predefined
type OIDs (e.g. TEXTOID, FLOAT8OID, INT4OID, etc.) or user-defined type OIDs. The length
specifies the number of bytes in the internal representation for the given type or -1 if type has variable
representation length (e.g. text). The type modifier refers to type-specific modifier value (e.g.
maximum length for VARCHAR). Lastly, the format field specifies whether the column data will be
transferred in textual or binary mode. In case of binary mode, the internal byte representation in big
endian order is used. In case of textual mode, all values are converted and transferred as null-
terminated human readable strings.

DataRow message supplies for successive rows the actual values for columns specified with
RowDescription. DataRow's payload includes again the number of columns in the row followed by
column values in their respective formats. Upon transfer of the last DataRow, a CommandComplete
message is sent. The CommandComplete contains a null-terminated string specifying command tag
(e.g. SELECT) of the command which finished. If query consisted of only one command the
processing is finished and ReadyForQuery message follows. However, if multiple commands
separated with semicolons were present in the query, the process described above will be repeated for
each command before finally replying with a ReadyForQuery message.

If during processing of any command, a warning is generated it will be transferred using the
NoticeResponse message. On the other hand, in an error is encountered it will be communicated with
an ErrorResponse message and further processing of the query will be aborted. In both cases of final
CommandComplete message and an ErrorResponse message, the server communicates end of
processing using a final ReadyForQuery message.

84
WEIRDB uses binary format to transfer data to clients. Data types are mapped between Qt and
Postgres protocol using conversions illustrated in Table 4.4.2.1 . In case of CSV files, numbers are
always mapped to double precision floating point values and all other values to text.

Table 4.4.2.1. Data mapping between some Qt and SQLite types and PostgreSQL Network Protocol v3 predefined type
OIDs

Qt SQLite PostgreSQL Protocol v3

QString TEXT TEXTOID

float REAL FLOAT4OID

double REAL FLOAT8OID

int INTEGER(4) INT4OID

unsigned int INTEGER(8) INT8OID

long long INTEGER(8) INT8OID

unsigned long long REAL FLOAT8OID

char TEXT CHAROID

bool INTEGER(1) BOOLOID

4.5. Summary

Current proof-of-concept implementation of WEIRDB improves over state of the art in a number of
ways. It's built from the ground up with the aim of handling external data similarly to Teiid. The latter
however spreads its efforts among accessing external files, JDBC data sources, LDAP,
Salesforce.com API and other Web Services, whereas WEIRDB focuses on external files with
particular emphasis on formats used in neuroscience. This results in cleaner and more easily
extensible API compared to Teiid's hard-coded CSVTABLE() and XMLTABLE() functions.

Furthermore WEIRDB is architected to target performance objectives exceeding those possible with
Teiid or other existing classical DBMS solutions such as SQLite, PostgreSQL or MySQL running
with data virtualization extensions. To this end, it tries to maximally benefit from available memory
resource by using memory-mapping for access to external files as well as using numerous format-
specific optimizations in its external data adapters.

WEIRDB's programming interface allows for random access to query results with amortized
processing time thanks to data caching and efficient retrieval algorithms. With network access it
boasts great performance for bulk data transfers using simplified native protocol and good
compatibility with existing software when using PostgreSQL network protocol.

The proof-of-concept implementation handles Nifti-1 and CSV - two most frequently used file
formats in neuroscience and is designed in a way allowing for minimum-effort implementation of all
the other relevant formats - DICOM, MINC, HDF5, MAT, etc.

85
WEIRDB's unique characteristic is its ability to immediately handle very big files with non-
normalized table representation which can result in millions of columns (e.g. for genetic
investigations).

Lastly, the typing system used in WEIRDB derives directly from its internal engine - SQLite and is
fully dynamic which offers more flexibility when handling heterogeneous external data.

The combination of these characteristics allowed to successfully validate WEIRDB against existing
solutions and compete with them in terms of benchmarking performance. In the following chapters, I
will present and discuss the results of such comparison.

86
5. Evaluation

5.1. Datasets and Test Queries

In order to illustrate the full spectrum of WEIRDB's characteristics, the testing procedure uses
multiple datasets and test queries.

Dataset 1 (a.k.a. "nodb_test") is modelled after microbenchmark dataset from [86] and consists of a
CSV file containing 7.5 million rows and 150 columns. Each cell in this dataset contains a random

integer number from the [0, 109) range. The total size of the resulting file is about 10GB. The purpose
of dataset 1 is to illustrate performance using a generic big dataset with typical normalized layout.

Dataset 2 (a.k.a. "ADNI_Clin_6800_geno") is modelled after Alzheimer's Disease Neuroimaging


Initiative (ADNI) [33] clinical data combined with preprocessed structural MRI and single nucleotide
polymorphism (SNP) genetic data. The clinical data include subject identifier, date of birth, gender,
diagnosis and other basic information as well as results of different cognitive tests in a total of about
140 columns. The structural MRI data contain initially about 400000 values per subject. This number
is reduced to 6800 using an atlasing approach based on subparcellation of Automated Anatomical
Labeling (AAL) [114] atlas volume. The genetic data contain about 20000 SNP values. The overall
number of columns in the dataset is close to 30000. This dataset was used with its original content for
the tests. For replication purposes, a generation script is provided which keeps the characteristics
(column names, row sizes, some content essential for benchmark queries) of the original. The goal of
dataset 2 is to test performance with a typical medium size mixed modality neuroscience study.

Dataset 3 (a.k.a. "PTDEMOG") contains extended clinical information not included in dataset 2. It is
meant as companion file to dataset 2 for testing JOIN queries.

Dataset 4 (a.k.a. "MicroarrayExpression_fixed") is modelled after single human subject gene


expression data from Allen Brain Atlas [115]. It contains 60000 rows and 1000 columns resulting in a
900 MB CSV file. Similarly to dataset 2, this dataset was used with its original content for the tests
and analogous generation script is provided for replication. The purpose of dataset 3 is to illustrate
WEIRDB's behaviour in the context of genetic data.

Dataset 5 (a.k.a. "Probes") is a companion file for dataset 4. It contains additional genetic probe
information. Each row in dataset 4 (and vice versa) is associated with one probe using an identifier
stored in its first column. The number of rows in dataset 5 is equal to dataset 4 (i.e. 60000).

87
Dataset 6 (a.k.a "dummy") is modelled after original preprocessed structural MRI data from ADNI,
i.e. images containing about 400000 voxels. The resulting file is stored again in CSV format with
1000 rows and 400000 columns. The columns contain single-byte values to reflect the characteristics
of preprocessed imaging data (8-bit). The file's size is about 800MB. The purpose of dataset 4 is to
test the "wide" data handling capabilities of WEIRDB with neuroimaging data.

Dataset 7 (a.k.a. "dummy2") is meant as companion file for dataset 4 in order to test the JOIN query.
It contains 1000 rows and 10000 columns resulting in about 20MB of size.

Query 1 is the microbenchmarking query using dataset 1. It can be written as pseudo regular
expression:

SELECT c{0-149}(, c{0-149}){9} FROM nodb_test ;

where {0-149} signifies a random number from range [0, 150) picked independently for each
repetition of the (, c{0-149}) group and for each execution of the query. The natural language
description of the query is selecting 10 random columns from dataset 1 without filtering.

Query 2 is another microbenchmarking query using dataset 1, this time with filtering. It can be
written as pseudo regular expression:

SELECT c{0-149}(, c{0-149}){9} FROM nodb_test WHERE c{0-149}>500000000 ;

using the same pseudo syntax for the {0-149} symbol. The added filtering condition c{0-
149}>500000000 means that on average half of the rows will be retrieved due to characteristics of
dataset 1.

Queries 3 through 6 illustrate different questions a researcher might ask with respect to datasets 2 and
3.

Query 3:

SELECT * FROM ADNI_Clin_6800_geno ;

retrieves indiscriminately all records from dataset 2.

Query 4:

SELECT * FROM ADNI_Clin_6800_geno


WHERE Diagnosis=” AD “ ;

retrieves records of subjects with diagnosis of Alzheimer's disease.

88
Query 5:

SELECT * FROM ADNI_Clin_6800_geno


WHERE Diagnosis=" AD " OR PTGENDER=1 OR RID%2==0 ;

performs more elaborate filtering by selecting subjects that satisfy any of the following criteria:
diagnosed Alzheimer's disease, gender male, identifier is an even number.

Query 6:

SELECT * FROM ADNI_Clin_6800_geno AS a


INNER JOIN PTDEMOG AS b USING(RID)
WHERE a.Diagnosis=" AD " OR a.PTGENDER=1 OR a.RID%2==0 ;

in addition to applying the same filtering criteria as query 5, joins records from dataset 2 with
corresponding records from dataset 3 using the subject identifier field as join criterium.

Query 7:

SELECT * FROM MicroarrayExpression_fixed AS a


INNER JOIN Probes AS b ON(a.c1=b.probe_id) ;

performs an unfiltered JOIN operation on datasets 4 and 5 using the c1 column and probe_id column
as join criterium respectively in both datasets.

Queries 8 and 9 test performance of "wide" data handling capabilities using datasets 6 and 7.

Query 8:

SELECT * FROM dummy ;

retrieves indiscriminately all data from dataset 6.

Finally, query 9:

SELECT * FROM dummy AS a INNER JOIN dummy2 AS b USING(c0) ;

retrieves all data from both datasets 6 and 7 using JOIN operation and common column c0 as join
column.

In the following chapter I will describe how WEIRDB was benchmarked against existing solutions
using the aforementioned datasets and queries.

89
5.2. Benchmarking against Existing Solutions

Overall, WEIRDB has been benchmarked against the following existing DBMSes: MySQL [70],
PostgreSQL [116], HSQLDB [117], H2 [118], wormtable [119], Teiid [120] and SQLite [67]. Some
of them have been benchmarked more than once using different software and hardware
configurations. Many of the following results come from [121], others (Table 5.3.5, Table 5.3.6)
haven't been published before. Each test has been executed 10 times and reported values are
execution times mean and standard deviation either alone (for queries 3-9) or with separate results for
each run (for queries 1-2).

The overview of tests performed can be illustrated using the following matrix Table 5.2.1:

Table 5.2.1. Overview of software/hardware configurations in performed benchmarking tests and corresponding results
tables. * - solutions implementing real external data access, other DBMSes required data import and results are presented
only as reference point, X - no tests performed, (2) - indexing failed, (3) - bulk of data had to be stored in single column
because of number of columns limitation

Platform/Query 1 2 3 4 5 6 7 8 9

WEIRDB*, WEIRDB *,
PC Core i5−2400 16GB RAM Windows 7 GNU PostgreSQL(3), PostgreSQL (3), WEIRDB*
X HSQLDB*(2), H2 HSQLDB*( 2), H2
GCC 4.6.2
Table 5.3.1 Table 5.3.2 Table 5.3.7

MySQL, wormtable, Teiid*, WEIRDB*

Table 5.3.3 Table 5.3.4


Macbook Air Core i7 1.7 GHz Turbo Boost 3.3 GHz
8GB RAM Mac OS X 10.9.5 clang 600.0.57 Apple SQLite (with CSVExtension)*, MySQL X
LLVM 6.0 (with Federated engine)*, PostgreSQL
(with file_fdw)*

Table 5.3.5 Table 5.3.6

Of particular interest are solutions marked with * as they implement external data access in the sense
described in previous chapters, i.e. not requiring data import. Tests of remaining solutions were
performed after prior loading of data to the native database storage and are provided as point of
reference only. Upon completion of the import procedure or upon attaching external data sources,
indices were created for columns used in WHERE or ON/USING clauses when possible. HSQLDB
didn't finish indexing of requested columns within 10 minutes and therefore this step was considered
as failed and results for HSQLDB are reported without using indexing.

In case of PostgreSQL, the limit on number of columns made it necessary to store most of the data in
single TEXT column in tests 3-6 and 7 with only columns used for filtering and joining stored
separately. Therefore PostgreSQL results should be treated highly conceptually.

90
PostgreSQL was subsequently tested using external data access extension (file_fdw) along with
SQLite (with CSVExtension [43]) and MySQL (with Federated engine) using queries 1 and 2 which
didn't conflict with the number of columns limit.

The compiler listed in hardware column on Windows platform refers to WEIRDB only, as the other
DBMSes were used as provided in precompiled vendor binaries. On Mac OS X platform, all software
was compiled from sources using the compiler specified.

In the following chapter, the results of the aforementioned battery of tests are presented.

5.3. Results

The results of aforementioned benchmark tests provide and support answers to a number of
questions: (1) to what extent are existing DBMSes capable of processing neuroimaging data in
general (indiscriminately of support for external data access), (2) what level of performance can be
expected in existing solutions with and without support for external data, (3) how does WEIRDB
compare in terms of performance to both types of existing solutions within their operational limits,
(4) what level of performance can be expected of WEIRDB when processing external data exceeding
limits of existing DBMSes.

(1) To address the first issue regarding extent of support for neuroimaging data, Table 5.2.1 illustrates
that among queries 3 through 9 which reflect a range of plausible research inquiries on realistic data
sources, existing solutions (represented by PostgreSQL, HSQLDB and H2) can handle only queries 3
through 7. Queries 8 and 9 involving a table with close to half a million columns were beyond the
fixed technical limitation of number of columns for PostgreSQL Table 4.1.3 whereas in case of
HSQLDB and H2 they were impossible to execute on the tested hardware because of excessive
memory requirements. With the amount of RAM as specified in Table 5.2.1 HSQLDB and H2 were
causing intense usage of swap file and CPU and rendering the whole computer unresponsive - a state
from which the PC could be recovered only by killing the database process. Some of the other
existing solutions (e.g. MySQL, SQLite) share the limitations of PostgreSQL Table 4.1.3 while
others exhibit similar performance shortcomings (Table 5.3.3, Table 5.3.4, Table 5.3.5, Table 5.3.6)
and excess in terms of required resources relative to WEIRDB as HSQLDB and H2. This leads to
conclusion that real-life capabilities of existing DBMSes with regards to neuroimaging data are
limited.

91
(2) The question of performance in existing database solutions is broadly covered by Table 5.3.1,
Table 5.3.2, Table 5.3.3, Table 5.3.4, Table 5.3.5 and Table 5.3.6 which demonstrate query
processing efficiency on a variety of workloads ranging from medium-sized neuroimaging-specific
tasks (queries 3-7) to typical generic queries (1, 2). In Table 5.3.1 and Table 5.3.2 databases using
data import prior to testing (PostgreSQL, H2) exhibit an advantage over HSQLDB using its external
CSV file support. PostgreSQL performs an order of magnitude better than HSQLDB and H2 in Table
5.3.1 and is again on par with H2 in Table 5.3.2 suggesting that merging of excess columns into one
TEXT column in PostgreSQL for Table 5.3.1 as mentioned in previous chapter gives it an unfair
advantage while the growing size of the TEXT column in Table 5.3.2 significantly lowers
PostgreSQL's performance. A generic benchmark in Table 5.3.3, Table 5.3.4, Table 5.3.5 and Table
5.3.6 supports the observation that data import works better in existing DBMSes than external data
support, illustrated in particular by performance of MySQL with data import (Table 5.3.3, Table
5.3.4) and MySQL with external data adapter ( Table 5.3.5, Table 5.3.6) - the former reporting
execution times of 184.305±5.015s (Table 5.3.3) and 103.530±4.366s (Table 5.3.4) respectively,
while the latter - 395.623±16.051s (Table 5.3.5) and 322.324±6.009s (Table 5.3.6) respectively.
Wormtable is a notable exception because in spite of data import it performs twice as bad as Teiid -
the slowest solution using external data access. Wormtable is advertised as Beta software and could
conceivably contain bottlenecks explaining this behaviour. These observations lead to conclusion that
within the framework of discussed benchmarks, existing solutions report query execution times from
tens up to hundreds of seconds.

(3) WEIRDB was included in all previously mentioned tables for comparison purposes. In Table
5.3.1 an order of magnitude advantage over second-best PostgreSQL can be seen for query 3 which
subsequently drops to about 50% advantage for query 4 and then increases again reporting 5-fold
advantage over PostgreSQL in query 5 and 6-fold advantage in query 6. In Table 5.3.2 the advantage
of WEIRDB over PostgreSQL and H2 is much stronger at slightly over two orders of magnitude.
Notably, HSQLDB didn't finish that benchmark within the allotted time. In the generic query 1 (Table
5.3.3, Table 5.3.5) WEIRDB compares very well against all existing solutions reporting one (against
PostgreSQL with file_fdw) to two orders of magnitude advantage (against remaining DBMSes). The
most interesting observation arises from benchmark results of query 2 (Table 5.3.4, Table 5.3.6)
where WEIRDB performs (227.630±11.120s) slightly worse than MySQL with data import
(103.530±4.366s) and PostgreSQL with external data access (89.535±6.177s) but slightly better than
MySQL with external data access (322.324±6.009s). Those results underline advantage of MySQL's
data import and expose the technical shortcoming of current implementation of WEIRDB - the
internal caching and indexing engine which requires importing of an additional column each time the
WHERE clause in query 2 changes.

92
(4) Finally, Table 5.3.7 shows the results of benchmarking queries 8 and 9 using WEIRDB as the
only DBMS capable of handling the necessary data on the given hardware. The reported results of
6.298±0.047s and 7.391±0.123s respectively are very fast compared to some of the other benchmarks
and highlight WEIRDB's excellent capability of handling "wide" data typical for many neuroimaging
scenarios.

Table 5.3.1. Performance comparison of WEIRDB and HSQLDB/H2/PostgreSQL. All execution times in milliseconds. *
For PostgreSQL only columns necessary for testing the WHERE/JOIN/etc. conditions were created in respective tables, the
remaining columns where preserved in CSV format in one column called “rest”. Time required for parsing of the CSV
column is not included in the measurements as it would be negligibly small compared to the query execution time. ** For
HSQLDB the actual CSV support was used.

Query WEIRDB PostgreSQL HSQLDB H2

SELECT * FROM ADNI_Clin_6800_geno 347.9±8.79 3978.1±30.27 43579.0±1432.11 35412.0±1611.24

SELECT * FROM ADNI_Clin_6800_geno WHERE Diagnosis=” AD “ 436.1±10.88 702.7±13.0618 30513.0±5064.80 7543.2±167.83

SELECT * FROM ADNI_Clin_6800_geno WHERE Diagnosis=" AD " OR


689.2±12.02 3046.6±33.58 38849.5±889.83 20650.1±946.79
PTGENDER=1 OR RID%2==0

SELECT * FROM ADNI_Clin_6800_geno AS a INNER JOIN PTDEMOG AS b


771.9±6.22 4830.6±48.37 262336.7±9080.58 11206.3±527.54
USING(RID) WHERE a.Diagnosis=" AD " OR a.PTGENDER=1 OR a.RID%2==0

Table 5.3.2. Performance comparison of WEIRDB and HSQLDB/H2/PostgreSQL using single subject Allen Brain Atlas
data. All execution times in milliseconds.

Query WEIRDB PostgreSQL HSQLDB H2

SELECT * FROM MicroarrayExpression_fixed AS a INNER JOIN Probes AS b Didn’t


310.9±14.77 72158.9±1312.98 75680.8±20192.18
ON(a.c1=b.probe_id) finish

Table 5.3.3. Query time comparison in microbenchmark similar to the first one from [86], without WHERE condition. All
times in seconds. * Without Initial Load time.

Database Initial Load 1st run 2nd run 3rd run 4th run 5th run

WEIRDB ~120 34.165 8.441 7.949 7.780 5.373

MySQL ~285 178.706 185.047 175.109 185.564 189.746

Wormtable ~583 1357.733 1343.813 1342.653 1456.482 1418.348

Teiid ~0 602.970 625.252 665.062 637.091 657.034

Database 6th run 7th run 8th run 9th run 10th run Mean±SD*

WEIRDB 5.364 5.219 5.321 4.583 4.408 8.860±8.549

MySQL 185.712 180.901 183.235 193.692 185.334 184.305± 5.015

Wormtable 1614.710 1469.741 1412.251 1348.159 1347.477 1411.137± 81.902

Teiid 692.183 644.607 625.928 616.335 642.654 640.912± 24.529

93
Table 5.3.4. Query time comparison in microbenchmark based on first one from [86], with WHERE condition. All times in
seconds.

Database 1st run 2nd run 3rd run 4th run 5th run

WEIRDB 235.406 215.238 221.219 223.249 227.089

MySQL 100.214 106.100 99.777 98.810 99.280

Wormtable 1378.221 1329.387 1372.538 1419.567 1644.511

Teiid 571.458 585.293 620.953 622.774 632.024

Database 6th run 7th run 8th run 9th run 10th run Mean±SD*

WEIRDB 237.116 238.105 248.184 210.987 219.704 227.630± 11.120

MySQL 102.006 104.829 113.525 103.399 107.366 103.530± 4.366

Wormtable 1488.179 1641.428 1872.519 1510.913 1646.998 1530.426± 160.910

Teiid 615.900 775.444 835.995 847.338 713.459 682.064± 98.207

Table 5.3.5. First benchmark for true in-situ CSV implementations, without WHERE clause. * Reused results from Table
5.3.3.

Query Time (s)

SELECT (random 10 columns) FROM test; Initial load 1st run 2nd run 3rd run 4th run 5th run

0 639.963 663.023 662.496 662.979 662.033

SQLite (with CSVTable extension) 6th run 7th run 8th run 9th run 10th run Mean ± SD

675.018 687.721 641.671 683.165 660.547 663.862 ± 15.445

456.930 384.208 381.199 386.423 414.544 414.227


MySQL (with Federated engine)
408.720 416.499 392.464 382.134 375.809 395.623 ± 16.051

0 94.336 95.512 106.378 104.128 93.544


PostgreSQL (file_fdw)
91.443 91.869 92.091 94.503 91.300 95.511 ± 5.078

~120 34.165 8.441 7.949 7.780 5.373


WEIRDB *
5.364 5.219 5.321 4.583 4.408 8.860 ± 8.549

Table 5.3.6. First benchmark for true in-situ CSV implementations, with WHERE clause. * Reused results from Table 5.3.4.

Query Time (s)

SELECT (random 10 columns) FROM test WHERE (random column) > 500000000; Initial load 1st run 2nd run 3rd run 4th run 5th run

0 691.075 697.634 648.847 669.064 669.431

SQLite (with CSVTable extension) 6th run 7th run 8th run 9th run 10th run Mean ± SD

669.039 667.400 656.622 647.656 670.760 668.752 ± 15.256

n/a 318.349 321.754 319.055 318.467 333.070


MySQL (with Federated engine)
322.902 321.983 314.567 319.208 333.884 322.324 ± 6.009

0 87.316 97.740 93.578 93.094 96.291


PostgreSQL (file_fdw)
80.564 80.017 86.203 95.346 85.205 89.535 ± 6.177

~120 235.406 215.238 221.219 223.249 227.089


WEIRDB
237.116 238.105 248.184 210.987 219.704 227.630 ± 11.120

94
Table 5.3.7. WEIRDB performance on a table with 400000 columns and 1000 rows (dummy); dummy2 has 10000 columns
and 1000 rows. Execution time in milliseconds.

Query WEIRDB

SELECT * FROM dummy 6297.8±47.04

SELECT * FROM dummy AS a INNER JOIN dummy2 AS b USING(c0) 7391.2±123.261

In addition to mostly superior results of WEIRDB compared to state of the art solutions, the software
has been practically tested by neuroscience research teams. In the following chapter, WEIRDB is
presented through the prism of practical applications in neurodegenerative disorders investigation.

95
6. Practical Applications
One of the major fields in neuroscience is study of mechanisms of pathological neurodegeneration
caused by diseases such as Parkinson's, Alzheimer's and/or Hungtington's. Such research process
usually consists of analyzing features specific to people suffering from a given disorder as compared
to general population. Sources of features may include imaging techniques (e.g. MRI, PET), blood
tests, cerebrospinal fluid (CSF) tests, genetic examination, electrophysiological recordings (e.g. EEG,
MEG) and others. The goal of the analysis is identification of so-called biomarkers which refer to
measurable indicators of some biological state or condition. Examples of the latter may include e.g.
diagnosis, neuropsychological scores, age, gender or handedness of a subject. Depending on the
timepoint the measure of state refers to, we can differentiate between using biomarkers for estimation
(of state at the same time as the biomarker is obtained) and prediction (of state at a time following
that of the biomarker acquisition). Most of the time, ensembles of raw (measured) biomarkers are
used to define multi-variate biomarkers. Depending on the type of state variable, such multi-variate
biomarkers are used in classification (for discrete states, e.g. diagnosis) and regression (for
continuous states, e.g. age) algorithms.

The goal pursued by WEIRDB in this domain is to provide theoretical, methodological and technical
tool to extract features from different neuroimaging modalities, biospecimen measures and genetic
data. The emphasis is on mega-variate approach (data fusion) that is able to combine multiple
biomarkers from different types of MRI/PET scanning protocols, blood/CSF/genetic tests and clinical
data. One example is integration of high resolution morphological information obtained from T1
structural scans with functional blood oxygenation level dependent (BOLD) images obtained from
functional Magnetic Resonance Imaging (fMRI) in order to be fitted to the
genetic/clinical/behavioural observations using approaches ranging from simple correlation to
discriminative analysis (SVM, LDA) in order to create reliable predictive models of
neurodegenerative diseases. A good predictive model is one able to predict the data obtained in one
modality from data in another (e.g. predicting clinical outcome from MRI scans).

WEIRDB was used so far mostly in Alzheimer's Disease studies where its data fusion features
constituted part of several pipelines used to extract features and biomarkers, analyse them statistically
against previously diagnosed cases and characterize the patient’s state with respect to cognitively
normal, through mildly cognitively impaired, to the dementia stage. The ultimate aim of such
pipelines is to establish a generative model of the pathophysiology of neurodegeneration-associated
cognitive decline, weighing in informed ways the impact of anatomical changes, genotype, clinical
phenotype, etc. and be deployed within hospital environments, providing clinicians with information
about patient's diagnosis, management and treatment options.

96
Figure 6.1. Illustration of data sets available in ADNI phases 1, 2 and GO. Wavy rectangles contain full names (top) and file
names (bottom) of respective data sets. Circles represent phases of the ADNI study. Data sets are illustrated as colored dots
belonging to one or more circles. Intersection areas designate data sets available for multiple phases of the study. ADAS-Cog
- Alzheimer's Disease Assessment Scale - Cognition, ApoE - Apolipoprotein E, ASL - Arterial Spin Labeling, AV45 -
florbetapir, BSI - Brief Symptom Inventory, CBF - Cerebral Blood Flow, CSF - Cerebrospinal Fluid, DTI - Diffusion Tensor
Imaging, fMRI - Functional Magnetic Resonance Imaging, NMRC - National Medical Research Council, NYU - New York
University, ROI - Region of Interest, sPAP - Syngo PET Amyloid Plaque, SPM - Statistical Parametric Mapping, SyN -
Symmetric Diffeomorphic Image Normalization method, TBM - Tensor-Based Morphometry, UA - Universiteit Antwerpen,
UC - University of California, UCD - University College Dublin, UCL - University College London, UCLA - University of
California Los Angeles, UCSF - University of California San Francisco, UPENN - University of Pennsylvania, UPitt -
University of Pittsburgh, UU - Utrecht University, VBM - Voxel-Based Morphometry.

97
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a public-private partnership study aiming to
increase the pace of discovery in Alzheimer's Disease (AD) research. The three principal goals of
ADNI are: to detect AD at the earliest stages possible and identify biomarkers allowing to track the
progress of the disease; to support advances in AD intervention, prevention and treatment at the
earliest stages; and to continually develop ADNI's data sharing model which grants access to the data
to all interested investigators. ADNI data servers contain brain scans, genetic profiles and biomarkers
in blood and CSF. In addition to raw data, results of different processing pipelines are also made
available as part of the effort. ADNI is an on-going study that began in October 2004. During its
lifetime it has transitioned from phase called ADNI1 to ADNI GO in 2010 and from ADNI GO to
current ADNI 2 in 2011. Different phases of ADNI are characterized by differences between used
biomarker acquisition protocols which causes additional heterogeneity in data availability - Figure 6.1
(please note reversed figure/caption order for layout reasons).

A typical study of ADNI data integrates imaging data with genetic, neuropsychological, biochemical,
haematological and clinical information, and any other biomarkers, over a very large number of
elderly (>60 years), cognitively impaired and unimpaired subjects. The first challenging issue is the
vast amount of data that must be fusioned. WEIRDB serves as the unifying layer which thanks to its
in-situ data handling allows for both efficient in-database data selection and analysis and out-of-
database processing using custom scripts (e.g. MATLAB). The following Figure 6.2 illustrates how
different external data sources were linked to WEIRDB in their raw and processed forms.
Furthermore, the schematic illustrates access to data from MATLAB environment using both custom
I/O and WEIRDB database interface - a situation which is often necessitated by preexisting
processing algorithms supporting file input only.

Typically, before handing data over to scripting environments a number of data curation operations
are performed using WEIRDB and SQL. Examples of such basic procedures are presented in Table
6.1. Custom code is used only for study-specific processing (data fitting, classification, regression) as
illustrated in the following sections.

The next three sections contain examples of research projects using WEIRDB as part of their
processing pipelines. The studies have been concluded respectively with publications: [1], [2] and
[122].

98

Figure 6.2. Schematic of external data sources linked to WEIRDB. (1) ADNI tabular data
(demographic, clinical, imaging meta data and other meta data) (2) ADNI FDG-PET raw imaging
data (3) Three City Study (3C) imaging meta data (4) ADNI MRI raw imaging data (MP-RAGE
sequence) (5) Three City Study (3C) MRI raw imaging data (6) results of segmentation of (4) and
(5) using SPM8 (7) Results of normalization of (6) into MNI standard brain space (8) Results of
extracting ~6800 custom features from (7) using tailor-made atlas. MATLAB is capable of
accessing the data directly or via WEIRDB.

99
Table 6.1. Basic WEIRDB operations used for preparing a dataset for handling by data clustering method "H". The
following notation is used: bold capital letter, e.g. T - table, bold lower case letter, e.g. c - column/vector, italic upper case
letter N - set of natural numbers, regular upper case letter, e.g. P - string, italic lower case letter e.g. p - function, mod -
modulo division operator, ROWID - function returning unique row identifier corresponding to provided argument, NULL -
pseudovalue representing absence of value. For brevity of notation, the following assumptions are made: cells/rows within a
column are ordered by ROWID value, ROWID ∈ N ∪ {0}, the first ROWID is equal to 0 and is monotonically incremented
by 1, columns listed at least once together in the same table have the same cardinality.

Name Description

Input: tables T1 = { c1, c2, ..., cn}, T2 = { c*}


Add Column Output: table T* = { c1, c2, ..., cn, cn+1=c*}
SQL: SELECT c1,1, ..., c1,n, c* FROM T1 INNER JOIN T2 ON (ROWID(T1) = ROWID( T2))

Input: tables T1 = { c1,1, ..., c1,n}, T2 = { c2,1, ..., c2,m},


indices i = {i 1, ..., i k : ∀ i∈i i ∈ [1, m]}
Copy Columns
Output: table T* = { c1,1, ..., c1,n, c2,i1, ..., c2,ik}
SQL: SELECT c1,1, ..., c1,n, c2,i1, ... c2,ik FROM T1 INNER JOIN T2 ON (ROWID(T1) = ROWID( T2))

Input: table T1 = { c1, ..., cn}


Count Values Output: vector n = {n k : k ∈ [1, n] ∧ n k = |m : ∀ m∈m ck[m] ≠ NULL|}
SQL: SELECT COUNT(c1), ..., COUNT(cn) FROM T

Input: table T = { c1, ..., cn}, number n ∈ N


Extract Every Nth Row Output: table T* = { c1*, ..., cn* : ∀ c ∈ T* ∀ c ∈ c ROWID(c) = n * m, m ∈ N ∪ {0}}
SQL: SELECT * FROM T WHERE ROWID(T) mod n = 0

Input: table T = { c1, ..., cn}, indices i = {i 1, ..., i k : ∀ i ∈ i i ∈ [1, n]}


Extract Columns / Output: table T* = { ci1, ..., cik}
Change Order of Columns
SQL: SELECT ci1, ..., cik FROM T

Input: table T = { c1, ..., cn}, pattern string P


Filter Columns Output: table T* = { ck : k ∈ [1, n] ∧ ∀ k column name of ck matches P}
SQL: SELECT column names matching P FROM T

Input: table T = { c1, ..., cn}


Output: vector r = {r 1, ..., r k : ∀ r ∈ r ∀ c ∈ T c[r] ≠ NULL},
Filter Empty Rows
table T* = { c1*, ..., cn* : ∀ c ∈ T* ∀ c ∈ c ROWID(c) ∈ r}
SQL: SELECT * FROM T WHERE c1 ≠ NULL AND .... AND cn ≠ NULL

Input: table T = { c1, ..., cn}, vector of predicates p = { p1(c1, ..., c n), ..., pk(c1, ..., c n)}
Filter Rows Output: table T* = { c1*, ..., cn* : ∀ r = ROWID(c ∈ c1*) p1(c1*[r], ..., cn*[r]) ∧ ... ∧ pk(c1*[r], ..., cn*[r])}
SQL: SELECT * FROM T WHERE p1(c1, ..., cn) AND ... AND pk(c1, ..., cn)

Input: table T = { c1, ..., cn}


Output: table T* = { c1*, ..., cn* : ∀ r = ROWID(c ∈ c1*) c1*[r] = c1[r] if c1[r] ≠ NULL else c1[r − 1], ...,

cn*[r] = cn[r] if cn[r] ≠ NULL else cn[r − 1]}


Fix Missing Values
SQL: SELECT CASE A.c1 WHEN A.c1 = NULL THEN B.c1 ELSE A.c1 END, ...,
CASE A.cn WHEN A.cn = NULL THEN B.cn ELSE A.cn END
FROM T AS A INNER JOIN T AS B ON(ROWID(A)=ROWID(B)+1)
Additional assumption for brevity: first row never contains missing values

Input: tables T1 = { c1,1, ..., c1,n}, T2 = { c2,1, ..., c2,n}


Union Select Output: table T* = { c1,1 ∪ c2,1, ..., c1,n ∪ c2,n}
SQL: SELECT * FROM T1 UNION SELECT * FROM T2

Input: table T = { c1, ..., cn}, column ck, k ∈ [1, n]


Output: table T* = { c1*, ..., cn* : ∀ a, b ∀ ca ∈ ca*, cb ∈ cb* ROWID(ca in ca) = ROWID(c b in cb) ∧
Sort
∀r ∈ [0, |ck| − 2] ck[r] < ck[r + 1]}
SQL: SELECT * FROM T ORDER BY ck

100
6.1. Combining Classification Pipeline with Data Virtualization

Publication [1] involves usage of WEIRDB as part of a pipeline consisting of automated machine
learning, Support Vector Machine (SVM) classification, feature selection and relational data
processing (joining, filtering, grouping, conversion, stratification, output formatting). Raw input data
size is equal to ~325 GB out of which ~137 MB are processed using WEIRDB and the remaining data
combined with output from WEIRDB are handled using a MATLAB toolbox.

Adaszewski, Dukart et al. investigate diagnostic accuracy of automated machine learning


classification of Alzheimer's disease from preclinical stages to clinical dementia. To this end, they use
longitudinal data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Data include date of
birth, gender, behavioral test results (in particular Mini-Mental State Examination (MMSE)) and
neuroimaging scans using 1.5T T1-weighted structural magnetic resonance imaging (sMRI). All
measurements and imaging data are accompanied by corresponding acquisition dates. The data come
in form of separate CSV files (~137 MB) for patient, behavioral examination and imaging meta data,
with image volumes stored as Nifti-1 files (~325 GB).

In this first published study involving WEIRDB, the practical application consisted of in-situ access
to CSV files in order to perform a number of joining, filtering, grouping, conversion and formatting
tasks. The first stage was stratification of subjects into four diagnostic groups: AD patients (n=108),
mild congnitive impairment (MCI) converters (n=142), MCI nonconvertes (ncMCI, n=61) and
healthy controls (n=137). Stratification was followed by selecting subjects who had baseline and at
least 2 years follow-up scans. Subsequently MRI scans were binned and paired with closest
behavioral test results at baseline and 6, 12, 24, 36, 48 and 60-month follow-up visits. For usage with
Support Vector Machine (SVM) classifier testing framework, the data were split into training/testing
sets and converted into suitable output files to be used with custom MATLAB scripts. The latter
relied on LIBSVM [123] implementation of SVM to perform a series of cross-validations
demonstrating longitudinal changes in predictive accuracy of computer-based diagnosis of AD.

Main findings of the study include detection of systematic differences in classification accuracy of
cMCI that were dependent on closeness to conversion time. Whole-brain SVM classification 3–4
years before conversion revealed chance-level accuracy for AD detection. After feature selection
restricting analysis to structures of medial temporal lobe and parietal cortex, authors observed
accuracies significantly higher than chance level already 4 years before conversion, bringing strong
evidence that sMRI-based information can be used to predict conversion from cMCI to AD at disease
stages earlier than previously suggested.

101
Usage of WEIRDB in this scenario was crucial for efficiently establishing a robust local database
combining multiple CSV files with imaging data for usage within MATLAB.

6.2. Preparation for Statistical Analysis and Modelling using


Data Virtualization

Study [2] employs WEIRDB as one of the input stages for a biomedical modelling workflow.
WEIRDB is used for relational data processing and consistency cross-validation between two virtual
data sources representing CSV files and imaging data. Raw input data size is equal to ~400 GB all of
which are processed using WEIRDB prior to being handed over to a MATLAB toolbox and scripting
environment.

Dukart et al. observe that prior studies using both 18F-fluorodeoxyglucose positron emission
tomography (FDG-PET) and structural magnetic resonance imaging (sMRI) report controversial
results about time-line, spatial extent and magnitude of respectively glucose hypometabolism and
atrophy in Alzheimer's disease (AD) with inconsistencies depending on clinical and demographic
characteristics of subject cohorts. Subsequently, Dukart et al. present a generative anatomical model
of glucose hypometabolism and atrophy progression in AD based on FDG-PET and sMRI data of 80
patients and 79 healthy controls from Alzheimer's Disease Neuroimaging Initiative (ADNI) database.
The aim of the model is to illustrate age- and symptom severity-related changes differentially against
baseline provided by healthy aging - Figure 6.2.1.

To derive the model, Dukart et al. extract from the ADNI database sMRI and FDG-PET data from
multiple centers of 80 patients and 79 healthy controls including: subject ID, diagnosis, date of sMRI
scan, IDs of MRI study, series and image, MRI sequence name, date of FDG-PET scan, IDs of FDG-
PET study, series and image and FDG-PET sequence name. WEIRDB was used for in-situ access to
raw ADNI CSV files in order to create a flat file joining corresponding pieces of information from
separate sources. Furthermore, it was used in conjunction with filesystem access middleware to
cross-check for presence of all necessary sMRI and FDG-PET unprocessed volumes as well as
required processed (SPM8) output in the local computing cluster environment and network
filesystem.

102
Figure 6.2.1. Voxel-wise intersections of healthy aging and changes observed in AD in MRI (a) and
FDG-PET (b). Colour bars represent the intersection age. AD Alzheimer's disease, FDG-PET
[18F]fluorodeoxyglucose positron emission tomography, MRI structural magnetic resonance
imaging. With sMRI, we see the highest intersection ages bilaterally in hippocampus, anterior and
posterior thalamus, posterior and midcingulate, parietal, temporal, cerebellar, prefrontal and
premotor regions. With FDG-PET, regions with the highest intersection ages are bilaterally
restricted to precuneus, cerebellum, anterior and posterior cingulate, posterior parietal, temporal,
lateral occipital, primary motor, premotor and prefrontal cortices and the left dorsal caudate nucleus.
Source: Dukart J, Kherif F, Mueller K, Adaszewski S, Schroeter ML, Frackowiak RS, Draganski B,
Alzheimer's Disease Neuroimaging Initiative (2013) Generative FDG-PET and MRI model of aging
and disease progression in Alzheimer's disease. PLoS Computational Biology. Vol. 9(4). pp.
e1002987.

The resulting model deriving from multi-modal data describes, integrates and predicts characteristic
patterns of AD related pathology in dissociation from healthy aging effects. It further provides a
cross-modal explanation for findings suggesting distinction between early- and late-onset AD. The
generative model constitutes a reference for development of individualized biomarkers allowing for
accurate early diagnosis and treatment evaluation in future studies.

Usage of WEIRDB in this research was useful for quickly cross-referencing two virtual data sources
in their original form without the need for building a database schema.

6.3. Novel Uses of Data Virtualization in Cerebral Cartography

Study [122] places WEIRDB at the center of input processing for an unbiased data clustering method
"H". Its role involves relational data processing, integration and formatting. Input encompasses ~115
GB of data consisting of tabular text and biomedical images all of which are handled using WEIRDB.
Subsequently, processed data are interfaced with standalone third-party analytical software. Study-
specific data processing involved the types of operations listed in Table 6.3.1.

103
Table 6.3.1. Operations realized using external scripts taking advantage of WEIRDB's in-situ operation maintaining native
file formats.

Name Description

Format data for Prepare CSV and XML files for processing using "H". Thanks to WEIRDB using CSV as temporary table storage format the CSV portion
"H" of processing packages didn't have to be explicitly prepared at all, with a little Python script generating the XML meta data.

Modify Column Changing names of columns usually requires calls to DDL. Using WEIRDB it can be done directly within file formats supporting column
Names names, such as CSV by replacing relevant fragments of the first file line (header).

Replace Replacing character in all fields using SQL requires unnecessarily complex statement to be executed. Having direct access to CSV files
Character allows to use text processing instead.

Clustering rules obtained from "H" can be equally easily applied directly to CSV files using existing tools or by piping data thorugh the
Apply Rules
database interface.

Prepare Rules x
Preparing a graph representation of rules sharing matching data instances was possible using pre-existing file-based scripts.
Instances Graph

Prepare Rules x
Preparing a graph representation of rules sharing variables used for classification was possible using pre-existing file-based scripts.
Variables Graph

Following is a brief summary of the neuroimaging aspect of the study. It is intended as an illustrative
bird's-eye view of aforementioned functions in the broader context of brain research. It doesn't
constitute essential part of this work and its analysis is left to reader's discretion.

Frackowiak and Markram define modern cerebral cartography as static neuroanatomical knowledge
augmented with temporal, biochemical, psychological and behavioral observations. Currently those
observations are being analysed independently by about 100000 neuroscientists worldwide,
sometimes organized within consortia or collaborative networks. This traditional model results in
astounding amount of information. Consequently they point out the need of data-driven multi-scale
frameworks for relating individual observations and discoveries. In particular, they forecast in the
next quarter century the discovery of principles governing brain organization at different levels and
how different spatio-temporal scales relate to each other. This advancement will stem from analysis
of large volumes of neuroscientific and clinical data by a process of reconstruction, modelling and
simulation - capitalizing on recent developments in informatics and computer science.

As part of their vision, Frackowiak and Markram outline a strategy for classification of brain
diseases. They recognize hospital records as locked up wealth of information and propose methods
for accessing them while adhering to patient data privacy policies. Furthermore, they envision to
apply big data analysis and data mining tools currently used in many industries (e.g.
aircraft/spacecraft design, weather forecasting, entertainment) to patient records in Europe's hospitals
with the aim of finding disease signatures. The latter are defined as sets of quantifiable parametrized
biological and clinical variables common to homogeneous groups of patients. Multiple such groups
can be differentiated within cohorts previously clinically classified into much coarser units.

Figure 6.3.1 (source: [122]) (courtesy of HBP sub-project 8, Dr F Kherif et al.) illustrates 22 disease
signatures identified within a cohort classically divided only into two groups of participants - those
suffering from cognitive impairment symptoms and the unaffected ones. The results are based on 500

104
datasets provided by the 3C consortium based in Bordeaux France courtesy of Professors Dartigues
and Orgogozo. They included clinical, imaging, proteomic, cerebrospinal fluid (CSF) protein and
genotyping data, some of them incomplete and from different centres. The groupings were obtained
using an unbiased data clustering method. Input for the method's software implementation was
prepared using WEIRDB. It involved in-situ access to original multi-modal data, filtering, grouping,
integration and formatting of the resulting unified dataset according to the clustering software
requirements.

Figure 6.3.1. Multiple modality rule set for cognitively declined and healthy individuals. The figure
represents subgroups of individuals of advanced age with (red) and without (blue) cognitive
symptoms, obtained using an unbiased data clustering method. The datasets contained clinical,
imaging, proteomic, cerebrospinal fluid (CSF) protein and genotyping (550–1000 K SNPs) data,
some of them incomplete and from three different centres. The method divided the individuals into
different subgroups in each category depending on their similarity. The numbers in the subgroups
represent the number of individuals belonging to each one. The profile of each subgroup represent
specific SNPs (in black), MRI (green) and PET (orange) patterns of focal brain atrophy or
hypometabolism, proteins (purple) and CSF proteins (sky blue). The subgroups in the normal
cognition category may represent patients with compensated pathology that goes on to cognitive
decline as well as completely normal individuals of different biological ages. The cognitive decline
category also has a number of subgroups. The largest subgroup (n = 92) is associated with Abeta 42
in the CSF and the ApoE4 homozygotic genotype, suggesting it may represent typical Alzheimer
disease. The specificity of associations is remarkable. This is an initial, unpublished categorization
based on 500 datasets provided by the 3C consortium based in Bordeaux France courtesy of
Professors Dartigues and Orgogozo. Although not truly a big data analysis, which would consist of
many hundreds of thousands of individuals, the result is striking. (Figure courtesy of HBP sub-
project 8, Dr F Kherif and colleagues). Source: Frackowiak R. Markram H. (2015) The future of
human cerebral cartography: a novel approach. Philosophical Transactions of the Royal Society of
London B: Biological Sciences. Vol. 370 (1668).

105
Furthermore, WEIRDB was able to serve as pipeline for preparing a proof of concept unified
clustering input file from close to 10000 datasets coming from the same source. The disease signature
search within such expanded space is subject to future research.

Application of WEIRDB in this project was essential for verification, selection and transformation of
input data into a form compatible with destination clustering software. Direct access to binary data in
their native form and automatic schema inference were crucial for effortless definition of necessary
operations and their efficient execution.

106
7. Conclusions and Discussion
Based on the presented results, WEIRDB exhibits good performance characteristics, most of the time
offering more efficient query execution compared to existing DBMSes both using internal storage
and external data access. As a DBMS designed from ground up for in-situ operation, the setup
procedure for accessing external data sources is less complicated in WEIRDB compared to existing
solutions, in particular the database schema is inferred directly from the source files without the need
for manual specification. Furthermore, WEIRDB's random access model consisting of memory-
mapped files and allocated lookup structures doesn't impose a lot of memory overhead. This
characteristic allowed for practical execution of queries which while technically possible proved
unrealistic using competing solutions on the tested hardware. WEIRDB proposes a low complexity
codebase which is efficient for retrieval and stream-based processing of data in third-party
applications. To this end, it makes use of an internal indexing engine instead of implementing its
functions from scratch. In the discussed proof-of-concept implementation, SQLite served in this role
but it is replaceable by arbitrary existing solutions depending on data processing needs. Finally,
WEIRDB has been successfully applied in a number of neuroscience studies using datasets of
varying complexity - from lightweight to challenging for existing DBMSes. Results of these studies
are published in peer-reviewed journals as discussed in Practical Applications chapter.

A few critical remarks surface based on observable but non-measured behavior of WEIRDB in the
benchmark tests. Although not enough to seriously influence the performance measurements,
WEIRDB's parsing time was more than linearly different depending on the number of columns in the
query. The cause for this behaviour is usage of dictionary structures for accessing columns' properties
by name. Since the dictionaries have log(n) big-O lookup complexity, the parser checking all columns
has n*log(n) complexity. Although not an issue in the benchmarks at hand and still exceeding
performance of other discussed solutions, it could pose a problem with hypothetical scenarios of non-
normalised multi-million column tables especially with many levels of nested queries. Furthermore,
WEIRDB was previously used mostly as data access engine, whereas performing in-database
processing and analytics would require a more convenient syntax for array-like data structures
represented broadly in the domain of neuroimaging. An attempt at such extension allowing to specify
columns by ranges rather than by names has been implemented in the proof-of-concept software and
proved to be a valuable addition pending practical verification. Furthermore, the initial loading times
reported for queries 1 and 2 could be eliminated by implementing on-the-fly lookup structure
building as opposed to current ahead-of-time approach.

107
Finally, the use of an existing solution as internal indexing engine although sound from software
architecture point of view, causes bottlenecks and unnecessarily rigidifies the query pipeline, having
significant impact on processing and analytics queries which require imports of many columns to the
engine.

Reaching back to the originally stated problem of low database adoption in scientific environments,
Maier at al. [20] named four main reasons for this situation: (1) data sizes, (2) data access patterns, (3)
interfaces to scientific programming environments and (4) mapping of data types between scientific
software and databases. WEIRDB addresses the problem of data size by offering a solution capable
of handling sets of arbitrary volume with outstanding efficiency further underlined by its external data
access mechanisms eliminating the need for data import. These characteristics allow to create multi-
gigabyte databases in tens of seconds and execute complex queries on them within single seconds.
The external data access reflects as well the typical access pattern requirement in a scientific setup
with read-only access provided by the SELECT query, whereas output can be achieved by custom
software accessing the database interface and writing out the results in domain-specific format. In
terms of interfaces to scientific programming environments WEIRDB offers three main methods:
embedding WEIRDB using plugin mechanisms (e.g. MEX for Matlab, Cython for Python) and
accessing its functionality directly in a manner similar to SQLite; network access using WEIRDB's
native protocol which offers support for unlimited number of columns and due to its simplified
design carries low implementation cost within the existing software; finally - network access using
PostgreSQL protocol allows existing software to interface with WEIRDB via existing drivers e.g.
ODBC, JDBC, native PostgreSQL drivers, alternative drivers supporting the protocol, all of which
are easily accessible for all mainstream scientific programming platforms due to popularity of
PostgreSQL. Lastly - with regards to the mapping of datatypes between different environments
WEIRDB instead of imposing a particular solution relies on dynamic typing to provide the flexibility
necessary to leave the decision in the hands of end-users. This approach allows to make the necessary
conversions on demand only (e.g. when expression requiring particular data types is used within
database query or when user software writes out the results retrieved from WEIRDB in a custom
format). In many cases this results in most of the data never needing to be converted and none of the
data needing to be converted ahead-of-time (i.e. during creation of the database). This lifts a heavy
burden from researchers considering adoption of a DBMS as no architecture planning is involved and
allows them to start working with the data immediately rather than consider the technical details.

108
As stated in chapter 2, the onset of publicly available online neuroscience databases, such as ADNI
[33], PPMI [34] and many others, as well as (in the near future) Human Brain Project [3] poses new
challenges for data access and integration methods. Those online resources will increasingly become
part of research pipelines calling for mechanisms of seamless integration between network databases,
remotely stored files, local data and locally controlled processing and analytics algorithms. Existing
solutions for data virtualization prove to be inadequate in many typical research scenarios, leaving the
challenge open for WEIRDB and other next generation external data management architectures.
WEIRDB's potential to lead this race was realistically demonstrated in the Practical Applications
chapter, where it was used to handle the ADNI database files and the Three City Study [124] data.

Current proof-of-concept implementation of WEIRDB accounts for and simplifies many tasks
identified as beneficial to neuroimaging studies. It's a DBMS supporting SQL semantics [38]. It
offers rapid creation of relational databases [39] out of native neuroimaging data formats. It can be
used in strictly local as well as distributed computing environments [39]. Furthermore, due to
supported remote access protocols, WEIRDB can serve as convenient solution for storing and sharing
data via network. It enables server-side filtering and aggregation, provides single point of update, can
be coupled to load balancing mechanisms. WEIRDB's focus on external data access contributes to
reduction of data duplication. Due to its simple random data access abstraction, implementation of
support for new data formats in addition to existing selection is easier than in competing DBMSes
with external data support. Combined with cross-platform and cross-language capabilities, it provides
the possibility for choosing the "best of breed" algorithms for each task in a mix-and-match approach
with data transformations taking place flexibly either on database or software ends. Consequently,
WEIRDB can be employed at any stage of a neuroimaging pipeline while offering good practical
performance on demanding datasets.

From the results of this work and previously identified requirements for modern scientific database
solutions and neuroimaging databases in particular a number of directions emerges for future
development of WEIRDB. A strategic decision needs to be taken between three possible approaches
to executing SQL queries. One of them is the current decoupled approach where the query execution
and external data access are separated which implies performance penalties as discussed in the
Results section. Another approach would be mixing-in of data virtualization mechanisms present in
the internal engine used by WEIRDB to alleviate the on-demand column import costs combined with
native evaluation of output expressions. The final alternative is to implement from scratch full query
execution pipeline within WEIRDB and in the author's current conviction it should be the gold
standard offering the best performance gains.

Several more immediate modifications are possible as well, e.g. extended handling of PostgreSQL
protocol to allow retrieval of unlimited number of columns without the overhead of executing a

109
completely new query. Consecutive queries referring to different columns in the same set of rows
should be identified and corresponding optimizations in data retrieval procedure carried out on the
server side. Furthermore, a fourth interface to WEIRDB using shared memory is possible in local
environments. It would offer the same benefits as embedding approach in terms of performance
without duplication of memory allocated for lookup structures across different processes on a local
machine. Multi-user access is another venue which currently functions in very simplified manner
delegating authentication and authorization to external proxies. Introduction of access control
mechanisms into WEIRDB would allow for more fine-grained control of access rights and further
simplify the setup procedure for some of the multi-user research scenarios. Standardization is yet
another strategic area for scientific databases. The current goals of WEIRDB are to support
SQL/MED [40] as external data specification method and SciQL [41] for referencing array-like data
highly prevalent in neuroimaging use cases. The former requires less work as it maps quite closely
with existing semantics for external file attachment, while the latter is more ambitious and related to
another goal of WEIRDB - to offer support for MATLAB calls within SQL queries similarly to
announced R [21] support in Microsoft SQL Server 2016 [22]. Last but not least, expansion of the
portfolio of supported neuroimaging data formats should be on the roadmap of WEIRDB, in
particular including such emerging standards as BIDS [36] and OpenFMRI [37]. Support for these
formats originally aiming to be custom I/O standards for multisite neuroimaging will help bridging
the gap between scientific software and DBMS usage in neuroscience. Successive implementation of
aforementioned features will result in progressive adoption of database solutions in research
environments and offer increased productivity, robustness, sharing opportunities while reducing the
necessary storage and computational resources.

110
8. Disclosure Statement
Chapters 4 and 5 contain text fragments, tables and figures from [121]. The author retains rights to
those materials according to original journal's general publishing license.

111
9. Glossary
Abbreviation Full Form / Meaning

AAL Automated Anatomical Labelling

ABA Allen Brain Atlas

ACID Atomicity, Consistency, Isolation, Durability

ADNI Alzheimer's Disease Neuroimaging Initiative

AGD Autism Genetic Database

API Application Programming Interface

ASL Arterial Spin Labeling

AST Abstract Syntax Tree

BAT Binary Association Table

BIDS Brain Imaging Data Structure

cMCI Converting Mild Cognitive Impairment (into Alzheimer's Disease)

CPU Central Processing Unit

CSV Comma Separated Values

CWI Centrum Wiskunde & Informatica

DARTEL Diffeomorphic Anatomical Registration Using Exponential Lie Algebra

DBMS Database Management System

DDL Data Definition Language

DWI Diffusion Weighted Imaging

EDW Enterprise Data Warehouse

EEG Electroencephalography

ETL Extract, Transform, Load

FDG−PET 18F−FluoroDeoxyGlucose Positron Emission Tomography

FDW Foreign Data Wrapper

FITS Flexible Image Transport System

fMRI Functional Magnetic Resonance Imaging

GB Gigabyte

GEOS Geometry Engine, Open Source (a library)

GUI Graphical User Interface

HBP Human Brain Project

HDF5 Hierarchical Data Format Version 5

HIS Hospital Information System

HSQLDB HyperSQL Database

IBIS Inferred Biomolecular Interaction Servers

ICA Independent Component Analysis

JBDS JBoss Developer Studio

JNDI Java Naming and Directory Interface

JSON JavaScript Object Notation

KB Kilobyte

112
Abbreviation Full Form / Meaning

LDAP Lightweight Directory Access Protocol

LEO Linked Open Earth Observation Data for Precision Farming

MAL MonetDB Assembly Language

MAT Multiple Association Table

MCI Mild Cognitive Impairment

MINC Medical Image NetCDF

MMSE Mini-Mental State Examination

MPD Mouse Phenome Database

MSEED Mini Standard for the Exchange of Earthquake Data

MVCC Multiversion Concurrency Control

ncMCI Non-converting (i.e. stable) Mild Cognitive Impairment

NDB Network Database (clustering mechanism in MySQL)

NetCDF Network Common Data Form

NIF Neuroscience Information Framework

OASIS Open Access Series of Imaging Studies

ODBC Open Database Connectivity

OID Object Identifier

OLAP Online Analytical Processing

OpenFMRI Open Functional Magnetic Resonance Imaging (standard of storing Functional Magnetic Resonance Imaging experimental data)

ORDBMS Object Relational Database Management System

PACS Picture Archiving and Communication System

PCA Principal Component Analysis

PGSQL PostgreSQL

PID Process Identifier

PPMI Parkinson's Progression Markers Initiative

RAM Random Access Memory

RDBMS Relational Database Management System

RID Row Identifier

ROOT Object Oriented program library developed by CERN

sMRI Structural Magnetic Resonance Imaging

SQL Structured Query Language

SVM Support Vector Machine

SWC One of the most widely used neuron morphology formats

TIC Total Ion Current

URL Unique Resource Locator

VBM Voxel Based Morphometry

VCF Variant Call Format

VDBE Virtual Database Engine

Win32 Low-level Windows Programming API

XML Extensible Markup Language

113
10. References
1. Adaszewski S, Dukart J et al. (2013) How early can we predict Alzheimer's disease using
computational anatomy? Neurobiology of Aging. Vol. 34(12). pp. 2815-26.
2. Dukart J, Kherif F, Mueller K, Adaszewski S, Schroeter ML, Frackowiak RS, Draganski B,
Alzheimer's Disease Neuroimaging Initiative (2013) Generative FDG-PET and MRI model of
aging and disease progression in Alzheimer's disease. PLoS Computational Biology. Vol. 9(4).
pp. e1002987.
3. Human Brain Project (n.d.) Available: https://www.humanbrainproject.eu/. Accessed: 4 August
2015.
4. NIFTI-1 Format Website (n.d.). Available: http://nifti.nimh.nih.gov/nifti-1. Accessed 2 July
2014.
5. Gifti Format Website (n.d.) Available: https://www.nitrc.org/projects/gifti/. Accessed: 4 August
2015.
6. Medical Imaging NetCDF 2.0 (n.d.) Available:
https://en.wikibooks.org/wiki/MINC/Reference/MINC2.0_File_Format_Reference. Accessed: 4
August 2015.
7. FreeSurfer Surface File Formats Website (n.d.) Available:
http://www.grahamwideman.com/gw/brain/fs/surfacefileformats.htm. Accessed: 4 August 2015.
8. MAT File Format Website (n.d.) Available:
https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf. Accessed: 4 August 2015.
9. Hierarchical Data Format v5 - HDF5 (n.d.) Available: https://www.hdfgroup.org/HDF5/.
Accessed 28 August 2015.
10. UCL Camino Diffusion MRI Toolkit (n.d.) Available: http://cmic.cs.ucl.ac.uk/camino/.
Accessed: 27 September 2015.
11. Medical Reality Modeling Language (n.d.) Available: http://slicer.org/archives/mrml/index.html.
Accessed: 27 September 2015.
12. TrackVis (n.d.) Available: http://trackvis.org/. Accessed: 27 September 2015.
13. GraphML (n.d.) Available: http://graphml.graphdrawing.org/. Accessed: 27 September 2015.
14. Graph Exchange XML Format (n.d.) Available: http://gexf.net/format/. Accessed: 27 September
2015.
15. CIFTI Connectivity File Format (n.d.) Available: https://www.nitrc.org/projects/cifti/. Accessed:
27 September 2015.
16. Neurolucida (n.d.) Available: http://www.mbfbioscience.com/neurolucida. Accessed: 27
September 2015.

114
17. A Model Description Language for Computational Neuroscience (n.d.) Available:
https://www.neuroml.org/. Accessed: 27 September 2015.
18. SWC File Format (n.d.) Available: http://research.mssm.edu/cnic/swc.html. Accessed: 27
September 2015.
19. Shoshani A. (1984) Characteristics of scientific databases. VLDB’84. pp. 147-160.
20. Maier D. (1993) A Call to Order. Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART
symposium on Principles of database systems. pp. 1-16.
21. R Project (n.d.) Available: https://www.r-project.org/. Accessed: 3 August 2015.
22. In-database R coming to SQL Server 2016 (n.d.) Available:
http://blog.revolutionanalytics.com/2015/05/r-in-sql-server.html. Accessed: 3 August 2015.
23. Marcus DS1, Wang TH, Parker J, Csernansky JG, Morris JC, Buckner RL (2007) Open Access
Series of Imaging Studies (OASIS): cross-sectional MRI data in young, middle aged,
nondemented, and demented older adults. Journal of Cognitive Neuroscience. Vol. 19(9) pp.
1498-507.
24. Baker EJ, Galloway L, Jackson B, Schmoyer D, Snoddy J (2004) MuTrack: a genome analysis
system for large-scale mutagenesis in the mouse. BMC Bioinformatics. Vol. 5.
25. Gardner D et al. (2008) The Neuroscience Information Framework: A Data and Knowledge
Environment for Neuroscience. PMC Neuroinformatics. Vol. 6. pp. 149-160.
26. Ruttenberg A, Rees JA, Samwald M, Marshall MS (2009) Life sciences on the Semantic Web:
the Neurocommons and beyond. Briefinds in Bioinformatics. Vol. 10. pp. 193-204.
27. Colland F, Jacq X, Trouplin V, Mougin C, Groizeleau C, Hamburger A, Meil A, Wojcik J,
Legrain P, Gauthier JM (2004) Functional Proteomics Mapping of a Human Signaling Pathway.
Vol. 14. pp. 1324-1332.
28. Shoemaker BA, Zhang D, Tyagi M, Thangudu RR, Fong JH, Marchler-Bauer A, Bryant SH,
Madej T, Panchenko AR (2012) IBIS (Inferred Biomolecular Interaction Server) reports, predicts
and integrates multiple types of conserved interactions for proteins. Nucleic Acids Research.
Vol. 40. pp. 834-840.
29. Horai et al. (2010) MassBank: a public repository for sharing mass spectral data for life sciences.
Journal of Mass Spectrometry. Vol. 45. pp. 703-714.
30. Maddatu et al. (2012) Mouse Phenome Database (MPD). Nucleic Acids Research. Vol. 40. pp.
887-894.
31. Matuszek G, Talebizadeh Z (2009) Autism Genetic Database (AGD): a comprehensive database
including autism susceptibility gene-CNVs integrated with known noncoding RNAs and fragile
sites. BMC Medical Genetics. Vol. 10.
32. Goodman N et al. (2003) Plans for HDBase—a research community website for Huntington's
Disease. Clinical Neuroscience Research. Vol. 3. pp. 197-217.

115
33. Alzheimer's Disease Neuroimaging Initiative (n.d.) Available: http://www.adni-info.org/.
Accessed 31 August 2015.
34. Parkinson’s Progression Markers Initiative (n.d.) Available: http://www.ppmi-info.org.
Accessed: 7 September 2015.
35. Ascoli GA, Donohue DE, Halavi M (2007) NeuroMorpho.Org: A Central Resource for Neuronal
Morphologies. Journal of Neuroscience. Vol. 27. pp. 9247-9251.
36. Brain Imaging Data Structure (n.d.) Available: http://bids.neuroimaging.io/. Accessed: 4 August
2015.
37. OpenFMRI (n.d.) Available: https://openfmri.org/. Accessed: 4 August 2015.
38. Gray J, Liu DT, Nieto-santisteban M, Szalay AS, Dewitt D (2005) Scientific Data Management
in the Coming Decade. ACM SIGMOD Rec.
39. Hassona U, Skipper JI, Wildec MJ, Nusbaumb HC, Smalla SL (2008) Improving the analysis,
storage and sharing of neuroimaging data using relational databases and distributed computing.
NeuroImage. Vol. 39, Issue 2, pp. 693–706
40. ISO/IEC 9075 Management of External Data SQL/MED (n.d.) Available:
http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=38643.
Accessed: 4 August 2015.
41. Kersten M. (2011) SciQL, a query language for science applications. Proceedings of the
EDBT/ICDT 2011 Workshop on Array Databases. pp. 1-12.
42. Vitter J.S. (1985) Random Sampling with a Reservoir. ACM Transactions on Mathematical
Software. Vol. 11. pp. 37-57.
43. Werner Ch. (n.d.) SQLite ODBC Driver. Available: http://www.ch-werner.de/sqliteodbc/.
Accessed: 24 May 2015.
44. Ivanova M. Kersten M. Manegold S. (n.d.) Data Vaults: a Symbiosis between Database
Technology and Scientific File Repositories. Available:
http://oai.cwi.nl/oai/asset/19922/19922B.pdf. Accessed 31 August 2015.
45. Oracle Berkley DB SQL API (n.d.) Available:
http://www.oracle.com/technetwork/products/berkeleydb/overview/sql-160887.html. Accessed:
26 September 2015.
46. Oracle BerkeleyDB (n.d.) Available: http://www.oracle.com/us/products/database/berkeley-
db/index.html. Accessed: 26 September 2015.
47. Redis (n.d.) Available: http://redis.io. Accessed: 26 September 2015.
48. Riak (n.d.) Available: http://basho.com/products/#riak. Accessed: 26 September 2015.
49. Infinispan (n.d.) Available: http://infinispan.org/. Accessed: 26 September 2015.
50. OrientDB (n.d.) Available: http://orientdb.com/. Accessed: 26 September 2015.
51. SQL-like language holds promise of NoSQL database interoperability (n.d.) Available:

116
http://www.couchbase.com/press-releases/unql-query-language. Accessed: 26 September 2015.
52. Apache CouchDB (n.d.) Available: http://couchdb.apache.org/. Accessed: 26 September 2015.
53. MongoDB (n.d.) Available: https://www.mongodb.org/. Accessed: 26 September 2015.
54. Clusterpoint (n.d.) Available: http://www.clusterpoint.com/. Accessed: 26 September 2015.
55. Lotus Notes (n.d.) Available: http://www-03.ibm.com/software/products/en/ibmnotes. Accessed:
26 September 2015.
56. Google Bigtable (n.d.) Available: https://cloud.google.com/bigtable/. Accessed: 26 September
2015.
57. Chang F. (2006) Bigtable: A distributed storage system for structured data. ACM Transactions
on Computer Systems (TOCS). Vol. 26(2). pp. 4.
58. Apache Cassandra (n.d.) Available: http://cassandra.apache.org/. Accessed: 26 September 2015.
59. Druid (n.d.) Available: http://druid.io/. Accessed: 26 September 2015.
60. Apache HBase (n.d.) Available: http://hbase.apache.org/. Accessed: 26 September 2015.
61. Neo4j (n.d.) Available: http://neo4j.com/. Accessed: 26 September 2015.
62. HyperGraphDB (n.d.) Available: http://www.hypergraphdb.org/. Accessed: 26 September 2015.
63. AllegroGraph (n.d.) Available: http://www.franz.com/agraph/allegrograph. Accessed: 26
September 2015.
64. InfiniteGraph (n.d.) Available: http://www.infinitegraph.com. Accessed: 26 September 2015.
65. Writing and Querying MapReduce Views in CouchDB (n.d.) Available:
http://oreilly.com/catalog/0636920018247. Accessed: 26 September 2015.
66. Couchbase (n.d.) Available: http://www.couchbase.com/. Accessed: 26 September 2015.
67. SQLite Website (n.d.). Available: http://www.sqlite.org/. Accessed 2 July 2014.
68. UnQL for Node and CouchDB (n.d.) Available: https://github.com/sgentle/unql-node. Accessed:
26 September 2015.
69. MariaDB (n.d.) Available: https://mariadb.org/. Accessed: 27 September 2015.
70. MySQL Website (n.d.). Available: http://www.mysql.com/. Accessed 2 July 2014.
71. Cassandra Storage Engine Overview (n.d.) Available:
https://mariadb.com/kb/en/mariadb/cassandra-storage-engine-overview/. Accessed: 27
September 2015.
72. DataStax (n.d.) Available: http://www.datastax.com/. Accessed: 27 September 2015.
73. Apache Spark (n.d.) Available: http://spark.apache.org/. Accessed: 27 September 2015.
74. DataStax Enterprise - Joining tables with Apache Spark (n.d.)
https://academy.datastax.com/fr/demos/datastax-enterprise-joining-tables-apache-spark.
Accessed: 27 September 2015.
75. Gremlin (n.d.) Available: https://github.com/tinkerpop/gremlin/wiki. Accessed: 27 September
2015.

117
76. Apache Lucene (n.d.) Available: https://lucene.apache.org/core/. Accessed: 27 September 2015.
77. Cannon RC (1998) An on-line archive of reconstructed hippocampal neurons. Journal of
neuroscience methods. Vol. 84(1). pp. 49-54.
78. Knowledge Base of Relational and NoSQL Database Management Systems (n.d.) Available:
http://db-engines.com/. Accessed: 28 September 2015.
79. Flajolet P (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm.
AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.
80. Simple Dynamic Strings library for C (n.d.) Available: https://github.com/antirez/sds. Accessed:
28 September 2015.
81. Pugh W (1990) Skip lists: A probabilistic alternative to balanced trees. Communications of the
ACM. Vol. 33(6). pp. 668.
82. Thredis (n.d.) Available: https://github.com/grisha/thredis. Accessed: 28 September 2015.
83. Lua (n.d.) Available: http://www.lua.org/. Accessed: 28 September 2015.
84. Boost Website (n.d.). Available: http://www.boost.org. Accessed 2 July 2014.
85. Qt Project Website (n.d.). Available: http://qt-project.org/. Accessed 2 July 2014.
86. Alagiannis I, Borovica R, Branco M, Idreos S, Ailamaki A (2012) NoDB: efficient query
execution on raw data files. Proc 2012 ACM SIGMOD Int Conf Manag Data: 241–252.
doi:10.1145/2213836.2213864.
87. Agrawal S, Chaudhuri S, Narasayya V (2000) Automated Selection of Materialized Views and
Indexes in SQL Databases. VLDB.
88. Agrawal S, Narasayya V, Yang B (2004) Integrating vertical and horizontal partitioning into
automated physical database design. Proceedings of the 2004 ACM SIGMOD international
conference on Management of data - SIGMOD ’04. p. 359. doi:10.1145/1007568.1007609.
89. Agrawal S, Chaudhuri S, Kollar L, Marathe A, Narasayya V, et al. (2005) Database tuning
advisor for microsoft SQL server 2005: demo. Proceedings of the 2005 ACM SIGMOD
international conference on Management of data SE - SIGMOD ’05. pp. 930–932.
doi:10.1145/1066157.1066292.
90. Bruno N, Chaudhuri S (2005) Automatic physical database tuning: a relaxation-based approach.
Proceedings of the 2005 ACM SIGMOD international conference on Management of data. pp.
227–238. doi:10.1145/1066157.1066184.
91. Chaudhuri S, Narasayya VR (1997) An Efficient Cost-Driven Index Selection Tool for Microsoft
SQL Server. Proceedings of the 23rd International Conference on Very Large Data Bases. pp.
146–155.
92. Dageville B, Das D, Dias K, Yagoub K, Zait M, et al. (2004) Automatic SQL tuning in oracle
10g. VLDB ’04: Proceedings of the Thirtieth international conference on Very large data bases -
Volume 30 SE - VLDB '04. pp. 1098–1109.

118
93. Dash D, Polyzotis N, Ailamaki A (2011) CoPhy: a scalable, portable, and interactive index
advisor for large workloads. Proc VLDB Endow: 362–372.
94. Papadomanolakis S, Ailamaki A (2004) AutoPart: automating schema design for large scientific
databases using data partitioning. Proceedings 16th Int Conf Sci Stat Database Manag 2004.
doi:10.1109/SSDM.2004.1311234.
95. Valentin G, Zuliani M, Zilio DC, Lohman G, Skelley A (2000) DB2 advisor: an optimizer smart
enough to recommend its own indexes. Proc 16th Int Conf Data Eng (Cat No00CB37073).
doi:10.1109/ICDE.2000.839397.
96. Zilio DC, Rao J, Lightstone S, Lohman G, Storm A, et al. (2004) DB2 design advisor: integrated
automatic physical database design. VLDB ’04. pp. 1087–1097.
97. Bruno N, Chaudhuri S (2006) To tune or not to tune?: a lightweight physical design alerter.
VLDB ’06: Proceedings of the 32nd international conference on Very large data bases. pp. 499–
510.
98. Schnaitter K, Abiteboul S, Milo T, Polyzotis N (2006) COLT: continuous on-line tuning.
SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on
Management of data. pp. 793–795. doi:10.1145/1142473.1142592.
99. Graefe G, Kuno H (2010) Adaptive indexing for relational keys. Proceedings - International
Conference on Data Engineering. pp. 69–74. doi:10.1109/ICDEW.2010.5452743.
100. Graefe G, Kuno H (2010) Self-selecting, self-tuning, incrementally optimized indexes.
Proceedings of the 13th International Conference on Extending Database Technology - EDBT
’10. p. 371. doi:10.1145/1739041.1739087.
101. Graefe G, Idreos S, Kuno H, Manegold S (2011) Benchmarking adaptive indexing. Lecture
Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics). Vol. 6417 LNCS. pp. 169–184. doi:10.1007/978-3-642-18206-
8_13.
102. Idreos S, Kersten ML, Manegold S (2007) Database Cracking. Proc CIDR 369: 68–78.
doi:10.1002/per.
103. Idreos S, Kersten ML, Manegold S (2009) Self-organizing Tuple Reconstruction in Column-
stores. SIGMOD ’09: 297–308. doi:10.1145/1559845.1559878.
104. Idreos S, Manegold S, Kuno H, Graefe G (2011) Merging what’s cracked, cracking what's
merged: adaptive indexing in main-memory column-stores. Proc VLDB.
105. Idreos S, Kersten ML, Manegold S (2007) Updating a cracked database. Proc 2007 ACM
SIGMOD Int Conf Manag data - SIGMOD ’07: 413. doi:10.1145/1247480.1247527.
106. Jain A, Doan A, Gravano L (2008) Optimizing SQL queries over text databases. Proceedings -
International Conference on Data Engineering. pp. 636–645. doi:10.1109/ICDE.2008.4497472.
107. Ailamaki A, Kantere V, Dash D (2010) Managing scientific data. Commun ACM 53: 68.

119
doi:10.1145/1743546.1743568.
108. Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C (2009) MAD Skills : New Analysis
Practices for Big Data. Proc VLDB Endow 2: 1481–1492.
109. Gray J, Liu DT, Nieto-santisteban M, Szalay AS, Dewitt D, et al. (2005) Scientific Data
Management in the Coming Decade. ACM SIGMOD Rec.
110. Idreos S, Alagiannis I, Johnson R, Ailamaki A (2011) Here are my Data Files. Here are my
Queries. Where are my Results? 5th Biennial Conference on Innovative Data Systems Research
CIDR’11. pp. 1–12.
111. Kersten M, Idreos S, Manegold S, Liarou E (2011) The researcher’s guide to the data deluge:
Querying a scientific database in just a few seconds. Proc VLDB Endow 4: 1474–1477.
112. K Lorincz, K Redwine JT (2003) Grep versus FlatSQL versus MySQL. Available:
http://www.eecs.harvard.edu/~konrad/projects/flatsqlmysql/final_paper.pdf. Accessed 2 July
2014.
113. Stonebraker M (2009) Requirements for Science Data Bases and SciDB. 4th Biennial
Conference on Innovative Data Systems Research CIDR’09. p. 173184. doi:10.1.1.145.1567.
114. Tzourio-Mazoyer N (2002) Automated anatomical labeling of activations in SPM using a
macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage. Vol. 15.
pp. 273-289.
115. Allen Human Brain Atlas Website (n.d.). Available: http://human.brain-map.org/. Accessed 2
July 2014.
116. PostgreSQL Website (n.d.). Available: http://www.postgresql.org/. Accessed 2 July 2014.
117. HSQLDB Website (n.d.). Available: http://hsqldb.org/? Accessed 2 July 2014.
118. H2 Database Website (n.d.). Available: http://www.h2database.com/? Accessed 2 July 2014.
119. Wormtable Website (n.d.). Available: https://github.com/wormtable/wormtable. Accessed 2 July
2014.
120. Teiid Website (n.d.). Available: http://teiid.jboss.org/. Accessed 2 July 2014.
121. Adaszewski S (2014) Mynodbcsv: lightweight zero-config database solution for handling very
large CSV files. PLoS ONE. Vol. 9(7).
122. Frackowiak R. Markram H. (2015) The future of human cerebral cartography: a novel approach.
Philosophical Transactions of the Royal Society of London B: Biological Sciences. Vol. 370
(1668).
123. Chang C (2001) LIBSVM: a library for support vector machines. Available:
http://www.csie.ntu.edu.tw/wcjlin/libsvm. Accessed 8 January 2012.
124. Three City Study (n.d.) Available: http://www.three-city-study.com/. Accessed 28 August 2015.
125. Comparison of relational database management systems (n.d.). Available:
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems.

120
Accessed 4 April 2015.
126. Python bindings for Berkeley DB. Database limits. (n.d.). Available:
http://pybsddb.sourceforge.net/ref/am_misc/dbsizes.html. Accessed 4 April 2015.
127. The Brainmap Project (n.d.) Available: http://www.brainmap.org/. Accessed: 06 October 2015.
128. NeuronDB (n.d.) Available: https://senselab.med.yale.edu/neurondb/. Accessed: 06 October
2015.
129. CoCoMac - Macaque macro connectivity 2nd edition (n.d.) Available: http://cocomac.g-
node.org/main/index.php. Accessed: 06 October 2015.
130. PubMed - NCBI (n.d.) Available: http://www.ncbi.nlm.nih.gov/pubmed. Accessed: 06 October
2015.

121
Appendix A
The following listings were create by the author to orientate himself in the architecture of existing
DBMS solutions with support for external data access. They do not contain information pertaining to
the novel solution presented but can be of value to investigators interested in the research area of
databases.

Listing 1. PostgreSQL code architecture summary

access (containing implementation of various built-in indices and the transam transactional
engine)
bootstrap (used for creating the initial template database)
catalog (for manipulating the system catalog data, e.g. database schemas)
commands (for handling parsed text commands, e.g. database creation, table alteration)
executor (the query executor responsible for carrying out provided execution plan nodes)
foreign (provides support for foreign data, servers and user mappings according to the
SQL/MED "SQL Management of External Data" standard)
libpq (client library for accessing the PostgreSQL database)
main (launch utility for starting a PostgreSQL instance, allows to specify data directory, etc.)
nodes (node manipulation functions, e.g. copying, textual representation, comparison, creation,
etc.)
optimizer (query optimization engine, consisting of further submodules):
plan (generates the output plan)
path (enumerates all possible ways to join tables)
prep (handles preprocessing steps for special cases)
util (houses utility routines)
geqo ("genetic optimization" planner which considers only a random subset of the join
space)
parser (parses SQL queries and generates query structures to be passed to optimizer and
executor)
port (holds portability code for different supported platforms, e.g. UNIX, Win32)
postmaster (the main PostgreSQL process responsible for launching the following services):
autovacuum (schedules starting autovacuum workers)
bgworker (background workers execution)
bgwriter (writing out dirty shared buffers)
checkpointer (automatic checkpoint creation)
pgarch (Write-Ahead log archiver)
pgstat (statistics collector)

122
startup (server launch and recovery operations)
syslogger (catches all stderr output and redirects it to a set of log files)
walwriter (Write-Ahead log writer)
regex (regular expressions implementation)
replication (pair of WAL sender/receiver daemons for replicating data in a cluster)
rewrite (rewriting rules management and execution engine)
snowball (word stemming engine)
storage (storage related modules):
buffer (Shared Buffers)
file (management of buffered files)
freespace (free space mapping for quick page allocation)
ipc (inter-process communications)
largeobject (large object application interface routines)
lmgr (locking manager)
page (data pages management)
smgr (storage managers framework, currently only "magnetic disk" manager)
tcop (interface to postgres backend, i.e. user-defined function calls, reading commands from
interactive shell or sockets, retrieving query output)
tsearch (text search engine)
utils (miscellaneous utility routines)

Listing 2. MySQL code architecture summary

BUILD (containing building instructions and scripts)


cmake (CMake build system scripts)
client (source code used in majority of client-side utilities as well as the mysql client)
storage (contains different storage engine directories):
myisam (MyISAM storage engine) - rt_index.cc (R-Tree index) - fulltext.h
(declarations for fulltext indices)
heap (HEAP storage engine)
innodb (InnoDB transactional storage engine) - row0purge.cc / trx0purge.cc (purge
operation handling, known as VACUUM in PostgreSQL) - dict0stats.cc (index
statistics routines)
ndb (ndb cluster engine)
bdb (Berkeley DB storage engine, 3rd party)
innobase (InnoBase storage engine, 3rd party)
csv (CSV storage engine)
mysys (MySQL system library implementing miscellaneous low-level routines)
sql (lexer, parser, handling programs for different storage engines and different SQL statement

123
routines, e.g. for INSERT, UPDATE, etc., as well as built-in and user-defined SQL functions
handling):
main.cc (contains entry point routine for mysqld)
sql_planner.cc (SQL query planner)
sql_optimizer.cc (plan optimizer)
sql_executor.cc (plan executor)
rpl_master.cc / rpl_slave.cc (replication handling)
spatial.cc (spatial data types)
sql_insert.cc, sql_update.cc, sql_select.cc, ... (respective SQL statement handlers)
sql_udf.cc (user-defined functions)
transaction.cc (transaction handling)
ha_ndb_index_stat.cc (NDB index statistics)
vio (Virtual I/O wrapper calls for different network protocols)
dbug, pstack, regex, strings, zlib, libevent (containing respective 3rd party open-source
libraries)
packaging (UNIX/Linux package building directory)
include (common header files)
cmd-line-utils (command line utilities)
extra, mysql-test, repl-tests, support-files, tests, tools (stand-alone utility and test programs)
libmysql (client library for accessing standalone MySQL server)
libmysqld (library for embedding MySQL server within an application)

Listing 3. Teiid code architecture summary

org.teiid.query.processor.relational
TextTableNode (handles text file processing, e.g. row reading, column extraction)
SelectNode (handles select query execution)
NestedTableJoinStrategy (variant of loop join strategy for nested tables)
JoinNode (handles execution of table join)
RelationalNode (handles relational query execution)
ProjectNode (handles projection of given model row according to a list of expressions
and column indices)
org.teiid.query.processor
BatchIterator (an interface for BatchProducer)
QueryProcessor (processing plan execution)
BatchCollector (retrieves all BatchProducer rows into a buffer)
ProcessorPlan (represents an abstract processing plan)
org.teiid.dqp.internal.process
RequestWorkItem (represents a task taking more than one processing task to complete)

124
DQPCore (implements the DQP processing core)
DQPWorkContext (stores DQP context information, e.g. username, virtual database
name, VDB version, request ID, client address for socket transport, security context
and domain, etc.)
DQPConfiguration (Distributed Query Processing configuration description, provides
information such as maximum number of concurrent plans, maximum number of
threads, maximum row fetch sizes, sizes of Large Object chunks, etc.)
org.teiid.transport
LocalServerConnection (represents connection to a local Teiid server)
org.teiid.jdbc
ResultSetImpl (client-side Teiid result set retrieval implementation)
BatchResults (client-side batched results retrieval implementation)
org.teiid.query.sql.lang
Select (SQL SELECT query representation)
TextTable (TEXTTABLE function call description)
org.teiid.query.sql.symbol
Expression (Type signifying SQL expression)
org.teiid.query.parser
QueryParser (Converts SQL query string into object representation)
org.teiid.query.resolver
QueryResolver (Resolves variable names to fully qualified names using metadata and
variable types to types defined in the metadata)
org.teiid.query.validator
Validator (verifies correctness of object representation of an SQL query string, e.g.
command types, presence of subqueries in GROUP BY clauses, lack of unique keys
when necessary, etc.)
org.teiid.query.optimizer.relational
RelationalPlanner (generator of a relational query execution plan, relying on WITH
clauses, cardinality estimation, JOIN conditions, sorting, etc.)
org.teiid.query.optimizer
QueryOptimizer (produces a data processing plan according to the specified query
plan, optimizing explicitly for (batched) updates, table modifications and SELECT
queries separately for queries involving XML or not, in the latter casse the
RelationalPlanner is used)
org.teiid.adminapi.impl
ModelMetaData (stores model metadata information, e.g. type (physical, virtual,
function, other), source/path, schema)
org.teiid.runtime

125
EmbeddedConfiguration (a subclass of DQPConfiguration for scenario with embedded
server)
EmbeddedServer (minimal server for running in embedded configuration)
org.teiid.resource.spi
BasicConnectionFactory (factory class for Teiid's Common Client Interface client
connection implementation)
org.teiid.adminapi
Model.Type (model type enumeration)
org.teiid.translator
FileConnection (interface for accessing local file data)
org.teiid.translator.file
FileExecutionFactory (factory class for file-based command execution manager)
org.teiid.resource.spi
BasicConnectionFactory (base class for connection factory)
org.teiid.resource.adapter.file
FileConnectionImpl (local file access implementation)

Listing 4. HSQLDB code architecture summary

org.hsqldb.server
Server (HSQL protocol network server)
ServerConnection (client connection representation)
org.hsqldb.result
Result (communication unit between Connection, Server and Session; encapsulates all
types of requests, e.g. ALTER, SELECT, etc.; implements wire protocol for
transferring such requests over network)
ResultMetaData (metadata for a result set, e.g. column names)
org.hsqldb.navigator
RowSetNavigatorClient (class for transferring a subset of result rows to the client)
RowSetNavigator (navigation functionality for a list of rows)
org.hsqldb.persist
TextFileReader (line by line reader for text files, supporting quoted strings)
TextCache (handles read/write operations for text tables, acting as a buffer manager)
TextFileSettings (encapsulates text table settings)
LobManager (large objects manager)
RowStoreAVLDiskData (implementation of PersistentStore for text tables)
RAFile (random access file wrapper used with text tables)
org.hsqldb.rowio
RowInputText (handles reading separate fields from a text table row retrieved by

126
TextCache)
RowOutputText (handles writing data fields to a text table row put into TextCache)
RowInputTextQuoted (RowInputText variant for quoted text)
RowOutputTextQuoted (RowOutputText variant for quoted text)
org.hsqldb
Database (root HSQL database class, encapsulates database settings, holds instances of
SchemaManager and LobManager)
SchemaManager (management of schema related objects)
Session (implementation of an SQL session - settings, query processing, holds
reference to StatementManager, transaction management)
DatabaseManager (handles initial connection attempts to a database, starting it if
necessary, used to share a single Database instance between multiple servers, used for
locking and notifications)
ParserDQL (parser for Data Query Language)
Statement (base class for compiled statement objects)
StatementDMQL (statement implementation for Data Manipulation Language and Data
Query Language)
StatementQuery (statement implementation for query expressions)
StatementManager (manages reuse of prepared statement objects within a Session)
QueryExpression (SQL query expression representation, i.e. QuerySpecification but
without functionalities of SELECT)
QuerySpecification (SQL query specification, holds information about grouping,
ordering, aggregate functions, etc.)
Expression (base expression class)
ExpressionColumn (column, variable, parameter access operations)
ExpressionTable (table conversion implementation)
ExpressionValue (value access operation)
ExpressionArithmetic (arithmetic expression operations, e.g. type conversions,
evaluation)
Table (encapsulates database table information and operations)
TableBase (base table implementation)
TextTable (extends Table class to provide a table object backed by a text file)
Row (base class for row implementation)
RowAVLDisk (row implementation for TEXT and CACHED tables)
org.hsqldb.jdbc
JDBCConnection (JDBC connection implementation for HSQLDB)
JDBCDriver (JDBC driver for HSQLDB)
JDBCResultSet (JDBC ResultSet implementation for HSQLDB)

127
JDBCStatement (JDBC Statement implementation for HSQLDB)
JDBCDataSource (JDBC DataSource implementation for HSQLDB)

Listing 5. H2 code architecture summary

org.h2.server
TCPServer (native H2 network protocol server implementation)
TCPServerThread (a thread for serving one client connection)
org.h2.schema
Schema (encapsulates SQL schema as created by the CREATE SCHEMA statement)
SchemaObject (represents any object stored in database schema)
SchemaObjectBase (base class for implementations of SchemaObject)
TriggerObject (SchemaObject implementation for triggers)
Sequence (SchemaObject implementation for sequences)
Constant (SchemaObject implementation for constants)
org.h2.message
Trace (common trace module for all HSQLDB components)
org.h2.command
Parser (SQL parser for converting a statement string into object representation)
Command (object representation of SQL statement)
CommandContainer (wrapper for prepated statement)
Prepared (prepared statement representation
org.h2.command.dml
Query (represents simple or union SELECT query)
Select (represents simple SELECT query)
SelectUnion (represents union SELECT query)
SelectOrderBy (represents single ORDER BY term in a SELECT query)
TransactionCommand (transactional statement object)
Optimizer (implementation of optimal query execution plan search using brute force
and genetic testing of different table filters permutations)
org.h2.table
Table (base class for tables)
TableBase (base class for regular tables)
TableFilter (representation of a table used in a query, contains a list of conditions
resolvable using index as well as expression part which cannot)
TableView (virtual table defined by a CREATE VIEW statement)
RegularTable (a concrete implementation of regular table)
MetaTable (responsible for creation of database pseudo tables)
Column (represents column in a table)

128
org.h2.result
Row (row in a table)
RowList (a list of rows with disk buffering if necessary)
ResultColumn (result set column)
SimpleRow (a simple row)
SimpleRowValue (a simple row with only one column)
SearchRow (interface for rows stored in a table and partial rows stored in index)
org.h2.index
Cursor (a cursor interface for iterating over an index)
PageDataIndex (scan index implementation)
PageBTreeIndex (B-Tree index implementation)
SpatialTreeIndex (R-Tree implementation for spatial indexing)
org.h2.store
PageStore (paged file store implementation)
Page (representation of a page)
FileStore (abstraction of a random access file)
LobStorageFrontend / LobStorageBackend (LOB storage implementation)
org.h2.mvstore
MVStore (a persistent storage for maps, this class is used to store e.g. spatial tree
index)
MVRTreeMap (an R-Tree implementation for MVStore)
MVSpatialIndex (an index for MVRTreeMap)
org.h2.jdbc
JdbcConnection, JdbcStatement, JdbcResultSet, etc. (JDBC implementation for H2)
org.h2.tools
Csv (methods for reading and writing from/to CSV files)
SimpleRowSource (interface for classes creating rows on demand)
SimpleResultSet (basic result set implementation supporting SimpleRowSource)
org.h2.value
ValueResultSet ("result set" data type representation)

Listing 6. SQLite code architecture summary

src/main.c (contains implementations of main routines of the SQLite API, e.g.


sqlite3_initialize, sqlite3_shutdown, sqlite3_open, sqlite3_close)
src/tokenize.c / complete.c (SQL tokenizer)
src/parse.y (SQL parser)
src/resolve.c (this module is responsible for walking an SQL syntax tree and resolving all
symbols used to specific tables and columns in a database)

129
src/prepare.c (implementation of prepared statements interface)
src/legacy.c (contains the sqlite3_exec routine - a wrapper for sqlite3_prepare_v2, sqlite3_step
and sqlite3_finalize)
src/vdbe.c (implementation of Virtual Database Engine for running bytecode of a prepared
statement)
src/vdbeapi.c (APIs for interacting with VDBE)
src/vdbeaux.c (code used for creation, initialization and destruction of a VDBE virtual
machine)
src/vdbeblob.c (blob handling within VDBE)
src/vdbemem.c (routines for manipulating sqlite_value a.k.a. Mem type which is SQLite's
implementation of variant data type)
src/vdbesort.c (implementation of multi-threaded out-of-core merge sort used for creating
indices or handling SELECT statements without useful index and no limit clause)
src/vdbetrace.c (responsible for embedding user-defined values into the output of sqlite3_trace
function)
src/vtab.c (helper functions for implementing virtual tables interface)
src/wal.c (implementation of write-ahead log)
src/walker.c (routines for visiting the parser tree of an SQL statement)
src/where.c (VDBE code generator for handling SQL WHERE statement, indices used for
lookups are selected here, therefore it's roughly equivalent to an optimizer module)
src/rowset.c (implementation of a row set - a collection of row ids)
src/pager.c / pcache.c / pcache1.c (implementation of a page caching system)
src/btree.c (a B-Tree disk-based database implementation)
src/hash.c (implementation of hashtables used in SQLite)
src/memjournal.c (in-memory rollback journal implementation)
src/journal.c (journal file implementation)
src/func.c (implementation of SQL functions)
src/date.c (implementation of date and time SQL functions)
src/fkey.c (foreign key support in SQL statements)
src/expr.c (expression analysis and code generation for VDBE)
src/delete.c (code generation for DELETE statements)
src/callback.c (access to user-defined functions and collations)
src/build.c (handling of CREATE, DROP, BEGIN TRANSACTION, COMMIT and
ROLLBACK statements)
src/auth.c (authorization callback support, authorization mechanisms must be provided by the
user)
src/attach.c (handling of ATTACH, DETACH commands)
src/analyze.c (handling of ANALYZE command)

130
src/alter.c (VDBE code generation for ALTER command)
src/update.c (handling of UPDATE statements)
src/trigger.c (implementation of triggers)
src/status.c (interface for querying SQLite runtime performance)
src/vacuum.c (implementation for the VACUUM command)
ext/fts1, ext/fts2, ext/fts3 (full text search implementations)
ext/rtree (R-Tree spatial index implementation)

Listing 7. MonetDB code architecture summary

clients/ (client-side code)


mapilib (native Monet API client library)
mapiclient (native Monet API client application)
odbc (ODBC driver)
perl, php, python2, python3, ruby (MAPI wrappers for respectively Perl, PHP, Python
2/3 and Ruby)
common/
options (command-line options parser)
stream (generic stream wrapper)
utils (miscellaneous cryptography, UUID, etc. related functions)
gdk/ (Goblin Database Kernel, the innermost library of MonetDB)
gdk_aggr.c (grouped aggregates using BATs)
gdk_align.c (BAT alignment routines for making certain joins faster)
gdk_bat.c (implementation of binary association tables - the core building block of
MonetDB)
gdk_batop.c (basic operations on the BATs, e.g. slicing, sorting, reversing)
gdk_bbp.c (management of directory of BATs, BBP maintains a hierarchy of
directories with a maximum of 64 BATs per node, responsible for loading/saving BATs
to disk, atomic commit/recovery protocol)
gdk_calc.c (arithmetic operations for all combinations of input types)
gdk_cross.c (BAT cross product routines)
gdk_delta.c (commit/undo implementation for BATs)
gdk_firstn.c (routines for retrieving OIDs for n smallest/largest values in a BAT)
gdk_group.c (grouping of one BAT according to another)
gdk_imprints.c (implementation of column imprints)
gdk_select.c (handling of SELECT statement on BATs)
gdk_join.c (handling different variants of JOIN statement on BATs)
gdk_logger.c (transaction support)
gdk_qsort.c (quicksort implementation)

131
gdk_ssort.c (timsort implementation)
gdk_search.c (accelerated search implementation)
gdk_setop.c (set operations)
gdk_tm.c (transaction manager implementation)
geom/ (Spatial extension for MonetDB)
lib/libgeom.c (simple geometric library, wrapper for GEOS)
monetdb5/geom.c (routines for handling spatial types in MonetDB SQL)
sql/40_geom.sql (SQL statements for registering geometric functions and types)
monetdb5/mal (implementation of MonetDB5 assembly language - MAL, the lowest level
textual interface to MonetDB5)
mal_parser.c (parser implementation for MAL textual representation)
mal_builder.c (utilities for constructing MAL program from MAL or higher level
language text)
mal_resolve.c (resolves operators according to definitions in the current module)
mal_linker.c (responsible for dynamically looking for and loading functions used in
MAL programs)
mal_profiler.c (MAL peformance tracing engine)
mal_dataflow.c (out of order execution mechanism)
mal_factory.c (management of MAL execution contexts)
mal_scenario.c (abstraction for handling different high level languages using MAL, a
scenario consists of reader, parser, optimizer, scheduler and engine)
mal_session.c (MAL bootstrap code and MAL execution engine)
mal_recycle.c (MAL instruction and data recycler)
mal_runtime.c (runtime profiling of high level counters - bytes read/written, time used)
mal_resource.c (heuristic mechanism for postponing threads eligible for parallel
execution to avoid resource contention)
mal_sabaoth.c (cluster support, responsible for redistributing queries among multiple
mutually-aware MonetDB servers)
mal_type.c (mapping between BAT types and MAL types)
mal_module.c (module management routines, a module holds symbol definitions
visible to statements executed within the context of that module)
mal_listing.c (creates textual representation of a MAL program)
mal_interpreter.c (core execution engine for MAL)
mal_function.c (functions and polymorphic functions management)
mal_instruction.c (handling of MAL instructions and blocks)
mal_exception.c (exception handling utility routines)
mal_debugger.c (MAL debugger)
mal_authorize.c (authorization management between tables and their backing BATs)

132
mal_atom.c (management of MAL atom types)
mal.c (MAL context initialization and destruction)
monetdb5/modules/atoms/ (functions and operators for different atomary types)
batxml.c (XML document mapping functions, e.g. to/from string)
blob.c (blobs)
color.c (24-bit RGB color)
identifier.c (string identifier)
inet.c (IPv4 addresses)
json.c (JSON documents)
mcurl.c (calls to CURL file transfer library, uses string atoms as input/output)
mtime.c (functions for atoms: date, datetime, timestamp, timezone, zrule)
str.c (strings)
streams.c (IO streams)
url.c (URLs)
uuid.c (UUIDs)
xml.c (XML document atom handling functions, e.g. traversal)
monetdb5/modules/kernel/ (core execution functions for MAL)
aggr.c (grouped aggregates handling)
alarm.c (timers and timed interrupts handling)
algebra.c (most common algebraic BAT manipulation)
bat5.c (BAT updates, property management, basic I/O, persistence and storage)
batcolor.c (map operations for color string primitives)
batstr.c (map operations for atom string primitives)
logger.c (transation logging)
mmath.c (math commands)
status.c (system state information access routines)
monetdb5/modules/mal/ (all fundamental functions and operators available for calling from
MAL are defined here)
monetdb5/optimizer/ (optimizer implementation)
opt_accumulators.c (reuses storage space of unused temporary variables)
opt_aliases.c (redundant aliases removal)
opt_centipede.c ()
opt_cluster.c (reduces size of hash tables by grouping columns on both sides of the
join)
opt_coercion.c (removes unnecessary coercions)
opt_commonTerms.c (common subexpression elimination)
opt_constants.c (one time evaluation of expressions using only constant values)
opt_costModel.c (results size estimation heuristic)

133
opt_dataflow.c (dependency determination and parallel execution of eligible side-effect
free instructions)
opt_deadcode.c (removes variables not used as arguments and side-effect free
instructions not used in the final result)
opt_emptySet.c (propagates empty set property to eliminate resulting noops)
opt_evaluate.c (constant expression evaluation)
opt_factorize.c (splits query code into two parts - independent and dependent of input
arguments respectively, this way the independent code block can be cached and reused
for other queries)
opt_garbageCollector.c (garbage collection of temporary variables on return from
function call)
opt_generator.c (optimizer for generated series, substituting materialization of series
with another series where appropriate)
opt_groups.c ()
opt_inline.c (function inlining)
opt_joinpath.c (manages materialization of common join subpaths)
opt_json.c (replaces rendering results of affectedRows() and rsColumn() with JSON)
opt_macro.c (code expansion/contraction, i.e. macro/orcam transformation)
opt_mapreduce.c (experimental map-reduce transformation to be executed on the
systems in a cloud)
opt_matpack.c (unrolling of mat.pack() into an incremental sequence)
opt_mergetable.c (propagates the use of merge association table (MAT) within a
function without materializing a temporary BAT)
opt_mitosis.c (horizontal fragmentation for parallel execution)
opt_multiplex.c (optimizes applying scalar functions to elements in a container)
opt_octopus.c (another distributed processing optimization relying on mitosis and
mergetable optimizers and executing resulting parallelizable blocks in a cloud)
opt_pipes.c (optimizer pipeline management)
opt_prelude.c (definitions used across different optimizer modules)
opt_pushranges.c (deprecated)
opt_pushselect.c (pushes selects down projections)
opt_qep.c (deprecated)
opt_querylog.c (collect SQL query statistics)
opt_recycler.c (caching of all materialized intermediates as long as deemed useful -
decided by a heuristic)
opt_reduce.c (removes all MAL variables which do not contribute)
opt_remap.c (remapping function calls to their multiplex variant)

134
opt_remoteQueries.c (uses remote environment to execute instructions until first result
is locally needed)
opt_reorder.c (reorders basic blocks without side-effects in order to minimize lifetime
of variables and maximize parallel execution, in particular it transfers breadth-first
mergetable plan to a depth-first one)
opt_statistics.c (optimizer statistics for dynamic optimization and offline analysis)
opt_strengthReduction.c (performs strength reduction optimization by moving eligible
expressions outside the loop)
opt_support.c (support optimizer routines)
opt_wrapper.c (prepares environment for optimizer execution and removes the
optimizer call to avoid recursion)
optimizer.c (entry functions for calling the optimizer)
monetdb5/scheduler/ (scheduler implementation)
run_isolate.c (creates isolated copy of MAL block for further manipulations)
run_memo.c (memo structure based query execution - ordering of execution by lowest
cost and discarding redundant operations)
run_octopus.c (scheduler for the octopus optimization, responsible for delegating MAL
execution to remote sites)
run_pipeline.c ()
srvpool.c (server pool implementation for octopus/centipede)
sql/ (SQL language implementation for MonetDB)
common/ (common SQL backend routines)
scripts/ (SQL scripts for creating necessary tables and registering all the available
functions)
backends/monet5 (Monet5 SQL backend implementation)
bam/ (BAM/SAM DNA sequence alignment file format support)
datacell/ (MonetDB's DataCell continuous query streaming support)
generator/ (series generation)
gsl/ (GNU Scientific Library support)
LSST/ (spatial routines)
rest/ (REST interface for jsonstore HTTP daemon)
UDF/ (user-defined functions)
vaults/ (Data Vault module with support for FITS and MSEED file formats)
mal_backend.c ()
rel_bin.c (translation of relational to binary statements)
sql_bat2time.c (time conversion between BAT and SQL)
sql_cast.c (miscellaneous casts between BAT and SQL atoms)
sql_fround.c (fround() implementation for Windows)

135
sql_gencode.c (SQL to MAL code generation)
sql_optimizer.c (backend optimizer implementation, currently it relies on the
default SQL implementation)
sql_rdf.c (RDF data file shredding support)
sql_readline.c (word completion and command input mechanism for SQL)
sql_result.c (SQL result row construction routines)
sql_round.c (round() implementation)
sql_scenario.c (MAL session scenario implementation for SQL)
sql_statement.c (SQL statement buidling and traversal routines)
sql_statistics.c (SQL statistics module)
sql_user.c (SQL user and authorization implementation)
sql.c (core SQL support module)
server/ (implementation of SQL server)
sql_atom.c (SQL atomary types routines)
sql_datetime.c (SQL date/time routines)
sql_decimal.c (SQL DECIMAL type routines)
sql_env.c (SQL environmental variables controlling the server's behavior)
sql_mvc.c (multi-version catalog implementation)
sql_parser.tab.c (SQL parser)
sql_privileges.c (SQL privileges implementation)
sql_qc.c (query cache implementation)
sql_scan.c (SQL scanner and tokenizer)
sql_semantic.c (semantic representation of SQL text)
sql_symbol.c (building blocks for sql_semantic.c)
rel_distribute.c (data replication in cluster environment)
rel_dump.c (utilities for printing SQL expressions)
rel_optimizer.c (relational optimizer)
rel_planner.c (relational scheduler)
rel_prop.c (properties manager)
rel_psm.c (persistent stored modules implementation)
rel_schema.c (schema handling)
rel_select.c (SELECT query implementation)
rel_semantic.c (distribution of different SQL commands to appropriate modules)
rel_sequence.c (sequences support)
rel_trans.c (transaction support)
rel_updates.c (INSERT/UPDATE commands implementation)
rel_xml.c (XML type handling)
storage/

136
bat/ (implementation of storage necessary for SQL using BATs)
sql_catalog.c (implementation of SQL catalog access)
store.c / store_*.c (implementation of SQL store for schemas, types, functions,
tables, sequences, dependencies, connections, triggers, columns, keys, indices,
objects)

137

Potrebbero piacerti anche