Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Deepika Prakash
Data
Warehouse
Requirements
Engineering
A Decision Based Approach
Data Warehouse Requirements Engineering
Naveen Prakash Deepika Prakash
•
Data Warehouse
Requirements Engineering
A Decision Based Approach
123
Naveen Prakash Deepika Prakash
ICLC Ltd. Central University of Rajasthan
New Delhi Kishangarh
India India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. part
of Springer Nature.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
To
Our Family
Preface
That requirements engineering is part of the systems development life cycle and is
about the first activity to be carried out when building systems is today considered
as basic knowledge in computer science/information technology. Requirements
engineering produces requirements specifications that are carried through to system
design and implementation. It is assumed that systems automate specific activities
that are carried out in the real world. These activities are transactions, for example
reservations, cancellations, buying, selling, and the like. Thus requirements engi-
neering produces requirements specifications of transactional systems.
So long as systems were not very complex, the preparation of a requirements
specification was feasible and did not compromise on system delivery times.
However, as systems became more and more complex, iterative and incremental
development came to the fore. Producing a requirements specification is now
frowned upon and we need to produce, in the language of Scrum, user stories for
small parts of the system.
About the time requirements engineering was developing, data warehousing also
became important. Data warehouse development faced the same challenges as
transactional systems do, namely determination of the requirements to be met and
the role of requirements engineering in the era of agile development. However, both
these issues have been taken up relatively recently.
Due to this recent interest in the area, requirements engineering for data ware-
housing is relatively unknown. We fear that there is widespread paucity of
understanding of the nature of data warehouse requirements engineering, how it
differs from traditional transaction-oriented requirements engineering and what are
the new issues that it raises.
Perhaps, the role of agility in data warehouse development is even more crucial
than in transactional systems development. This is because of the inherent com-
plexity of data warehouse systems, long lead times to delivery, and the huge costs
involved in their development. Indeed, the notion of data marts and the bus
approach to data warehouse development is an early response to these challenges.
This book is our attempt at providing exposure to the problem of data warehouse
requirements engineering. We hope that the book shall contribute to a wider
vii
viii Preface
fragments than data marts and the problem of inconsistency and costs is even more
severe.
Given the severity of the problem, we do not consider it advisable to wait for the
problem to appear and then take corrective action by doing consolidation. It is best
to take a preventive approach that minimizes fragment proliferation. Again, keeping
in mind that for us requirements are the fulcrum for data warehouse development,
we consolidate requirements granules even as they are defined.
This book is a summary of research in the area of data warehouse requirements
engineering carried out by the authors. To be sure, this research is ongoing and we
expect to produce some more interesting results in the future. However, we believe
that we have reached a point where the results we have achieved form a coherent
whole from which the research and industrial community can benefit.
The initial three chapters of the book form the backdrop for the last three. We
devote Chap. 1 to the state of the art in transactional requirement engineering
whereas Chap. 2 is for data warehouse requirements engineering. The salient issues
in data warehouse requirements engineering addressed in this book are presented in
Chap. 3.
Chapter 4 deals with the different types of decisions and contains techniques for
their elicitation. Chapter 5 is devoted to information elicitation for decisions and the
basic notion of a requirements granule is formulated here. Chapter 6 deals with
agility built around the idea of the requirements granules and data warehouse
fragments. The approach to data warehouse consolidation is explained here.
The book can be used in two ways. For those readers interested in a broad-brush
understanding of the differences between transactional and data warehouse
requirements engineering, the first three chapters would suffice. However, for those
interested in deeper knowledge, the rest of the chapters would be of relevance as
well.
xi
xii Contents
Naveen Prakash started his career with the Computer Group of Bhabha Atomic
Research Centre Mumbai in 1972. He obtained his doctoral degree from the Indian
Institute of Technology Delhi (IIT Delhi) in 1980. He subsequently worked at the
National Center for Software Development and Computing Techniques, Tata
Institute of Fundamental Research (NCSDCT, TIFR) before joining the R&D group
of CMC Ltd where he worked for over 10 years doing industrial R&D. In 1989, he
moved to academics. He worked at the Department of Computer Science and
Engineering, Indian Institute of Technology Kanpur (IIT Kanpur), and at the Delhi
Institute of Technology (DIT) (now Netaji Subhas Institute of Technology (NSIT)),
Delhi. During this period he provided consultancy services to Asian Development
Bank and African Development Bank projects in Sri Lanka and Tanzania,
respectively, as well as to the Indira Gandhi National Centre for the Arts (IGNCA)
as a United Nations Development Programme (UNDP) consultant. He served as a
scientific advisor to the British Council Division, New Delhi and took up the
directorship of various educational institutes in India. Post-retirement, he worked on
a World Bank project in Malawi.
Prof. Prakash has lectured extensively in various universities abroad. He is on
the editorial board of the Requirements Engineering Journal, and of the
International Journal of Information System Modeling and Design (IJISMD). He
has published over 70 research papers and authored two books.
Prof. Prakash continues to be an active researcher. Besides Business Intelligence
and Data Warehousing, his interests include the Internet-of-things and NoSQL
database. He also lectures at the Indira Gandhi Delhi Technical University for
Women (IGDTUW), Delhi and IIIT Delhi.
Deepika Prakash obtained her Ph.D. from Delhi Technological University, Delhi
in the area of Data Warehouse Requirements Engineering. Currently, she is an
Assistant Professor at the Department of Big Data Analytics, Central University of
Rajasthan, Rajasthan.
xv
xvi About the Authors
Dr. Prakash has five years of teaching experience, as well as two years of
experience in industrial R&D, building data marts for purchase, sales and inventory
and in data mart integration. Her responsibilities in industry spanned the complete
life cycle, from requirements engineering through conceptual modeling to extract-
transform-load (ETL) activities.
As a researcher, she has authored a number of papers in international forums and
has delivered invited lectures at a number of Institutes throughout India. Her current
research interests include Business Intelligence, Health Analytics, and the
Internet-of-Things.
Chapter 1
Requirements Engineering
for Transactional Systems
and/or total rejection of software. The Standish group [4] reported that one of the
reasons for project failure is “incomplete requirements”. Clearly, the effect of
poorly engineered requirements ranges from outright systems rejection by the
customer to major reworking of the developed system.
The Software Hall of Shame [5] surveyed around 30 large software development
projects that failed between 1992 and 2005 to try to identify the causes of this
failure. It was found that failures arise because either projects go beyond actual
needs or because of expansion in the scope of the original project. This implied that
requirements changed over the course of product development and this change was
difficult to handle.
The foregoing suggested that new methods of software development were
needed that delivered on time, on budget, met their requirements, and were also
capable of handling changing requirements. The response was twofold:
• An emphasis on incremental and iterative product development rather than
one-shot development of the entire product. Small, carefully selected product
parts were developed and integrated with other parts as and when these latter
became available. As we shall see this took the form of agile software
development.
• The birth of the discipline of requirements engineering in which the earlier
informal methods were replaced by model-driven methods. This led to the
systematization of the requirements engineering process, computer-based
management of requirements, guidance in the requirements engineering task,
and so on.
We discuss these two responses in the rest of this chapter.
The System Development Life cycle, SDLC, for transactional systems (TSDLC)
starts from gathering system/software requirements and ends with the deployment
of the system. One of the earliest models of TSDLC is the waterfall model. The
waterfall model has six sequential phases. Each phase has different actors partici-
pating in it. Output of one phase forms the input to the next phase. This output is
documented and used by the actors of the next phase. The size of documentation
produced is very large and time-consuming.
Since the model is heavy on documentation, the model is sometimes referred to
as document driven. Table 1.1 shows the actors and document produced against
each phase of the life cycle.
The process starts with identifying what needs to be built. There are usually
several stakeholders of a system. Each stakeholder sits down with the requirements
engineer and details what s/he specifically expects from the system. These needs are
referred to as requirements. A more formal definition of the term requirements is
available in the subsequent sections of this chapter. These requirements as given
1.1 Transactional System Development Life Cycle 3
Being sequential in nature, a working model of the product is released only at the
end of the life cycle. This leads to two problems. One that feedback can be got from
the stakeholder only after the entire product is developed and delivered. Even a
slightly negative feedback means that the entire system has to be redeveloped;
considerable time and effort in delivering the product is wasted.
The second problem is that these systems suffer from long lead time for product
delivery. This is because the entire requirements specification is made before the
system is taken up for design and implementation.
An alternate method to system development is to adopt an agile development
model. The aim of this model is to provide an iterative and incremental devel-
opment framework for delivery of a product. An iteration is defined by clear
deliverables which are identified by the stakeholder. Deliverables are pieces of the
product usable by the stakeholder. Several iterations are performed to deliver the
final product making the development process incremental. Also, iterations are time
boxed with time allocated to each iteration remaining almost the same till the final
product is delivered.
One of the popular approaches to agile development is Scrum. In Scrum, iter-
ations are referred to as sprints. There are two actors, product owner and developer.
The product owner is the stakeholder of the waterfall model. The requirements are
elicited in the form of user stories. A user story is defined as a single sentence that
identifies a need. User stories have three parts, “Who” identifies the stakeholder,
“What” identifies the action, and “Why” identifies the reason behind the action.
A good user story is one that is actionable, meaning that the developer is able to use
it to deliver the need at the end of the sprint.
Wake [6] introduced the INVEST test as a measure of how good a user story is.
A good user story must meet the following criteria: Independent, Not too specific,
Valuable, Estimable, Small, and Testable. One major issue in building stories is that
of determining when the story is “small”. Small is defined as that piece of work that
can be delivered in a sprint. User stories as elicited from the product owner may not
fit in a sprint. Scrum uses the epic–theme–user story decomposition approach to
deal with this. Epics are stories identified by the product owner in the first con-
versation. They require several sprints to deliver. In order to decompose the epic,
further interaction with the product owner is performed to yield themes. However, a
theme by itself may take several sprints, but a lesser number than for its epic, to
deliver. Therefore, a theme is further decomposed into user stories of the right size.
When comparing agile development model with the waterfall model, there are
two major differences as follows:
1. In Scrum, sprints do not wait for the full requirements specification to be pro-
duced. Further, the requirements behind a user story are also not fully specified
but follow the 80–20 principle. 80% of the requirements need to be clarified
before proceeding with a sprint and the balance 20% are discovered during the
sprint. Thus, while in waterfall model, stakeholder involvement in the require-
ments engineering phase ends with a sign-off from the stakeholder, in Scrum the
stakeholder is involved during the entire life cycle. In fact, iterations proceed
with the feedback of the stakeholder.
1.1 Transactional System Development Life Cycle 5
Let us start with some basic definitions that tell us what requirements are and what
requirements engineering does.
Requirements
A requirement has been defined in a number of ways. Some definitions are as
follows.
Definition 1: A requirement as defined in [7] is “(1) a condition or capability
needed by a user to solve a problem or achieve an objective, (2) A condition or
capability that must be met or possessed by a system or system components to
satisfy a contract, standard, specification or other formally imposed documents,
(3) A document representation of a condition as in (1) or in (2)”.
According to this definition, requirements arise from user, general organization,
standards, government bodies, etc. These requirements are then documented.
A requirement is considered as a specific property of a product by Robertson,
and Kotonya as shown in Definition 2 and Definition 3 below.
Definition 2: “Something that the product must do or a quality that the product
must have” [8].
Definition 3: “A description of how the system shall behave, and information about
the application domain, constraints on operations, a system property etc.” [9].
Definition 4: “Requirements are high level abstractions of the services the system
shall provide and the constraints imposed on the system”.
Requirements have been classified as functional requirements, FR, and
non-functional requirements, NFR. Functional requirements are “statements about
what a system should do, how it should behave, what it should contain, or what
components it should have” and non-functional requirements are “statements of
quality, performance and environment issues with which the system should con-
form” [10]. Non-functional requirements are global qualities of a software system,
such as flexibility, maintainability, etc. [11].
Requirements Engineering
Requirements engineering, RE, is the process of obtaining and modeling require-
ments. Indeed, a number of definitions of RE exist in literature.
Definition 1: Requirements engineering (RE) is defined [7] as “the systemic process
of developing requirements through an iterative cooperative process of analyzing
6 1 Requirements Engineering for Transactional Systems
Users
EARLY PHASE Literature
Existing -
software
stakeholders
Requirements Engineer
Stakeholder
Requirements Different
Inconsistencies elicitation understanding of
Missing requirements system
Ambiguous requirements
New
Project Manager Requirements Engineer
Verification information Analysis and
Stakeholder Stakeholder
and Validation required Negotiation
Facilitator
Specification
and
Documentation Conflicting requirements
System analysts
Domain experts
Formal Specification
languages LATE PHASE
Knowledge
representation
language
system must have and what is the goal of building the system to-be. This gives rise
to conflicts. In this step, an agreement between the various stakeholders on the
requirements of the system to-be is established with the help of a facilitator. Notice
a red arrow in Fig. 1.1 from this step to the previous requirements elicitation step. It
may happen that during resolution of the conflicts, there is a new and different
understanding of the system entirely or of a part of the system. By going back to the
requirements elicitation stage, new requirements are elicited.
“Specification and Documentation”: Once all conflicts are resolved, require-
ments are documented for use in subsequent stages of system development. As
shown in Fig. 1.1, the document may be in formal specification languages,
knowledge representation languages, etc. System analyst and domain experts are
involved in this task. It is possible that during documentation new conflicting
requirements are found. To accommodate this, there is a feedback loop, shown by
the red arrow, which goes from this stage to the “analysis/negotiation” stage. It may
also happen that more information in the form of requirements is needed for the
system to be built. For this, there is a loop (red arrow in the figure) to the “re-
quirements specification” stage.
“Verification and Validation” (V&V): The main goal here is to check if the
document meets the customers’/clients’ needs. The input into this stage is the
documented requirements. The project manager along with the stakeholder is
involved in this task. Consistency and completeness are some aspects of the doc-
ument that are checked. Once the requirements have been verified and validated, the
RE process is considered completed and other phases of the TSDLC described in
Sect. 1.1 are executed. However, if any inconsistencies, missing requirements or
ambiguous requirements are found, then the entire process is repeated.
There are two phases of RE, an early phase and late RE phase. The dotted line
running diagonally across Fig. 1.1 divides the RE process into the two phases.
Early RE phase focuses on whether the interest of stakeholders is being addressed
or compromised. Requirements elicitation and analysis/negotiation form the early
RE phase. Late RE phase focuses on consistency, completeness, and verification of
requirements [15].
(i) Interviews—Interviews are held between the requirements engineer and the
stakeholder. Interviews are commonly acknowledged to be a stimulus–re-
sponse interaction [16]. This interaction is based on some (usually unstated)
assumptions [16]. The requirements engineer prepares a set of “relevant
questions” in order to get to know what the stakeholders want. It is assumed
that the questions will be read without variation, will be interpreted in an
unambiguous way, and will stimulate a valid response. Suchman and Jordan
[17] argue that this validity is not assured. There is another problem of the
requirements engineer trying to impose their views in the form of the
questions they ask.
(ii) Analyzing existing documentation—Documentation like organizational
charts, process models, or standards or existing manuals can be analyzed to
gather requirements for systems that closely resemble or are a replacement
to an old system. If the documentation is well done, then it can form a rich
source of requirements [10]. Great caution has to be imposed while ana-
lyzing the documentation. A tendency to over-analyze the existing docu-
mentation often leads to the new system to be too constrained [18].
(iii) Questionnaires—This technique is generally used for problems that are
fairly concrete and in understanding the external needs of a customer [18].
This method has several advantages. One can, quickly collect information
from large numbers of people, administer it remotely, and can collect atti-
tudes, beliefs, and characteristics of the customer. However, this technique
has certain disadvantages [16]. The questionnaire can have simplistic
(presupposed) categories providing very little context. It also limits room
for users to convey their real needs. One must also be careful while
selecting the sample to prevent any bias.
(iv) Group elicitation techniques—Focus groups are a kind of group inter-
views [16]. They overcome the problem of the interviewing technique being
a very rigid interaction. This technique exploits the fact that a more natural
interaction between people helps elicit richer needs [14]. The groups gen-
erally are formed ad hoc based on the agenda of the day. It usually consists
of stakeholders and requirements engineers. These group elicitation tech-
niques are good for uncovering responses to products. These techniques
overcome the disadvantages of interviews. However, they are not found to
be effective in uncovering design requirements. Two popular groups are the
Joint Application Development (JAD) and the Rapid Application
Development (RAD) group.
(v) Brainstorming—[19] This is highly specialized groups consisting of actual
users, middle level, and/or total stakeholders brainstorm in order to elicit the
requirements. This process has two phases: idea generation and idea re-
duction. In the idea generation phase, many ideas as possible are generated.
These ideas are then mutated. Ideas can also be combined. The idea
reduction phase involves pruning ideas generated in the idea generation
phase that are not worthy of further discussion. Similar ideas are grouped
into one super topic. The ideas that survive the idea reduction phase are
10 1 Requirements Engineering for Transactional Systems
The first part of the question is answered by Goal Analysis and the second part
by Goal Evolution. In the former, goals, stakeholders, actors, and constraints are
identified. This gives a preliminary set of goals. Once validated by the stake-
holders, this initial set can be refined.
It has been observed by Antón and Potts [29] that identifying goals of the system
is not the easiest task. GORE is subjective, dependent on the requirements engineer
view of the real world from where goals are identified [28]. Horkoff and Yu [30]
also point out that such models are “informal and incomplete” and “difficult to
precisely define”. Horkoff and Yu [31] observe “goal modeling is not yet widely
used in practice” [32] and notice that constructs used in KAOS are not used in
practice.
Agents have been treated in software engineering as autonomous units that can
change state and behavior. They can be humans, machine, or any other type. Agents
have the following properties [15, 22, 33]:
(i) Agents are intentional in that they have properties like goals, beliefs, abili-
ties, etc. associated with them. These goals are local to the agent. It is
important to note that there is no global intention that is captured.
(ii) Agents have autonomy. However, they can influence and constrain one
another. This means that they are related of each other at the intentional level.
(iii) Agents are in a strategic relationship with each other. They are dependent on
each other and are also vulnerable w.r.t. other agents’ behavior.
Agents help in defining the rationale and intensions of building the system. This
enables them to ask and answer the “why” question. Agent-oriented RE focuses on
early RE (see Fig. 1.1).
The central concept is that “goals belong to agents” rather than the concept in
GORE where “agents fulfil goals”. Notice that even though it is possible to have
goals without agents and agents without goals, goals and agents complement each
other.
We discuss the i* framework. i* framework was developed for modeling and
reasoning the organizational environment and its information system. The central
concept of i* is that of the intentional actor. This model has two main concepts, the
Strategic Dependency Model (SDM) and the Strategic Rationale Model (SRM).
Both early and late phase requirements can be captured through this model.
SDM component of the model describes the actors in their organizational
environments and captures the intentional dependencies between them. The free-
dom and the constraints of the actors are shown in terms of different dependencies
like goal, task, softgoal, and resource dependencies. SRM is at a much lower level
of abstraction than SDM. It captures the intentional relationships that are internal
14 1 Requirements Engineering for Transactional Systems
Scenarios have been used for requirements engineering [34] particularly for elici-
tation refining and validating requirements, that is, in the late RE phase. Scenarios
have also been used to support goals formulated in the early requirements phase.
They show whether the system satisfies (fulfillment) or does not satisfy
(non-fulfillment) a goal. In other words, scenarios “concretise” goals.
Holbrook [35] states that “Scenarios can be thought of as stories that illustrate
how a perceived system will satisfy a user’s needs.” This indicates that scenarios
describe the system from the viewpoint of the user. They have a temporal com-
ponent as seen in the definition given by van Lamsweerde and Willemet [36]: “a
scenario is a temporal sequence of interaction events between the software to-be
and its environment in the restricted context of achieving some implicit purpose(s)”.
Scenarios have also been defined with respect to agents. Plihon et al. [37] say that
scenario is “…possible behaviours limited to a subset of purposeful…communi-
cations taking place among two or several agents”.
A meta schema was proposed by Sutcliffe et al. [34] that shows the relationship
between goals, scenarios, and agents. Scenarios are a single instance of a use case.
Use cases are composed of actions that help in fulfillment of goals. One use case
fulfills one goal. A single action “involves” one or more agents.
Several elicitation techniques exist two of which are described below:
• SBRE [35]: There are two worlds, users’ and designers’ world. The goal set is
defined in the user’s world. It contains information regarding goals and con-
straints of the system. The goals are represented as subgoals. The design set is in
the designer’s world. This set consists of design models that represent the
system. The goal set and the design set communicate with each other with the
help of scenarios that is in the scenario set. This set shows how a specific design
meets a goal. Scenarios have a one-to-one relationship with the design models.
A specific scenario may satisfy many goals. Any issue that may arise is captured
in the issue set. A feedback cycle captures the user’s response to issue and
design. Scenarios form part of the specification of the required system.
1.5 Model-Driven Techniques 15
Proposals for goal–scenario coupling also exist in literature [38–41]. This can be
unidirectional from goals to scenarios or bidirectional coupling of goals and sce-
narios. Unidirectional coupling says that goals are realized by scenarios and this
reveals how goals can be achieved. Bidirectional coupling considers going from
scenario to goals in addition to going from goals to scenarios. It says that scenarios
can be sources of subgoals of the goal for which the scenario is written.
1.6 Conclusion
References
There are two perspectives to data warehousing, the organizational and the tech-
nological. From the organizational standpoint, data warehouse technology is for
providing service to the organization: it provides Business Intelligence, BI. The
© Springer Nature Singapore Pte Ltd. 2018 19
N. Prakash and D. Prakash, Data Warehouse Requirements Engineering,
https://doi.org/10.1007/978-981-10-7019-8_2
20 2 Requirements Engineering for Data Warehousing
versatile way of querying the data warehouse is needed. In other words, we need a
different model of data than the database model. This OLAP model enables data to
be viewed and operated upon to promote analysis of business data.
A data warehouse provides a multidimensional view of data. Data is viewed in
terms of facts and dimensions, a fact being the basic data that is to be analyzed,
whereas dimensions are the various parameters along which facts are analyzed.
Both facts and dimensions have their own attributes. Thus, sales data expressed as
number of units sold or in revenue terms (rupees, dollars) is basic sales data that can
be analyzed by location, customer profile, and time. These latter are the dimensions.
The n-dimensions provide an n-dimensional space in which facts are placed.
A three-dimensional fact is thus represented as a cube (see Fig. 2.1). The X-, Y-, and
Z-axes of the cube represent the three dimensions, and the cells of the cube contain
facts. For our example, the three axes correspond to location, customer profile, and
time, respectively. Each cell in the cube contains sales data, i.e., units sold or
revenue. Of course facts may have more than three dimensions. These form
hypercubes but often in data warehouse terminology, the words cubes and hyper-
cubes are used interchangeably.
It is possible for attributes of dimensions to be organized in a hierarchy. For
example, the attributes month, quarter, half year, and year of the dimension time
form a hierarchy. Monthly facts can be aggregated into quarterly, half-yearly, and
yearly facts, respectively. Such aggregations may be computed “on the fly” or, once
computed, may be physically materialized. In the reverse direction, one can obtain
finer grained information by moving from yearly facts to half-yearly, quarterly, and
monthly facts, respectively.
The multidimensional structure comes with its own operations for Online
Analytical Processing, OLAP. These operations are as follows:
Data warehouses have been developed for the last 25 years or so. The broad
learning is that data warehouse projects are complex, time-consuming, and
expensive. They are also risky and have a high propensity to fail. Often, they do not
meet expectations and have a poor record of delivering the promised products. As a
result, there are issues concerning data warehouse project management as well as
around DWSDLC.
2.2 Data Warehouse Development Experience 23
A number of studies have found that data warehouse projects are expensive in
financial terms as well as in terms of the effort required to deliver them. In [2], we
find a number of specific indicators:
• One company hired the services of a well-qualified systems integrator. It cost
two million USD and after 3 years time they got a single report but not a
working system.
• Having bought a ready to use data model, another company found that it needed
2 years to customize the model to their needs and 3 years to populate it with
their data. Only after that would they get their first outputs.
• Yet another company required 150 people for over 3 years and the project got so
expensive as to hurt their share price.
Aside from being expensive, data warehouse projects are risky. Ericson [3] cites
a survey showing that data warehouse projects whose cost averaged above $12M
failed 65% of the time. Hayen et al. [4] refer to studies that indicate the typical cost
of a data warehouse project to be one million dollars in the very first year and that
one-half to two-thirds of most data warehouse projects fail.
Loshin [1] points out that data warehouse projects generate high expectations but
bring many disappointments. This is due to failure in the way that DW projects are
taken from conception through to implementation. This is corroborated by [2] who
concludes that the lessons learnt from data warehouse projects covered the entire
development life cycle:
• Requirements: The business had little idea of what they wanted because they
had never experienced a data warehouse. Further, requirements gathering should
have been done better.
• Design: Design took a very long time and conflicting definitions of data made
designs worthless.
• Coding: It went slow, testing came under pressure, and crucial defects were not
caught.
The implications of the foregoing were deviations from initial cost estimates,
reduced number and value of the delivered features, and long delivery time.
The conclusion of Alshboul [5] is that one of the causes of data warehouse
project failure is inadequate determination of the relationship of the DW with
strategic business requirements.
From the foregoing, we observe in Table 2.1 that there are three main causes of
data warehouse project failure. The first is inadequate DW-business alignment. It is
necessary to ensure that the data warehouse brings “value” to the organization. This
is possible if the data warehouse delivers information relevant to making business
decisions.
The second issue is that of requirements gathering. In its early years, data
warehouse requirements engineering was de-emphasized. Indeed, requirements
were the last thing to be discovered [6]. However, considerable effort has been put
in the last 15 years or so to systematize requirements engineering. Starting from
24 2 Requirements Engineering for Data Warehousing
The activities that are to be performed in order to develop a data warehouse are laid
out in the DWSDLC. The manner in which these activities are performed is defined
in process models. For example, in the waterfall model, the activities comprising
the SDLC are performed in a linear manner: an activity is performed when the
previous one is completed. It is possible to do development in an iterative and
incremental manner in which case this strict linear ordering is not followed.
It is interesting to see the evolution of DWSDLC over the years. In its early
years, data warehouse development was largely concerned with implementation
issues. Therefore, upstream activities of conceptual design and requirements
engineering were de-emphasized. Data warehouse development started from anal-
ysis of data in existing databases. The nature of this analysis was largely experience
based and data warehouse developers used their expertise to determine the needed
facts and dimensions. This was viewed in [7], in terms of the requirements, con-
ceptual design, and construction stages of the DWSDLC. In the requirements stage,
facts and preliminary workload information are obtained starting from a database
schema. Subsequently, in conceptual design, the database schema, workload, and
facts are all used to obtain the dimensional schema.
Notice that there is no real conceptualization of the data warehouse as no con-
ceptual schema is built. The process of identifying dimensions is not apparent and
seems to rely on designer insight and understanding of information requirements of
2.3 Data Warehouse Systems Development Life Cycle, DWSDLC 25
Golfarelli et al. [10] developed a process that can be used for converting an ER
diagram into multidimensional form. This is partially automated and requires
developers to bring additional knowledge to decide on the final multidimensional
structure.
Moving to the conceptual design stage did present one major advantage over
database schema-based approaches. This move was based on the argument that the
process of discovery of data warehouse concepts should be rooted in an analysis of
the conceptual schema. Thus, it provided a foundation for obtaining facts,
dimensions, etc. It contributed to somewhat de-mystifying this process.
Conceptual schema/ER-driven techniques have been criticized on several
grounds:
• Limited data: If reverse engineered from operational databases, then the infor-
mation carried by ER schemas is limited to that in the database schema. It is
difficult to identify sources that are both external and other internal sources [7].
• Modeling deficiencies: ER schemas are not designed to model historical
information as well as aggregate information, both of which are so important in
data warehousing.
• Ignoring the user: ER-based techniques do not give primary importance to the
users’ perspective [10, 11]. As a result, the DW designer ends up deciding on
the relevance of data but this decision should be taken by the user and not by
designers.
The Requirements Engineering Stage
The introduction of the requirements engineering stage in the DWSDLC addressed
the concerns raised in conceptual schema-driven techniques. A clear effort was
made to take into account needs of stakeholders in the data warehouse to-be. The
definition of multidimensional structures was based on understanding of business
goals, business services, and business processes. This understanding was gained by
interaction with decision-makers. Thus, the context of the data warehouse was
explored and the requirements of the data warehouse to-be were seen to originate in
the business. Since ab initio investigation into decision-maker needs is carried out
in the requirements engineering stage, existing data sources and/or their concep-
tualization in conceptual schemas did not impose any limitations. Rather, the
determined requirements could use data from existing data sources or come up with
completely new data not available in these sources.
With the introduction of the requirements engineering stage in DWSDLC, there
is today no difference between the stages of the TSDLC of transactional systems
and stages of the DWSDLC. However, the tasks carried out in these stages are
different. This is brought out in Table 2.2.
It is important to notice that the problem of data warehouse requirements
engineering, DWRE, is that of determining the information that shall be contained
in the data warehouse to-be. On the other hand, requirements engineering for
transactional systems, TRE, aims to identify the needed functionality of the
2.3 Data Warehouse Systems Development Life Cycle, DWSDLC 27
(a) (b)
ConstrucƟon ConstrucƟon
Methods for developing data warehouses need to go through the stages of the
DWSDLC. There are two possibilities, to traverse the DWSDLC breadth first or
depth first. The breadth-first approach calls for the three stages in the DWSDLC to
be done sequentially, construction after conceptual design after requirements
engineering. The depth-first approach breaks down the task of data warehouse
development into small pieces or vertical slices, and the DWSDLC is followed for
each slice produced.
Breadth-first traversal can be done based on two different assumptions. The first
assumption is that the deliverable is the complete data warehouse. Hence,
requirements of the entire data warehouse must be identified; the multidimensional
model for the enterprise must be designed and then taken into implementation.
Thereafter, specific subject-oriented data marts are defined so as to make appro-
priate subsets of the data warehouse available to specific users. This is shown in
Fig. 2.3 where sales, purchase, and production data marts are built on top of the
enterprise-wide data warehouse. Defining data marts in this manner is analogous to
construction of subschemas on the schema of a database. The main idea in both is to
provide a limited view of the totality, limited to that which is relevant to specific
users.
This approach of constructing the monolithic data warehouse follows the
waterfall model; each stage of the DWSDLC must be completed before moving to
the next stage. This implies that lead time in delivering the project is very high.
There is danger that the requirements might change even as work is in progress. In
Data Warehouse
Purchase
Conformed Dimensions
Sales ProducƟon
short, monolithic development is prone to all problems associated with the waterfall
model of development. However, the likely benefit from the waterfall model is that
it could produce a long-lasting and reliable data architecture.
A different process model results if the assumption of delivering the full data
warehouse is relaxed. Rather than building the entire monolithic data warehouse,
this approach calls for first building data marts and then integrating them by putting
them on a common bus. This bus consists of conformed dimensions, dimensions
that are common across data marts, and therefore allow the drill across operation to
be performed. Consequently, data held in different data marts can be retrieved. This
approach is shown in Fig. 2.4.
Data marts are built independently of one another. Since the size of a data mart is
smaller than the entire data warehouse, the lead time for release is lesser. Therefore,
business value can be provided even with the release of the first data mart. Freshly
built data marts can then be added on to the bus. Thus, the data warehouse consists
of a number of integrated, self-contained data marts rather than a big centralized
data warehouse. Evidently, the bus approach promotes iterative and incremental
development and no complete plan is required upfront. The risks are that data marts
may contain missing or incompatible measures and dimensions contain replicated
data and display inconsistent results.
The success of the bus architecture is crucially dependent on conforming facts
and dimensions. Thus, if one data mart contains product information in number of
cases shipped and another keeps product information as units sold, then moving
across these data marts yields incompatible information. Such facts must be con-
formed, by keeping, along with shipping data, unit data as well. This allows units
shipped to be compared with units sold. Dimensions need to be conformed too. If
one data mart has attributes day, quarter, and year for the dimension time and
another has day and month half year, then drill across becomes difficult. The
dimension attributes must be made to conform and the lowest granularity attribute
kept in both the dimensions. The product information must also be available on a
daily basis in our example.
There are two possibilities for conforming dimensions. The first is to do it “on
the fly”, as each new data mart is added on to the bus. This may involve reworking
existing data marts to make them conform. This can be adopted so long as the effort
30 2 Requirements Engineering for Data Warehousing
to bring about conformity is within limits and does not offset the benefits involved
in doing early release. When this boundary is crossed, then attention must be paid to
designing for conformity. This means that the bus of conformed dimensions must
be determined either all upfront in the waterfall model style or enough investigation
should be carried out to get a modicum of assurance that the bus is well defined.
The trade-off between the two is apparent, delayed release versus the risk of rework.
Iterative and incremental development that forms the basis for the bus architecture
is at the core of agile methods. Indeed, agility has been extended to data warehouse
development as well. However, agile methods for data warehousing, which we refer
to as DW-agile methods, differ from agile methods for transactional systems or
T-agile methods. We bring these out by considering two DW-agile methods.
Using Scrum and User Stories
In Hughes [2], we see the adoption in DW-agile methods of notions of sprint and
user stories of T-agile methods. Recall that user stories are not complete require-
ments specifications but identify the needs with the details left to be discovered as
the sprint progresses. Defining user stories is an “art” and defining good stories
requires experienced story writers. Story writing follows the epic–theme–story
trajectory and the INVEST test is applied to test if a story is appropriately defined or
not. Over several decades, in order for teams to better author user stories, agile
practitioners have devised a number of strategies and tools. Since a user story aims
to answer the “who,” “what,” and “why” of a product; a more detailed examination
of these components is suggested. Strategies like user role modeling, vision boxes,
and product boards have also been devised. Finally, Hughes also introduced the
T-agile roles of product owner, Scrum master, and the development team in
DW-agile methods.
The point of real departure in DW-agile methods is reached when defining
sprints for doing data integration. During this stage, data from disparate sources is
brought together in a (a) staging area, (b) integrated, (c) converted into dimensional
form, and (d) dashboards are built. This is to be done for each fact and dimension
comprising the multidimensional schema. Thus, we get four sprints, one each for
(a) to (d): one sprint that does (a) for all facts and dimensions of the schema,
another that does (b), and so on, for all the four stages. If we now ask the question,
what is the value delivered by each sprint and to whom, then we do not get a
straightforward answer. Indeed, no business value is delivered at the end of sprints
for (a) to (c) to any stakeholder. The only role aware that progress in the task of
delivering the data warehouse is being made is the product owner but this role is not
the end user.
The Data Warehouse Business Intelligence, DWBI, reference data architecture
shown in Fig. 2.5 makes the foregoing clearer. This architecture separates DWBI
2.4 Methods for Data Warehouse Development 31
data and processes into two layers, back end and front end. Within the back end
part, we have sub-layers for staging, integration, and presentation sub-part relevant
to integration, whereas the front end layer comprises the presentation sub-layer
interfaces to the sematic sub-layer, the semantic sub-layer as well as the dashboard
sub-layer.
Delivering a dashboard requires the preceding four layers to be delivered and
can be likened to delivery of four applications rolled into one. Delivering such a
large application is unlikely to be done in a single sprint of a few weeks in duration
and needs to be broken down into sub-deliverables.
To deal with this, Hughes introduces the idea of developer stories. A developer
story is linked to a user story and is expressed in a single sentence in the who–
what–why form of user stories. However, these stories have the product owner as
the end user and are defined by the development team. A developer story provides
value to the product owner and is a step in delivering business value to the
stakeholder. It defines a sprint. Developer stories must pass the DILBERT’S test;
2. Layered: Each developer story must show progress in only one layer of the
DWBI reference data architecture. This promotes independence of developer
stories (the I in DILBERT’S).
Introduction of developer stories requires a number of additional roles, other
than the three roles of product owner, Scrum master, and development team. These
are as follows:
• Project architect: This role is for conceptualizing the application and commu-
nicating it to both business stakeholders as well as to technical people. The job
involves relating source data to target data in a presentation layer and for-
mulating the major functions of the dashboards.
• Data architect ensures that the semantics of the data are clear, manages the data
models of the various layers, implements normalization, etc.
• Systems analyst: Starting from user stories, determines the transformations of
source data required to meet business needs. In doing so, the systems analyst
will need to look at developer stories to determine the transformations across the
multiple layers of the DWBI reference architecture. This role may also need to
work with the data architect to define any integrity constraints that must be
satisfied by the data before it is accepted into the next layer.
• Systems tester: To ascertain if the build is correct and complete. This is done on
a daily basis, at the end of each iteration and when a release is issued. It is
normally done at the end of each day.
2.4 Methods for Data Warehouse Development 33
In the DW-agile method considered above, the issue of how the conformed bus
is built is not addressed. Presumably, it is to be built “on the fly” since no provision
has been made in the method to build the bus.
Using Agile Data Modeling: Data Stories
Yet, another approach to developing data warehouses in an agile manner is that of
Business Event Analysis and Modeling, BEAM* [12]. This method is based on
Agile Data Modeling. The argument behind using Agile Data Modeling is that
techniques like Scrum and user stories will improve BI application development but
only once the data warehouse is already in position. However, not much guidance is
available in such techniques for developing the data warehouse per se. Therefore,
we must move towards building the dimensional models in an agile manner. This is
where the role of Agile Data Modeling lies.
Agile Data Modeling [13] is for exploring data-oriented structures. It provides
for incremental, iterative, and collaborative data modeling. Incremental data
modeling refers to availability of more requirements when they are better under-
stood or become clear to the stakeholder. The additional requirements are obtained
“on the fly” when the developer needs them for completing the implementation task
at hand. Iterative data modeling emphasizes reworking to improve existing work.
As requirements become better understood and as need for changing data schemas
is felt, correcting errors, including missing information just discovered and other
such rework, referred to as refactoring in the data warehouse community, is carried
out. Collaborative data modeling calls for close interaction between the devel-
opers and stakeholders in obtaining and modeling data requirements. Thus, it
moves away from merely eliciting and documenting data requirements with
stakeholder participation but also includes stakeholder participation in modeling of
data.
BEAM* uses the notion of data stories that are told by stakeholders to capture
data about business events that comprise business processes. These data stories
are answers to seven types of questions about events and each answer provides a
fact or dimension of the multidimensional schema. These questions, called 7W, are
(1) Who is involved in the event? (2) What did they do? To what is it done?
(3) When did it happen? (4) Where did it take place? (5) Why did it happen?
(6) How did it happen—in what manner? (7) How many or much was recorded—
how can it be measured? Out of these, the first six supply dimensions whereas the
last one supplies facts. As an example, the event, order delivered, can have three
“who”-type dimensions, namely, Customer, Product, Carrier and two “when”-type
dimensions, order date and shipment date.
Even as facts and dimensions are being discovered, stakeholder–developer
interaction attempts to make them conform. The key issue is ensuring that identi-
fication of conformed facts and dimensions is done in an agile manner. To do this
an event matrix is built. This matrix has business events as rows and dimensions as
columns. There is a special column in the matrix, labeled as Importance that
contains a number to show the importance of the event. Associated with each row is
34 2 Requirements Engineering for Data Warehousing
an indication of the importance of the event of the row and similarly a row labeled
Importance contains the importance of the dimension (Table 2.3).
When associating dimensions with events, the product owner initiates discus-
sions to make sure that there is agreement on the meaning of dimensions across the
different events to which these are applicable. As a result, conformed dimensions
are entered into the event matrix.
It is possible to follow the waterfall model and build the event matrix for all
events in all processes in the organization. However, agility is obtained when just
enough events have been identified so as to enable defining the next sprint. Further,
a prioritization of the backlog on the basis of the Importance value is done. Thus,
the event matrix is the backlog.
Events have event stories associated with them. Since conformed dimensions
have already been identified, it is expected that event stories will be written using
these. The expression of events is as a table whose attributes are (a) specific to the
event and (b) are the conformed dimensions already obtained from the event matrix.
The table is filled in with event stories; each event story is a row of the event table.
An event table is filled in with several event stories so as to ensure that all stake-
holders agree on the meaning of each attribute in the event table. If there is no
agreement, then attributes that are homonyms have been discovered and separate
attributes for each meaning must be defined.
Reports that are desired by stakeholders are captured in report stories. These
stories are taken further to do data profiling and then on to development in a sprint.
This is the BI application aspect of data warehouse development.
As seen in the previous section, there are essentially two kinds of data marts as
follows:
• Dependent data marts: These are built from an already operational data ware-
house and so data for the data mart is extracted directly from the latter.
Therefore, such data marts have data which has already been integrated as part
of developing the data warehouse. Data in a dependent data mart will also, quite
naturally, be consistent with the data in the enterprise data warehouse. The
2.5 Data Mart Consolidation 35
enterprise data warehouse represents the “single version of the truth” and these
data marts comply with this.
• Independent data marts: Developed independently from the enterprise data
warehouse, these are populated with data often directly from an application, an
OLTP database or operational data sources. Consequently, data is not integrated
and is likely to be inconsistent with the data warehouse. Independent data marts
are built by several different teams using technologies preferred by these teams.
Therefore, there is a proliferation of tools, software, hardware, and processes.
Clearly, the foregoing happens if conformity across data marts is handled on the
fly. Notice, however, that this happens even if consolidation is designed for as in
BEAM* because post-design, data marts are developed independently and
independent teams work on the several data marts.
As already discussed, building independent data marts results in early delivery.
This mitigates two pressures that development teams are under, (a) meet infor-
mation needs early and (b) show that financial investment made is providing
returns. As a result, there is great momentum behind building data marts as and
when needed, with minimal concern for the enterprise data warehouse. Since
departments gain early benefit from this, data mart proliferation has come to be
widely accepted. Further, since data marts are developed taking only departmental
requirements into account, they facilitate departmental control and better response
times to queries. However, the downside of this is that independent data marts lead
to the creation of departmental data silos [14–16]. That is, data needs of individual
departments are satisfied but the data is not integrated across all the departments.
This leads to data having inconsistent definitions, inconsistent collection and update
times, and difficult sharing and integration.
Data mart proliferation raises a number of issues as follows:
• A large number of data marts imply increased hardware and software costs as
well as higher support and maintenance costs.
• Each data mart has its own ETL process and so there are several such processes
in a business.
• Same data existing in a large number of data marts leads to redundancy and
inconsistency between data.
• There is no common data model. Multiple data definitions, differing update
cycles, and differing data sources abound. This leads to inconsistent/inaccurate
reports and analyses.
• Due to lack of consistency between similar data, it could happen that
decision-making is inaccurate or inconsistent.
Data mart proliferation can be a drain on company resources. Industry surveys
[14] show that the number of data marts maintained by 59% of companies is 30.
There are companies that maintain 100 or more data marts. Maintenance of a single
data mart can cost between $1.5 million and $2 million annually. Out of these
costs, 35–70% are redundant costs.
36 2 Requirements Engineering for Data Warehousing
The foregoing implies that there is a tipping point beyond which independent
data mart proliferation becomes very expensive. Beyond this stage, an
enterprise-wide data warehouse supporting dependent data marts can meet demand
better. This is because such a data warehouse has enterprise scope, and therefore
(a) supports multiple work areas and applications across the business, and (b) has
consistent definitions of the data. The dependent data mart approach enables faster
delivery than building yet another independent data mart does because new
applications can leverage on the data in the data warehouse. It follows that at this
tipping point, consolidating the disparate data marts together starts to create value.
Data mart consolidation involves building a centralized enterprise data ware-
house (EDW). Data from multiple, disparate sources is centralized or consolidated
into a single EDW. Anyone in the organization authorized to access data in the
EDW will be able to do so. Thus, consolidation allows business to (a) retain
functional capabilities of the original sources, and at the same time (b) broaden the
business value of the data. Data mart consolidation provides benefits as follows:
• A centralized EDW results in common resources like hardware used, software
and tools, processes, and personnel. This results in a significant reduction in cost
per data mart.
• Since it is easier to secure centralized data than data distributed across different
platforms in multiple locations, better information security can be provided.
This also aids in being compliant with regulatory norms.
• There is a “single version of the truth”, which enables better decision-making by
providing more relevant information. Enterprise managers, as different from
department managers, require data from all departments and this is made pos-
sible by data consolidation.
There are two factors in consolidation, the data warehouse implementation
platform and the data model. This yields four possible approaches to doing
consolidation:
1. Platform change but no change in data models: This addresses only the issues of
consolidating the platform. All existing data marts are brought to the same
platform. We get same common procedures for backup, recovery, and security.
Proliferation of platforms and associated hardware/software costs is mitigated.
Further, the business gets cost savings in supporting and maintenance staff.
However, this is a mere re-hosting of existing data models and several data
models continue to exist though on a centralized platform.
This form of consolidation is relatively easy to carry through. The main effort
lies in redoing those procedures that might have used platform-specific features.
However, with this approach, multiple ETL processes continue to be needed and
there is no metadata integration.
2. No platform change but changed data model: This type of consolidation does
integrating of data of the several data marts. As a result, problems of incon-
sistency, redundancy, missing data, etc. are removed. BI applications give better
2.5 Data Mart Consolidation 37
results and costs in keeping redundant data are minimized. This approach
requires the construction of the bus of conformed dimensions. These may have
not been determined earlier, as is likely in the approach of using Scrum and user
stories, or consolidation may have been designed for, as in the approach of
BEAM*. Clearly, the former shall require more work than the latter.
To the extent that conformed dimensions are used, some standardization of
metadata does occur in this approach. However, non-conformed data continues
to have different metadata. There could be changes in schemas due to conformed
dimensions and these may require changes in the ETL processes. However, such
changes are minimal. Similarly, there may be changes in the code that produces
reports to take into account the changed schema.
Note, however, that due to diverse platforms, the cost savings of using a
common platform do not accrue. According to [16] organizations use this as a
first step in moving to consolidation as per approach (4) below.
3. No platform change, no change in data model: This leads to no consolidation
and can be discarded.
4. Changed platform and changed data model. In this case, we get benefits of both
a common platform and integrated data models. As mentioned earlier, this is the
culminating step in data mart consolidation and is usually preceded by following
approach (2) above.
There are two ways in which this kind of consolidation can be done. These are
as follows:
a. Consolidate by merging with primary: Two data marts, a primary and a
secondary data mart, are selected out of the several data marts that exist. The
secondary data mart is to be merged with the primary. As a first step, the
primary data mart is moved to the new platform. The secondary data mart is
then migrated to the new platform and conformed to the primary or, in other
words, conformed dimensions and facts are determined. Once merging is
completed, the secondary data mart can be discarded. This migration to the
new platform and integration with the primary is repeated for all remaining
data marts to yield the enterprise-wide data warehouse.
The “merge with primary” approach works well if the schema of the primary
does not have to undergo major changes in accommodating the independent
data marts. If this condition is not satisfied, then the approach considered
below is deployed.
b. Consolidate by doing a redesign: In this case, a fresh design is made keeping
in mind the common information across independent data marts. Existing
data marts are not used except to gain some understanding of the department
view of the business, thereby laying a basis for development of the
enterprise-wide data warehouse schema. Evidently, this approach can require
large effort and time before delivery.
To sum up, the simplest form of data mart consolidation saves cost of software
and hardware infrastructure. More complex forms of consolidation can further help
38 2 Requirements Engineering for Data Warehousing
severe data quality issues emerge that need to be resolved. Senior management
needs to impose data standards and additionally, overcome any resistance to
change current practice.
2. Alignment between data warehouse plan and business plan: The vision of
the data warehouse in the business needs to be defined. If this is only a
short-term vision, then a lower budget will be allocated, early delivery shall be
required, and the independent data mart approach shall be adopted. If on the
other hand full organizational control is needed, then an enterprise-wide data
warehouse would be required. The strategy for this may be through building
data marts and then doing consolidation.
3. Flexibility in data warehouse planning: If there is likelihood that business
strategy changes even as data warehouse development is ongoing, then a change
in the business requirements of the data warehouse could be necessitated. In
such a situation, iterative development with short lead times to delivery may be
the answer.
4. Technical integration of the data warehouse: The business case for a data
warehouse must first be established and business needs determined before
opting for a particular technology. Selecting technology is based on its ability to
address business and user requirements. The inability of the organization to
absorb large amounts of new technology may lead to failure. Conversely,
deploying old technology may not produce the desired results. Similarly,
dumping huge quantities of new data in the lap of users may be as negative as
providing very little new data.
5. Business user satisfaction: End-user participation is essential so as to both
manage user expectations and satisfy their requirements. The selection of
appropriate users in the project team is crucial.
The data warehouse community has responded to the need for alignment in
several ways. One is the adoption of agile techniques. The agile manifesto is in four
statements as follows:
• Individuals and interactions over processes and tools,
• Working software over comprehensive documentation,
• Customer collaboration over contract negotiation, and
• Responding to change over following a plan.
It can be seen that this manifesto addresses the five factors discussed above.
Agile development, as we have already seen, provides a broad developmental
approach to data warehouse development but does not provide techniques by which
the various stages shall be handled in a real project. To realize its full potential, it
relies on models, tools, and techniques in the area of requirements, design, and
construction engineering. Of interest to us, in this book, is requirements engi-
neering. All work in the area of data warehouse requirements engineering, DWRE,
is predicated upon close requirements engineer–stakeholder interaction.
40 2 Requirements Engineering for Data Warehousing
The importance of requirements gathering was highlighted in Sect. 2.2. The area of
data warehouse requirements engineering, DWRE, aims to arrive at a clear
requirements specification on which both organizational stakeholders and the
development team agree. As already seen, this specification may be a complete
enterprise-wide specification if the DWSDLC is being followed breadth first, or it
may be partial if the DWSDLC is being sliced vertically.
The first question that arises is, “what is a data warehouse requirement?”
Notionally, this question can be answered in two ways:
(a) What shall the data warehouse do?
(b) What information shall the data warehouse provide?
Data warehouse technology does not directly address the first question. One
answer that is provided is that the data warehouse supports analysis of different
forms, analyze sales, analyze customer response, etc. The second answer is that the
data warehouse can be queried, mined, and Online Analytical Processing (OLAP)
operations performed on it. It follows that a data warehouse per se does not provide
value to the business. Rather, value is obtained because of the improved
decision-making that results from the better information that is available in it. This
situation is different from that in transactional systems. These systems provide
functionality and can perform actions. Thus, a hotel reservation system can do room
bookings, cancelations, and the like. On the other hand, by providing capabilities to
query, mine, and do OLAP, a data warehouse can be used by decision-makers to
make decisions about what to do next. Therefore, data warehouse requirements
cannot be expressed in terms of the functionality they provide, because they are not
built to provide functionality. Asking what data warehouses do is the wrong
question to ask.
The second question is of relevance to data warehousing. If the information to be
kept in the data warehouse is known, then it is possible to structure it in multidi-
mensional form and thereafter, to query it, mine it, and do OLAP with it. Thus, the
data warehouse requirements engineering problem is that of determining the
information contents of the data warehouse to-be. Again notice, the difference with
transactional systems where supplying information is not the priority, and asking
what information to keep would be the wrong question to ask.
Now, the information to be kept in the data warehouse cannot be determined in
isolation and requires a context within which it is relevant. Thus, information for a
human resource data mart is different from that of the finance data mart. Due to this,
requirements engineering techniques explore the context and then arrive at the
information that is to be kept in the data warehouse. There are several proposals for
exploring the context and determining information relevant to the context.
Broadly speaking, there are two approaches as shown in Fig. 2.6. On the left
side of the figure, we see that interest is in the immediate concern that motivates
obtaining information from the data warehouse. This may be a requirement for
2.7 Data Warehouse Requirements Engineering 41
analyzing sales, forecasting sales, or simply asking questions about sales. Once
these needs are identified, then it is a matter of eliciting the information that should
be kept in the data warehouse. The important point to note is that though the
immediate context may be derived from an organizational one, the latter is not
modeled and is only informally explored.
The second approach is shown on the right side of Fig. 2.6. Here, the organi-
zational context that raises the immediate concern is also of interest and is, con-
sequently, modeled. There is a clear representation of the organizational context
from which the immediate context can be derived. For example, forecasting sales
may be of interest because the organization is launching a variant of an existing
product. It may also be interest to know trends of sales of existing products. The
organizational context then provides the rationale for the immediate context. It
provides a check that the immediate context is indeed relevant to the organization
and is not merely a fanciful analysis.
How many levels deep is the organizational context? We will show that there are
proposals for organizing data warehouse requirements engineering in multiple
levels of the organizational context.
Immediate Context
Hughes [2] makes a case for agile data warehouse engineering and builds user
stories that form the basis for subsequent data warehouse development. User stories
principally specify the analysis needs of decision-makers, for example, analyze
sales. This is determined by moving down the epic–theme–story levels of Scrum.
The technique is completely based on interviewing and deriving stories.
Paim and Castro [18] proposed the DWARF technique and used traditional
techniques like interviews and prototyping to elicit requirements.
Winter and Strauch [19] propose a cyclic process which maps the information
demand made by middle-level managers and knowledge workers with information
supplied in operational databases, reports, etc. They have an “initial” phase, an “as
is” phase, and a “to be” phase. In the first phase, they argue that since different users
can result in different data models, the dominant users must be identified. This helps
Data
warehouse
Data Immediate Context
warehouse
OrganizaƟonal Context
Immediate Context
target a specific business process. In the “as is” phase, an information map is
created by analyzing (a) existing information systems and (b) reports that the users
commonly use. According to the authors, analyzing the latter helps identify more
sources of information that one is not commonly aware of. In the “to be” phase,
information demand is elicited from the user by asking business questions. The
information supply and information demand are compared and inconsistencies
analyzed. Finally, using semantic models information requirements are modeled.
Organizational-Immediate Context
Traditionally, there are two major approaches: one is for setting the organizational
context with goal modeling and the other with business process modeling. There is
a third group of techniques that modify goal orientation by introducing additional
business-related concepts. We refer to these as goal-motivated techniques.
As we have already seen, there is much interest in RE for transactional systems
on goal-oriented [20, 21] and scenario-oriented techniques [22, 23]. These were
coupled together to yield the goal–scenario coupling technique [24, 25]. Goal
orientation uses means–ends analysis to reduce goals and the goal hierarchy
identifies the goals that are to be operationalized in the system. Notice the near
absence of the data/information aspect in goal orientation. Scenario orientation
reveals typical functionality and its variations by identifying typical interaction
between the system and the user. Even though example data is shown to flow across
the system–user interface, focus is not on the data aspect; data and its modeling are
largely ignored in scenario-oriented RE. Goal–scenario coupling allows develop-
ment of a scenario for a goal of the goal hierarchy. Consequently, variations of
goals are discovered in its scenario. Any new functionality indicated by the scenario
is then introduced in the goal hierarchy. Thus, a mutually cooperating system is
developed to better discover system goals. Again, notice that data is largely
ignored.
A number of proposals for goal-oriented data warehouse requirements engi-
neering, GODWRE, are available and all of these link goals with data, that is, all are
aimed at obtaining facts and dimensions of data warehouses from goals [7, 26–31].
We consider each of these in turn.
The second approach takes business processes as the basis for determining the
organizational context. An example of a business process is order processing.
Events that take place during a business process generate/capture data and a
business would like to analyze this data. Thus, data may be, for example, logs of
web service execution, application logs, event logs, resource utilization data,
financial data, etc. Interest is in analyzing the data to optimize processes, resource
allocation, load prediction and optimization, and exception understanding and
prevention.
When starting off from business processes, the several processes carried out in a
business are first prioritized and the process to be taken up next is selected.
2.7 Data Warehouse Requirements Engineering 43
Requirements of the business process are then obtained and taken into one or more
dimensional models.
The data resulting from events of business processes is essentially performance
metrics and can be mapped to facts of the multidimensional model, whereas
parameters of analysis become dimensions. Therefore, business intelligence can be
applied to this data.
There are also a number of hybrid approaches that follow from goal-oriented
approaches, one of which is to couple goals and processes. Others are for example
to couple goals with key performance indicators and to couple goals with decisions.
We refer to these as approaches that are motivated by goal modeling.
We consider these three types of DWRE techniques in the rest of this section.
(b) Variation factor: The relevant question to ask here is, “What factors can
influence quality focus?” Examples of such factors are customers, time, work
center, etc. Again, eliciting variation factors requires considerably skilled and
experienced requirements engineers who ask the right questions and understand
the responses.
(c) Baseline hypothesis: What are the values assigned to the quality focus of
interest? These are the typical queries that shall be asked when the warehouse
becomes operational, for example, average cost of activities of a certain type.
(d) Impact on baseline hypothesis: How do baseline hypothesis vary quality focus?
These tell us the query results that the data warehouse will produce once it
becomes operational.
The requirements engineering aspect is over once abstraction sheets are built.
However, just to complete the description of the technique, we consider the manner
in which the star schema is constructed. First, using the information obtained from
abstraction sheets, ideal star schemas are constructed. Thereafter, in the bottom-up
analysis phase, step (ii) above, entity relationship diagrams of existing operational
databases are obtained and converted to star schemas. Finally, in step (iii) the ideal
star schemas and those of step (ii) are matched. A metrics for selection is applied
and the star schemas are ranked. The designer then chooses the best fit for system
design.
Notice that in this technique, the organization context is the goal structure,
whereas the immediate context is the abstraction sheets.
Yet, another goal-oriented technique is due to Mazon et al. [30] who base their
approach on i* methodology. They relate goals supported by DW with information
requirements. Facts and dimensions are discovered from information requirements.
An intentional actor refers to a decision-maker involved in the decision-making
process. For each intentional actor, there are three intentional elements goals, tasks,
and resources. Goals can be of three kinds:
• Strategic goals are at the highest level of abstraction. These goals are the main
objectives of the business process and cause a beneficial change of state in the
business. Thus, increase sales is a strategic goal.
• Decision goals are at the next lower level of abstraction. These goals are for
achieving strategic goals. As an example, “open new store” is a decision goal
that achieves the strategic goal and increases sales.
• Information goals are at the lowest level of abstraction. These goals identify the
information required to achieve a decision goal. For example, “analyze pur-
chases” is an information goal.
Information is derived from information goals and is represented as tasks that
must be carried out to achieve information goals.
The requirements process starts with identification of decision-makers and a
strategic dependency model is built that shows the dependency between different
decision-makers. In the Strategic Rationale model, SR of i*, specific concepts for
multidimensional structures are introduced. These are business process, measures,
2.7 Data Warehouse Requirements Engineering 45
Data
warehouse
Bu
Immediate Context
OrganizaƟonal Context
goals. This association arises because facts are the data to be kept when a goal is
achieved. Finally, facts are augmented with their attributes.
In the decisional phase, the organizational model is reviewed but with the
decision-maker as the actor. The focus in this phase is in determining the analysis
needs of the decision-maker and goals like analyze sales are established. Such goals
are decomposed to yield their own goal hierarchy. Facts are normally imported
from the organizational perspective but some additional facts may be obtained
when the analyst investigates the goal model of the decisional phase. Dimensions
are obtained by considering the leaf goals of the decision-maker goal hierarchy and
the facts in the upper layers of this hierarchy.
The techniques discussed in the previous section associate facts and dimensions
with goals. There are other approaches that start out with goals but introduce an
intermediate concept using which facts and dimensions are obtained.
The goal-process approach of Boehnlein and Ulbricht [7, 26] rely on the
Semantic Object model, SOM, framework. After building a goal model for the
business at hand, the business processes that are performed to meet the goals are
modeled. The business application systems resulting from these are then used to
yield a schema in accordance with the Structured Entity Relationship Model,
SERM. Business objects of the business processes get represented as entities of
SERM, and dependencies between entities are derived from the task structure.
Thereafter, a special fourth stage is added to SOM in which only those attributes
that are relevant for information analysis required for decision-making are identi-
fied. Thereafter, the developer converts the SERM schema to facts and dimensions;
facts are determined by asking the question, how can goals be evaluated by metrics?
Dimensions are identified from dependencies of the SERM schema.
The Goal-Decision-Information, GDI, technique [28, 29] associates decisions
with business goals. A decision is a selection from a choice set of alternative. Each
alternative is a way of achieving a goal. The decision-maker needs information in
order to select an alternative. For each decision, relevant information is obtained by
writing informational scenarios. These scenarios are sequences of information
requests expressed in an SQL-like language. An information scenario is thus a
typical system–stakeholder interaction to identify information required for a deci-
sion. Once information for all decisions is elicited, an ER diagram is built from
which the multidimensional schema is constructed.
Typical information retrieval requests use the rather fuzzy notion of “relevant
information”. What constitutes “relevance” is not spelt out.
2.7 Data Warehouse Requirements Engineering 47
Though the DWRE area is highly oriented toward goals, techniques that start off
from notions other than goals do exist.
One such example is that of BEAM*. This approach [12] gives prominence to
business events that comprise a business process. Each business event is repre-
sented as a table and the RE problem now is to identify the table attributes. This is
done by using the 7W framework that provides for asking questions of seven types,
namely, (1) Who is involved in the event? (2) What did they do? To what is done?
(3) When did it happen? (4) Where did it take place? (5) Why did it happen?
(6) How did it happen—in what manner? and (7) How many or much was recorded
—how can it be measured? Out of these, the first six supply dimensions, whereas
the last one supplies facts.
Yet, another proposal kicks off from use cases [34]. Use cases are used for
communication between stakeholders, domain experts, and DW designers. The
authors propose an incremental method to develop use cases. Facade iteration is the
first iteration where use case outlines and high-level descriptions are captured. Its
purpose is to identify actors for other major iterations. The information gathered is
regarding names and short descriptions of actor interactions with DW system.
During the next iteration, ideas of use cases are broadened and deepened. They
generally include “functional”, information requirements plus requirement attri-
butes. Since the requirements gathered can be too large, use cases are first indi-
vidually evaluated for errors and omissions, then prioritized and pruned. This is
done so that at the end only the use cases that provide sufficient information to build
DW system are left. Thereafter, conflicting/inconsistent use cases are identified and
reassessed. Finally, use cases are used for obtaining relevant information.
The use of key performance indicators has also formed the basis of DWRE
techniques. References [35, 36] model business indicators as functions and identify
the needed parameters and return type. That is, input and output information needed
to compute a business indicator is determined.
It can be seen that there is a clear attempt to obtain the context in which facts and
dimensions of interest carry meaning. This context is explored through a variety of
concepts like goals, decisions, business processes, business events, and KPIs.
Thereafter, attention turns to obtaining data warehouse information. The techniques
for this second part are summarized in Table 2.4.
The primary difficulty with Boehnlein and Ulbricht is the absence of any model
or guideline to discover the attributes relevant to the analysis. The authors do not
indicate how stakeholders articulate the analysis to be performed. Consequently,
attribute identification becomes an unfocused activity. Further, the approach is for
48 2 Requirements Engineering for Data Warehousing
2.8 Conclusion
As for transactional systems, the two issues have been treated independently of
one another. That is, the manner in which requirements engineering can support an
efficient, iterative, and incremental development strategy has not been addressed.
It is evident, however, that there is a fundamental difference between require-
ments engineering for transactional systems and that for data warehousing. The
former is oriented toward discovering the functionality of the system to-be. The
discovered functionality is then implemented or operationalized in the system to be
built. In contrast, the problem of DWRE is to determine the information contents of
the data warehouse to-be. However, our analysis of information elicitation tech-
niques shows that these are rather ad hoc, and provide little guidance in the
requirements engineering task. We need models, tools, and techniques to do this
task better.
References
1. Loshin, D. (2013). Business intelligence the savvy manager’s guide (2nd ed.). Elsevier.
2. Hughes, R. (2013). Agile data warehousing project management business intelligence
systems using scrum. Morgan Kaufman.
3. Ericson, J. (2006, April). A simple plan, information management magazine. http://www.
information-management.com/issues/20060401/1051182-1.html. Accessed September 2011.
4. Hayen, R., Rutashobya, C., & Vetter, D. (2007). An investigation of the factors affecting data
warehousing success. Issues In Information Systems, VIII(2), 547–553.
5. Alshboul, R. (2012). Data warehouse explorative study. Applied Mathematical Sciences, 6
(61), 3015–3024.
6. Inmon, B. (2005). Building the data warehouse (4th ed.). New York: Wiley.
7. Boehnlein, M., & Ulbrich vom Ende, A. (1999). Deriving initial data warehouse structures
from the conceptual data models of the underlying operational information systems. In
Proceedings of Workshop on Data Warehousing and OLAP (pp. 15–21). ACM.
8 Hüsemann, B., Lechtenbörger, J., & Vossen, G. (2000). Conceptual data warehouse design. In
Proceedings of the International Workshop on Design and Management of Data Warehouses
(DMDW’2000), Stockholm, Sweden, June 5–6.
9 Moody, L.D., & Kortink, M.A.R. (2000). From enterprise models to dimensional models: A
methodology for data warehouses and data mart design. In Proceedings of the International
Workshop on Design and Management of Data Warehouses, Stockholm, Sweden (pp. 5.1–
5.12)
10. Golfarelli, M., Maio, D., & Rizzi, S. (1998). Conceptual design of data warehouses from E/R
schemes. In Proceedings of the Thirty-First Hawaii International Conference on System
Sciences, 1998 (Vol. 7, pp. 334–343). IEEE.
11. Prakash, N., Prakash, D., & Sharma, Y. K. (2009). Towards better fitting data warehouse
systems. In The practice of enterprise modeling (pp. 130–144). Springer, Berlin, Heidelberg.
12. Corr, L., & Stagnitto, J. (2012). Agile data warehouse design. UK: Decision One Press.
13. Ambler, S. www.agiledata.org.
14. CMP. Data mart consolidation and business intelligence standardization. www.
businessobjects.com/pdf/investors/data_mart_consolidation.pdf.
15. Muneeswara, P. C. Data mart consolidation process, What, Why, When, and How, Hexaware
Technologies white paper. www.hexaware.com.
16. Ballard, C., Gupta A., Krishnan V., Pessoa N., & Stephan O. Data mart consolidation: Getting
control of your enterprise information. redbooks.ibm.com/redbooks/pdfs/sg246653.pdf.
50 2 Requirements Engineering for Data Warehousing
17. Bansali, N. (2007). Strategic alignment in data warehouses two case studies (Ph.D. thesis).
RMIT University.
18. Paim, F. R. S., & de Castro, J. F. B. (2003). DWARF: An approach for requirements
definition and management of data warehouse systems. In 11th IEEE Proceedings of
International Conference on Requirements Engineering, 2003 (pp. 75–84). IEEE.
19. Winter, R., & Strauch, B. (2003). A method for demand-driven information requirements
analysis in data warehousing projects. In Proceedings of the 36th Annual Hawaii
International Conference on System Sciences, 2003 (p. 9). IEEE.
20. Antón, A. I. (1996, April). Goal-based requirements analysis. In Proceedings of the Second
International Conference on Requirements Engineering (pp. 136–144). IEEE.
21. Lamsweerde, A. (2000). Requirements engineering in the year 00: A research perspective. In
Proceedings of the 22nd International Conference on Software Engineering (pp. 5–19). ACM.
22. Sutcliffe, A. G., Maiden, N. A., Minocha, S., & Manuel, D. (1998). Supporting
scenario-based requirements engineering. IEEE Transactions on Software Engineering, 24
(12), 1072–1088.
23. Lamsweerde, A., & Willemet, L. (1998). Inferring declarative requirements specifications
from operational scenarios. IEEE Transactions on Software Engineering, 24(12), 1089–1114.
24. CREWS Team. (1998). The CREWS glossary, CREWS Report 98-1. http://SUNSITE.
informatik.rwth-aachen.de/CREWS/reports.htm.
25. Liu, L., & Yu, E. (2004). Designing information systems in social context: A goal and
scenario modelling approach. Information systems, 29(2), 187–203.
26. Boehnlein, M., & Ulbrich vom Ende, A. (2000). Business process oriented development of
data warehouse structures. In Proceedings of Data Warehousing 2000 (pp. 3–21). Physica
Verlag HD.
27. Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., & Paraboschi, S. (2001). Designing data
marts for data warehouses. ACM Transactions on Software Engineering and Methodology, 10
(4), 452–483.
28. Prakash, N., & Gosain, A. (2003). Requirements driven data warehouse development. In
CAiSE Short Paper Proceedings (pp. 13–17).
29. Prakash, N., & Gosain, A. (2008). An approach to engineering the requirements of data
warehouses. Requirements Engineering Journal, Springer, 13(1), 49–72.
30. Mazón, J. N., Pardillo, J., & Trujillo, J. (2007). A model-driven goal-oriented requirement
engineering approach for data warehouses. Advances in Conceptual Modeling–Foundations
and Applications (pp. 255–264). Springer, Berlin, Heidelberg.
31. Giorgini, P., Rizzi, S., & Garzetti, M. (2008). GRAnD: A goal-oriented approach to
requirement analysis in data warehouses. Decision Support Systems, 45(1), 4–21.
32. Leal, C. A., Mazón, J. N., & Trujillo, J. (2013). A business-oriented approach to data
warehouse development. Ingeniería e Investigación, 33(1), 59–65.
33. Giorgini, P., Rizzi, S., & Garzetti, M. (2005). Goal-oriented requirement analysis for data
warehouse design. In Proceedings of the 8th ACM International Workshop on Data
Warehousing and OLAP (pp. 47–56). ACM.
34. Bruckner, R., List, B., & Scheifer, J. (2001). Developing requirements for data warehouse
systems with use cases. In AMCIS 2001 Proceedings, 66.
35. Prakash, N., & Bhardwaj, H. (2014). Functionality for business indicators in data warehouse
requirements engineering. Advances in conceptual modeling (pp. 39–48). Springer
International Publishing.
36. Bhardwaj, H., & Prakash, N. (2016). Eliciting and structuring business indicators in data
warehouse requirements engineering. Expert Systems, 33(4), 405–413.
37. Nasiri, A., Wrembel, R., & Zimányi, E. (2015). Model-based requirements engineering for
data warehouses: From multidimensional modelling to KPI monitoring. In International
Conference on Conceptual Modeling (pp. 198–209). Springer.
Chapter 3
Issues in Data Warehouse Requirements
Engineering
In this chapter, we consider the three issues that emerge from the previous chapter,
namely,
1. The need for a central notion that forms the focus of data warehouse require-
ments engineering. Just as the notion of a function is central to transactional
requirements engineering, we propose the concept of a decision as the main
concept for data warehouse requirements engineering.
2. The development of information elicitation techniques. The absence of sys-
tematic information elicitation makes the data warehouse requirements engi-
neering process largely ad-hoc. We propose to systemize these techniques.
3. The tension between rapid DW fragment development and consolidation. This
tension arises because consolidation is treated as a project in itself separate from
the DW fragment development process. We propose to integrate it with the
requirements engineering process of the data warehouse development life cycle.
Since a data warehouse system is used to provide support for decision-making, our
premise is that any model that is developed must be rooted in the essential nature of
decision-making. Therefore, in this section, we first consider the notion of a
decision process. Thereafter, we consider the role of data warehousing in
decision-making and highlight the importance of basing data warehouse require-
ments engineering on the notion of a decision.
The decision process, see Fig. 3.1, can be seen as transforming information as input
into decisions as outputs. The input/output nature of a decision process suggests
information transfer to and from elements that are external to the process. These
elements may be sub-processes within the same decision system or may be separate
elements. These elements are collectively referred to as the environment of the
decision process. It is important for the environment and the decision process to be
appropriately matched so as to achieve the intended purpose.
The decision-making process has been described in terms of (a) its essential
nature and (b) the steps that constitute it. According to the former, decision-making
is [1], the intellectual task of selecting a particular course of action from a set of
alternative courses of action. This set of alternatives is often referred to as the
choice set. Turban [2] takes this further. Not only is an alternative selected but there
is also commitment to the selected course of action. Makarov et al. [3] formulate the
decision-making problem as an optimization problem and the idea is to pick up the
most optimal alternative from the choice set. Given C, a set of alternatives, and Op
as an optimality principle, the decision-making problem is represented as the pair
<C, Op>. The decision-maker is the person, a manager in an organization perhaps,
who formulates this pair and thereafter finds its solution. The solution to <C, Op> is
the set of alternative(s), COp, that meet the optimality principle Op.
The second view of the decision-making process emphasizes the steps to be
followed in performing this task. Simon [4] considers these steps as
decision-making phases. There are three phases:
• Intelligence: This is for searching conditions that need decisions,
• Design: Possible courses of action are identified in this phase. This may require
inventing, developing, and analyzing courses of actions, and
• Choice: Here, selection of a course of action is done from those available.
A more detailed description of the decision-making process is available in [5].
The decision-making process is organized here in five steps as follows:
(1) Define the problem,
(2) Identify the alternatives and criteria, constituting the decision problem. That is,
C and Op of the <C, Op> pair are identified,
(3) Build an evaluation matrix for estimating the alternatives criteria-wise,
(4) Select method to be applied for doing decision-making, and
(5) Provide a final aggregated evaluation.
X
n
i ¼
aws wj aij for all i
j¼1
When looking for maximization, the alternative with the highest weighted score
is the best one.
(5) Fuzzy methods [9, 10] use fuzzy set theory to enrich decision-making methods
for defining decision criteria and/or attributes values. As a result, uncertainty
and interdependence between criteria and alternatives can be addressed.
However, preferences determined by this type of methods can be inexact.
information. In all cases, a data warehouse is the source of information and it is for
the external decision process to use this source to produce the decision output.
Evidently then, the data warehouse is neutral to whether the decision process is
structured, ill-structured, or semi-structured. In its role as information supplier, the
data warehouse must supply complete, unambiguous, and perfect information to
structured decision processes. If the decision process is not well structured and the
nature of information is a factor in this being so, then the information supplied may
be incomplete, imperfect, and ambiguous. The data warehouse is not impacted by
factors other than the nature of information and is thus neutral to unstructured
processes arising, for example, due to ill-defined decision-making goals.
Whereas from the decision process viewpoint interest in a data warehouse is as a
source of information to arrive at decisions, from the data warehouse requirements
engineering, DWRE, point of view interest is in determining the information
contents of the data warehouse to support decision-making. In other words, if we
can determine the decisions of interest, then the requirements engineering task
would be to discover the information relevant to these. The information may be
well defined to support structured decision processes, or it may be insufficiently
defined and support semi-structured decision processes (Fig. 3.2).
Since information discovery is motivated by decisions, we refer to this
requirements engineering process as decision-centric. We see the decision-centric
requirements engineering process in two parts:
1. Determination of decisions.
2. Elicitation of information relevant to each decision.
Evidently, it is not possible to claim that all possible decisions shall be deter-
mined. That is, the completeness of the set of decisions cannot be established but, in
keeping with requirements engineering practice, an agreement can be reached
among stakeholders and with the requirements engineer that the discovered ones are
adequate.
We see three kinds of decisions in a business, namely, decisions for formulating
business policies, for formulating policy enforcement rules, and for operational
decision-making. Policies are the broad principles/directions that the business shall
follow. Policies are relatively stable over time but change when the business or its
environment changes. Policy enforcement rules provide guidelines/directives on
possible actions that can be taken in given business situations and can be repre-
sented in the IF x THEN y, where x is a business situation and y is a possible course
of action. Policy enforcement rules, PERs, can be formulated once policies are
Business Decisions
Policy PER
OperaƟonal
FormulaƟon FormulaƟon
Decisions
Decisions Decisions
agreed upon. These rules change relatively more often than policies do. Finally,
operational decisions are selections of possible courses of actions that can be
implemented/carried out in the business.
Figure 3.3 shows an architecture for these three kinds of decisions. The rectangle
at the top of the figure shows that a business consists of three different kinds of
decisions. For each of these, we have a separate data warehouse/mart, DWp for
policy formulation decisions, DWper for PER formulation, and DWop for opera-
tional decisions, respectively.
Taken individually, each data warehouse/mart has its own decisional motivation.
The kinds of decisions that are to be supported are different and each data ware-
house has its own information contents. There is no common data model and
common platform. However, from the point of view of an enterprise, the three kinds
of business decisions are interrelated as shown in Fig. 3.4: policy decisions lead to
decisions about policy enforcement rules that in turn lead to operational decisions.
Let us illustrate this through an example of a hospital service.
Policy decisions lead to formulation of policies of the hospital. For example, in
order to have an adequate number of doctors, the hospital defines patient: doctor
ratios, eight patients for every doctor, twelve per doctor, etc. Once the policy has
been laid down, rules to enforce it must be formulated. This is done through
decisions for formulating PERs. Thus, see Fig. 3.4, PER decisions are derived
from policy decisions. For example, in our hospital, patient demand for services,
resignations and superannuation of doctors, enhancing services, etc. lead to situa-
tions in which the patient: doctor policy could be violated. PERs articulate the
actions to be taken under given conditions, when to transfer doctors from one unit
to another, when to hire, etc.
Lastly, Fig. 3.4 shows that PER decisions drive operational decisions. The issue
in operational decision-making is that of first deciding the PER that shall be selected
and thereafter, about the parameters/individuals on which the action of the selected
rule shall be applied. For example, given the choice of hiring or transferring a
3.1 The Central Notion of a Decision 57
Policy Decisions
Derive
PER Decisions
Drive
Operational
Decisions
doctor, a decision about which one is to be adopted is needed. Let us say that it is to
transfer. Thereafter, the particular doctor to be transferred is to be decided upon.
The forgoing suggests that policy and PER decisions do not cause changes in
transactional data whereas operational decisions do. To see this, consider our
hospital once again and refer to Fig. 3.5. As shown in the figure, transactional
information systems supply information that shall be kept in the data warehouse.
Before it is put in the data warehouse, the ETL (Extraction, Transformation, and
Loading) process is carried out.
Policy Decisions
Derive
Transactional PER Decisions Data Warehouse
Information change
System Operational Drive
Decisions
ETL
Now, let it be required to take the decision that formulates the patient–doctor
policy. The decision-maker consults the data warehouse before an appropriate
patient: doctor ratio is adopted. As such, there is no effect on the transactional
information system. When the decision-maker derives PERs from the policy, then
again reference to the data warehouse is made. Yet again, there is no effect on the
transactional information system. During operational decision-making, the
decision-maker selects the appropriate PER to be applied to the situation at hand.
Again, the data warehouse needs to be consulted to comprehend the current situ-
ation of the business. Even at this stage, the transactional information system shall
not be affected. However, when the parameters/individuals of the action suggested
by the rule are determined from the data warehouse and then the action is actually
performed, then, and only then, shall the data in the transactional system be
changed (for example, the unit of the doctor shall change upon transfer). This is
shown in Fig. 3.5 by the arrow labeled change.
The relationship between the various kinds of decisions and therefore between
their data warehouses (Fig. 3.3) indicates that an all-encompassing enterprise-wide
data warehouse would support all three kinds of decision-making. In other words,
we could treat the three individual data warehouses as well as their different
components taken separately as DW fragments. The DW fragment proliferation
problem therefore gets even more severe and special attention to consolidation
needs to be paid.
Data Warehouse Fragments
It is to be noted that motivation behind the decisional data warehouse is not mere
analysis but the idea of providing information support to a decision. This decision is
the reason for doing analysis and analysis per se is not interesting. That is, the data
warehouse is not developed based on perceived analysis needs; data warehousing is
not driven by such statements as “I want to analyze sales” but by explicit decisions
like “Arrest sales of mobile phones” and “Abandon land line telephony”. Such
decisions call for eliciting relevant information and making it available in the data
warehouse/mart.
This decision-centric nature of the DWRE process changes the perspective of a
DW and because of this shift, we postulate the notion of a data warehouse fragment.
We define a data warehouse fragment, DW fragment for brevity, as a decision-
oriented collection of time variant, nonvolatile, and integrated data. This is in
contrast to a DW fragment being subject-oriented, time variant, nonvolatile, and
integrated collection of data.
A subject-oriented data mart lays emphasis on the business unit (sales, purchase)
for which the data warehouse is to be built. Interest is in determining the infor-
mation relevant to the unit. During requirements elicitation, stakeholders identify
what they want to analyze and what information is needed for this analysis. The
reason behind doing this analysis remains in the background.
A DW fragment springs off from the decisions that it shall support. There are two
aspects to be considered, decisions and information. Decisions may be completely
local to a unit of an organization or may span across organizational units. In the
3.1 The Central Notion of a Decision 59
the bigger the subject, the less is the proliferation. Thus, a stores data mart has a
larger granularity than the sales data mart or the purchase data mart. We get one
mart for the subject store and two for the latter. Similarly, if a DW fragment is for a
decision of large granularity or for several decisions, then we will get a larger DW
fragment than if the fragment is for a small granularity decision. We will consider
the notion of decision granularity in detail in Chap. 6.
Bullen and Rockart [11] look upon a Critical Success Factor, CSF, as a key area of
work. Meeting a CSF is essential for a manager to achieve his/her goals. Evidently,
Ends achievement can be considered in two different ways, depending upon the
way one conceptualizes the notion of Ends. These are as follows:
(1) An End is a statement about what is to be achieved, a goal. In this view, one can
do Ends analysis by asking which Ends contribute to the achievement of which
other Ends. Notice that an End is different from a CSF in that the latter is a
work area where success is critical whereas End is that which is to be achieved.
(2) An End is the result achieved by performing a task or is the intended result of a
decision. Therefore, unlike view (1) above, interest is not in determining which
End achieves which given End. Rather, interest is in determining the infor-
mation needed to ensure the effectiveness of the End. In other words, Ends
analysis here is the identification of the needed information. We refer to it as
ENDSI elicitation.
Notice the difference between the notion of a CSF and this view of Ends.
Whereas a CSF is about success in a work area, an End is the expected result of
a decision. A CSF is at a more “macro” level, whereas an End is relatively more
focused and is at a “micro” level.
In our context, “Ends” refers to the result achieved by a decision. Therefore,
requirements engineering is focussed on determining the information for the
effectiveness of the result. The manager considers only those decisions that con-
tribute positively to Ends effectiveness.
As for CSF above, we see that this ensures that the Ends effectiveness technique
is close to the manager’s view of a business and that it directly relates to decisions
for promoting Ends effectiveness. This ensures continued manager interest in the
requirements engineering task.
62 3 Issues in Data Warehouse Requirements Engineering
A means is of as much interest in the business world as are the notions of Ends and
CSF. A means is an instrument for achieving an End and interest lies in determining
the efficiency of the deployed means. Thus, we need a technique for determining
means efficiency, thereby identifying information for evaluating the efficiency of
the means. We refer to the technique for obtaining this information about means as
MEANSI elicitation.
Just as for the Ends achievement technique, the means efficiency technique is
close to the manager’s view of the business. Since it directly relates to an important
concern of managers, it ensures relatively high manager involvement in the
requirements elicitation task.
Sterman [13] has shown that feedback plays an important role in the area of
dynamic decision-making. The business environment is changed by a decision. As
a result, the conditions of choice get changed and these eventually feed back into
the decision. A feedback cycle is formed. For example, let a manager take a
decision to increase production. This changes the price, profits, and demand of
goods. Consequently, there is an effect on the labor and materials market of the
business. Additionally, customers may also react to the changed environment. All
these affect future production decisions.
We interpret this feedback loop in terms of information. The manager needs
information about each element (price, profit, etc.) in the feedback loop so as to
make future production decisions.
3.2.5 Summary
This removes the problem with strategy (a) since there is no common pool of
(N + K) requirements granules. However, this strategy does not take into
account that as a result of KC2 comparisons, some related DW fragments could
3.3 Requirements Consolidation 65
It can be seen that all the three cases yield polynomial time complexity for the
expression of the total number of comparisons to be made. At low values of N and
K, doing the comparison may be even be possible but as these values rise, the
problem starts to get out of hand. In other words, for large values of N and K, it is
worthwhile to consider the “consolidate by redesign” approach to consolidation.
However, for low values of N and K, we can still consider the incremental and
iterative approach to consolidation.
We can convert our polynomial time complexity problem into a linear problem
by making K = 1 and applying it to the expression in (c) above. As a consequence,
L becomes equal to 1 since one DW fragment has to be built. Therefore, we get
Number of combinations ¼ N
Number of DW fragments ¼ ðN MÞ þ 1
earlier. The logical data models of the N − M DW fragments are not changed,
whereas the M + 1 DW fragments get consolidated into a single logical data model.
As shown in Chap. 2, there are only two techniques for DW fragment consol-
idation, namely, DW fragment redesign and merge with primary. Adopting the
former violates the incremental and iterative development principle that, as stated
above, is our preferred approach. Thus, we are left with merge with primary as the
alternative.
Our consolidation process does pair-wise consolidation as shown in Fig. 3.7. Let
the organization start to build its first DW fragment, DWF1, with requirements
granule RG1. This is shown by the dashed line between DWF1 and RG1 in the
figure. There is no previous backlog of fragments and so RG1 can be directly taken
into development. When the organization starts on the second DW fragment
DWF2, then an attempt is made to integrate its requirements specification, RG2
with RG1. If these are disjoint, then we get two separate DW fragments in the
organization. If, however, these have commonalities, then we get the integrated
requirements specification, RG3, that is then taken through the development life
cycle to yield a physically and logically unified data warehouse fragment, DFW2.
For the third DW fragment, we either have two disjoint DW fragments or a con-
solidated one. In either case, the requirements of the new DW fragment are matched
with those of the existing ones, to, as before, either add to the backlog of disjoint
DW fragments or do further consolidation. This process continues for each DW
fragment to be developed. The figure shows the case where consolidation of RG3
with the new requirements granule RG4 can be performed.
RG of DWF to-be
Requirements Integrator
(N-M) +1 RGs
It can be seen that the underlying development principle of the process model is
to “build by integrating”. That is, rather than wait for inconsistency and cost issues
to arise and then treat consolidation as a separate process to be carried out, our
approach is to check out the possibility of integration during the requirements
engineering process itself. This forces a comprehensive look at the DW fragments
being developed in the organization by the IT department. The chances of having
multiple platforms therefore reduce. Additionally, DW fragment proliferation is
minimized since whatever can be integrated is in fact consolidated upfront in the
requirements engineering stage.
Additionally, the cost of developing new fragments that can be integrated with
others is minimized. We can see this by considering RG2 in Fig. 3.7. Under the
traditional approach, no effort would be made to match it against RG1. The
requirements in RG2 would result in an implemented data mart. However, the
development effort would be wasted since the design and implementation would be
discarded at the time integration, with RG1, eventually occurs. In the “build by
integration” approach, this effort would not be put in, as the possibility of inte-
gration is spotted during requirements engineering. However, if there is no com-
monality between RG1 and RG2, then effort would be put in development of RG2
and similar costs as with the traditional approach would be incurred.
This suggests that the “build by integration” approach provides best results when
the DW fragments that have common information are taken up for development in
the organization. These results in
68 3 Issues in Data Warehouse Requirements Engineering
3.4 Conclusion
References
Given the central role that the decision plays, the critical step in developing a DW
fragment is that of establishing the decision or collection of decisions that the
fragment shall support. Since the notion of a DW fragment is neutral to the nature
of decision that it supports, we need to define the different kinds of decisions that
there can be, where they originate from, and how they can be defined.
We have already seen that there are three levels of decisions in a business, for
formulating policies, policy enforcement rules, and also operational decisions. The
interesting issue now is that of determining these decisions. That is, support for the
task of obtaining decisions is to be provided.
In Sect. 4.1, we address the issue of formulating policies of an enterprise. There
is wide spread recognition that the task of policy formulation is a complex one. In
order to do this, we consider the meaning of the term, policy. Thereafter, we
represent policies in a formalism based on the first-order predicate logic. We show
that a statement of this logic, therefore a policy, can be represented as a hierarchy.
The nodes of this hierarchy are components of the policy. Lastly, we associate the
possibility of selecting, rejecting, or modifying any node of the hierarchy. This
yields the set of decisions for formulating enterprise policies.
The issue of formulating policy enforcement rules is taken up in Sect. 4.2. Since
these rules are for policy enforcement, it is necessary that the policy corresponding
to the rule is already formulated. A representation system for policy enforcement
rules is presented and a collection of rules and guidelines is developed to obtain
policy enforcement rules from policies. The developed rules are associated with
operators to select, reject, and modify them. This yields the set of decisions using
which enforcement rules of the enterprise are formulated.
Finally, we consider operational decision-making in Sect. 4.3. Operational
decision-making first involves examining the policy enforcement rules applicable to
the situation at hand to select the most appropriate one. In Sect. 4.3, the structure of
constant,
an SV,
or an n-adic function symbol applied to n SVs:
a CV or
an n - adic function symbol applied to n CVs.
An atom is an n-place predicate P(x1, x2, …, xn) where any xi is either ST or CT.
There are standard predicates for the six relational operators named EQ (x, y), NEQ
(x, y), GEQ (x, y), LEQ (x, y), GQ (x, y), and LQ (x, y).
The formulae of the logic are defined as follows:
• Every atom is a formula.
• If F1 and F2 are formulae, then F1 AND F2, F1 OR F2, and Not F1 are
formulae.
• If F1 and F2 are formulae, then F1 ! F2 is also a formula.
• If F1 is a formula, then ƎsF1 and 8sF1 are formulae. Here, s is SV or CV.
• Parenthesis may be placed around formulae as needed.
• Nothing else is a formula.
4.1 Deciding Enterprise Policies 75
where
AyH(a) says that a is AYUSH hospital and
operate (a, OPD) says that a must operate an OPD.
Policy 2: A semi-private ward has an area of 200 ft2 and two beds.
where
spw(s) says that s is a semi-private ward,
EQ(s,c1) says that s is equal to c1, and
B is a set of beds.
Using the well-formed formulae, a policy is expressed into its structural hierarchy.
For this, the structure of the formulae is decomposed into two parts, the part on the
left-hand side of the implication and the second part that is on the right-hand side of
the implication. These parts can themselves be reduced into formulae and this
decomposition ends when atoms are reached. This provides us a hierarchical
structure for each policy. Figure 4.1 shows a policy P decomposed into formulae
F1, F2, … Fn. Each of these is further decomposed and the process continues till
the leaves of the hierarchy are reached. These leaves are the atoms of the policy.
F1 F2 ........... Fn
F11 F12
76 4 Discovering Decisions
The algorithm starts with the full statement of the policy as the root.
Subsequently, it examines the root for the presence of quantifiers and removes them
giving us the child node which has the root node as it’s parent node. Subsequent
levels are added by splitting this node into two formulae, one on the right side of the
implication and the other on the left side. Thereafter, postfix trees are built for both
the sides. These subtrees are then attached to their respective parent nodes giving us
the policy hierarchy. The leaves of the final tree are atoms.
As an example, take Policy 1, the policy “Every doctor must have a post
graduate degree”.
8x½docðxÞ ! degreeðx; MDÞ
Fig. 4.2 Policy hierarchy for “Every doctor must have a post graduate degree”
4.1 Deciding Enterprise Policies 77
EQ(area(x),200) EQ(count(b),2)
(a)
(b)
Fig. 4.4 a Hierarchy for “Every doctor must have a postgraduate Degree”. b The process of
adopting a policy
It can thus be seen that “given” policies are represented using first-order logic
and subsequently converted to a policy hierarchy. Choice sets {select node, modify
node, reject node} are associated with each node of the hierarchy.
Reusing Policies
Now the question is how these policy hierarchies are used by any organization to
formulate its own policies. First, the organization constructs the policy hierarchy.
Then starting from the leftmost leaf, decision-makers move up the tree in a
bottom-up manner. As each node is processed, an alternative from the choice set
{select, modify, reject} is picked. The tree is fully processed once the root node has
been selected, modified, or rejected. The algorithm is shown below.
78 4 Discovering Decisions
In order to decide on an alternative from the choice set, the decision-maker will
require information. If relevant information is present in the data warehouse, then
this information can be consulted and an alternative from this choice set can be
selected so as to formulate the organizational policy.
Using the policy hierarchy shown in Fig. 4.4a, let us first describe steps (1) and
(3). It is assumed here that step (2) is performed every time after step (1) is
performed.
A choice set using {select, modify, reject} is constructed for (a) the two atoms
doc(x) and degree(x, MD) respectively, (b) the next hierarchical level formula, and
(c) the full policy. Selection of doc(x) says that in our new policy this atom is
retained. Its rejection means that the policy is to be reframed and possibly a new
one formulated. Its modification implies the modification of the predicate doc(x)
and essentially results in a creation of a new policy. Assume that doc(x) is selected.
The left topmost corner of Fig. 4.4b shows that the decision SELECT (marked in
orange) is taken and so doc(x) is selected.
Now consider the second atom. Its selection means that the atom is retained;
rejection means that possibly a new policy is to be formulated; modification may
mean that the qualification required may be PDCC. Assume that this modification is
to be done. The choice “Modify” will be selected and therefore, marked with orange
in the figure and the value of the node will be changed. The figure also shows the
selection of the implication at the next level and the selection of the universal
quantifier at the root. Again, selection, modification, or rejection is possible. The
selection of the root node means that the entire policy has been selected. One way
of modifying is the replacement of the universal quantifier with the existential
quantifier. Rejection at the root level means that we are rejecting the binding of the
variable to a quantifier. This means that the entire policy is in fact rejected.
To conclude, the well-defined formulae of the first-order logic makes it possible
to examine each sub-formula recursively in order to decide whether to select, reject,
or modify as discussed above. By making an appropriate choice, we can formulate
modified policies, reject existing policies, or create new policies.
Such a policy hierarchy is constructed for each policy that may be defined by a
regulatory body, standardization body, or that may be available as best practice of
another organization or as a legacy policy of an organization.
4.2 Deciding Policy Enforcement Rules 79
Once policies have been formulated, the next problem is that of formulating policy
enforcement rules. When an action is performed in an organization, there are two
possibilities. One, that this action does not violate any policy. Two, that this action
violates a policy. Clearly, the latter situation needs to be handled so that there is
policy compliance. Thus, interest here is in formulating rules that specify corrective
actions to be taken.
Let us revisit the structure of policies, defined in Sect. 4.1.1. We can see that
policies can have either simple or complex formulae. Complex formulae are those
involving
• Conjunctions (AND) and disjunctions (OR) and
• n-adic functions.
Since a policy is of the form quantifier (IF Formula1 THEN Formula2), we
obtain four kinds of business policies, as shown in Table 4.1, where S stands for
simple and C for complex. These depend on the nature of Formula1 and Formula2.
A simple policy, SS policy, has both Formula1 and Formula2 as simple. Thus
the policy, “Every doctor must have an M.D. degree” expressed as 8x[doc
(x) ! degree(x, MD)], has no conjunction/disjunction or n-adic function on either
LHS or its RHS and thus is a simple policy.
Row numbers two, three, and four of Table 4.1 have at least one formula as
complex. The policy is of simple–complex (SC) type. Consider the policy,
8yƎBƎN [GB(y) ! ratio(count(N), count(B),1,8)]. Since the right-hand side uses
functions, it is complex. LHS is simple. This is an SC-type policy. Another example
is 8xƎb [S(x) ! LEQ(count(b),3) AND GT(count(b),1)]. The RHS contains both a
function and AND conjunction. LHS is a simple formula making the policy of SC
type.
Now, let us look at row three of Table 4.1. Here, the situation is reverse of that
in row two. The LHS is complex but the RHS is simple and we have a CS policy.
Consider the policy 8x[housekeeper(x) OR nurse(x) ! Offer (x,PF)]. LHS of the
implication uses the conjunction and is thus complex. The RHS is simple making
this policy CS-type policy.
The last row of Table 4.1 considers a CC policy as having complex formulae on
both sides of the implication. The policy 8xƎ wtabSet Ǝ ftabSet [woodTable(x) OR
fibreTable(x) ! Sum(count(wtabSet), count(ftabSet),2)] has a conjunction OR on
the LHS of the implication and function count() on the RHS, both of which are
complex. Thus, this policy is a complex–complex policy.
Consider the situation when in the general form of a policy, Quantifier(IF
Formula1 THEN Formula2), Formula1 is true and Formula2 false. This indicates a
policy is violated. Let there be an action A that causes Formula1 to be True. If
Formula2 is True, then no violation has occurred. On the other hand, if Formula2 is
False, then corrective action say B needs to be carried out so that Formula2
becomes True.
Let us assume now that another action A causes the Formula2 on the RHS to
become False. This implies that either an action C must be performed to make
Formula1 False or that the action A itself should be disallowed.
In the subsequent sections, we discuss (a) representation of policy enforcement
rules and (b) elicitation of actions A, B, and C above.
Policy enforcement rules are a type of business rules. Business rules have been
represented by either natural language-based approaches or logic-based approaches.
Natural language representation has been used in [12, 13]. Leite and Leonardi [12]
defines business rules expressed in natural language using specified patterns. Fu
et al. [13] makes use of templates. A template consists of different kinds of
expressions, for example, determiner and subject expressions. SBVR uses its own
definition of Structured English for expressing business rules. A predicate
logic-based Business Rules Language, BRL, was proposed by Fu et al. [13] but this
has only a limited number of built-in predicates.
Logic-based representation expressed in the IF-THEN form has been used by
Auechaikul and Vatanawood [14]. Two variants of this form have been proposed,
IF-THEN-ELSE by Muehlen and Kamp [15] and WHEN-IF-DO by Rosca et al.
[16].
We need to use the notion of an action in order to represent policy enforcement
rules. There can be two kinds of actions:
• Triggering: This type of action triggers a policy violation. This action could on
the Then side of the implication and cause the IF side to be false. It can also be
on the IF side causing the Then side to be false. Action A above is a triggering
action.
• Correcting: As stated once there is a policy violation, suitable corrective action
has to be taken. Actions B and C above are correcting actions.
Since an activity fact type is absent in SBVR [10], we must explore a more direct
way to represent triggering and correcting actions. Indeed, a representation in logic
shall not yield a direct representation of triggers and actions which will need to be
derived from the functions/predicates comprising well-formed formulas of the
4.2 Deciding Policy Enforcement Rules 81
logic. Therefore, the WHEN part of a rule contains the triggering action; the
condition to be checked when the triggering action has occurred in the IF part; and
the THEN part contains correcting action to be taken. Therefore, a policy-enforcing
rule is represented as
Notice the similarity of the policy enforcement rule with that of the notion of a
trigger in SQL. A trigger [17] is a stored program; a pl/sql block structure that is
fired when INSERT/UPDATE/DELETE operations are performed and certain
conditions are satisfied. There are thus three components to a trigger, an event, a
condition, and an action corresponding to the WHEN, IF, and THEN part,
respectively.
In SQL, a trigger is seen as an executable component. However, a policy enforce-
ment rule is a directive that governs/guides [11] a future course of action. Seeing this
similarity with SQL, we use here the basic idea behind a range variable of SQL.
The remaining question is about the representation of an action. Actions, both
triggering and correcting, are of the form <verb> <range variable>. To see this, let
us first consider the notion of a range variable.
A range variable denotes an instance of a noun. Before using it, a range variable
is declared using the form:
\OPD [ \x [
\Ayurvedic Hospital [ \y [
In the first example, OPD is a noun and x is its range variable. This says that x is
an instance of OPD. Similarly, in the second example, y is an instance of Ayurvedic
Hospital.
Now we can construct actions which, as mentioned above, are of the form
<verb> <range variable>. Using the range variables x and y declared above, we can
define actions, create x and operate y, respectively.
The policy enforcement rules from above now can be written as
In order for the requirements engineer to formulate PER for a policy P, s/he has to
decide on the possible correcting actions for a triggering action. Let this be the set
{corrAction1, corrAction2, corrAction3…}.
On examining this set closely, one finds that in fact with every action there is a
choice the requirements engineer has to make, whether to select, modify, or reject
the action. In other words, the choice set presented to the requirements engineer is
{select corrAction1, modify corrAction1, reject corrAction1, select corrAction2,
modify corrAction2………}
The actions selected become part of the PER, rejected actions are not part of any
PER, and the modified actions become part of the PER. For example, if corrAction1
and corrAction2 are selected and corrAction3 is rejected, then the requirements
engineer arrives at two PER:
Note, the same action can be a correcting action for more than one kind of
triggering action. Also, a triggering action in one PER can be a correcting action in
another PER and vice versa.
In order to elicit the required action, the following two macro-guidelines are
used. This applies to all the four types of policies, SS, SC, CS, and CC.
• Guideline I: The requirements engineer defines triggering actions to make LHS
true. Since policies are violated when the RHS becomes false, correcting actions
are elicited to make RHS true.
• Guideline II: The requirements engineer defines triggering actions to make
RHS false. Since a policy is violated, its left-hand side becomes true. Correcting
actions are elicited to make the LHS false.
Once the actions have been elicited, the policy enforcement rule is formulated by
filling the “WHEN-IF-THEN” form with the triggering and connecting actions.
Let us look at the application of the guidelines for the four kinds of policies.
SS-Type Policy
Consider the following policy:
Example I: Every Ayurvedic hospital must operate an outpatients department.
Since actions are of the form <verb> <range variable>, we start by defining
range variables as follows:
\Ayurvedic hospital [ \h [
\OPD [ \o [
SC-Type Policy
Recall, a complex SC-type policy has LHS as simple and RHS as complex.
Since LHS is simple, actions for LHS are elicited in the same way as with simple
policy types described above. Elicitation strategies are formulated for complex
predicate (formula) on the RHS.
Unlike general-purpose languages, special-purpose languages do not have full
expressive power [15]. Consequently, recourse to standard predicates is taken as
seen in [15] where predefined standard predicates are defined. These standard
predicates can be connected using AND/OR operators.
84 4 Discovering Decisions
For complex predicates here, standard predicates along with the elicitation
strategy are defined as shown in Table 4.2. Consider row 1 that defines standard
predicate EQ(Function(x),c). This is complex due to function Function(x). If
Function(x) evaluates to a value less than constant c, then correcting action must
increase the value of the function so that it satisfies the predicate. If, however, its
value is greater than constant c, its value must be decreased by the correcting action.
This approach applies to all rows of the table.
Let us now consider an example.
Example I: Each private room must have an area of 200 ft2 expressed as
Applying Guideline I
Let the requirement engineer elicit triggering action “create private room”. This
makes LHS true. At this moment, RHS is false and therefore the first row of
Table 4.2 is applied. The elicitation strategy suggested is to elicit correcting action to
1. Increase the value of area(x). The elicited actions are
(a) Rebuild private room and
(b) Expand private room.
2. Reduce the value of area(x). The elicited action is
(a) Partition private room.
Using the elicited triggering and correcting actions, the following rules are
obtained:
• WHEN create pr IF LT(area(pr),200) THEN Rebuild pr
• WHEN create pr, IF LT(area(pr),200) THEN Expand pr
• WHEN create pr, IF GT(area(pr),200) THEN Partition pr
Applying Guideline II
Suppose triggering action “partition room” causes the available area of a room to
reduce. This makes RHS false. A correcting action needs to be elicited to make
LHS false. Notice that LHS is a simple formula and so actions can be elicited as
with SS type of policy. Assume that the elicited correcting action is “relocate
private room”.
The policy enforcement rule obtained is
• WHEN partition pr IF !EQ(area(pr),200) THEN relocate pr
Example II: Consider the following example:
Applying Guideline I
Let the triggering action be “create spw”. Notice, correcting actions will be formed
by a combination of actions of LEQ and GT. Table 4.3 suggests four possibilities.
When a new semi-private ward is created, then the number of beds is zero and so
function count(b) gives value zero. This makes LEQ predicate to evaluate to true
and GT predicate to be false. As suggested by the second row of Table 4.3, elic-
itation strategy to make GT to be true has to be explored to make the entire RHS
true. Thus, applying fourth row of Table 4.2 the elicited actions may be
(a) Purchase bed and
(b) Transfer bed.
So the policy enforcement rules are
• WHEN create spw IF !GT(count(b),1) THEN Purchase b
• WHEN create spw IF !GT(count(b),1) THEN Transfer b
Applying Guideline II
Now, the removal of a bed from the ward may result in a bedless ward, thereby
violating GT(count(b),1). Correcting actions to make LHS false must be obtained.
Let this be “Relocate semi-private ward”.
The policy enforcement rule obtained is
• WHEN remove b IF !(GT(count(b),1)) THEN Relocate spw
Example I Consider the policy “Provide provident fund to all nurses and house-
keepers”. This is expressed as
\housekeeper [ \hk [
\Provident Fund [ \pf [
Applying Guideline I
When either new nurses or new housekeepers are recruited, then a corresponding
correcting action is to be taken so that RHS becomes true. Let this action be to allot
provident fund. The policy enforcement rules obtained are
• WHEN recruit n IF !provide (n, pf) THEN Allot pf
• WHEN recruit hk IF !provide (hk, pf)THEN Allot pf
Applying Guideline II
Suppose provident fund is stopped for some employee. This makes RHS false. It is
now required to make LHS false. This may be done by the following correcting
actions:
88 4 Discovering Decisions
CC-Type Policy
CC policy is a combination of CS and SC type of policies. Thus, elicitation
strategies shown in Tables 4.2 and 4.4 can be applied.
Consider the following example.
Example: The total number of Wooden or Fibre Panchakarma tables must be
expressed as
8x9wtabSet9ftabSet½fibreTableðxÞOR woodTableðxÞ
! SumðcountðftabSetÞ; countðwtabSetÞ; 2Þ
Here,
wtabSet: is the set of wooden tables, ftabSet: is a set of fibre tables.
Range variables are
Applying Guideline I
When a new wooden or fibre table is purchased, then this may disturb the total
number of wooden and fibre tables in the hospital. Both LHS and RHS are com-
plex. Applying row 9 of Table 4.2, we get
1. Elicit action to reduce the sum of tables
(a) Discard fibre table and
(b) Discard wooden table.
2. Elicit action to increase the sum of tables
(a) Purchase wooden table and
(b) Purchase fibre table.
4.2 Deciding Policy Enforcement Rules 89
Applying Guideline II
For an elicited triggering action that causes the sum to be unequal to 2, a correcting
action must be elicited to make the LHS false. Let these be
(a) Discard wooden table,
(b) Stop purchasing wooden table,
(c) Discard fibre table, and
(d) Stop purchasing fibre table.
The enforcement rules are
• WHEN add wt IF woodTable(wt) THEN Discard wt
• WHEN add wt IF woodTable(wt) THEN Stop Purchasing wt
• WHEN add wt IF fibreTable(ft) THEN Discard ft
• WHEN add wt IF fibreTable(ft) THEN Stop Purchasing ft
When a policy enforcement rule has been formulated, then the set of correcting
actions to be taken in the organization is known. This starts off the process of
arriving at operational decisions shown in Fig. 4.5. From the set of rules, correcting
actions are extracted and this forms the initial choice set of actions or the initial set
of decisions. Since a decision may involve other decisions, a decision is structured
as a hierarchy.
From the requirements point of view, our task is to discover the operational
decisions. The process shown in Fig. 4.5 is to be interpreted in this light. The set of
correcting actions are high-level decisions that can be decomposed into simpler
ones. The requirements engineering task is to elicit this structure.
The correcting action suggested by the policy enforcement rule may have its own
structure. This structure is an adaptation of the AND/OR tree used in Artificial
Intelligence for the reduction of problems into conjunctions and disjunctions.
90 4 Discovering Decisions
Extract Actions
Decision Decomposition or
specialization-generalization
Decision Extraction
B C D
E F
Choose Choose
Location Department
(b)
Start private ward
4.4.1 Architecture
• Back end: This part of Fetchit deals with information elicitation, maintenance of
the elicited information in the information repository, and providing guidance in
the information elicitation task.
The architecture of the front end of Fetchit is shown in Fig. 4.8. There are four
main components, one each for the three kinds of decision-making, formulating
policies, policy enforcement rules and operational decisions, and one for eliciting
information. The four components are accessed via a user interface as shown in the
figure. The four components interact with a repository. Let us look at the function
of each component in detail.
Policy formulation: Policy formulation requires two steps, construction of the
policy hierarchy and formulating the policy. The former is carried out by the policy
hierarchy maker, while the latter is done by the policy formulator. These two
components interact with the policy base.
The policy base is organized along two dimensions, namely policy type and
domain of the organization. An organization is treated as a function with policies
that govern its input, the outputs it produces, and the processing it performs to
obtain the output. Consequently, there are policies to regulate input like infras-
tructure, material, etc.; output policies regulate the amount of output and its nature;
process policies regulate the process. Besides these three policy types, there is a
fourth kind that specifies the impact of the output on the organization. These are
outcome policies. Now consider the domain dimension of the policy base. Here,
policies can be viewed as belonging to domains such as medical, life sciences,
engineering, etc.
User Interface
Policy Formulation PER Formulation Operations
Policy Base
Early
Information
PER Base Base
Op Action Base
Before starting policy formulation, the requirements engineer enters the domain
of the policy being formulated as well as the name of the business. The require-
ments engineer is given the option to view existing policies either domain-wise or
policy type-wise. The third option is to retrieve based on specific terms in the policy
like policies that contain the term “area”.
If the domain is a new one, then the policy base returns empty handed when the
domain is entered by the requirements engineer. In this case, or in the case when the
engineer does not want to use legacy policies, each policy is input as an expression
of the first-order logic form considered above. The logic expression is broken up
into its components and the policy hierarchy is constructed by the policy hierarchy
maker.
In either case, whether of reuse of policies from the database or of fresh policy
input, at this stage we have policy components of interest. Now, the policy for-
mulator presents each node of the policy hierarchy the choice to select, reject, or
modify it. The formulated policies are kept in the policy base.
PER formulation: There are two parts, one the action elicitor and the another
the policy enforcement rule maker. Organizational policies retrieved from the
policy base are presented to the action elicitor. The action elicitor applies the two
macro-guidelines and correcting and triggering actions are elicited. The PER action
base is populated with elicited actions. The elicited actions are now used as input to
the policy enforcement rule maker where actions are filled into the
WHEN-IF-THEN form. These are then stored in the PER base. Notice that the
requirements engineer role requires familiarity with extended first-order logic. This
expertise may exist in the requirements engineer directly or it may be obtained
through appropriately qualified experts.
Operational decisions: The PER actions in the PER action base are presented to
the action hierarchy maker. New actions are discovered from the actions presented
and stored in the OP action base.
Early information elicitor: The fourth component of the tool is the early
information elicitor. The aim of this part is to elicit information for each decision
that has been built. As already discussed, this part shall be dealt with in Chap. 5
after the information elicitation techniques have been described.
Let us now consider the user interface offered by the three front end components of
Fetchit, namely policy formulation, PER formulations, and operations.
Policy Formulation
There are two interfaces one each for the policy hierarchy maker and the policy
formulator, respectively. These are shown in Figs. 4.9 and 4.10, respectively.
4.4 Computer-Aided Support for Obtaining Decisions 95
Once the policy has been formulated, the requirements engineer changes the
first-order expression of the policy to reflect the policy hierarchy formulated. This
statement is then stored in the policy base for later use.
PER Formulation
PER formulation consists of two components, action elicitor and the policy
enforcement rule maker. We consider the user interfaces offered by Fetchit for these
in turn.
The user interface of action elicitor is shown in Fig. 4.11. The policy of interest
is at the top of the screen. There is provision to display existing range variables.
A new range variable can be defined, by using the “Enter new Range Variable”
button.
The middle panel in Fig. 4.11 is where guideline 1 is applied and right-hand side
panel is where guideline 2 is applied. Existing triggering and correcting actions can
be viewed by selecting “Existing actions” button. A new action can be entered by
using “Insert new Action” radio button which opens a corresponding textbox.
Figure 4.11 shows the process of eliciting actions for policy, “8x[Ayurvedic
(x) ! Run(x, OPD)]”. In the center panel, the triggering action create x is dis-
played after obtaining it from the action base. Similarly, construct y is obtained and
displayed. The panel also shows that the requirements engineer is entering a new
correcting action start y by selecting the “Insert new Action”. Guideline 2 has not
yet been applied.
Let us now consider the second component of PER formulation, the policy
enforcement rule maker, PER maker. This component formulates policy enforce-
ment rules once actions have been elicited and store them in the PER base. The
input to the PER maker are actions of the action base. The user interface for PER
maker is shown in Fig. 4.12.
The policy for which the enforcement rules are being formulated is displayed on
the top left corner of the screen in Fig. 4.12. The requirements engineer can either
view already existing rules or insert a new rule. When the former is selected, a list of
the rules along with the range variables present in the PER base is displayed. When
the latter is selected, a panel partitioned into WHEN, IF, and THEN subsections is
displayed. This partitioning corresponds to the WHEN-IF-THEN representation of
the policy enforcement rules. Figure 4.12 shows the elicited triggering actions on the
WHEN side, elicited correcting actions on the THEN side of the panel. The
requirements engineer selects the desired triggering action from the WHEN side and
the desired correcting action from the THEN side. The selected actions are high-
lighted. The requirements engineer keys in the IF condition. In order to save this
policy enforcement rule, the Generate Policy Enforcement Rule button found at the
bottom of the screen is clicked. The rule is saved in the PER base.
Operations
The Operations component consists of the action hierarchy maker. Its user interface
is shown in Fig. 4.13. The screenshot of the figure is divided into three sections.
The left-hand side of the screen shows the range variables and the PER actions that
are from the PER action base. The upper panel on the right-hand side of the screen
is where new range variable s and new actions are defined. The bottom panel of the
screen is where the action hierarchy is constructed.
New actions may mean defining new range variables. For this, “Enter new
Range Variable” button has been provided. On checking the radio button, new
range variables can be entered. If an existing range variable can be used, “Use
existing Range Variable” button is clicked. From the list provided, the necessary
range variable can be selected. New actions are defined and are ready for use.
98 4 Discovering Decisions
4.5 Conclusion
There are three kinds of decisions, for policy formulation, PER formulation and for
actions. Policy formulation decisions are obtained from policy hierarchies that
display the hierarchical structure of polices. Each node of this hierarchy is asso-
ciated with operators to select, reject, and modify it. Therefore, we get decisions
that are three times the number of nodes in the policy hierarchy. These decisions are
put through the information Elicitor part of Fetchit.
Policy enforcement rule formulation elicits triggering and correction actions from
which rules are formulated. Associated with each rule are the three operators of
select, reject, and modify. Thus, yet again we get a collection of decisions that are
three times the number rules. Information elicitation for these is to be carried out.
Operational decisions are obtained by constructing AND/OR trees for the cor-
recting actions of policy enforcement rules of an organization. During
decision-making, choice exists in selecting between the ORed branches of the tree.
Again, information is to be elicited for the actions of the AND/OR tree.
We have also seen that information elicitation is to be done for each node of the
policy tree as well as for each node in the AND/OR tree. This elicitation must
address the managerial concerns identified in Chap. 3. A special approach is needed
for information elicitation for supporting decisions for policy enforcement rule
formulation. This is because the supporting information can be directly derived
from the statement of the policy for which the rule is formulated.
References 99
References
1. Lindbloom, C. E., & Woodhouse, E. J. (1993). The policy-making process (3rd ed.). Prentice
Hall.
2. Hillman, A. J., & Hitt, M. A. (1999). Corporate political strategy formulation: A model of
approach, participation, and strategy decisions. Academy of Management Review, 24(4),
825–842.
3. Park, Y. T. (2000). National systems of advanced manufacturing technology (AMT):
Hierarchical classification scheme and policy formulation process. Technovation, 20(3),
151–159.
4. Kelly-Newton, L. (1980). Accounting policy formulation: The role of corporate management.
Addison Wesley Publishing Company.
5. Ritchie, J. R. B. (1988). Consensus policy formulation in tourism: Measuring resident views
via survey research. Tourism Management, 9(3), 199–212.
6. Cooke, P., & Morgan, K. (1993). The network paradigm: New departures in corporate and
regional development. Environment and planning D: Society and space, 11(5), 543–564.
7. Ken, B. (2007). Social policy: An introduction (3rd ed.). Open University Press, Tata
McGraw-Hill.
8. Wies, R. (1994). Policy definition and classification: Aspects, criteria, and examples. In
International Workshop on Distributed Systems: Operations & Management, Toulouse,
France, October 10–12, pp. 1–12.
9. Anderson, C. (2005). What’s the difference between policies and procedures? Bizmanualz.
10. OMG. (2008). Semantics of business vocabulary and business rules (SBVR), v1.0, January
2008.
11. The Business Rules Group. (2010). The business motivation model: Business governance in a
volatile world, Release 1.4.
12. Leite, J. C. S. P., & Leonardi, M. C. (1998, April). Business rules as organizational policies.
In Software Specification and Design (pp. 68–76). IEEE.
13. Fu, G., Shao, J., Embury, S. M., Gray, W. A., & Liu, X. (2001). A framework for business
rule presentation. In Proceedings of 12th International Workshop on Database and Expert
Systems Applications, 2001 (pp. 922–926). IEEE.
14. Auechaikul, T., & Vatanawood, W. (2007). A development of business rules with decision
tables for business processes. In TENCON 2007–2007 IEEE Region 10 Conference (pp 1–4).
IEEE.
15. Muehlen, M. Z., & Kamp, G. (2007). Business process and business rule modeling: A
representational analysis. In Eleventh International IEEE on EDOC Conference Workshop,
2007, EDOC07 (pp. 189–196). IEEE.
16. Rosca, D., Greenspan, S., Feblowitz, M., & Wild, C. (1997). A decision making methodology
in support of the business rules lifecycle. In Proceedings of the Third IEEE International
Symposium on Requirements Engineering, 1997 (pp. 236–246). IEEE.
17. Navathe, S. B., Elmasri, R., & James, L. (1986). Integrating user views in database design.
IEEE Computer, 19, 50–62.
Chapter 5
Information Elicitation
Having looked at the processes using which decisions can be determined, we now
move on to the second aspect of DWRE mentioned in Chap. 3, namely, that of
eliciting information relevant to each decision. Decisions represent the useful work
that is supported by the data warehouse. This work can be to formulate policies,
policy enforcement rules, or to take operational decisions. The decision–informa-
tion association is important because it defines the purpose served by the data
warehouse. We use this association for defining the notion of a decision require-
ment that forms the basis of our information elicitation technique.
We have seen in Chap. 3 that DWRE would benefit from systematization of the
information elicitation process. To explicitly identify the benefit, we now examine
in greater detail the methods used by different DWRE techniques. Since stake-
holder–requirements engineer interaction is paramount, we lay down four basic
principles that, if followed, led to good quality interaction. Thereafter, we formulate
the life cycle of decisional data warehouse development and then go on to describe
our information elicitation approach.
DWRE methods presented in Chap. 2 treat the definition of facts and dimensions of
the data warehouse to be built as part of the requirements engineering process.
These methods adopt a variable number of steps in the process of reaching the
multidimensional, MD structure of the data warehouse.
Table 5.1 shows that there can be between one-step and three-step processes in
DWRE methods. One-step processes are based on the idea that enough informa-
tion should be obtained as early as possible to arrive at the MD structure. This is the
view of methods in the fourth, fifth, sixth, and ninth rows of the table. That is, as
soon as a stakeholder determines that a piece of information is relevant in the data
warehouse, these methods immediately identify whether it is a fact or a dimension.
© Springer Nature Singapore Pte Ltd. 2018 101
N. Prakash and D. Prakash, Data Warehouse Requirements Engineering,
https://doi.org/10.1007/978-981-10-7019-8_5
102 5 Information Elicitation
There is overriding concern on the data model of the data warehouse to-be. The
process of obtaining information and conversion of elicited information into facts/
dimensions is completely implicit, unarticulated and unsystematic.
In contrast, multistep processes break up the complex requirements engineering
task into relatively more manageable pieces. Each step focuses on a single aspect of
the task, obtaining information, building a conceptual schema, and constructing the
MD structure respectively. Consequently, it is possible to develop techniques to
address the concerns of each step. Let us now take up such multistep processes.
Two-step approaches are shown in the first, second, and third rows of Table 5.1.
The piece of information acquired from the stakeholder is initially represented in
some generic form in the first step, and is then converted into the MD structure. In
the method of row 1, the method is to obtain service measures from stakeholders
that they map to SERM in step one and then get the MD structure from the SERM
diagram. In the method of the third row, information obtained from abstraction
sheets as quality focus, variation factors, etc. in step one is converted to the MD
structure in step 2. The method of row 8 suggests obtaining information in the form
of tables before building the MD structure.
The two-step process separates the issues of obtaining information and its initial
representation from building the MD structure. The manner of conversion from the
former to the latter can be studied and systematized. Thus, for example, algorithms
like the one developed by [1–3] for converting from ER schema to MD guide in this
conversion task.
5.1 Obtaining Multidimensional Structure 103
Three-step processes, see rows three and seven of Table 5.1, provide a process
step for the stakeholder to articulate the needed information. In the method of the
third row, a typical interaction between the decision-maker and the information
elicitation system is represented as an information scenario. During this interaction,
the decision-maker formulates typical queries to obtain information in an SQL-like
language. This information is then represented as an ER schema for later conversion
into MD structures. The method in the seventh row of the table obtains the key
performance indicators, KPIs, of a decision-maker, treats these as functions, and
then determines the input parameters of these functions and the output produced.
Inputs and outputs are the needed information. This method develops a variant of
use cases of UML. Again, an ER schema is built that is converted to MD structure.
Three-step processes move away from merely structuring information to
obtaining information from the stakeholder and then structuring it. They attempt to
systematize the entire task right from information elicitation through defining MD
structures.
However, these processes do not provide guidance and support in the task of
information elicitation. Thus, for example, in the informational scenario approach,
the manner in which the SQL-like query is identified and the response formulated is
not articulated.
Our information elicitation technique uses the decision requirement model [4]. This
model captures in it the structure of a decision and of information respectively as
well as the relationship between these two. Information that is relevant to a decision
is represented in the decision requirement model as a decision requirement. This
association is textually written as <decision, information>. In Fig. 5.1, it is modeled
as an aggregation of decision and information respectively. The decision–infor-
mation relationship is N:M since a decision may have more than one information
associated with it and a given information may be associated with more than one
decision.
1..* 1..*
1..*
Decision Information
1..* 1..*
Is member of Expressed as
1..* 1..*
1..*
Choice Set Situation
Relevant to
5.3 The Decision Requirement Model 107
The first member of this choice set aims to reduce the physical rush of patients
on site, whereas the second member of the choice set is for handling of a larger
number of patients on site.
Figure 5.1 shows an N:M relationship between choice set and decision. The
choice set considered above consists of more than one decision. Further, the first
member of Reduce patient rush, namely, register patients online, can be a member
in another choice set say, for improving the medical services. This shows the N:M
relationship.
The figure also shows a 1:N relationship between choice set and situation. That
is, a choice set is applicable to many situations but a situation has only one choice
set associated with it. In our example, there is one choice set associated with the
situation to handle the rush of patients. However, this choice set can also be
applicable to the situation that requires improvement of medical services.
There are two constraints, namely, coherence and cardinality constraints on a
choice set. Coherence ensures that all elements of a choice set must achieve the
same purpose. Consider the choice set, CSET = {Increase bed count, Optimize bed
use, Increase units}. All members of this set have the same purpose, namely, to
reduce the rush of patients. This choice set is coherent. An example of an incoherent
choice set is CSET1 = {Increase bed count, Optimize bed use, Open research unit}.
The member, open research unit, does not help in reducing the rush of patients.
Therefore, CSET1 is incoherent.
The cardinality says that the number of members of a choice set must be greater
than unity. Clearly, a choice set with no alternatives, that is, having cardinality = 0
is undefined. If this cardinality = 1, then there is exactly one alternative and there is
no decisional problem. For a decision problem to exist, it is necessary that the
cardinality of the choice set should be greater than unity. The cardinality constraint
ensures that this is indeed so.
Figure 5.1 shows the relationship between situation and information. This is an
M:N relationship.
The decision metamodel, expressed in UML notation, shows three kinds of deci-
sions: Atomic, abstract, and complex. An atomic decision cannot be decomposed
into subdecisions. An abstract decision is arrived at by using generalization/
specialization principles to produce ISA relationships between decisions. Finally, a
complex decision is composed of other simpler decisions. Complex decisions form
an AND/OR hierarchy. This is captured in Fig. 5.2 by the concept, Link that can be
an AND or an OR. These subtypes of links are not shown in the figure.
108 5 Information Elicitation
2..2
Decision
Composed of
0..*
Start New 1-tonne Assembly Line Start New 13- tons Assembly Line
(a)
Set up New Assembly Line
OR
AND
AND
OR
5.3.3 Information
The notion of information is also explained in Fig. 5.1. The detailed information
model [4] is shown in Fig. 5.5. Let there be a set of decisions D = {D1, D2… Dn}.
Each Di, 1 i n, is a member of its own choice set and is associated with its
relevant information. Consequently, we can associate, with each Di, its relevant
information represented by Ii. Then, the set of relevant information to D, repre-
sented as information in Fig. 5.5, is defined as the union of these information sets:
110 5 Information Elicitation
Categorized by
Attribute
1..* 1..*
1..* Information
Property of 1..*
1..* 1..*
Composed of
Computed from
1..*
1..* Composition
Aggregate Detailed Historical
Temporal Unit
Period
Report Comparison
Now, data warehouse technology tells us that there are three kinds of infor-
mation, detailed information, summarized or aggregated information, and historical
information. All these different kinds of information have their own dimensions.
Figure 5.5 expresses this through a typology of information. This typology is
shown in Fig. 5.5 by the ISA links between Information and Detailed, Aggregate,
and Historical information. Detailed information is raw unprocessed information.
Aggregate information is obtained by computing from other detailed, aggregate, or
historical information. This is modeled by the “Computed from” relationship
between aggregate and information. Historical information is defined by its two
properties, period or duration of the history (5, 10, etc.) and temporal unit that tells
us the time unit, month, quarter etc., for which history is to be maintained.
Figure 5.5 shows a special kind of information called composition. The idea of
composition is to define a meaningful collection of logically related information.
There are two kinds of compositions, namely, reports and comparisons as repre-
sented by the ISA links between these. A report is built from logically related
detailed, aggregate or historical information as well as of comparisons.
A comparison is a composition of information that enables the decision-maker to
compare organizational data. It may contain rankings like top ten, bottom 10, or
may bring out the similarities/differences between a variety of information.
The specific properties of information are modeled as attributes. There is an N:M
relationship between attribute and information as shown in Fig. 5.5.
5.3 The Decision Requirement Model 111
CSFI Analysis
CSF Name
Prompt Medicine Delivery Parameter Waiting time of patients
Submit
former, the needed information is the average waiting time categorized by patient
type. A monthly record of this information is to be kept for a quarter.
The table implies that CSFI elicitation is done in three steps, (a) CSF association
with a decision is elicited, (b) CSF variables are obtained, and (c) information is
identified to populate the model shown in Fig. 5.5. The interface of the elicitation
tool for CSFI elicitation is shown in Fig. 5.6.
The figure shows that elicitation is being done for the decision add new phar-
macy. The interface allows (a) selection of a CSF from the list of CSFs displayed or
(b) to enter a new CSF. The figure shows that the CSF, Prompt medicine delivery,
has been selected and the variable, waiting time for patients, has been entered.
A subsequent screen, not shown here, then obtains the information detailed in the
fourth column of Table 5.3.
Again, consider the same decision of adding a new pharmacy but for estimating
achievement of the Ends. The End associated with it is full utilization. An effec-
tiveness variable for assessing the effectiveness of full utilization is the customer
service provided. The information needed for this variable is the total amount of
sales for each medicine sold by the pharmacy. The table shows that the variable can
also be estimated by keeping information of the number of transactions carried out
during each shift.
Though not brought out in the example here, it is possible for a decision to have
several Ends. Each End can have several variables.
efficiency variables for the former are shown in the table. These are the resources
that shall be created and include civil work, electrical work, fixtures and furniture,
equipment, etc. The information needed is the estimation of cost for each resource
that shall be provided. The second row of the table shows another variable, Time,
for the required time to establish the new pharmacy.
The second means is estimated by the extent of reuse of existing resources and
the time to become operational.
Feedback information elicitation aims to determine the impact of a decision and each
element impacted is seen to be a variable. As before, the information for the variable
is then elicited. As shown in Table 5.6, the three aspects of interest are the decision,
the feedback variable that captures the impact, and the information to be kept.
Consider, again, the same decision add new pharmacy. Adding the new phar-
macy has the effect of creating a better perception of our health service. This change
has resulted in an increase in the number of registered patients that in turn leads to
addition of a new pharmacy. The second variable says that in order to keep the new
pharmacy fully utilized, additional medical staff may be required that in turn affects
the number of pharmacies. This feedback cycle starts from the outcome of add new
pharmacy and returns back to it.
Table 5.6 shows the feedback variables and the information required.
The techniques described in Sect. 5.4 have their own elicitation process consisting
of two or three steps. We have provided details of this micro-level guided process.
However, as mentioned, the use of multiple elicitation techniques, corresponding to
the factors of interest, shall be beneficial. This implies that there is a global,
macro-level elicitation process that suggests interleaving of these micro-level
processes.
5.5 The Global Elicitation Process 115
Our global, multi-factor elicitation process takes as input the set of decisions, D
and, each decision of D guides the requirements engineer to determine the relevant
achievement parameter, namely, CSF, etc. The micro-level elicitation technique(s)
associated with each factor are then deployed.
This determination can be done in different ways as follows:
(a) Decision-wise: This process of determining information picks up a decision
from D and then applies each technique, one after the other. After all the
techniques have been applied, then the next decision is taken up. The process
ends when information elicitation has been carried out for all decisions in the
set.
This process has the benefit that it minimizes the number of visits to the
stakeholder for obtaining the required information. This is because, in principle,
all information can be obtained about the decision in one sitting. Therefore, if
stakeholders can be available for long sessions, then this technique works well.
An additional session, in all probability, will be required for verification
purposes.
(b) Sliced decision-wise: This is a variant of the decision-wise process in which
information elicitation is done from only one of the four elicitation techniques.
The sliced decision-wise process needs several, relatively shorter duration
requirements engineer–stakeholder sessions. At the beginning of each session,
the previous work done could be verified.
(c) Technique-wise: This process gives primacy to the information elicitation
technique. The requirements engineer selects one of the four techniques and
applies it to the decisions of D, one by one. The process ends when all the
techniques have been applied to all decisions of D.
This process requires several sessions with each stakeholder. Each session is
shorter than the one in the decision-wise process. This process works well when
stakeholders cannot be available for a long interactive session but can give
several shorter duration appointments. At the beginning of each session, the
previous work done could be checked out for correctness.
(d) Sliced technique-wise: This process breaks up the technique-wise process into
smaller parts. When a stakeholder has a stake in more than one decision in D,
then there are two aspects to these processes that are interesting. These are as
follows:
• The stakeholder is guided to examine the relevance of all the factors and
encouraged to do complete requirements analysis.
• The stakeholder can prioritize the factors considered important.
Indeed, as we will see in the example later in this chapter, decisions for for-
mulating policies rely heavily on CSFI and ENDSI, respectively. However,
MEANSI is not so relevant, perhaps, because the means in this case are the policy
enforcement rules since, evidently, these rules enforce policies.
116 5 Information Elicitation
Let us illustrate the use of the four elicitation techniques for formulating policies.
Consider the policy that for all in-patient departments, IPD, there must be at least
one doctor with the degree MD and the number of doctors must be at least 2. Let the
range variable of doctor be y and let the condition on doctors be captured by the
predicates degree (y, MD) and GEQ (count(D),2). The expression in the first-order
logic, as discussed in Chap. 4, is as follows:
For our example policy shown in Fig. 5.7, let there be a CSF, “patient satisfaction”.
Now, the hierarchy is traversed in the left to right, bottom-up manner. For each
node thus obtained, the information needed to assess CSF achievement is elicited.
For the node IPD(x), we get patient satisfaction measures for the in-patient
department as follows:
• Rate of patient admission,
• Prompt provision of medical aid,
• Availability of paramedical services,
• Rapidity of discharge procedure, and
• Referrals to other hospitals.
Fig. 5.7 Policy hierarchy xƎy[IPD(x) doc(y) AND GEQ(count(d),2) AND degree(y, MD)]
CSFI Analysis
Entity Doctor
Please choose one
Existing CSF Attribute Degree
Add New CSF
Category specialization-wise
Function Count
CSF Name
Max
Prompt Medicine Delivery Min
Avg
Submit
For our policy shown in Fig. 5.7, let one of the ends be to provide comprehensive
patient service. The next step is to get hold of the effectiveness measures of this end.
This is done by traversing the hierarchy as before.
Consider IPD(x). For our ends, we get measures
• Referrals to other medical services and
• Medical services that are rented.
For each of these, we now elicit the information for evaluating it. For the former,
we get
• Number of operations performed every day and their history for 1-year period;
• Number of referrals every day; history for 1 year;
• Case of each referral; and
• Inward referrals by other medical facilities.
Thereafter, the second effectiveness measure, medical services rented, is taken
up.
As before, the elicited information is used to populate the information model and
elicited information is stored in the early information base shown in Fig. 5.11.
Both F1 and F2 can be complex. As defined in Chap. 4, this means that F1 and F2
can both have conjunctions, disjunctions, and n-adic functions.
5.7 Eliciting Information for PER Formulation 119
8x½spwðxÞ ! . . .
Recall that spw denotes a semi-private ward. Therefore, we can surmise that we
need information of all semi-private wards.
Now, let us assume a different quantification as follows:
9x½spwðxÞ ! . . .
Here, we need to obtain information to ascertain that there is at least one spw
meeting the condition on the right-hand side. Again, we need to obtain information
about all semi-private wards in order to do so. Therefore, it can be seen that
irrespective of the whether universal or existential quantification is used, infor-
mation about all semi-private wards is of interest.
A reading of the policy may also suggest the specific information required. For
example, consider
This says that information about all nurses is needed and more specifically we
need their salary.
Lastly, when a formula contains a function, then either the value of the function
must be computed on demand or its value must be available in the data warehouse.
A full reading of the policy helps to determine the nature of the required
information. For example, consider the policy as follows:
Here, B is a complex variable; bed is a predicate that returns true if the collection
in B are all beds; area is a function that computes the area of x; count is a function
that counts the number of B; and belongs is a predicate that returns true if B belongs
to x. The quantification suggests that we need to keep information about
semi-private wards and collections of beds. However, the predicate, belongs,
clarifies that the collection of beds is to be for each semi-private ward, that is
spw-wise. We need to consider whether the area of a semi-private ward shall be
stored or shall be calculated each time it is needed. In the latter case, enough
information, for example, length and breadth, will need to be determined. A similar
view is to be adopted for the function, count. Should we keep the count of beds or
will we compute it dynamically when needed.
120 5 Information Elicitation
It can be seen that information elicitation for PER formulation can be done
directly from an interpretation of the policy as expressed in the first-order logic of
Chap. 4. The only issue is of determining whether historical information is needed
or not. This cannot be ascertained from the first-order expression. Therefore,
interaction with the PER decision-maker should be carried out to determine any
requirements for history. There is no need to deploy the four elicitation techniques
here.
Information elicitation for the first step is illustrated below. As shown, unlike
information elicitation for PER formulation, the techniques developed in Sect. 5.4
are used here. We illustrate this for eliciting information for the second rule above,
that is, for the action re-designate a, where a is an instance of AYUSH hospital.
1. CSFI Analysis: As discussed earlier, a CSF is to provide patient satisfaction. To
assess this factor, one piece of information needed is the total yearly count of
patients. This gives rise to the decision requirement <Re-designate x, annual
number of patients>. Applying the information model, Patient becomes Entity.
2. ENDSI Analysis: The objective or result of re-designate a can be to maximize
economic return. The effectiveness of this end can be assessed by Revenue
generated, that is estimated by cost per lab test, number of tests, service fees of
nurses, and consultancy fees of doctors. Applying the information model, lab
5.8 Information Elicitation for Operational Systems 121
test, doctor, and nurse become entities with service fees and consultancy fees as
attributes for nurse and doctor entity, respectively.
3. MEANSI Analysis: Again consider the action “Re-designate a”, where a is an
instance of AYUSH hospital. One means to perform this action is to
re-designate the hospital by choosing another speciality. The efficiency with
which this is done is determined by the expertise available already in the hos-
pital. If enough expertise is available, then the re-designation shall be efficiently
carried out. Early information needed is about number of doctors having the
specialized qualification, number of patients with disease of the speciality, and
current equipment in the hospital in the area of the speciality, among others.
Again, applying the information model, doctor, patient, disease, and equipment
become entities.
The results of performing steps 1–3 above are summarized in Table 5.7. The
table has two columns, the first column for the information elicitation techniques
being applied to the action, re-designate x, and the second column describes the
information base.
The early information base contains variables obtained from CSFI, ENDSI, and
MEANSI. For each measure, entity, attribute, history, category, and function are
identified as part of the information base.
To illustrate consider the PERs as follows. In these rules, priv refers to private
rooms. The rules are triggered when a new private room is to be created. If the area
of a currently available private room is less than 200, then the first two rules ask for
rebuilding it and expanding it, respectively. The third rule is for the case where the
area is greater than required, in which case the action is to partition the private
room.
(a) WHEN create privIF LT(area(priv),200) THEN Rebuild priv
(b) WHEN create priv, IF LT(area(priv),200) THEN Expand priv
(c) WHEN create priv, IF GT(area(priv),200) THEN Partition priv
Table 5.8 contains the information elicited for PER selection for rule (b). For the
Ends, service more patients, information is about the revenue that shall be generated
and for this purpose information about patients using the private room and their
income bracket is needed; the diseases for which treatment was offered by disease
type; and the 1-month history of diseases treated. Similarly, for the means, remodel
room, information about the resources to be raised for remodeling, time and cost as
shown are needed.
Now, let the second rule be selected after referring to the information available in
the data warehouse fragment for the three PERs. Expand priv is now processed
from the point of view of its implementation. Let Expand priv be atomic; it has no
AND/OR decomposition.
Information required to commit to it for operationalization is shown in
Table 5.9. Now concern in ENDSI is about the increased capacity that shall be
created. For this, history of the daily influx of patients in the ward is required. The
new capacity to be created must be compared with the needed capacity to generate
the needed revenues and justify the decision. Similarly, in MEANSI, details of the
Table 5.8 Elicited information for selecting PER for Expand priv
Elicitation method Information base
Entity Attribute History Category Function
ENDS I Ends Effectiveness
Service Revenue Patient Income Count
more generated Disease Name Month Type
patients
MEANSI Means Efficiency
Remodel Resources Private Build time
room needed room Build cost
Rental cost
Space
Labor cost
124 5 Information Elicitation
remodeling to be carried out are needed. The costs of constructing new walls and
breaking any existing barriers need to be factored into the decision-making process.
Therefore, information for this is relevant and is to be kept in the data warehouse
fragment.
We notice that information in Tables 5.8 and 5.9 differ in the level of detail
required to finally select the action, Expand priv. The Ends for both the tables is the
same, “Service more patients”. However, the effectiveness measure is different with
“revenue generated” at the PER level and “capacity of patients” at the operational
level. Similarly, whereas means at the PER level is “Remodel room” with efficiency
measure as resources needed in Table 5.8, in Table 5.9 MEANS are “Construct
barrier” and “Break barrier”.
Even though the level of detail at which information is elicited is different in the
two situations, the application of the techniques, the guidance provided, the nature
of information model, etc. remain the same.
Complex
If the action of the selected PER in step 1 of the operational decision-making
process can be decomposed into AND/OR hierarchy, then in the second step, the
decision-maker has to deal with a complex action.
Since during operational decision-making, it is possible that any of AND or OR
branch may be selected, the information elicitation process must be applied to all
nodes in the AND/OR hierarchy. Consider once again that rule (b) has been
selected and it is now required to carry out the second step of the operational
decision-making process. Further, Expand priv is now a complex action consisting
of two sub-actions, Remodel priv and Extend priv, as shown in Fig. 5.9. Let these
be in an OR relationship with each other.
The information to be elicited for Expand priv is now that for Remodel priv and
Extend priv respectively. Again the process to be followed in carrying out infor-
mation elicitation is the same as before. Again, the elicited information may be
quite different for the two components of Expand priv.
5.9 The Late Information Substage 125
The late information substage defined in Table 5.2 takes early information as input
and produces the ER schema as output. Let us consider the three types of
decision-making, policy formulation, PER formulation, and operational
decision-making from the perspective of building ER diagrams.
Information elicited for PER formulation, even though it is not based on the elic-
itation techniques introduced in Sect. 5.4, is in accordance with the early infor-
mation model of this chapter. Construction of the ER schema is done following the
guidelines of the next section.
In contrast, information elicited for operational decision-making is from our four
elicitation techniques in accordance with the information model. Therefore, yet
again, the guidelines of the next section can be followed to build the ER schema.
The first step is to resolve any naming conflicts that might arise. It is necessary to
ensure, for example, that doctor of CSFI elicitation and doctor of ENDSI elicitation
are the same. If this is not the case, then the requirements engineer needs to resolve
the conflicting names and find an agreed name for the concept.
Assuming that name conflicts have been resolved, the requirements engineer
now picks up the entities and their attributes from the elicited early information. If
history is required, then additional attributes to hold it are defined in the entity.
Categorization is handled in two ways. The first is to define an attribute of the
entity being categorized, for example, when categorizing disease by its type, we can
introduce an attribute, type, in the entity disease. The second is by defining an entity
for each different category. A relationship is then set up with this entity and the
entity to be categorized. For example, if patients are to be categorized by disease
type, then two entity types, patient and disease, are defined. A relationship is then
set up between these two.
Finally, functions needed may be handled either dynamically and computed as
and when needed, or their values may be pre-computed and stored as derived
attributes of entities. In the former case, functions are annotations to indicate that
they have to be computed.
The foregoing does not uniformly identify the relationships between the entities
of the ER schema. The requirements engineer needs to elicit these from stake-
holders during interaction.
We illustrate the use of these guidelines in the construction of the ER schema for
the information elicited in Table 5.8. Applying the guidelines, we obtain three
entities of the ER schema, namely, Patient, Disease, and Private Room. We
introduce the attribute type as shown in Fig. 5.10 for categorizing Disease. Further,
as shown Name and Income are attributes of Disease and Patient, respectively. The
attributes of Private Room are obtained directly from the attributes found in early
information as elicited. These are shown in the ER schema in Fig. 5.10.
5.9 The Late Information Substage 127
Rental Build
space Income
cost cost
occupies
1 1
Private Room Patient
Type Disease
The ER schema shows the relationship occupies between Private Room and
Patient. This relationship is obtained from the stakeholder during requirements
engineer–stakeholder interaction.
There are four user interfaces corresponding to the four methods of information
elicitation. We consider each of these in turn.
128 5 Information Elicitation
User Interface
Policy Formulation PER Formulation Operations
Policy Base
Early
Information
PER Base Base
Op Action Base
CSFI Analysis
Consider the action “start y”, where y is an instance of OPD. One CSF is to provide
quality care. Information needed to estimate this factor is as follows:
• Count of doctor and their specialization,
• Count of patients,
• Disease type, and
• Disease name.
Following our rules, Doctor, Patient, and Disease become entities. Specialization
becomes an attribute of the entity Doctor. For the entity Disease, the attribute is the
name of the disease. Disease is categorized type-wise.
The user interface for CSFI analysis is shown in Fig. 5.12. The top of the figure
shows the action for which information is being elicited and its relevant range
variables. The requirements engineer can choose to either select an existing CSF or
to create a new one.
Upon selecting the former, a list of the existing CSFs is displayed and the
desired CSF can be identified for modification. The latter option is for creating a
new CSF and the figure shows the provision made for this. All relevant information
about entity, attribute, etc. are also entered in the places provided.
5.10 Computer-Based Support for Information Elicitation 129
Function Count
CSF Name Max
Min
Prompt Medicine Delivery Avg
Submit
ENDSI Analysis
Consider once again the action “start y”. Its end is to treat patients using the
traditional system of medicine, AYUSH. The effectiveness of this end is measured
by the variable patient capacity. An indication of effectiveness can be obtained by
keeping information about the patients serviced every day.
The user interface for ENDSI is as shown in Fig. 5.13.
The top of the figure shows the action for which information is being elicited.
Again, we have the two options for selecting an already existing End or creating a
new End. The former shows a list of existing Ends. For the latter, Ends as well as
the effectiveness measure is entered. As before, entity, attribute, etc. are all entered.
MEANSI Analysis
Again consider the action “start y” that can be performed by constructing a new
OPD. The efficiency variable is land required.
Figure 5.14 shows the user interface. As can be seen, the figure is similar to the
ones for CSFI and ENDSI.
Outcome Feedback
In Fetchit, the requirements engineer can enter the sequence of outcomes consti-
tuting the feedback loop. Parameters of outcomes are entered in the screen shown in
Fig. 5.15. When the initial decision is reached, then the feedback loop is termi-
nated. The starting decision is not re-entered in the sequence.
130 5 Information Elicitation
Submit
History Period
Submit
Outcome
Increase in registered numbers Parameter Additional Medical Staff
Submit
The elicitor of Fetchit interfaces with the early information base as shown in
Fig. 5.11. The basic purpose of the information base is to store elicited information.
Each piece of elicited information is associated with the decision it is elicited for. It
may happen that the same piece of information is elicited for more than one
decision. In this case, the information will be associated with each decision sepa-
rately. Therefore, three bases, the policy base, PER base, and the Op action base,
interact with the early information base as shown in Fig. 5.11.
The repository supporting early information elicitor tool is in three parts as
shown in Fig. 5.16. The decision base contains the decisions; factor and variables
base contains factors and variables; information base contains information resulting
from the population of the information model. These three parts are related to one
another as shown in the figure.
The repository exploits the relationship between the different bases to provide
traceability. Information in the information base can be traced back to its source
decision either directly or transitively through factors and variables. It is also
possible to retrieve information relevant to given decisions as well as to factors and
variables. A query facility exists to support this.
132 5 Information Elicitation
affects
Factor and
Decision base Variable base
Information base
5.11 Conclusion
A decision is the crucial concept that gives rise to our approach to data warehouse
requirements engineering. We refer to this approach as decisional requirements
engineering. The notion of a decision is closely related to the achievement factors,
CSF, ENDS, MEANS, and Outcome Feedback. Therefore, in making the decision–
achievement factor relationship explicit, we get
• Guidance: All techniques guide the requirements engineering process by
introducing factors and variables before determining early information. Further,
there is an information model to populate;
• The association decision information, as well as the information factor associ-
ation; and
• Traceability of information.
The first issue that we addressed is that of creating interest in stakeholders to
become active participants in the requirements engineering process in a sustained
manner. We achieve this by determining important managerial concerns and
developing elicitation techniques for each of these. The second issue is as to
identify a range of techniques that are as comprehensive as possible. This reduces
the possibility of missing information requirements while doing requirements
engineering. We achieve this by developing a suite of techniques to be used
collectively.
Requirements engineers emphasize that a consensus on the requirements spec-
ification among all stakeholders is crucial. Failing such an agreement, system
development may suffer from, for example, conflicting requirements, and wrongly
prioritized requirements. This goes well with Sect. 2.6 where five factors that affect
alignment of business with the data warehouse were presented.
5.11 Conclusion 133
References
1. Golfarelli, M., Maio, D., & Rizzi, S. (1998, January). Conceptual design of data warehouses
from E/R schemes. In Proceedings of the Thirty-First Hawaii International Conference on
System Sciences, 1998 (Vol. 7, pp. 334–343). IEEE.
2. Hüsemann, B., Lechtenbörger, J., & Vossen, G. (2000). Conceptual data warehouse design. In
Proceedings of the International Workshop on Design and Management of Data Warehouses
(DMDW2000), Stockholm, Sweden.
3. Moody, L. D., & Kortink, M. A. R. (2000). From enterprise models to dimensional models: a
methodology for data warehouses and data mart design. In Proceedings of the International
Workshop on Design and Management of Data Warehouses, Stockholm, Sweden (pp. 5.1–
5.12).
4. Prakash, D., & Prakash, N. (to appear). A multi-factor approach for elicitation of information
requirements of data warehouses. Requirements Engineering Journal.
5. Prakash, N., Prakash, D., & Sharma, Y. K. (2009). Towards better fitting data warehouse
systems. In The practice of enterprise modeling (pp. 130–144). Berlin Heidelberg: Springer.
Chapter 6
The Development Process
In the last few chapters, we saw that there are three sources of decisions and each
source has information associated with it. Thus, the policy formulation layer had
policy formulation decisions and early information associated with each decision;
the PER formulation layer also had PER formulation decisions and associated
information; and the operational decision layer has actions as decisions and simi-
larly its own early information.
In this chapter, we look at the development of our Data Warehouse fragment
from an agile point of view. For this, a model-driven technique to arrive at the user
stories is discussed. The instantiation of the model gives us a requirements granule.
Since the development process is agile, multiple iterations are carried out and
multiple requirements granules and therefore multiple DW fragments are obtained.
This means that in fact there is a need for agile development and consolidation to
proceed together. This is to remove problems of inconsistency that will arise if there
are multiple DW fragments in the enterprise. Thus, this chapter also discusses a
five-step process to consolidation. This is a semi-automated process, and our tool
Fetchit supports the consolidation process.
As discussed in Chaps. 1 and 2, agile techniques like Scrum aim for rapid and
piecemeal product delivery where a user story is an identified requirement. It is
therefore crucial to write good user stories. However, the essential difficulty in
scripting user stories is that it requires highly skilled and experienced project
architects/product owners. Consequently, it is ad hoc and experience based. There
is a need for making this more systematic and providing better guidance in the task
of writing user stories.
The decision application model (DAM), see Fig. 6.1, is the basis for doing
model-driven agile data warehouse development.
A decision application (DA) consists of a set of elements about which a decision
is to be taken. For example, a DA that
• Formulates policies of an organization has as elements the various policies that
need to be accepted, rejected, or modified;
• Decides upon policy enforcement rules of the enterprise has as elements the
various business rules of the enterprise; and
• Judges what action is to be taken next has these actions as elements.
Figure 6.1 shows that an element may be simple or complex. A simple element
is atomic and cannot be decomposed further, whereas a complex element is built
from other elements. Thus, complex business rules of Chap. 4 are built out of
simpler business rules connected by logical operators.
There are two kinds of relationships among elements, namely, compatible and
conflict. An element may be in conflict with zero or more elements as shown by the
cardinalities of conflict. An element is compatible with one or more elements as
shown in Fig. 6.1. Let us consider these two relationships in turn.
When an element E1 is in conflict with another element E2, then the former is
achieved at the cost of the latter. This cost may be financial, performance, flexi-
bility, and security. Conflict may also arise due to diverging stakeholder interests, or
simply because one element interferes with achievement of others.
Consider two elements as follows:
DAM suggests a hierarchy of concerns. The topmost level is the application level. It
deals with elements and their properties, namely, simple/complex and compatible/
conflict, respectively.
140 6 The Development Process
to be taken into the development run. It can be inferred that a requirement at this
level is the backlog of decisions that meet the business strategic requirement.
Clearly, the decision level is concerned with the tactics to be adopted in the
business to deliver the identified strategy.
At the information level, we get information requirement. We see that the
structure of the information is treated as relatively unimportant. Rather, it is
important to establish a relationship between the elicited, relatively unstructured
information, and a decision at the decision level. Thus, a requirement here is the
information relevant to each decision of the decision application.
It is clear that the DAM model gives a stack of requirements. Now, there are two
ways to do DWRE. One approach is to slice the stack horizontally and process a
level completely before moving to the next one. In other words, we look at
requirements level-wise where requirements engineering yields the totality of
142 6 The Development Process
requirements granules at any given level before we consider requirements from the
perspective of the subsequent level. Notice that this is in accordance with the
waterfall model and corresponds to breadth-first approach of Chap. 2. This
approach suffers from (a) long lead times for delivering requirements as well as
project delivery and (b) high risk of instability if requirements change even as
requirements engineering is being carried out.
In the second approach, we could cut the stack of requirements vertically, in
accordance with depth-first approach of Chap. 2. That is, the requirements engineer
identifies a subset of the set of elements at the application level and then moves
down the levels to elaborate these. At the decision level, a subset of elaborated
decisions is picked up for exploring. From decisions, the requirements engineering
process moves to the next level, the information level. Again information elicitation
is followed by selection of a subset of this information. From this selected infor-
mation, conceptual and multidimensional structures are determined. To summarize,
appropriate subsets are selected as the requirements engineering process moves
vertically down.
Let us take a detailed look at the second approach. The top left corner of Fig. 6.4
shows requirements engineering is at the application level where elements and the
relationships between these elements are being discovered. Now, even as new
elements are being discovered, the requirements engineer selects E1, E2, E3, E4, E9,
and E10 (marked in green) to be taken for elaboration to the decision level, leaving
out E7 and E8 (marked in red) for the time being. Notice that requirements engi-
neering is now being performed at two levels, at the application level where even
more elements are being discovered and at the decision level where decisions and
decision hierarchies are being obtained for the selected elements.
E10 E10
E1 E2 E7 E1 E2 E7
Application Application
E9 E8 E9 E8
E3 E4 E3 E4
E10 Decision {Select E1, Reject E1, {Select E9, Reject E9,
E1 E2 E7 Accept modified E1} Accept modified E9}
Application
E9 E8
E3 E4 {Select E91, Reject E91, {Select E92, Reject E92,
Accept modified E91} Accept modified E92}
{Select E2, Reject E2,
Accept modified E2} {Select E10, Reject E10,
Accept modified E10}
Information for E2
Decision {Select E1, Reject E1, {Select E9, Reject E9, Information
Accept modified E1} Accept modified E9} Information for E1
E10
E1 E2 E7
Application
E9 E8
E3 E4
Information for E2
Information
Information for E1
The bottom left corner of Fig. 6.4 shows population of the decision level for E1,
E2, E9, and E10. The choice set of decisions for E9 is complex and so a decision
hierarchy is obtained. Decisions for elements E3 and E4 are yet to be discovered.
The requirements engineer may now decide that information to take a decision
on E1 and E2 is required and so, as shown in the top right corner of the figure, the
choice set of decisions for E1 and E2 are marked in green and taken down to the
information level. Notice again that requirements engineering is now being per-
formed at the application and decision levels where even more elements and even
more decisions are being discovered, and also at the information level where
information for E1 and E2 is being elicited.
Finally, the requirements engineer may either decide that information of both E1
and E2 may go to the conceptual and construction phase, or alternatively, s/he may
choose one of the two. At this point we can now define a requirements granule as
the selected subset of information. Thus, for the former case, where information
for both E1 and E2 is selected, the requirements granule consists of information of
E1 and E2. While with the latter, the requirements granule will only contain the
information of say, E1.
The bottom right side of the figure shows the selection of both information of E1
and E2 (marked in green). This requirements granule is taken down into the con-
ceptual design and the construction phase giving us a DW fragment. In this case,
the DW fragment addresses decisions for elements E1 and E2.
This selection makes up a single iteration of agile DWRE. Let us say we just
performed iteration 1.
There are several starting points for iteration 2. Recall that out of the selected
elements at the application level, decisions for E3 and E4 were not elaborated in
iteration 1. Thus, one starting point for iteration 2 is at the decision level where
again it may be decided that either decisions for both E3 and E4 will be looked at or
only one will be looked at. This iteration will proceed to the information level
where a selection will produce a requirements granule that will be taken into
development.
Another starting point for iteration 2 is at the application level where elements
not selected in any previous iteration are up for selection. The process moves down
the sub-levels as discussed in iteration 1.
It follows that there is in fact a third starting point for iteration 2, at the infor-
mation level. Here, a selection is made from previously unselected information and
the requirements granule is arrived at. Thus, subsequent iterations can in fact begin
at any level of our stack of requirements.
Notice that the size of the DW fragment depends on the size of the requirements
granule. In other words, if information for a large number of elements is selected for
a given iteration, then the DW fragment will also be large.
This strategy is similar to the epic–theme–story strategy of Scrum. In Scrum, the
epics are obtained and reduced to yield themes from which stories are obtained.
This reduction operation yields stories and stories meeting the INVEST test are
taken up for development.
144 6 The Development Process
Note carefully that the three levels, application, decision, and information, do
NOT correspond to epic, theme, and story, respectively. In fact, these levels reduce
the fuzziness associated with the ideas of epic, theme, and user story, by using
relatively concrete notions of (a) concepts for which decisions are taken, (b) the
decisions, and (c) the information relevant to the decisions.
In the development process, vertical slices of requirements to be elaborated are
selected. The full specification is obtained only at the end of the project.
Evidently, there is a trade-off between thickness and lead times to project delivery.
Therefore, agility is promoted when computer-based support is provided for low
thickness vertical slices. Maximum lead time occurs when the entire element graph
and the complete decision hierarchy constitute the vertical slice. We will consider
the issue of thickness at the three levels separately.
Application Level
First notice that there are two drives that can lead to initial identification of
elements:
• The subject drive: Stakeholder identifies a subject of interest and the elements of
interest are elicited.
• The element drive: The stakeholder directly identifies elements independent of
any subject. The subject, if any, is identified only after element identification is
done.
Irrespective of the drive, the vertical slice to be taken up for development must
be determined. There are three factors to be considered. These are as follows:
Factor I: Topological
We view this in terms of the element graph. An element graph has the following:
• An isolated node: Such a node has no edges. That is, such a node does not
conflict with any other and is also not compatible with any other node (note: but
by reflexivity, it is compatible only with itself).
• A node cluster for a node N comprises N and all nodes with which N is
connected by compatible or conflict edges.
From the element graph, a number of candidate elements for selection can be
identified. A single-node cluster, for example, is a candidate. In Fig. 6.5, consider
node cluster C2 for node E6. E6 is compatible with E2 and in conflict with E4. Thus,
the node cluster for E6 comprises E2 and E4.
We can also build slices that span across multiple node clusters. If node A
belongs to the node cluster of N, then we can consider the cluster formed by A in
our slice as well. Such a cluster consists of nodes directly connected to N, and also
6.4 Granularity of Requirements 145
N: E6
C1: 2 edge away m -i = 1 so C1 is thicker
C2: 1 edges away
of those indirectly connected to N by being two edges away (by being connected to
A). This can be extended to include nodes three edges away, four edges away, and
in general k edges away. Consider node cluster C1 shown in Fig. 6.5. Here, while
considering the slice, node E1 which is in conflict with E2 is also considered in the
cluster of E6. In fact, E1 is a node that is two edges away from E6. Nodes E2 and E4
are directly connected to E6 or one edge away.
We can now define the notion of thickness of a node cluster. Consider two node
clusters C1 and C2 for the node N. C1 is thicker than C2 if one of the following
holds:
• C1 comprises nodes m edges away from N, whereas C2 comprises nodes
(m − i), i 1, edges away from N. In Fig. 6.5, node cluster C1 has E1 that is
two edges away from E6 and so m = 2. Node cluster C2 has all directly con-
nected nodes to E6 and so i = 1. Since (m − i) = 1, node cluster C1 is thicker
than C2.
• C1 and C2 both comprise nodes that are m edges away from N but the number of
nodes in the former is greater than in the latter. In other words, if a node cluster
is more heavily populated than another, then it is thicker than the latter.
Consider the two node clusters in Fig. 6.6. Both C1 and C2 have nodes directly
connected to E6. However, in C1 the slice includes E7, E2, and E4 and slice C2
picks E2 and E4. Since C1 has three nodes that are directly connected to E6
compared with two in cluster C2, C1 is thicker than C2.
E2
E2 E6 E6
E4
E4
146 6 The Development Process
The notion of thickness suggests that at the application level isolated nodes are
the best slices. Thereafter, less thick node clusters as per the criteria outlined above
are next best slices.
Whereas the foregoing provides a syntactic view of the thickness of a slice, it is
possible to bring in semantics as well. There are two semantic criteria as follows.
Factor II: Business Coherence
The notion of coherence can be used to reduce the thickness of slices. If nodes of a
candidate are strongly related, then the requirements engineer needs to investigate
whether the entire candidate is to be selected or whether it is possible to select a
suitable subset of it.
Coherence is defined on the basis of the shared properties of nodes. That is, a set
of nodes, S, is coherent if each member of the set has the same property P with
respect to a certain designated member of S. There are three properties of interest,
Compatible, Conflict, and Compatible OR Conflict.
Evidently, an isolated node is coherent under compatibility. This is because
compatibility is reflexive. Therefore, it remains the most preferred candidate even
under business coherence.
There are three possibilities for a node cluster of N as follows:
(a) The entire node cluster is coherent since every node in it satisfies the property
Compatible OR Conflict with respect to the node N.
(b) The subset of the node cluster formed from conflicting nodes only is coherent.
This is because the property Conflict is satisfied by all nodes in the cluster with
respect to N.
(c) The subset of the node cluster formed compatible nodes only is also coherent.
This is because the property, Compatible, is satisfied by all nodes in the cluster
with respect to N.
Evidently, both node clusters (b) and (c) are thinner than the node cluster (a).
This is because they are less heavily populated, that is, they contain a lesser number
of nodes that the node cluster (a). When a choice is to be made between node
clusters (b) and (c), then this can be done by determining their populations and the
cluster with lesser number of nodes can be selected.
Factor III: Business Priority
Whereas topological and coherence considerations are derived from the element
graph, it is possible that there are other stakeholder concerns that influence the
choice of elements to be taken up. This may happen if selection of an element could
result in financial benefits, performance efficiencies, security, etc. Thus, business
priority plays a role in selecting the element to be taken up for data warehouse
support.
However, topology and coherence produce a menu of choices from which the
ones with higher priority could be picked up.
6.4 Granularity of Requirements 147
Decision Level
Whereas the application level provides the strategic business view, the decision
level provides the “Why” component of our user story.
There are two essential questions to be answered at the decision level:
1. How do we construct decisions from the selected elements?
2. How do we select the vertical slice of the decision level that will be taken into
the information level?
Regarding the first question, the set of alternatives is available in the choice sets
that result from entities. We postulate that for a given element E, the interesting
choice set is {Select E, Reject E, and Accept modified E}. In other words, if an
element is a policy, a business rule, or an action then the policy, business rule, or
action, respectively, may be selected, rejected, or accepted after some modification.
Applying this to all elements, we get as many choice sets as the number of elements
handed down by the application level.
Recall from Fig. 6.1 that elements may be simple or complex. For complex
elements, every component Ei of the element itself participates in the choice set
{Select Ei, Reject Ei, and Accept modified Ei}.
Now, let us consider the second question, namely, that of selecting decisions for
determining the thickness of the vertical slice. Recall that decisions are organized in
a hierarchy. Selection therefore amounts to selecting the appropriate branch of the
hierarchy. There are several selection criteria as follows:
Feasibility: The first selection criterion is whether a decision lying on a branch is
feasible or not. A feasible decision is that which can be considered for imple-
mentation, whereas a non-feasible decision is one that cannot be implemented in a
given situation in the business. Clearly, only feasible decisions are candidates for
selection.
Frequency: If an alternative is likely to be implemented with high frequency, then
it is a good candidate for data warehouse support. Consider an alternative Select
Replenish Stock. If this decision is to be taken with high frequency, then data
warehouse support would be a good idea. Similarly, frequency of Reject Replenish
Stock has a bearing on whether a data warehouse will be built or not.
Bottleneck: When delays occur for reasons like the absence of defined policies,
business rules, or unknown next action to be taken, then data warehouse support
would be helpful. This is because such delays in turn call for formulation/
reformulation of a policy, a business rule, or for identifying the next action to be
taken. Thus, if the decision to replenish stock causes such delay and is a bottleneck,
then Select Replenish Stock is a good candidate for providing data warehouse
support.
Performance: Alternatives that contribute toward organizational efficiency and/or
effectiveness are clearly good candidates for providing DW support.
Applying these criteria yields a subset of the set of constructed alternatives. This
subset is now to be taken explored from the point of view of the relevant
information.
148 6 The Development Process
Information Level
For completing our user story, we need to associate with a decision, the information
relevant to it to provide the “What” of the user story. The purpose of the
information level is to identify this information.
The starting point for identifying information is a selected decision. This
information is obtained as “early” by deploying CSFI, ENDSI, MEANSI, and
outcome feedback techniques described in Chap. 5.
Semantic criteria are deployed here for selection of information that is to form
the requirements granule. These include priority. Information with higher priority
will be selected over information with lower priority. Another semantic criterion is
bottleneck. This is similar to the bottleneck described above for selecting decisions.
Once the requirements granule is formed, development of the DW fragment
proceeds through the conceptual and remaining phases of DW life cycle.
P4 P3
6.5 Showing Agility Using an Example 149
Topological: There are several node clusters that are possible. One node cluster
is for node P2 with P1 and P3 as directly connected nodes. Let us call this C1.
Another node cluster can be C1 along with node P4. Let us call this C2. When
comparing C1 and C2, we find that C1 is thinner than C2 as C1 has lesser number of
nodes than C2. Similarly, C3 cluster contains nodes of C1 and node P5. Again C1 is
thinner. Consider yet another cluster that contains all the nodes of the element
graph. This naturally is the thickest cluster. Finally, there are two more possibilities,
namely the clusters that contain only the isolated nodes. These naturally are the
thinnest clusters.
Business coherence: node cluster of P2 containing P2, P1 and P3 are coherent
since all the nodes are in a conflicting relationship. So this is a good candidate to
select.
Business priority: Let us say that policy P5 is given highest priority and is
decided to be implemented first. Policy P3 is given second highest priority.
Now, the requirements engineer has a choice to make since the two criteria for
selecting the slice give different nodes to select. Since stakeholder priority is
important, s/he decides that policy P5 will be selected in the first iteration. Thus, P5
goes down to the decision level to be elaborated.
Decision level: Policy P5 at the decision level is converted into a decision
hierarchy based on the process described in Chap. 4. Here, the policy is expressed
in the logic form and a structural hierarchy tree is constructed. With each node of
the hierarchy, a choice set {select, modify, reject} is attached. This forms the
decision level.
Now, applying criteria namely feasibility, frequency, bottleneck, and perfor-
mance decisions to be elaborated at the information level are selected. Suppose two
nodes selected be “consultant” and “duty must be 24 h”.
Information level: Applying CSFI, ENDSI, MEANSI, and outcome feedback
early information is elicited. A subset of this information is to be selected to form
the requirements granule. Assume all the elicited information is selected.
We can now construct our DW user stories. Each row of Table 6.3 constitutes a
DW user story. This backlog is now available to development teams to take into
data warehouse development.
The differences between DAM approach of arriving at the user story and epic–
theme–story approach is summarized in Table 6.4. The comparison has been done
along various aspects and is the first column of the table. The second column of the
table displays the lessons learnt in applying the epic–theme–story approach. The
third column lists the way DAM deals with these aspects.
The first row of the table considers the speed of progress in the user story writing
against the level of detail that is to be reached. The risk in stakeholder interaction is
that the discussion can go into details of business processes, information life cycles,
and data transformation rules. It is necessary to avoid this kind of detail so that the
task of story writing proceeds at a good pace. In the DAM approach, elaboration of
the element graph, element reduction, and the decision hierarchy are the crucial
structures. It is important to limit the element graph to consider nodes directly
connected to a node; otherwise, the spread becomes too large. Element reduction
and construction of the decision hierarchy need to be taken into detail. The finer the
detail, the more specific is the user story that shall emerge.
The second row contains recommendations about how to keep the interaction
short. In column two of this row, we see the importance of staying at the business
level, i.e., to deal with measures and qualifiers. In our case, the focus is to consider
the full range in which early information can be expressed.
According to the third row, there is a difference between when the story is
discovered. The epic–theme–story approach with its focus on dashboards detects
user stories when stakeholders talk about what they will do with the dashboards
once the data identified in the theme is in place. On the other hand, a DW user story
is detected the moment the decision is selected for taking into development.
The fourth row says that our model-driven approach balances the why and what
aspects of the user story. In the epic–theme approach, the why and who aspects
remain static over a long period of time. Focus is on the changing what aspect of a
story. There is need for reinforcement that interview is still about the same why.
From the point of view of managing stories and their components, the
model-driven approach is computer supported. Element graphs, element reduction,
and decision hierarchies are all stored in the repository of our tool. Information
elicited for each decision–information association is also kept in the tool repository.
Lastly, there is a clear attempt at building the target business model in the
epic-theme approach. This is done even as the stakeholder interaction is being
carried out. The measures and qualifiers are laid out in what is essentially a mul-
tidimensional form. Simultaneously, epics, themes, and stories are being detected.
This overloading of the project architect is avoided in the model-driven approach
where a separation of concerns occurs due to the application, decision, and infor-
mation levels.
Figure 6.8 shows that DW fragments may arise at any of policy formulation, PER
formulation, and operational levels. This is as a consequence of requirements
granules populating these levels. Two cases arise:
A. DW fragment development of each level of decision-making is done level-wise
with policy DW developed first followed by PER DW followed by operational
DW. Here, the requirements granules produced at any time will belong to the
same level or are horizontal with respect to the other.
B. DW fragment development does not proceed level-wise but vertically across
levels. Thus, for certain policies formulated, the requirements engineer may
proceed to formulate enforcement rules for these policies and finally operational
Policy formulation
RG1 Horizontal Consolidation RGn
level
Vertical Consolidation
PER formulation
RG1 Horizontal Consolidation RGn level
Vertical Consolidation
Operational decision
RG1 Horizontal Consolidation RGn level
Recall that the data warehouse community has proposed data mart integration for
unifying data marts into one consolidated data warehouse. As a result, different
schemas are integrated and differences in data are sorted out.
In considering DW fragment integration, name conflicts are assumed to be
resolved. That is, it is ensured that there is no confusion in which name refers to
what concept. For example, all are agreed that employee means the same kind of
employee. Now, since integration is for data model only, it is only interesting to
take up the multidimensional models and integrate facts and dimensions. A number
of approaches for matching facts and dimensions of data marts have been reported:
• It was demonstrated [1] that drill across operations performed over
non-conforming dimensions and facts are either not possible or produce invalid
results. The authors assumed that data marts are available in a variety of forms,
DB2, Oracle, SQL server, etc. and proposed an integration strategy of three
steps consisting of a (a) semi-automated part [2] to identify dimension com-
patibility, (b) verification of compatible dimensions, and (c) making incom-
patible dimensions compatible. Thus, the integration problem is a semantic
issue.
• The approach of [3] is based on the observation that in many practical situations,
the assumption that in aggregation hierarchies, levels, and their
inter-relationships are given does not hold. They infer these levels and
inter-relationships from their data and use them for integration.
• Golfarelli [4] positions fact/dimension conformity in the larger context of the
functional and physical architecture of the integrated DW and resolution of the
trade-off between technical and user priorities.
• In ORE [5], information requirements of the integrated DW are determined as a
matrix of facts and dimensions. Each fact row is considered to be an information
requirement and is to be realized in a single data mart. Thus, one gets as many
data marts as the number of fact rows in the matrix. This collection of data marts
is then integrated into the full DW by fact matching, dimension matching,
exploring new multidimensional design, and final integration.
The authors propose to use ontology for available data sources to identify
relationship between concepts.
The underlying assumptions behind work on data mart integration [2] are as
follows:
(a) Data marts are structured in a uniform way; they use notions of facts and
dimensions only.
(b) Data quality in a data mart is usually higher than in a database because of the
ETL process.
Therefore, the interesting issue is to integrate facts and dimensions across data
marts for the purpose of providing a single logical schema for querying.
156 6 The Development Process
traceable to rule which is the rule identifier. An example is shown in Table 6.8.
Here, two decisions D1 and D2 are traceable to R1 and R2, respectively. Again, the
information obtained for each decision under different analysis types is available.
Here, TF is set to TRUE since D1 and D2 are traceable to PER R1 and R2,
respectively.
Correspondence Drafter
The aim of the correspondence drafter is to propose candidate EI for integration to
the information mapper. The correspondence drafter can be based on a number of
strategies, from the brute force strategy to strong correspondence strategy. These
strategies are described below:
1. Brute Force strategy: In this strategy, each row of EI of requirements granule,
RG1, is compared with every other row of EI of RG2. For less number of
comparisons, this strategy is suitable.
2. Weak correspondence strategy (WCS): As the number of comparisons with
the brute force strategy becomes large, there is a need to deploy heuristics.
Consider early information EIR and EID. EIR and EID correspond to one
another provided TF = TRUE. Thus, for Tables 6.7 and 6.8, the weak corre-
spondences are shown in Table 6.9.
Assuming that the amount of EI of a rule and that for a decision derived from it
is not large, this strategy is suitable.
3. Average correspondence strategy (ACS): As the early information to be
considered in the WCS rises, there is need for a stronger heuristic. Formally, let
there be two elements say R and D; analysis types AT1 as well as AT2; and
early information EIR,AT1 and EID,AT2. ACS says that EIR,AT1 and EID,AT2
correspond to one another provided (i) TF is TRUE, and (ii) AT1 = AT2. Thus,
for Tables 6.7 and 6.8, the average correspondences are shown in Table 6.10.
Assuming that the amount of EI of an analysis type is not very large, this
strategy is suitable.
4. Strong correspondence strategy (SCS): Again, as the amount of early infor-
mation to be considered in ACS rises, there is need for an even stronger
heuristic. Let there be a elements R and D; analysis types and values AT1, V1 as
well as AT2, V2 respectively; and early information EIR,AT1,V1 and EID,AT2,V2.
Now, a strong correspondence occurs between EIR,AT1,V1 and EID,AT2,V2 pro-
vided (i) TF is TRUE, (ii) AT1 = AT2, and (iii) EQV(V1, V2). EQV is defined
as follows:
158 6 The Development Process
Information Mapper
Once the correspondence drafter reports the correspondences, attention shifts to a
more detailed examination of early information. The notion of early information
was elaborated in Chap. 5 and had the following properties:
• Attribute,
• History: Whether or not its history is to be maintained,
• Categorization, and
• Function: use of a function like Count, Max, Min, etc.
To establish a mapping between correspondences generated by the correspon-
dence drafter, there is a need to ensure that information of one can be mapped to
that of the other. This is the job of the information mapper: it compares two pieces
of early information, EI1 and EI2, and reports their integration, EIintegrated.
Suppose EI1 has I1, A1, H1, C1, F1 and EI2 has I2, A2, H2, C2, F2 representing
information, attribute, category, and function of EI1 and EI2, respectively. While
comparing EI1 and EI2, three possibilities can arise. EI1 and EI2 can be
Conflict Resolver
EI which is partially mapped is sent to the conflict resolver. There are the following
two kinds of conflicts.
1. Property present in EI1 and not present in EI2 or vice versa: When such a
conflict arises, then the proposed heuristic is to maintain property in EIintegrated.
For example, EI1 shows that history is required and EI2 shows that it is not.
Obviously, then, history in EIintegrated has to be maintained. The requirement of
DW fragment DWF2 shall be satisfied with current data and that of DWF1 by
current plus historical data.
2. Property present in both EI1 and EI2 but with different property values:
Table 6.12 shows the different scenarios that can arise. Notice in the case of
attribute, categorization, and function, EIintegrated contains A1 U A2, C1 U C2,
and F1 U F2. In the case of temporal unit, the value having the lower grain is
chosen, since roll-up operations can always be performed at the level of BI
tools.
are fully mapped, partially mapped, or not mapped. Partially mapped ones move to
the conflict resolution stage.
Once the requirements have been integrated, the integrated early information is
converted to ER diagram and subsequently into multidimensional structures. For
this, we rely on existing techniques of [6, 7].
Consider two requirements granules one for a rule as the element and the other for
an operational decision as the element. The two elements are as follows:
Element 1 WHEN start x IF !area(x, 200) THEN expand x
Element 2 Remodel x
where range variable is <private ward> <x>.
Applying the five-step pair-wise consolidation process:
1. Early Information Reader
Consider the output of the early information reader for requirements granule of
element R1 as shown in Table 6.13. Each row gives information about the element,
analysis type applied, analysis value obtained, and EI identifier. Observe from
Table 6.13 that EI was elicited using two CSFI factors, three ENDSI, and three
MEANSI analyses. Details of the early information in the last column of Table 6.13
are provided later when we consider information mapper because these details are
needed then.
Similar to Table 6.13, the next table, Table 6.14 shows the output of reading the
requirements granule of element D1. There is an additional column that shows the
PER from which the decision is derived. Also, observe for D1, EI was elicited using
two CSFI factors, two ENDSI, and four MEANSI analyses. Again, details of early
information are considered when dealing with information mapper.
2. Correspondence Drafter
The next step is to find correspondences between each row of Tables 6.13 and 6.14.
The brute force strategy when applied made the total number of comparisons also
large as the EI to be consolidated was large.
Applying WCS:
WCS says that EIR and EID correspond to one another provided
(i) TF is TRUE.
Table 6.14 shows that D1 is traceable to R1. There is a weak correspondence
between EI of Tables 6.13 and 6.14. The result is shown in Table 6.15. Note that
neither is the analysis type nor is the analysis value taken into consideration while
drafting correspondence.
Applying ACS:
ACS says that EIR,AT1 and EID,AT2 correspond to one another provided
(i) TF is TRUE and
(ii) AT1 = AT2.
Consider the first and second rows of Table 6.14. Here, D1 is traceable to R1.
The analysis type is CSFI. In Table 6.13, row numbers 1 and 2 have rule as R1 and
analysis type as CSFI. Thus, the correspondence is ACS. Similarly, row numbers 3
162 6 The Development Process
Table 6.15 Correspondence between EIR and EID using WCS strategy
DWFper DWFop
EIR1,CSFI,PS, EIR1,CSFI,QualC EID1,CSFI,PatSat, EID1,CSFI,QC
EIR1,ENDSI,IncGrp, EIR1,ENDSI,Spat, EID1,ENDSI,Income, EID1,ENDSI,PatAtt
EIR1,ENDSI,PC EID1,MEANSI,NewPvt;
EIR1,MEANSI,NewRoom; EIR1,MEANSI,HireRoom; EID1,MEANSI,HirePvt;
EIR1,MEANSI,RemodRoom EID1,MEANSI,SplitPvt; EID1,MEANSI,AddSec
Table 6.16 Correspondence between EIR and EID using ACS strategy
DWFper DWFop
EIR1,CSFI,PS, EIR1,CSFI,QualC EID1,CSFI,PatSat, EID1,CSFI,QC
EIR1,ENDSI,IncGrp, EIR1,ENDSI,Spat, EID1,ENDSI,Income, EID1,ENDSI,PatAtt
EIR1,ENDSI,PC
EIR1,MEANSI,NewRoom, EID1,MEANSI,NewPvt, EID1,MEANSI,HirePvt, EID1,MEANSI,
EIR1,MEANSI,HireRoom, SplitPvt, EID1,MEANSI,AddSec
EIR1,MEANSI,RemodRoom
and 4 of Table 6.14 and row numbers 3, 4, and 5 of Table 6.13 have ACS between
their EIs. The result is shown in Table 6.16. Notice that the analysis value is not
taken into consideration here.
Applying SCS:
Consider the first row of Tables 6.13 and 6.14. Applying the rules for SCS EIR1,CSF,PS
and EID1,CSF,PatSat:
(i) TF is TRUE.
(ii) Analysis type for both is CSFI.
(iii) EIR1,CSF,PS has the same analysis value “patient satisfaction” as EID1,CSFI,PatSat.
Thus, according to the rules above, there is equivalence, EQV(PS, PatSat).
All the three conditions for strong correspondences between EIR1,CSFI,PS and
EID1,CSFI,PatSat are satisfied. Similarly, for EIR1,CSFI,QualC and EID1,CSFI,QC, and for
EIR1,ENDSI,IncGrp and EID1,ENDSI,Income, a strong correspondence is found and shown
in the second and third rows of the table.
The fourth row of Table 6.17 shows no entry against EIR1,ENDSI,SPat. This is
because there is no equivalent analysis value found in Table 6.14.
To obtain the fifth row of Table 6.17, consider the fifth row of Table 6.13 and
the fourth row of Table 6.14. Again, rules 1 and 2 are satisfied because D1 is
traceable to R1 and analysis type is the same for both, that is ENDS. Notice that
achievement of “Provide patient attention” contributes to achievement of “Improve
patient care”. Thus, there is EQV (PC, PatAtt).
6.9 Consolidating Requirements Granules 163
The last two entries of Table 6.17, rows 8 and 9, are obtained because the
MEANS “splitting room” and “adding section” of Table 6.13 contribute to the
MEANS “remodel room” of Table 6.14. Thus, there is EQV (RemodRoom,
SplitPvt) and EQV (RemodRoom, AddSec).
3. Information Mapper
Information mapper checks to see if early information to be integrated is fully
mapped, partially mapped, or not mapped.
Mapping Information from WCS:
The information mapper picks one EI from requirements granule of element R1 and
the other from requirements granule of element D1 for integration at random. If
they are fully mapped, then only one set is maintained and integrated with the next
EI picked at random by the information mapper. If at any point there is a conflict,
then the conflict resolver resolves the conflicts and integrates EIs. If EIs are not
mapped, then both the copies are stored. This process is repeated till all the entries
of Table 6.15 have been processed and integrated.
Mapping Information from ACS:
The first row of Table 6.16 has two entries from DWFper and two from
DWFop. The information mapper picks one from each DWFper and DWFop at
random, integrates, and then picks the remaining two for integration. After it fin-
ishes with the first row of Table 6.16, it proceeds to the second row and follows the
same process. Throughout, the rules for fully mapped, partially mapped, and not
mapped are followed.
Mapping Information from SCS:
Table 6.17 shows that EIR1,CSFI,PS and EID1,CSFI,PatSat have a strong correspon-
dence. The process of information mapping for CSFI analysis type is shown below.
Consider information for EIR1,CSFI,PS and EID1,CSFI,PatSat as shown in Table 6.18.
Clearly, information “Patient” is mapped. Now is the question of whether it is
fully or partially mapped. Notice here that while patient of EIR1,CSFI,PS is not
categorized, patient of EID1,CSFI,PatSat is categorized unit-wise, ward-wise, and
164 6 The Development Process
department-wise. Thus, they are partially mapped and this conflict is resolved by
the conflict resolver. EIintegrated obtained is shown in Table 6.19.
Consider information for EIR1,CSFI,QualC and EID1,CSFI,QC shown in Table 6.20.
Information disease has the same attribute, history, and category and function
values in both the rows. Thus, for information disease, the EI are fully mapped.
Doctor and Patient are unique to EIR1,CSFI,J and not mapped. Thus, EIintegrated
obtained is shown in Table 6.21.
In the next iteration, Tables 6.20 and 6.21 are integrated, conflicts resolved and
the resulting EIintegrated is shown in Table 6.22.
This process is repeated for all the entries of Table 6.17. After EIintegrated is
obtained, this information is converted to ER diagram and then to a star schema.
Table 6.22 EIintegrated after integrating from Tables 6.19 and 6.21
Information Attribute History Category Function
Disease Name Monthly Type-wise
Doctor Speciality Monthly Daily
Patient Income Monthly Unit-wise Count
Ward-wise
Department-wise
We have already introduced the tool Fetchit in earlier chapters. Fetchit also supports
requirements granule consolation. There are four components, the early information
reader, correspondence drafter, information mapper, and the conflict resolver. The
architecture is shown in Figs. 6.9 and 6.10. The first two components used in the
process, namely, early information reader and correspondence drafter are shown in
Fig. 6.9. The architecture involving information mapper and conflict resolver is
shown in Fig. 6.10.
Figure 6.9 shows that early information reader has the early information base as
input. It reads the early information of the requirements granules and sends it to the
correspondence drafter. The requirements engineer is presented with a list of four
strategies to select from, based on which the correspondence drafter finds corre-
spondences between the requirements granules in a pair-wise manner and stores the
same.
The correspondences output from the correspondence drafter (shown in
Fig. 6.9), together with the early information in the early information base, form the
input to the information mapper (shown in Fig. 6.10). For each pair of information
Correspondence
Drafter
Brute Force
Early WCS
User Interface List of
Information
ACS Correspondences
Reader
SCS
Early
Information
Base
List of
Early Correspondences
Information
Base
being integrated, the information mapper finds if it is fully, partially, or not mapped.
For fully mapped information, one copy is taken into the next iteration of inte-
gration. Partially mapped information is sent to the conflict resolver. Once the
conflict is resolved, one copy of the resolved information is taken into the next
iteration of integration. Naturally, for not mapped information, both the pieces of
information are taken into the next iteration of integration. The final set of inte-
grated information is stored in EIintegrated shown in Fig. 6.10.
As far as automation is concerned, no manual intervention is required in the
early information reader component of Fetchit. With respect to the correspondence
drafter, manual intervention is required to select a correspondence strategy. Once
this is selected, correspondences are generated automatically for WCS and ACS
strategies. For SCS, manual intervention is required for finding equivalence
between values of the analysis types. No manual intervention is required in the
information mapper component and the conflict resolver component. Thus, Fetchit
is a semi-automated tool to integrate requirements granules with manual inter-
vention required in the SCS strategy of the correspondence drafter.
Let us also examine the time taken for Fetchit to consolidate requirements
granules.
Time taken for drafting correspondences: In SCS, for defining EQV(V1, V2),
if the case is that V1 = V2, then the time taken to define equivalence is not high. No
interaction with the requirements engineer is required as the correspondence drafter
itself finds the equivalence. However, for the other cases, namely, V1 is computed
from V2; V1 contributes to achievement of V2; and V1 is a means that contributes
to the means used to achieve V2; EQV has to be defined by the requirements
engineer. This part is a manual process and a time-taking one.
In ACS, since it is a direct text search, there is no intervention required by the
requirements engineer and therefore the time taken to form correspondences is
lower than SCS.
6.10 Tool Support 167
6.11 Conclusion
References
1. Cabibbo, L., & Torlone, R. (2004, June). On the integration of autonomous data marts. In 16th
International Conference on Scientific and Statistical Database Management, 2004.
Proceedings (pp. 223–231). IEEE.
2. Cabibbo, L., Panella, I., Torlone, R., & Tre, U. R. (2006, April). DaWaII: A tool for the
integration of autonomous data marts. In ICDE (p. 158).
3. Riazati, D., Thom, J. A., & Zhang, X. (2010, January). Inferring aggregation hierarchies for
integration of data marts. In Database and expert systems applications (pp. 96–110). Berlin:
Springer.
4. Golfarelli, M., Rizzi, S., & Turricchia, E. (2011). Modern software engineering methodologies
meet data warehouse design: 4WD. In Data warehousing and knowledge discovery (pp. 66–
79). Berlin: Springer.
5. Jovanovic, P., Romero, O., Simitsis, A., Abelló, A., & Mayorova, D. (2014).
A requirement-driven approach to the design and evolution of data warehouses. Information
Systems, 44, 94–119.
6. Golfarelli, M., Maio, D., & Rizzi, S. (1998, January). Conceptual design of data warehouses
from E/R schemes. In Proceedings of the Thirty-First Hawaii International Conference on
System Sciences, 1998 (Vol. 7, pp. 334–343). IEEE.
7. Moody, L. D., & Kortink, M. A. R. (2000). From enterprise models to dimensional models: A
methodology for data warehouses and data mart design. In Proceednigs of the International
Workshop on Design and Management of Data Warehouses (pp. 5.1–5.12). Stockholm,
Sweden.
Chapter 7
Conclusion
From a global standpoint, this book has presented an approach to the unification of
(a) agile methods in data warehouse development, (b) data warehouse consolida-
tion, and (c) data warehouse requirements engineering. This is in contrast to the
traditional attitude of the data warehouse community to these issues. Data ware-
house developers consider data warehouse development methods, data mart con-
solidation, and data warehouse requirements engineering as three separate
problems. To elaborate:
1. Agile development strategies for data warehouse development concentrate on
cutting down product delivery times. The backlog of user stories leads to pro-
liferation of product increments that are to be consolidated. However, tech-
niques for consolidation like merge with primary do not find mention in agile
methods. Thus, consolidation does not seem to be of direct concern in agile
methods.
Agile data warehouse development methods also do not, by and large, take into
account results obtained in the area of data warehouse requirements engineering.
The agile community still believes that requirements engineering produces
monolithic specifications and has not investigated the possibility of obtaining
requirements of product increments from requirements engineering processes in
an agile manner. In fact, in agile development, requirements of product incre-
ments are obtained by techniques that are specific to the individual agile process
used. These techniques are interview based, are not model-driven, and rely
overly on the experience of the product owner and the development team.
2. Consolidation is merely a consequence of the incremental and iterative devel-
opment processes adopted and it lies completely outside the development pro-
cess. It is only when users face problems due to data mart proliferation that the
consolidation process kicks in. Approaches for consolidation therefore do not
impact data warehouse development methods. The world of consolidation
processes and that of development processes are isolated from one another.
the proposed suite of four techniques, and subsequently structured into multidi-
mensional form. Thus, a requirements granule can be taken into development and
yields a data warehouse fragment.
Consolidation of data warehouse fragments flows from integration of require-
ments granules. Requirements integration looks for the missing glue between
granules being integrated. Since a consolidated requirements granule is being made,
requirements of the consolidated increment are to be developed. Existing require-
ments granules going into the increment only contribute to these and do not entirely
define these requirements.
Consolidation of requirements granules not only leads naturally to a consoli-
dated data model but it also naturally leads to data warehouse fragments residing on
a common platform. This follows from unification of the consolidation process with
the requirements engineering process. Whenever requirements engineering is done,
an attempt is made to consolidate existing requirements granules. This requires a
centralization of granules and no granule, except the first, can be built unless an
attempt at consolidation is made. This centralization in an organization facilitates
use of a single platform and is a natural deterrent to multiple platforms.
The agile data warehouse development process resulting from the foregoing is
shown in Fig. 7.1. The requirements engineering process subsumes consolidation in
it and produces a requirements granule. This granule is then taken into conceptual
design and the resulting multidimensional design, Granular MD Design, is then
taken into the construction step. The result is a data warehouse fragment, DW
Fragment, for the requirements granule.
The requirement engineering process itself is decision-centric. That is, instead of
starting out with statements like “we wish to analyze sales” and then reaching the
information required to do this analysis, the decision-centric approach starts off with
decisions like “we wish to stop erosion of customer base” and then determining
information that is relevant to this decision. Determining decisions is therefore the
key.
Consolidation
Conceptual Design Construction
Requirements Engineering
Requirements Granular DW
Granule MD Design Fragment
Policies
Policy Enforcement
Rules
Operations
There are two ways in which decisions can be determined, by horizontal entry or
by vertical entry. This is shown in Fig. 7.2. Vertical entry refers to using the stack of
three types of decisions, policy, policy enforcement rules, and operational, thereby
exploiting the relationship across them. That is, decisions are obtained for formu-
lating a policy, then determining its enforcement rules and formulating decisions for
selection of appropriate rules, and finally formulating operational decisions.
Horizontal entry refers to selecting the level directly. This is done when
(a) Policies are given and support for enforcement rules is required. The policy
level is ignored and the requirements engineer enters into the policy enforce-
ment rules level as shown by the horizontal arrow in Fig. 7.2.
(b) Interest is only in operational decisions. In this situation, direct entry can be
made into the operations level as shown in Fig. 7.2.
Having obtained the decision(s) of interest, the next task is that of eliciting
information for each. There are four techniques for doing this. All of these have
stakeholder buy-in and, when used as a suite of techniques, the possibility of
missing information is mitigated.
The decision-centric approach provides a unit around which the entire require-
ments engineering process can be developed. This unit is specific; it is a decision
that is required to be taken in an organization and is an alternative of the choice set.
Elicitation of decisions is a part of the requirements engineering process.
7 Conclusion 173
it-eb.com
for more...