And Embeded Case Study Witha Goal

2015 IEEE/ACM 1st International Workshop on Big Data Software Engineering
Big Data System Development:

An Embedded Case Study with a Global Outsourcing Firm
Hong-Mei Chen Rick Kazman Serge Haziyev, Olha Hrytsay
Department of IT Management Department of IT Management SoftServe Inc.
Shidler College of Business University of Hawaii at Manoa Austin, TX, USA
University Of Hawaii at Manoa & SEI/CMU {shaziyev@softserveinc.com,
Honolulu, USA Pittsburgh, USA ohrytsay@softserveinc.com}
hmchen@hawaii.edu kazman@hawaii.edu
Abstract— Big data system development is dramatically different in-memory databases, real time stream processing, advanced
from small (traditional, structured) data system development. At analytics, big data analytic clouds, and big data appliances, just
the end of 2014, big data deployment is still scarce and failures to name a few. Each technology type has many vendors and
abound. Outsourcing has become a main strategy for many products. For instance, there are over 150 products in the
enterprises. We therefore selected an outsourcing company who
NoSQL space alone. The Hadoop ecosystem continues to
has successfully deployed big data projects for our study. Our
research results from analyzing 10 outsourced big data projects expand and change. For instance, Hadoop MapReduce is
provide a glimpse into early adopters of big data, illuminates the falling out of favor while Spark and the new Hadoop 2.0 are
challenges for system development that stem from the 5Vs of big taking its place. The challenges in technology selection (that is,
data and crystallizes the importance of architecture design choices the selection of each system component) are non-trivial and
and technology selection. We followed a collaborative practice affect the selection of architecture patterns, data models,
research (CPR) method to develop and validate a new method, programming languages, query languages, access methods, and
called BDD. BDD is the first attempt to systematically combine these in turn affect system performance, latency, scalability,
architecture design with data modeling approaches to address big availability, consistency, modifiability, etc. Third, there are
data system development challenges. The use of reference
uncharted territories. For example, the death of traditional
architectures and a technology catalog are advancements to
architecture design methods and are proving to be well-suited for datawarehouses has been predicted, as they are believed to be
big data system architecture design and system development. inadequate to support big data exploration. Many concepts
such as data lake, enterprise data hub, data refinery, Lambda
Index Terms— big data, system engineering, software architecture [13] and Polyglot persistence [14] have been
architecture, data system design methods, embedded case study recently proposed. And the integration of these new systems
methodology, collaborative practice research with existing ones is a challenge for companies who cannot
afford or do not want to start anew.
I. INTRODUCTION Big data system development in enterprises has a relatively
The big data phenomena is unprecedented: data volume short history: Big data became a megatrend in 2011 when IBM
doubling rates are accelerating, and 90% of all the data in the created the hashtag #bigdata [16]. “2013 is the year of big data
world has been created in the past 2 years. Big data holds big experimentation, 2014 too.” Gartner reports [10]. As of the
promise: data—structured, semi-structured and unstructured— end of 2014, big data deployments are still scarce [10] and
are being collected from everywhere by enterprises for real- failures abound. According to a CIO Survey in 2013, 55% of
time decision-making, operational intelligence, customer big data projects were not completed. Technical roadblocks and
intelligence, business innovation and competitive advantage. inappropriate scope were among the major reasons for failure.
Big data is no doubt in a “hype” phase, and has been touted as To help enterprises navigate through these uncharted waters
the new oil [17] and the new gold [15]. Most enterprises and be better equipped for their big data endeavors, our
believe that managing big data is important to improve research aims at developing a methodological framework for
business value. However, how to develop systems for mining big data system development, which departs from “small data”
the new gold appears to be a daunting task. system development—traditionally based on structured
First and foremost, the 5V (volume, velocity, variety, (relational) databases or data warehouses. Our research centers
veracity, and value) characteristics of big data require on how big data system can be cost effectively developed.
capabilities beyond traditional data systems and challenge Specifically, we ask:
existing software engineering and system development 1. How does big data system development (processes and
principles, methods and tools. Second, the rapid proliferation methods) differ from “small” (traditional, structured) data
and advancements of open source and proprietary technologies system development? How should requirement analysis,
to address the 5Vs has added a new dimension of complexity to architecting, data modeling (conceptual, logical and physical)
system development. Currently, big data technologies include and testing be done differently?
distributed database processing, MPP (Massively Parallel 2. Given the importance of architecture design in
Processing) databases, NoSQL databases, New SQL databases, developing complex big data systems, (this point will be
978-1-4673-7025-7/15 $31.00 © 2015 IEEE 44

DOI 10.1109/BIGDSE.2015.15
elaborated in Section II), how can existing software logical design in 3rd normal form [9], which simplified and
architecture approaches be extended or modified to address facilitated the design tasks. The “5Vs” of big data have,
new requirements for big data system design? however, challenged these established processes:
3. How can data modeling/design methods in traditional 1. The sheer volume of big data requires distributed and
structured database/datawarehouse development be extended parallel processing that most traditional DBMSs were not
and integrated with architecture methods for effective big data designed for. Hadoop MapReduce, for example, is a simple
system design? parallel processing algorithm. Traditional DB designers must
In answering these questions, we have uncovered therefore now pay attention to architectural issues such as
methodological voids and identified practical guidelines. We parallelism, scaling up and scaling out for big data systems.
employed an empirical research approach to answer our The volume of big data also challenges how traditional
research questions as system development, be it big or small prototyping is done so that it has fidelity of scale for testing.
data, cannot be separated from its organizational and business 2. The variety of big data created challenges in
contexts. Furthermore, because outsourcing is an important and modeling and metadata management for access to data.
common means to realize a big data strategy, the research NoSQL was invented to handle unstructured data. There are
method we chose is an embedded case study [18] with a global four popular NoSQL technologies: document-oriented,
software outsourcing company. column-oriented, key-value store, and graph databases for
Given the changing technology landscape, we looked for an different types of data. These data models differ in their
outsourcer who was open to innovation, quick to adapt, performance and the CAP theorem spells out the major tradeoff
methodologically sound, and who has successfully deployed a [2]: consistency vs. availability vs. tolerance to network
number of big data projects that can be triangulated as multiple partitions. This is an ACID vs. BASE (Basically Available
case studies of early adopters. We eventually selected a Soft-state services with Eventual-consistency) tradeoff. There
company, Softserve Inc. (hereafter called SSV), who met all are no “one size fits all” solutions so far. This, therefore, gave
our selection criteria and allowed us to conduct collaborative rise to polyglot persistence [14]—solving a complex problem
practice research (CPR) [12] on outsourcer-specific as well as by breaking it into segments and applying different database
general big data system development issues. The findings from models. It is then necessary to aggregate the results into a
this embedded case study have contributed toward the creation, hybrid data storage and analysis solution. Furthermore, the
refinement and validation of a big data system design method, query languages are also different for each data model. In
called BDD (Big Data-system Design). addition, NoSQL databases are often “schemaless” or use a
The remainder of this paper is organized as follows: “flexible schema” which creates difficulty for queries that
Section II describes our research foundations highlighting traditional databases can conveniently perform using SQL.
architecture design importance for big data system SQL-on-Hadoop, New SQL, and Spark SQL are all efforts to
development. Section III details our embedded case study help address query issues, bringing back the familiar SQL
research method. Section IV presents the analysis of the 10 big syntax. However, without the schema, data modeling and
data projects developed by SSV. Section V describes our CPR metadata management is difficult to perform. How can
action research process and the resulting BBD method. Future conceptual modeling be performed and mapped into each of the
research called for is discussed and Section VI concludes. NoSQL data models? It’s hard to separate logical and physical
design in the NoSQL world. The separation of concerns in
II. RESEARCH FOUNDATIONS traditional DB development is now undone, and DB designers
Traditional structured or “small” data system development need to know each NoSQL system in great detail (e.g.
benefits from the ANSI standard 3 tier DBMS architecture programming language, data modeling, metadata management)
which clearly defines data/program independence for data to develop the system. Obviously, this will result in systems
system development [9]. The data system development process that are harder to modify and evolve.
is well-established, consisting of 7 phases: requirement 3. The velocity of big data requires real-time processing
analysis, conceptual design, selection of DBMS, logical design, (ingestion, collection, preparation) of raw “fast” data to
physical design, prototyping/testing, implementation and facilitate quick decision making. Existing complex event
performance evaluation [9]. Typically, a RAD or agile processing capabilities are challenged by the variety of data.
development lifecycle is employed for iteratively refining The Lambda architecture [13] was proposed to address the real-
performance, accommodating new requirements, and managing time stream processing of raw data combined with materialized
system evolution. In each development phase there are models views of stored data residing in batch-processing oriented
and methods to assist DB designers and developers. storage such as Hadoop HDFS. This illustrates the importance
Relational data modeling and its query language, SQL, have of architectural design choices in big data system development.
dominated the data management world in practice—relational The architecture design for traditional data systems has been
DBMS technologies have over 95% market share. The relatively straightforward. The biggest decisions—reference
popularity of the relational DB approach owes partially to the architecture and major patterns—in traditional DB design are
existence of a conceptual data modeling tool. Entity already made: the N-tier client-server architecture is the norm.
Relationship (ER) modeling has been a useful tool for 4. The veracity of big data brings challenges in data
conceptual design and can be transformed into relational validation, modeling data context and governance. Big data
45
comes from everywhere—IoT, social media, web log files, most satisfactory approach when there are many variables of
etc.—and much of it could be “dirty” or illegal data whose interest and few data points, and where resources do not permit
collection would violate data privacy laws. In traditional DB enough replications to isolate individual variables.
system development, the data collection is mostly internal and We selected a company in the outsourcing industry for the
validation is relatively simple as data are structured. In big following reason. Finding talent, finding the right tools, and
data, unstructured data needs contextual information for understanding platforms have been cited as the top challenges
interpretation. This added complexity also highlights the need for big data system development [8]. These challenges,
for architecture design in the early stages for understanding the combined with the desire for fast deployment, have forced
data sources (and the cleanliness of each). many firms to turn to outsourcing companies for their big data
5. Extracting value from big data requires capabilities solutions. Outsourcing helps organizations focus on their core
beyond the traditional data warehouse. Every Fortune 500 competencies while mitigating shortages of skills and expertise
company now has a datawarehouse. Data mining and analytics in the areas where they want to grow. Furthermore, big data
utilizing structured data in datawarehouses have 20 years of professional services have been in great demand and have
established practice. Relational datawarehouse design is grown rapidly, from $2.8B in 2011 to $10.1B in 2014. A
relatively straightforward even though the design of ETL survey of the big data market [11] showed that professional
(extract, transform, load) processes is not trivial. The 5th V of services made up the largest market segment in 2013. This is
big data requires integration of existing master data, internal not unexpected. Enterprise practitioners require significant help
operational data as well as external reference data. The prior from professional services organizations to identify use cases,
4Vs of big data mean complex integration for extracting value. design and deploy systems, and integrate the technology and
How to bridge the new and old also highlights the criticality of output into business processes and workflows. This is predicted
architecture choices in big data system development. to grow to $17.2B in 2017, exceeding all other categories
Our research goal is to develop a big data system (hardware, software, cloud, etc.) in the big data market.
development methodology to help enterprises address these Our case company, SSV, provides global outsourcing
challenges which are further complicated by rapid technology solutions including Cloud, Mobile, UI/UX, and Big
changes. Our research centers on the “how” questions. Our first Data/Analytics. Since its founding in 1993, the company has
research question is how to address these big data system grown to over 3,500 employees and opened multiple offices in
development challenges cost-effectively to achieve business the US and Europe. Since its inception, SSV has successfully
goals. We examine how big data system design tasks would be completed thousands of outsourced projects.
carried out in each design phase of the traditional DB The selection of SSV was somewhat opportunistic. SSV
development life cycle. As argued above, architecture design is was suitable for our research objectives as it 1) has successfully
critical to big data system development, much more so than for deployed many big data projects that can be embedded in this
traditional data systems. Our second research question case study as second-order multiple studies and 2) is open to
therefore focuses on how to extend existing architectural collaboration and accessible.
methods for big data system design. Our third research question Our research method in this case study is a particular form
is the combination of the first two: How can data modeling and of action research called collaborative practice research (CPR)
design methods in traditional DB development be extended and [12]. Our research involved a collaboration between
integrated with architecture methods for effective big data researchers and practitioners to devise an effective big data
system development. Based on a design science approach, we system development method. The authors are collectively:
employed an empirical case study method to identify issues in creators of the architectural methods in use at SSV, big data
practice, and collaboratively explored, developed and validated system architects, and system developers. SSV has been
the results with practitioners. We now detail our research employing Attribute Driven Design (ADD) [1] from the start
method. for their projects. In early 2014, SSV invited the researchers to
participate in the training of their architecture team. Thus, the
III. RESEARCH METHOD researcher and practitioner teams came together to evolve the
ADD method into a method appropriate for big data system
An embedded case study is a case study containing more development called BDD. The researchers have first-hand
than one sub-unit of analysis [18]. The identification of sub- knowledge of the initiation, planning, and execution of ADD
units allows for a more detailed level of inquiry. A case study and have worked with other practitioners to improve the
research method is an empirical inquiry aimed at revealing method. The practitioner team has managed all the big data
aspects of a contemporary phenomenon inseparable from real- projects completed by SSV and had full access to all personnel
life context, and thus, difficult to replicate in a laboratory involved in the projects as well as documentation, including
environment. Our motivation for using this method rests on the contracts, meeting minutes, archived notes of workshops, and
following: 1) case studies are well suited for “how” research program files associated with each project.
questions, which are our main research foci; 2) exploratory There are two levels of analysis in our embedded case study
case studies are particularly appropriate for initial evaluation as research. The primary unit of analysis is the big data system
they allow the course of the study to be adjusted along the way development methodology employed by SSV. The second level
to account for what is learned, and 3) case studies are also the of analysis is the set of challenges and solutions in each of
46
SSV’s big data projects. Data was collected for analysis from new big data development lifecycle for enterprises. (See
semi-structured interviews, documentary sources, workshops, Figure 1).
meeting notes, e-mail communications, and dialogues with 6. Initially, the data volume (and then velocity) were the
practitioners. We followed the steps of the CPR methodology major concerns of the clients. As the technology
to create the BDD method. An iteration in the CPR process progressed, the design concerns have shifted the data
include 9 steps: 1) Appreciate problem situation 2) Study variety and veracity issues. To support data analysis with
literature 3) Develop framework 4) Evolve Method 5) Action good performance is a common design requirement for all
6) Evaluate experiences 7) Exit 8) Assess usefulness 9) Elicit projects.
research results Six projects started prior to 2014 (see Table 1) 7. Key requirements center on cost and quality attributes
were used to understand the problem situation and six projects (scalability, availability, consistency, elasticity, low latency,
started in 2014 or still on-going were used for developing the etc.). Initially, open source solutions were important to
research framework, evolving methods, action, evaluation, reduce cost for SSV’s clients, but more expensive
assessment and eliciting research results. We have gone components could be selected for performance reasons.
through several iterations of the CPR cycle to create the BDD 8. Design challenges throughout the projects center around
method. technology selection issues that impact on both cost and the
IV. CASE ANALYSIS & FINDINGS quality attribute impacts of the selected technologies.
9. SSV, in their early projects, learned that architecture
SSV started their big data projects in 2010 before the big
choices cannot be separated from technology selection.
data megatrend took off. In Table 1, 10 of SSV’s big data
This realization has motivated the collaborative
projects are listed in chronological order. The order is
development of the BDD method. (See next section and
important to show the progression of technology available and
Figure 1).
what was selected at the time of development (or
10. Enterprise (traditional, legacy) datawarehouses are not
redevelopment). In analyzing these 10 projects, we found:
going away soon. The integration of the new and the old is
1. 50% of the client companies are large enterprises (>5000
a major design challenge in the vast majority of the
employees and annual revenue >US$1 billion), 50% are
projects.
successful Internet companies with annual revenues in the
11. A majority of the projects involved updated their
range of US$8-180 million. This is not surprising, given
technology (which was, in some cases, less than one year
that big data requires substantial upfront investment.
old), some from the same vendor, some from changing to a
2. All of the client companies have increased their revenue
different platform or vendor.
after the deployment of the big data projects. This
12. Data modeling to support real time analytics were most
correlation invites two possible explanations: 1) earlier
challenging. Lambda architecture was employed but the
adopters of big data technology are more risk-taking and
selection of technology for each component to instantiate
innovative, hence they will always take advantage of new
the reference architecture remains a challenging task.
technology to increase their competitive advantage; 2) the
successful big data deployment indeed solves the client 13. Data modeling is intertwined with technology selection.
companies’ problems and achieves their business goals,
resulting in increased revenue. V. ACTION RESEARCH AND RESULTS
3. The 10 projects range from network security, network In the projects prior to 2014 (first iteration of the CPR
administration, email security and spam prevention, social process), to address design challenges, SSV employed an
marketing platform, e-coupon site, web analytics platform, architecture method called ADD that was developed by the
cloud app development platform and operational Carnegie Mellon Software Engineering Institute (SEI). The
intelligence for healthcare fraud detection. These first version of ADD (ADD 1.0) was published in January 2000
application areas are critical components of the clients’ and the second version (ADD 2.0) was published in November
enterprise systems. This indicates the changing role of 2006 [1]. ADD is, to our knowledge, the most comprehensive
outsourcing companies, which have increasingly become architecture design method available. SSV’s architects took
strategic partners of enterprises. It also showed the early SEI training classes and are well-versed in ADD 2.0.
adopters of big data concentrates on operational intelligence When ADD appeared, it was the first design method to
and customer intelligence. focus specifically on quality attributes and their achievement
4. Cloud based big data deployment is important to satisfy through the selection of architectural structures and their
elasticity requirements and hence was adopted by 7 out of representation through views. Another important contribution
10 projects. The deployment in the cloud is straightforward of ADD is that it includes architecture analysis and
with US companies but not with European or multi-national documentation as an integral part of the design process. In
companies due to data governance laws. The public cloud, ADD, design activities may include refining the sketches that
multi-tenant environment is a constraint on performance. were created during early design iterations to produce a more
5. The project scope and use cases are substantial in all 10 detailed architecture, and also performing a more formal
cases but have been well-defined in the discovery phase. evaluation of the design, perhaps using a method such as the
This is typical in outsourcing agreements and may imply a ATAM [3], a well-known architecture evaluation method
developed by one of the researchers.
47
As discussed in Section II, architectural choices for big data discovery with big data before the design process. This is very
systems and the technologies chosen for realizing conceptual different from small data system development. In addition,
system components are critical design considerations. Based on according to a multi-case study of 23 large enterprises we
our literature review and experiences and as shown in SSV’s conducted, in the use case development stage, prototyping for
early big data projects, big data system development rests business value demonstration were often attempted. In the
largely on the “orchestration” of a set of technologies. While design phase, prototyping and scale-up testing with chosen
ADD 2.0 was useful for linking quality attributes to design technology may also be performed. However, due to the
choices, there were three shortcomings that needed to be scaling problem in proving prototype fidelity, we are
addressed for big data system development: recommending using architecture analysis techniques to
First, ADD 2.0 guides the architect to use and combine tactics replace or compliment the prototyping.
and patterns to achieve the satisfaction of quality attribute
scenarios. But patterns and tactics are abstractions. The method
did not explain how to map these abstractions to concrete
implementation technologies.
Second, ADD 2.0 was invented before agile methods became
popular and thus did not offer guidance for architecture design
in an agile setting.
Third, ADD 2.0 was meant to be general, and hence
technology agnostic. While this is good for generality, ADD
2.0 did not explicitly promote the (re)use of reference
architectures which are an ideal starting point for big data
system architects.
ADD 3.0—the latest version—was catalyzed by the
creation of ADD 2.5.1 One researcher published ADD 2.5 in
2013 [4] which advocated the use of frameworks such as JSF,
Spring, Hibernate, and Axis2 as first class design concepts.
This change was to address ADD 2.0’s shortcoming of being
too abstract to easily apply. ADD starts with architecturally
significant requirements—drivers and constraints—
systematically links them to design decisions and then links
Figure 1: BDD process model
those decisions to implementation options available via
frameworks. For agile development, ADD 3.0 promotes design
In BDD, reflecting the paradigm shifts and addressing 5V
iterations, with each iteration retaining the 6 design steps of
requirements, a new process integrates architecture analysis,
ADD 2.5. In addition, ADD 3.0 explicitly promoted (re)use of
data modeling techniques and technology selection
reference architectures and was paired with a “technology
systematically in the Requirement Analysis (RA) as well as the
catalog”. This catalog includes tactics, patterns, frameworks,
Design stage. As shown in Figure 1, business goals and
reference architectures, and technologies. Along with each
design constraints/concerns/drivers are captured in RA Steps 1-
technology is a rating of its quality attributes such as
2. Big data scenarios focusing on “futuring” [7] are generated
scalability, modifiability, availability, performance, etc.
along with quality attribute scenarios in RA Steps 3-4. There
In the second iteration of the CPR process, SSV adopted
are 6 elements in a quality attribute scenario: source, stimulus,
ADD 3.0 and used their own experiences, prototyping, and
environment, artifacts, response, and response measure. When
benchmarking to obtain ratings for each quality attribute of
the stimulus are related to data, a big data template is used.
each technology in the catalog. With this catalog, SSV’s design
The big data template includes 14 data architecture elements
knowledge is preserved and reused for other projects.
(including data source quality, data variety, data volume,
ADD 3.0 facilitates the technology selection and agile
velocity, read/write frequency, time to live, queries, OLTP or
orchestration for big data system development, however, the
OLAP, etc.) are captured as inputs for each data source. All
data modeling aspect (hence, the selection of data management
queries are considered in the design input. The big data
components) requires additional effort. As argued in Section II
template is used for recording these design inputs. Each design
and based on the development experiences in SSV, data
input has a direct implication on the subsequent architecture
modeling tasks are tightly coupled with architectural design. In
choices, data model selection, technology selection and data
the third iteration of CPR process, ADD 3.0 was thus
access patterns. In BDD, expert rules are developed and used
integrated with extended data modeling analysis techniques
in linking these design inputs to resulting system architecture
created to form the BDD method, as shown in Figure 1. Note
and data model selections, access pattern and query design for
that the design process of BDD starts with an enterprise’s vaule
meeting quality attributes requirements.
In BDD, Design (Steps 1-3) starts with choosing a
1
Note that this is our own coding notation; the 2.5 number is not used reference architecture, forming architecture landscape [7]
elsewhere. (encompassing all architecture choices from big data
48
scenarios), and sketching data flows in the architecture insights and lessons learned for big data method development
landscape to form a DFD (Data Flow Diagram) context but also an empirical validation. The BDD method is the first
diagram. In Steps 4-9, the entire system will be decomposed attempt to systematically combine architecture design with data
and modeled in detail in a number of iterations. Each design modeling approaches to address big data system development
iteration will establish iteration goals (Step 4) and then select a challenges. The use of reference architectures and a technology
quality attribute or a driver or a concern or a system component catalog are advancements to architecture design methods and
to focus on (Step 5). The selection of design concepts (such as are proving to be well-suited for big data system architecture
patterns and tactics) and selection of data models are performed design and system development.
simultaneously in Step 6. In Step 7, the instantiation of
architectural elements includes selection of databases, analytics REFERENCES
platforms and other system components. BDD follows ADD’s [1] ADD (Attribute-Driven Design Method), SEI,
emphasis on documenting design decisions. In Step 8, BDD http://www.sei.cmu.edu/architecture/tools/define/add.cfm.
added metadata as part of design documentation. All [2] Brewer, E. , "CAP Twelve Years Later: How the "Rules" Have
architecture views are then sketched and metadata is modeled. Changed", Computer, vol.45, no. 2, pp. 23-29, Feb. 2012.
In each iteration, the design will be evaluated based on the [3] Bass, L. Clements, P. and Kazman, R. Software Architecture in
iteration goals in Step 9. Once the design iterations are Practice (3rd ed.). Pearson, 2013.
complete, an architecture analysis will be performed on the [4] Cervantes, H., Velasco-Elizondo, P., Kazman, R. “A Principled
entire system architecture to discover risks and identify design Way to Use Frameworks in Architecture Design Software,”
tradeoffs before implementation in Step 10. In BDD, we IEEE Software (Volume:30 , Issue: 2 ) 46 - 53 2013
employ BITAM method [6], which extends ATAM, to include [5] Cervantes, H., and Kazman R., Designing Software
business and IT alignment assessment. Architectures: A practical approach, Addison-Wesley, 2015,
Our collaborative development and validation of BDD forthcoming.
continues. The successful Phase 1 development of projects [6] Chen, H-M., Kazman R, Garg, A. "Managing Misalignments
(Cases 7-10) validated the benefits of the initial version of Between Business and IT Architectures: A BITAM Approach,"
BDD. Our method requires architects to have data modeling Journal of Science of Computer Programming, Vol. 57/1, 5-26,
2005
experience or to work closely with database designers. The
continued development of the BDD method suggests several [7] Chen, H-M and Kazman R. “Architecting for Ultra Large Scale
Green IS,” Proceedings of GREENS 2012 with the 34rd ICSE,
future research directions that may help reduce the design
Zurich, Switzerland, June 2-9, 2012.
complexity:
[8] “CIOs & Big Data,” http://visual.ly/cios-big-data, retrieved Sep.
1. Rule base for design concepts and data model selection;
2014
2. Automation for technology catalog update;
[9] Elmasri R. and Navathe S., Fundamentals of Database Systems,
3. New design patterns and tactics for big data systems; 6th edition, Addison Wesley, 2010.
4. Decision support systems for technology selection;
[10] Gartner, "Survey Analysis: Big Data Investment Grows but
5. Conceptual modeling of NoSQL databases; Deployments Remain Scarce in 2014."
6. Metadata management tools for big data. http://www.gartner.com/document/2841519. September 2014
[11] Kelly, J., “Big Data Vendor Revenue and Market Forecast 2013-
VI. CONCLUSION 2017”,
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Mar
The embedded case study method, which strikes a balance ket_Forecast_2013-2017, Feb 12, 2014
between relevance and rigor [18], is most appropriate to [12] Mathiasen, L. "Collaborative practice research," Information
address our research questions for moving towards an effective Technology & People (14:4), 321-345, 2012.
big data system development method even though the [13] Marz, N and Warren, J. Big Data: Principles and best practices
generalization of the results beyond the cases implemented by of scalable realtime data systems. Manning Publications, 2013.
SSV may be limited and researchers’ biases are inevitably (advanced edition)
embedded in the analyses. The research method we employed [14] Sadalage P.J., Fowler M. NoSQL Distilled: A Brief Guide to the
allowed us to help SSV serve their clients while collaboratively Emerging World of Polyglot Persistence, Pearson Education In.
developing a new method and gathering insights for improving 2013
big data system development knowledge and practice. [15] Singh, A. “Is Big Data the New Black Gold?” Wired,
The output of our collaborative practice research is the http://www.wired.com/2013/02/is-big-data-the-new-black-gold/,
BDD method. Big data system development is dramatically Feb.13. 2013
different from traditional database system development. Our [16] Winshuttle.com, “Big Data and the History of Information
research both illuminates the challenges for big data system Storage.” http://www.winshuttle.com/big-data-timeline/
development that stem from the 5Vs of big data and crystallizes retrieved Nov. 2014
the importance of architecture design choices. [17] Yonego, J.T. “Data Is the New Oil of the Digital Economy”,
We collaboratively developed the BDD method with SSV, Wired, http://www.wired.com/2014/07/data-new-oil-digital-
economy/, July 23.2014
extending the architecture design method ADD 3.0. The
projects successfully developed by SSV not only provided [18] Yin, R. K. Case study research, design and methods, 5th ed.
Newbury Park: Sage Publications, 2009
49
Table 1: 10 Outsourced Big Data Projects at Softserve
Case # Business goals Start Big data Technologies Challenges
• Provide ability for security Late • Machine generated data - •ETL - Talend • High throughput, different device data
1 analysts to improve intrusion 2010, 8.5 7.5BLN event records per •Storage/DW – InfoBright EE, schemas (versions)
Network Security, detection techniques;
month day collected from IPS HP Vertica • keep system performance at required
• Observe traffic behavior and
Intrusion Prevention devices •OLAP – Pentaho Mondrian level when supporting IP/geography
make infrastructure
US MNC IT corp. • Near real-time reporting •BI – JasperServer Pro analysis: avoid large table join.
adjustments:
Employees > 320,000 • Adjust company security • Reports which “touch” • Keep required performance for complex
billions of rows should querying over billions rows
policies
generate < 1 min
• Validation of the new 2012- • 20K Anti-spam rules • Vanilla Apache Hadoop • MapReduce was written on Python and
2 developed set of anti-spam 2013 • 5M email training set (HDFS,MapReduce,Oozie,Z Hadoop Streaming was used. The
Anti-Spam Network rules against the large training • 100+ Nodes in Hadoop ookeeper ) challenge was to optimize jobs
Security System set of known emails Clusters • Perl/Python performance.
• Detection of the best anti-spam • SpamAssassin • Optimal Hadoop cluster configuration to
US MNC Networking rules in terms of performance
equipment corp. • Perceptron maximize performance,and minimize
and efficacy
employees > 74,000 Map-Reduce processing time
• In-house Web Analytics 2012, • 500 million visits a year • Data Lake - (Amazon EMR) • Minimize transformation time for semi-
3 Platform for Conversion Funnel Ongoing • 25TB+ HP Vertica Data /Hive/Hue/MapReduce/Flu structured data
Online Coupon Web Analysis, marketing campaign Warehouse me/Spark • Data quality and consistency
optimization, user behavior • complex data integration
Analytics Platform • 50TB+ Hadoop Cluster • DW: HP Vertica, MySQL
analytics
US MNC • Near-Real time analytics • ETL/Data Integration – • fast growing data volumes,
• clickstream analytics, platform
“14 Revenue > US$200M feature usage analysis
(15 minutes is supported custom using python • performance issues with Hadoop
for clickstream data) • BI: R, Mahout, Tableau Map/Reduce (moving to Spark)
• Build in-house Analytics 2012, •Volume - 45 TB • Lambda architecture • Hadoop upgrade – CDH 4 to CDH 5
4 Platform for ROI measurement ongoing • Sources - JSON • Amazon AWS, S3 • Data integrity and data quality
Social Marketing and performance analysis of • Throughput - > 20K/sec • Apache Kafka, Storm • Very high data throughput caused a
every product and feature by
Analytical Platform • Latency (1 hour – for • Hadoop - CDH 5, HDFS(raw challenge with data loss prevention
the e-commerce platform; static/pre-defined reports data), MapReduce), (introduced Apache Kafka as a solution)
US MNC Internet • Provide analysis on end-users
marketing (user reviews) /real-time for streaming Cloudera Manager, Oozie, • System performance for data discovery
interaction with service
‘14 Revenue > US$ 168M data) Zookeper (introduced Redshift considering Spark)
content, products, & features
• HBase (2 clusters: batch • Constraints - public cloud, multi-tenant
views, streaming data)
• Provide visual environment for 2013, 8 • Data Volume > 10 TB • Middleware: RabbitMQ, • schema extensibility
5 building custom mobile month • Sources: JSON Amazon SQS, Celery • minimize TCO
Cloud-based Mobile applications • Data Throughput > 10K/sec • DB: Amazon Redshift, RDS, • achievimg high data compression
• Charge customers by usage
App Development • Analytics - self-service, pre- S3 without significant performance
• Analysis of platform feature defined reports, ad-hoc • Jaspersoft degradation was quite challenging.
Platform usage by end-users and
• Data Latency – 2 min • Elastic Beanstalk • technology selection: performance
US private Internet Co. platform optimization
• Integration: Python benchmarks and price comparison of
Funding > US$100M Redshift vs HP Vertica vs Amazon RDS.
• Build an OMNI-Channel End of • Analytics on 90+ TB (30+ • Hadoop (HDFS, Hive, • Data Volume for real-time analytics
6 platform to improve sales and 2013, TB structured, 60+ TB HBase) • Data Variety: data science over data in
Telecom E-tailing operations
(only unstructured and semi- • Cassandra different formats from multiple data
• analyze all enterprise data
platform discovery structured data) • HP Vertica/Teradata sources
from multiple sources for real- • Elasticity: through SDE • Microstrategy/Tableau • Elasticity: private cloud, Hadoop as a
Russian mobile phone time recommendation and )
retailer; 2013: 108B rubles principles service with auto-scale capabilities
cross/up sales
• Build social relationship 2013 • > one billion social • Cassandra • MySQL 5.6/5.1 • Minimize data processing time (ETL)
7 platform that allows enterprise ongoing connections across 84 • Elasticsearch • Implement incremental ETL, processing
Social Relationship brands and organizations to
(redesign countries • SaaS BI Platform - and uploading only the latest data.
manage, monitor, and measure
Marketing Platform 2009 • 650 million pieces of social GoodData
their social media programs content per day • Clover ETL, custom in Java,
US private Internet Co. • Build an Analytics module to system)
Funding > US$100M • MySQL (~ 11 Tb) Cassandra • PHP, Amazon S3,Amazon
analyze and measure results.
(~ 6Tb), ETL (> 8Tb per day) SQS
• RabbitMQ
• Optimization of all web, 2014, • Data Volume > 1 PB • Vanilla Apache Hadoop • Hive performance for analytics queries.
8 mobile, and social channels Ongoing • 5-10 GB per customer/day (HDFS,MapReduce,Oozie,Z Difficult to support real-time scenario
Web Analytics & • Optimization of recomm-
(Redesign • Data sources – clickstream ookeeper ) for ad-hoc queries.
endations for each visitor
Marketing Optimization 2006- data, webserver logs • Hadoop/HBase • Data consistency between two layers:
• High return on online • Aster Data raw data storage in Hadoop and
US MNC IT consulting co. marketing investments 2010
(Employees > 430,000) • Oracle aggregated data in relational DW
system)
• Java/Flex/JavaScript • Complex data transformation jobs
•Build tool to monitor network 2014, •collect data in large • MySQL • High memory consumption of HBase
9 availability, performance, Ongoing datacenters (each: • RRDtool when deployed in a single server mode
Network Monitoring & events and configuration.
(Redesign gigabytes to terabytes) • HBase
• Integrate data storage and
Management Platform 2006 •real-time data analysis and • Elasticsearch
collection processes with one monitoring (< 1 minute)
US OSS vendor web-based user interface. system)
Revenue > US$ 22M • types of devices: hundreds
•IT as a service
• Operation cost optimization for 2014, • Velocity: 10K+ events per • AWS VPC • Technology selection constraints by
10 3.4 million members Phase 1: second • Apache Mesos, Apache HIPAA compliance: SQS(selected) vs
Healthcare Insurance • Track anomaly cases (e.g.
8 month, • Complex Event Processing - Marathon, Chronus Kafka
control schedule 1 and 2 drugs,
Operation Intelligence ongoing pattern detection, • Cassandra • Chef Resource optimization:
refill status control) enrichment, projection, • Apache Storm extending/fixing open source
US health plan provider • Collaboration tool between
Employees> 4,500 aggregation, join • ELK (Elasticsearch, frameworks
65,000 providers (delegation,
Revenue> US$10B • High scalability, High- Logstash, Kibana) • 90% utilization ratio
messaging, reassignment)
availability , fault-tolerance • Netflix Exhibitor •Chef • Constrains: AWS, HIPAA
50

And Embeded Case Study Witha Goal

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

And Embeded Case Study Witha Goal

Caricato da

Copyright:

Formati disponibili

2015 IEEE/ACM 1st International Workshop on Big Data Software Engineering

Big Data System Development:

978-1-4673-7025-7/15 $31.00 © 2015 IEEE 44

Potrebbero piacerti anche