Sei sulla pagina 1di 13

Selecting the Right

Cloud Data Warehouse

for Analytics

Created by

Table of

Getting Started 3
What is a data warehouse and why do you need one?

Evaluation Criteria 5
Data types

How Segment Can Help 13

How do you plan to Extract, Transform, and Load (ETL)
data into your warehouse?

Selecting the Right Cloud Data Warehouse for Analytics Table of Contents

Getting Started

What is a cloud data warehouse?

You can think of a data warehouse as a home for all of your data. Companies guessed, is one that lives entirely online. Unlike on-premise data warehouses,
use a data warehouse to aggregate data from a number of different data cloud-based data warehouses don’t require any physical hardware and are
sources so it’s easy to analyze. A cloud data warehouse, as you may have much easier to implement and scale.

Clusters Nodes

Data Sources: Data Warehouse: BI Tool:

Where you collect customer data, like Where you store your raw customer data. A tool used to analyze your raw data, typi-
your website, app, servers, or cally with ad hoc or scheduled queries.
cloud apps. Clusters & Nodes:
Components of a warehouse that increase storage
and compute power. A cluster is typically made up
of a number of nodes.

Selecting the Right Cloud Data Warehouse for Analytics Getting Started

Getting Started

Why do I need one?

With a data warehouse, you have the ultimate flexibility for how you store and Standard analytics tools, like Google Analytics, give you a good sense of what
later query your data, better enabling you to answer any analytics questions. actions customers are taking on your website or app. However, with an out-of-
The reports and analyses you run can include elements from each and every one the-box tool, you’re limited to asking questions that can be answered with the
of the data sources you combine. You can analyze data from your website and number of variables, properties, and types of charts available.
app, as well as other platforms you use like Salesforce, Zendesk, Stripe, and
You should consider a data warehouse if you want to:
others. When you have all of your data in one place, you can easily run queries
directly in your warehouse or through a BI tool like Tableau, Looker, or Mode. • Centrally store all of your business-critical data
• Analyze your web, mobile, CRM, and other applications together
in a single place
• Dive deeper than traditional analytics tools by querying raw data with SQL
• Provide multiple people access to the same data set simultaneously

With clean, accurate, and complete data, you’ll be prepared to answer busi-
ness critical questions: How much revenue is at stake from unanswered
support tickets? What percent of customers renew after using a certain set
of product features? How much revenue can we expect next quarter from a
specific category or product?

Armed with the right insights, you will be able to more effectively drive prod-
uct design and development, evaluate marketing campaign effectiveness, and
spot potential issues in your user experience.

Selecting the Right Cloud Data Warehouse for Analytics Getting Started

Criteria for Data types Scale Maintenance
What type of data you want The amount of data you plan How much engineering effort

Your Data your warehouse to store to store you’re willing and able to
dedicate to your warehouse

Once you’ve decided that a data
warehouse is necessary for your
team’s needs, there are a number of
important factors to consider when
making a selection. Performance Cost Community
How quickly you need your How much you are willing How connected your
data when you query it to spend on your warehouse is to other critical
data warehouse tools and services

Note: Segment supports the following cloud It’s important to consider your use case when selecting a data warehouse. Your specific use
based data warehouses: Amazon Redshift, case will determine the importance of each of these factors. You should also keep in mind
Google BigQuery, IBM Db2 Warehouse, that many of the factors listed will directly influence one another and tradeoffs may be
Postgres, Snowflake. necessary. For example, opting for less scale may decrease performance but will typically
be more cost-effective. Throughout the selection guide we’ll highlight uses cases that will
help you optimize for each factor.

Selecting the Right Cloud Data Warehouse for Analytics Evaluation Criteria for Your Data Warehouse


Analyzing web or mobile behavior — A

relational data warehouse works best
Data types when it knows exactly what kind of data to
expect and how that data links together.
The first step to understand your data warehouse needs is to determine what “type” This is why data that fits nicely into a table
of data you will store in it. For a warehouse, there are two main “types” of data: works well in a relational data warehouse.
That type of data includes “user traits”
structured and unstructured.
such as names, emails, billing plans, and
“user events” such as clicks, page views,
and purchases. If your data mostly stays
Criteria Relational Non-Relational the same or is following a specific set of
user paths, a relational data warehouse
Type of data Structured Unstructured will be the clear way to go.
Would fit in massive Excel sheet Word doc Examples:

The schema Stays the same Changes often

Works well with data like User data, inventory Email content, photos, videos

For analysis like User paths, funnel analysis Text mining, language processing Analyzing language, text, or images
— If you’re doing a large amounts of text
Can query with SQL MapReduce, Python
mining, language processing, or image
processing, you’ll need to consider a
non-relational database. Depending on
the type data you are collecting, a data
A relational database works well with structured A non-relational database excels with extremely
warehouse that supports semi-structured
data or data that fits nicely into rows and large amounts of semi-structured data. Classic
data, like Snowflake, would also work.
columns. If your data could be organized into one examples of semi-structured data are emails,
extra large spreadsheet, then a relational data books, social media posts, audio/visual data, Examples:
warehouse would be a good fit for your company. and geographical data. You should consider a
data lake over a classic data warehouse if you are
working with purely unstructured data.

Selecting the Right Cloud Data Warehouse for Analytics Evaluation Criteria for Your Data Warehouse


Collecting data from a few sources —

Scale will be less important to you if you
Scale have a smaller data set and only a few
people responsible for querying the data.
The next consideration is how much data you’re accessing and the scale of data that As a benefit, lower scale means the
your warehouse needs to support. Relational cloud-data warehouses, like Redshift, warehouse will be more cost-effective.

Snowflake, BigQuery, and Db2 Warehouse, are able to store massive amounts of Examples:
data without much overhead cost. Most companies won’t need scale beyond what
those warehouses can deliver, especially if analytics is the primary use case.
However, in cases where extreme scale is needed (greater than 2 terabytes of data),
Collecting data from all sources —
a non-relational warehouse will typically be a better fit because they won’t impose You should consider a warehouse
restraints on incoming data, allowing you to write faster. specifically built for large scale if you have
over a terabyte of data and have multiple
users querying your data simultaneously.
There aren’t strict limitations as to how much data each warehouse can handle. However, we’ve found that Larger scale means that you won’t have
each excel within certain bands. any issues storing your data or keeping
your queries fast.
Database Options by Scale
Data Size < 1 TB 2 - 64 TB 64 TB - 2 PB+

Database That's a Postgres Amazon Redshift, Google BigQuery,

Amazon Aurora
Good Fit MySQL Snowflake, IBM Db2 Warehouse

You’ll also want to consider how a particular warehouse scales during times of demand. For example,
Redshift can support massive amounts of data but will require you to manually add more nodes (for added
storage and compute power). Snowflake, on the other hand, offers an auto-scale function which spins up
and down clusters dynamically, as needed. BigQuery offers automatic management of resources —
invisible to the user — to meet additional needs and also offers a free, batch ingest method that doesn’t
compete with query capacity. Each offers impressive scale but works slightly differently.

Selecting the Right Cloud Data Warehouse for Analytics Evaluation Criteria for Your Data Warehouse


You have the time and skill to customize

— If you have a very technical team or
Maintenance someone experienced with maintaining a
warehouse, a data warehouse requiring
Typically, relational data warehouses take less overall management than more maintenance could be a good fit.
More time spent manually tuning and
non-relational ones. The smaller your overall team, the more likely it is that you’ll
scaling your data warehouse will mean
need your engineers focusing on building product rather than ETL pipelines and you have greater control over the
day-to-day management of your warehouse. For data warehouses that aren’t performance and cost. To an experienced
warehouse admin, “more maintenance”
self-optimizing, you’ll need to have someone spend time vacuuming, resizing, and
means more flexibility and control.
monitoring the cluster to ensure performance remains strong. However, maintaining
a warehouse manually allows you to optimize it precisely for your company’s needs.

Working with structured data comes with another advantage: you can use SQL to
query them. SQL is well-known among analysts and engineers, and it’s easier to
No maintenance required — These
learn than most programming languages. Running analytics on semi-structured
warehouses will be a good fit for a smaller
data generally requires an object-oriented programming background or a code- team or a team that doesn’t want to
heavy data science background. Even with the emergence of analytics tools, like dedicate time to tuning. What’s sacrificed
in customization is gained in ease-of-use
Splunk for Hadoop or Slamdata for Mongo, analyzing these types of warehouses
and consistent performance.
requires special skills and far more data maintenance.

Selecting the Right Cloud Data Warehouse for Analytics Evaluation Criteria for Your Data Warehouse


Fast, after-the-fact reporting — Any

cloud data warehouses will enable you to
Performance understand customer behavior after-the-
fact. If you plan on having multiple users
The next thing to consider is how quickly you’ll need your data. This comes down querying your data simultaneously, you’ll
need a data warehouse that offers tools
to how fast your queries can run and how you maintain that speed in times of high
that are specifically designed to keep
demand. As you can imagine, performance and scale are closely connected. queries fast.
Performance will increase as you scale up the size of your warehouse or add
additional nodes (for example, Amazon Redshift). Snowflake, for example,
automatically adds more capacity as needed with its multi-cluster architecture.
Other data warehouses, like BigQuery for example, offer APIs to help deliver
real-time analysis. Real-time reporting — Even the fastest
warehouse is not well suited for real-time
A data warehouse’s architecture will have a big impact on how it’s able to handle analytics. However, if you absolutely need
real-time data for something, like fraud
running multiple queries at the same time. Some warehouses, like Snowflake and
reporting or system monitoring, you should
IBM Db2 Warehouse, are able to auto-scale to meet the demands of multiple team look at an unstructured solution such as
members querying data at the same time. Others, like Redshift, require you to Hadoop. You can design it to load very
quickly, though queries may take longer
manually add new nodes to expand a cluster in times of high demand.
at scale depending on a number of factors
including RAM usage, available disk space,
While real-time analytics is critical for some use cases, most analyses don’t require
and how you structure the data.
real-time data or immediate insights. When you’re answering questions like what is
causing users to churn or how people are moving from your app to your website,
accessing your data with a slight lag is fine. Your data doesn’t change that much
minute-by-minute and your ability to follow bigger trends won’t be impacted.

Selecting the Right Cloud Data Warehouse for Analytics Evaluation Criteria for Your Data Warehouse


No budget and limited planned usage —

This is a perfect example of when Postgres
Cost could be a good choice for a business. It
does much of what the other warehouses
Cost will be one of the most important considerations for your data warehouse. listed above do, but you don’t have a to pay
to use it. Also, when it’s time to upgrade
It’s also one of the most volatile, based on your company’s specific use case for
your solution, it’s fairly straightforward to
a warehouse. Typically, with a data warehouse, you’ll be required to pay for some switch (especially with a solution like
combination of storage, size of the warehouse, run time, or queries. Segment). You could also look into
BigQuery’s free option as well.
If you are constantly running queries on the data, you’ll want to look for a solution Examples:
that has a lower compute cost. If you have a lot of data but only have one team
using it, you’ll want to look for a solution that has low storage costs. A benefit of all
cloud-based, relational data warehouses is that storage costs are typically very low
and there’s not a huge upfront cost to purchase, install, and configure the solution. Querying your data constantly — In the
case where you’ll be querying your data all
Also, unlike a semi-structured warehouse or data lake solution that stores all your
the time, you’ll want to look for a solution,
data in an unstructured way, there will be far less resources required to actually like Redshift, that charges purely on
format your data in a way that enables you to analyze your data. storage rather than the number of queries
or overall compute time. BigQuery’s flat
Here’s the cost structure of five popular relational, cloud-based data warehouses: rate pricing option may also be a good
solution here. You could also consider
Redshift’s Reserved Instance Pricing,
Amazon Google IBM Db2
Snowflake Postgres which offers reduced compute pricing
Redshift BigQuery Warehouse
costs with longer contracts.

Pay per hour Pay for storage Pay for storage Examples:
Cost Pay per query
based on nodes or and compute and compute Free
Structure or flat rate
per bytes scanned time time

Selecting the Right Cloud Data Warehouse for Analytics Evaluation Criteria for Your Data Warehouse


Inside the AWS ecosystem — Redshift is

Amazon’s core data warehousing
Community product and one of the most popular cloud
data warehouses available. The benefit
If you’re already using a number of tools in your organization (which we guess you of staying within the Amazon ecosystem
are), you’ll want to make sure whatever you choose to incorporate works well with is that you could also easily use products
like Kinesis Firehouse for loading data
your existing technology stack. Not only will it make implementation easier, but it
into a warehouse, S3 for raw storage, and
will also save your team from having to develop multiple custom ETL pipelines to get Redshift Spectrum for querying across
your data where it needs to be. (You still may need to write a custom ETL to get data Redshift and S3.

into your warehouse). Inside the Google ecosystem — As

expected, this would be a great use case
for BigQuery. Other warehouses would
work too, but the ease of use between
Google products may make BigQuery the
right choice for you.

Using multiple varied tools — As an open

source tool, Postgres has one of the largest
developer ecosystems around it and has
connections built with thousands of other
products, making it very extensible.

Selecting the Right Cloud Data Warehouse for Analytics Evaluation Criteria for Your Data Warehouse

Quick Reference Guide

Compare the different cloud based data warehouses that are available via Segment.

Redshift BigQuery Snowflake IBM Db2 Warehouse Postgres

Structured and Structured and

Data Types Structured Structured Structured
semi-structured semi-structured

Scales horizontally by Automatically resizes your Automatically scales to Automatically scales by Requires manual data
manually adding new warehouse without keep queries fast adding new nodes as partitioning to scale
nodes storage limits needed effectively

You can manually add Uses available resources Clusters automatically spin Auto-scales clusters when Generally run on a single
nodes to keep queries fast as needed; no configs to up and down depending needed to keep queries machine
improve speed on usage fast

Requires some manual Fully managed Fully managed Fully managed Requires manual
maintenance maintenance

Pay per hour based Pay per query (more data

Pay for storage and Pay for storage and
Cost on nodes or per bytes drives cost per query up) Free
compute time compute time
scanned and flat rate

AWS ecosystem Google Cloud Platform Enables data sharing IBM ecosystem Large ecosystem of
ecosystem across Snowflake compatible products

Selecting the Right Cloud Data Warehouse for Analytics Quick Reference Guide

How Segment Can Help

Now that you have a better idea of what data warehouse to use, the next step is figuring out
how you’re going to get your data into the warehouse in the first place. Many people that
are just starting to use a data warehouse underestimate just how hard it is to build a
scalable data pipeline. You have to write your own extraction layer, data collection API,
queuing, and transformation layers. Each of those functions has to scale and typically
requires about 100 hours per month of maintenance work. Plus, you need to determine
the right schema down to the size and type of each column.

With Segment, you can bypass these hurdles of turning on a new data warehouse. Once
integrated, Segment will automatically do the ETL and setup the schema for you.
Moreover, Segment also provides tools that help prevent unwanted or “unclean” data from
ever reaching your data warehouse in the first place. This way, your data warehouse stays
fast and your analytics stay accurate.

However you do it, getting your data into a format where it can be holistically analyzed is
essential. Only by having your raw user data in a flexible, SQL format can you answer
granular questions about what your customers are doing across all platforms while
accurately measuring attribution, and building company-specific dashboards.

Learn more about the warehouse

Segment offers


Selecting the Right Cloud Data Warehouse for Analytics How Segment Can Help