Using Infobright Community Edition For Analytics

Improving the performance of ad-hoc
analysis of large datasets

About this document
This is the approach and results of an evaluation in to the capabilities of Infobright Community
Edition (ICE) versus a traditional MySQL InnoDB database when performing summary and group
analysis on large1 data sets. This document is not a full evaluation of ICE, neither is it an
endorsement of the product.
Situation
Most organisations will have at least one data warehouse or data marts containing business data
specific to a department. These databases typically feed management information (MIS) and/or
business intelligence (BI) solutions and, in larger organisations, are usually relational data stores
optimised to perform particular tasks 2.
Often business users want to perform additional analysis on the data in the warehouse or mart in
order to gain insights in to customer or employee behaviour. Examples of this might be “Who are my
top 10 customers buying widgets in the following regions over the past six months?”; “Which
employees over director grade and in the IT department spend the most on employee benefits”;
“Which customers using the Safari browser who click on the Swedish landing page go on to spend
over 100 krone”.
The problem
This desire to perform ad-hoc analysis or data mining can lead to difficulties for the teams that own
and provide access to the data.
This is because data marts are usually optimised for a particular set of use cases and hence are
aggregated and indexed on the dimensions that match the use cases. So a Sales data mart may be
built to query on dimensions of product code, region, sales manager, but may not be geared up to
answer queries as to the marketing campaign code of the product. The data warehouse itself (if a
traditional warehouse) will not make any optimisations along dimensions.
For this reason, users are often discouraged or prevented from performing this type of analysis on
data warehouses. If they are allowed access there are two opposing factors:
• Long response times to ad-hoc queries lead to a poor user experience
• Database optimisations (indexes and aggregate tables) greatly increase the amount of
storage required3
Reason for this evaluation

ICE is of potential interest to us as it provides a platform we can enhance to provide reporting and
alerting interfaces that perform deep and sophisticated analytics allowing users access to information
they previously could not have.
1
1 million plus rows of data
2
Smaller organisations often have their data warehouse made up of one or more spreadsheets
3
This has a knock-on effect of increasing the time required and complexity of populating the
database
Several of our current clients would benefit from being able to mine their data marts in an efficient
and productive (from a user experience perspective) manner.
About Infobright
Infobright is a database designed to solve analytical queries. It is built on MySQL but uses a different
storage engine, Brighthouse, rather than one of the standard storage engines (e.g. MyISAM,
InnoDB).
Infobright does not use indexes or aggregate tables but instead relies on the fact that it is a column-
oriented (columnar) database which is why it is more suited to aggregate analytics.
This is for the most part invisible to the user (depending on which edition is used) and Infobright can
be accessed through the same clients used for a regular MySQL instance.
Infobright comes in two flavours. The Community Edition (ICE) is Open Source Software and the
Enterprise Edition (IEE) is a commercial product. The chief differences between the two offerings are
support for data loading and DML (i.e. INSERT, UPDATE, DELETE).
Evaluation
We performed a limited evaluation to determine whether ICE would provide benefits in a real-life
situation.
We used data from a warehouse that belonging to one of our clients and worked with them to
understand analysis that they would like to be able to perform but up to now have not been able to.
The data and problem domain has been made anonymous and generic within this report to protect
client confidentiality.
The key principles for the evaluation were:
• Use real data volumes
• Ask real questions of the data
Aim
The aim of the evaluation was intended to understand how an Infobright Community Edition (ICE)
database compared to a standard MySQL database (using an InnoDB storage engine) over the
following dimensions:
• User response times to sample queries
• Storage space required by the database
Specifications
Tests were performed on a desktop developer’s machine
• Pentium Dual-core 2.16GHz, 3Gb RAM, Windows XP Professional
• MySQL Community Edition 5.1
o Using InnoDB
• Infobright Community Edition 3.3.1
• HeidiSQL was used to run the queries
• Approximately a year’s worth of historical data was loaded in to the databases. This equated
to 1.3 million rows.
Approach
In both cases the databases were loaded with approximately a year’s worth of data – 1,291,062 rows.
The time taken to load the databases was not compared as ICE only allows load from flat file 4
although as a note it took 1’29” to load the data in to ICE.
The queries
Four queries were used to compare the approaches. These were based on real conversations with our
clients.
1. Top 10 customers by quantity purchased
2. Top 10 customers by revenue
3. Top 10 customers with revenue between 300K and 600K
4. Top 10 customers by quantity between January 2009 and April 2009
Test 1: Comparing storage requirements

In this case, the same data was loaded in to both databases but the InnoDB database had no
optimisations applied (i.e. no keys, indexes, aggregates, etc). This was in order to limit the space to
only the data.
Although response times were noted and shown in the results you would expect the database to
perform sub-optimally as all of the queries will require a full-table scan.
Test 2: Comparing reponse times

The second test was to compare the performance of an ICE database against that of an optimised
InnoDB database. The optimisations made were crude and in real-life more trials and time would be
spent in optimising the database.
Two separate summary tables were created, the first to support customer queries (this provided an
optimised route for three of the queries) and the second was a summary date range table which
selected only the dates needed by the date range query.
In reality, there would be more data needed (as the date ranges would need to cover the whole year)
and also it is unlikely that the summary table would exactly match an arbitrary date range in a query.
This weighted the comparison in favour of the traditional database.
4
IEE allows population through more means (e.g. using DML, binary dumps rather than ASCII). See
more at http://bit.ly/aXQvKM
Results
Test 1: ICE compared to a non-optimised InnoDb database

Storage space
Infobright needed 17.7Mb to store the 1,291,062 rows versus 203.8 Mb needed by InnoDB.
Storage Comparison (Mb) for 1.3 million rows
InnoDb
Engine
Brighthouse (ICE)
0 50 100 150 200 250

Storage used (Mb)
Response times
Query Infobright InnoDB x
Faster
top 10 customers by quantity 3.828 147.781 39
top 10 customers by revenue 7.734 124.703 16
top 10 customers with revenue between 300K and 8.109 160.094 20

600K
top 10 customers by quantity between Jan and Apr 1.235 21.703 18

Query Timing Comparison between ICE and MySQL InnoDB
Top 10 customers by Quantity (Jan – Apr)
Top 10 customers by Revenue (300-600K)

Query
Infobright
InnoDB
Top 10 customers by Revenue
Top 10 customers by Quantity
0 20 40 60 80 100 120 140 160 180
Time (s)
Test 2: ICE compared to an optimised InnoDb database

Storage space
Storage comparison (Mb) for 1.3 million rows
InnoDB
Storage used (Mb)
Date summary
Customer summary
Data
Brighthouse (ICE)
0 50 100 150 200 250 300 350
Engine
Response times
Query Infobright InnoDB x
Faster
top 10 customers by quantity 2.89 18.657 6
top 10 customers by revenue 7.969 19.297 2
top 10 customers with revenue between 300K and 600K 5.782 20.703 4
top 10 customers by quantity between Jan and Apr 1.61 9.062 6

Query timings comparision between ICE and MySQL InnoDB
Infobright
Top 10 customers by Quantity (Jan – Apr)

Time (s)
Top 10 customers by Revenue (300 – 600k) Infobright

InnoDB
Top 10 customers by Revenue
Top 10 customers by Quantity
0 5 10 15 20 25
Query
Conclusions
Infobright Community Edition was significantly faster and required less storage than an unoptimised
MySQL InnoDB database.
Where the database had been optimised – and optimised specifically for only the queries that were
run – ICE was still faster, but an order of magnitude reduced versus an unoptimised database.
In both cases, ICE used significantly less storage.
ICE does not allow data to be loaded through standard DML (INSERT, UPDATE, DELETE) and hence
is only a suitable choice for when the data can be easily exported in to CSV format. However, where
this is possible it is very easy to load and run queries against the data in a short space of time. This
makes it suitable for use in ad-hoc analytical exercises (e.g. when mining a dataset) as well as
building and querying against a warehouse.

Using Infobright Community Edition For Analytics

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Using Infobright Community Edition For Analytics

Caricato da

Copyright:

Formati disponibili

Improving the performance of ad-hoc

analysis of large datasets

• Long response times to ad-hoc queries lead to a poor user experience

Reason for this evaluation

The key principles for the evaluation were:

• Use real data volumes

• Ask real questions of the data

• User response times to sample queries

• Storage space required by the database

• Pentium Dual-core 2.16GHz, 3Gb RAM, Windows XP Professional

• MySQL Community Edition 5.1

• HeidiSQL was used to run the queries

1. Top 10 customers by quantity purchased

2. Top 10 customers by revenue

3. Top 10 customers with revenue between 300K and 600K

4. Top 10 customers by quantity between January 2009 and April 2009

Test 1: Comparing storage requirements

Test 2: Comparing reponse times

Test 1: ICE compared to a non-optimised InnoDb database

Storage Comparison (Mb) for 1.3 million rows

0 50 100 150 200 250

top 10 customers by quantity 3.828 147.781 39

top 10 customers by revenue 7.734 124.703 16

top 10 customers with revenue between 300K and 8.109 160.094 20

top 10 customers by quantity between Jan and Apr 1.235 21.703 18

Top 10 customers by Quantity (Jan – Apr)

Top 10 customers by Revenue (300-600K)

Top 10 customers by Revenue

Top 10 customers by Quantity

0 20 40 60 80 100 120 140 160 180

Test 2: ICE compared to an optimised InnoDb database

0 50 100 150 200 250 300 350

top 10 customers by quantity 2.89 18.657 6

top 10 customers by revenue 7.969 19.297 2

top 10 customers by quantity between Jan and Apr 1.61 9.062 6

Top 10 customers by Quantity (Jan – Apr)

Top 10 customers by Revenue (300 – 600k) Infobright

Top 10 customers by Revenue

Top 10 customers by Quantity

In both cases, ICE used significantly less storage.

Potrebbero piacerti anche