Sei sulla pagina 1di 41

Financial Informatics:

Startup, low-cost, dataload Challenges and Solutions

Presented by,
MySQL & O’Reilly Media, Inc.
What are we talking about today?
 Financial Data, more specifically stock market
data as an example
 The basic design of a MySQL database that
contains a daily history of stock prices
 Building a stock machine and some of the
challenges posed
 Some large data ‘gotchas’ and solves
 Some large mysql ‘gotchas’ and solves
Financial data

 csv records of what happened that day or a signal


 often have unexplaind anomalies
 daily arrival of row data which doesn’t conform to spec
1. Big Picture
 Who is your audience? Make your analytics and
application work with a small dataset first
 Market data rules: You can’t scrape Yahoo
 QA is not a bad word: Data Quality is key
 What’s a security?
 What’s a corporate action?
 OLAP: This is once a day processing
 Take performance of your dev boxes seriously:
Dell 2950 with 32GB of ram, 6 disks, RAID10.
Where does financial data come from?
 Thompson / Reuters
 McGraw Hill / Interactive Data
 Securities and Exchange Commission
 Dow Jones
 Standard and Poors
 Bloomberg
 Lots of ‘boutique’ $100M companies
Market data rules
 Information about a security’s trade on an exchange is owned by
the exchange and distributed to those who have made a license
agreement (Reuters, Interactive Data, et al.) Your license
agreement with these 3rd parties will start at $20k-$50k a year
 Scraping yahoo, msn money, Forbes or another site is infringement
 There are different license levels with financial data providers,
redistribution usually costs more than a quantitative black box
 After three days most data is less valuable / expensive, you may
get a bargain for dev phase
 Working with financial data providers is a slow process, it may take
you 8 weeks from your initial point of contact with a rep before
securing a license agreement. Work with your business decision
team to prepare for this
 Even indexes like S&P 500 and industry data is under license.
What Data Do You Need?
 Historical Price - Everyone needs this for charts,
models, etc
 Corporate Actions - Adjustments going forward
for historical data
 Real-time Price - You may want this for real-time
charts (100’s of Megs a Day)
 SEC Filings - You may want to decompose for
quant models or present reports to users
 3rd Party Quant Data - Black box trading
solution, quant box
Don’t load everything day 1
 AAPL, INTL, T, X, XOM, DVW, DELL, GE
 S&P 500
 Russell 3000
 FTSE
 APAC
 OTC / PINK / BB
 Mutual Funds
 Money Market
 Indexes
What’s a security?
 Stocks, bonds, mutual funds and more
 In this context traded on an exchange
 A note held for you by your broker
 Represents a debt to be paid by issuer -or-
 Represents a share of the issuer -or-
 Represents a bet on the issuer -or-
 Represents an index of multiple securities -or-
 Represents another abstraction of ownership or
bet
SECURITY
What’s a corporate action?
 A change to an attribute of a security or a security’s
price
 Split; reverse split
 Dividend
 Name change
 Listing; delisting;
 Exchange change
 Notes change
 Regional change
 Currency change
CORPORATE_ACTIONS (ABRIDGED) 70 Cols!!!
QA is not a bad word
 QA of financial data is much different than qa of software
 row data can arrive empty, wrong, portions missing
 row data can fail to arrive
 stocks may be priced wrong
 corporate actions may be for the wrong stock
 Canadian stock can be listed in the us with Canadian dollar prices
 all kinds of other fun
 You must have Excel jockeys to identify and explain noise to:
 Engineers
 Your data provider
 Your customers
2. Table Designs
2. Table Designs
 SECURITY - Attributes of a security
 RAW_PRICE - Attributes of a security’s trades from csv, unadjusted
 PRICE - Attributes of a security’s trades, adjusted for corporate actions
 CORPORATE_ACTIONS - Change records of a security or price attributes
 JOBS - Attributes of a job
 COUNTRY - A reference table for a security’s country
 EXCHANGE - A reference table for a security’s exchange
 REGION - A reference table for a security’s region
 SOURCE - A reference table of the data provider for a security
SECURITY
SECURITY
 security_id is your abstraction of data industry
identifiers
 SECURITY_ID, your identifier int NOT NULL AUTO_INCREMENT unsigned pk
 SECURITY_NAME, exchanges name for company
 SOURCE_ID, what data provides this char(1)
 CUSIP, us and canada unique identifier, char(9)
 TICKER, an identifier, a gotcha, varchar(14)
 SYMBOL, an identifier, a gotcha, varchar(14)
 EXCHANGE_ID, what exchange is it traded on
 REGION_ID, what region does this trade in int
 COUNTRY_ID, what country does this trade in int
SECURITY table
 Uses internal identifier SECURITY_ID

 If you’re experimenting with different providers,


SOURCE_ID should be added to pk

 Holds providers key for a security (ric, symbol,


ticker, cusip)

 500k rows max


RAW_PRICE
RAW_PRICE table (Load Everything)
 The rows just as they’ve come from the provider
with an artificial key

 Price corrections with asof_date in the past may


come in, check for these

 Sometimes attributes don’t exist in source files,


missing asof, open, etc, not null loses the whole
row and it might take days to get another one
resent
PRICE
PRICE table
 SECURITY_ID, your identifier int unsigned pk
 ASOF_DATE, what data provides this char(1) pk
 OPEN, the opening price decimal
 LOW, the low price for the day decimal
 CLOSE, the closing price for the day
 HIGH, the high price for the day
 VOLUME, how many shares sold that day
 SPLIT ADJUSTMENT: (REUTERS, not COMSTOCK) multiplier decimal
PRICE table
 Only one price per security per day

 Validation happens from RAW_PRICE to PRICE

 Instead of bouncing rows you may consider a


suspect data flag which bubbles up to UI
CORPORATE_ACTIONS
CORPORATE_ACTIONS table
 Comstock: Splits and Reverses are in this file
 Reuters: Splits and Reverses are in price file
 Denormalized - Boo!
 Much of this information is display information
 Changes to exchange or trading status are in
here (bankruptcy, emerging from bankruptcy,
changing from NASDAQ to OTC.BB, etc)
 Dividend information is in here too
COUNTRY, REGION, EXCHANGE
COUNTRY, REGION, EXCHNAGE tables
 COUNTRY, keeps track of what country a security trades in
 USA
 CANADA
 REGION, keeps track of what region a security trades in
 NORTH AMERICA
 APAC
 EXCHANGE, keeps track of what Exchange a security is traded on
 VANCOUVER
 NASDAQ
 NASDAQ OTC.BB
SOURCE
SOURCE table
 Keeps track of who provides what data in the
security table

 Good to side-by-side comparisons where data


comes from two different providers

 Helps build organizational knowledge over what


providers have good data-quality
Data Gotchas
 Do: load everything, don’t build constraints based on provider
specs prior to understanding the data
 Do: use 5.0.31 or above with innodb
 Do: wrap batches in BEGIN / END
 Do: set innodb_rollback_on_timeout = ON
 Do: stage feeds in raw tables b/c if you adjust for splits in the live
history table and make mistakes you’re be loading millions of rows
again
 Don’t run things like:
exec(“mysql -u user -e “source
/feed/load_statements.sql”);
 Don’t: foreign keys until process is hardened or never
3. Gears
 load_raw_prices();
 daily_price_clean();
 load_security();
 load_price();
 split(); Special sauce for you to write
 undo_split(); Ditto
An approach to data loads
 Daily load phase 1
 Get data from provider in csv or xml
 Don’t translate
 Import into raw tables
 Run variance checks to throw alerts (~50k securities)
 is ( yesterday n rows / today n rows ) between 99.99 and 100.01%?

 Daily load phase 2


 Load data into live tables
 Make adjustments for corporate actions
 Run your models
 Run variance checks to throw alerts
load_raw_prices()
function load_prices( $price_file ) {

$lines = file($price_file);
$counter = 0;
foreach ($lines as $line_num => $line ) {

$counter = $counter+1;
$row = explode(",",$line);
$cusip = $row[0];
$ric = $row[1];
$asof_date = $row[2];
$open = $row[3];
$high = $row[4];
$low = $row[5];
$close = $row[6];
$volume = str_replace( "\n", "", $row[7] );
$split_adjustment = str_replace( "\n", "", $row[8] );
$today = date('Y-m-d');

if($split_adjustment=='') {
$split_adjustment = '0.00000';
}
load_raw_prices() (cont’d)
$query = "INSERT INTO RAW_PRICE ( CUSIP, RIC, ASOF_DATE, OPEN, HIGH, LOW, CLOSE, VOLUME,
SPLIT_FACTOR, LOAD_DATE ) VALUES ( "
. "'" . $cusip . "',"
. "'" .$ric . "',"
. "'" .$asof_date . "',"
. $open . ","
. $high . ","
. $low . ","
. $close . ","
. $volume . ","
. $split_adjustment . ","
. "'" . $today . "')" ;

# put the rows in the raw_prices table


sm_query( $query );

if (($counter%100)==0) {
echo $counter . " lines processed.\n";
}
}
echo $counter . " total lines processed.\n";
}
daily_price_clean()
function daily_price_clean( $source_file, $new_file ) {
$lines = file($source_file);
foreach ($lines as $line_num => $line ) {
# strip "-9,999,401"
$line = str_replace("\"-9,999,401\"","NULL",$line);
# strip volume quotes and commas
$pieces = explode("\"",$line);
$pieces[1] = str_replace(",","",$pieces[1]);
$fixed_line = implode("",$pieces);
# do some more funky stuff to get the date re-arranged
$date_repair = explode(",",$fixed_line);
$date_digits = explode("/",$date_repair[2]);
$date_repair[2] = "20" . $date_digits[2] . "-" .
$date_digits[0] . "-" . $date_digits[1];
$fixed_line2 = implode(",",$date_repair);
# write out new file
if ( !file_exists($new_file)) {
touch ($new_file);
}
$handle = fopen ($new_file, 'a');
fwrite($handle, $fixed_line2);
fclose($handle);
}
}
load_secuirty()
function load_security( $security_file ) {

$lines = file($security_file);
$counter = 0;

foreach ($lines as $line_num => $line ) {

$counter = $counter+1;
$row = explode(",",$line);

$cusip = $row[0];
$ric = $row[1];
$ticker = $row[2];
$today = date('Y-m-d');

$query = "INSERT INTO SECURITY ( CUSIP, RIC, TICKER, CREATED_DATE ) VALUES ( "
. "'" . $cusip . "',"
. "'" . $ric . "',"
. "'" . $ticker . "',"
. "'" . $today . "')" ;
load_secuirty() (cont’d)
# put the rows in the raw_prices table
sm_query( $query );

if (($counter%100)==0) {
echo $counter . " lines processed.\n";
}
}
echo $counter . " total lines processed.\n";
}
load_prices()
function load_prices( $date ) {

        $query = "INSERT INTO PRICE
                  SELECT
                    S.SECURITY_ID, RP.ASOF_DATE, RP.OPEN, RP.HIGH,
                   RP.LOW, RP.CLOSE, RP.VOLUME, RP.SPLIT_FACTOR,
                   date(now())
                FROM
                   RAW_PRICE RP,
                   SECURITY S
                WHERE
                   S.RIC = RP.RIC
                AND
                   RP.ASOF_DATE = '" . $date . "'";

        echo $query ;

        sm_query( $query );

        }
Dependency Task Scheduling
 Php and shell scripts are useful tools to download and process price data
 But cron doesn’t do a very good job of keeping track in a database of when something
starts, finishes, fails, fails to start
 If email is broken or cron isn’t reporting correctly you may not know of problems until it’s
too late
 Often a layer of metadata fails b/c of failed or weird market data, a missing price can make
a graph or signal look weird to customers
 You can’t load prices if the ftp or feed fails
 You can’t process corporate actions until you know the price
 You can’t get accurate calculations against time-series if there’s holes in the series
 You can’t send signals or present accurate graphs if anything related to a security fails
 Keeping track of failed jobs gives you a flag that can also tell your users what they’re
seeing is questionable and will be corrected
 You can report on a jobs list and throw alerts on failed jobs
Tracking variances in data quality
 Price weirdness:
yesterday’s price / today’s price
 Row weirdness:
num rows yesterday / num rows today
 Range weirdness:
yesterday’s average of a sum / today’s average of a
Questions?

Acknowledgements
 Starmine: Tripp, Flanzer, Foster, Breffle, Miller
 Cake Financial: Reed

Potrebbero piacerti anche