Sei sulla pagina 1di 3

Part 1:

The first zip code grep is 'grep "^12962" data/2010_income_by_zipcode.tsv'


To get the number of zips that start with 3, 'grep "^3[0-9]\{4\}" data/2010_inco
me_by_zipcode.tsv | wc -l'
The flag to do a reverse grep is '-v'
The awk command you'll need is: grep -v "^#" data/2010_income_by_zipcode.tsv | a
wk '{s=s+$4}; END{print s}'
A sed command that will work is: sed 's/^[0-9][0-9][0-9][0-9]
/0&/' data/
2010_income_by_zipcode.tsv
Note that all those spaces are actually a tab, which you can enter on the comman
d line with ^v [TAB].
Part 3:
To get the number of rows, the syntax is going to be something like "select coun
t(*) from t2_postd_trxn_BDA;".
The min/max query is "select min(trxn_amt),max(trxn_amt) from t2_postd_trxn_BDA;
".
To count the restricted table, try "select count(*) from t2_postd_trxn_BDA where
tsys_tcat_cd = '1' and tsys_tbal_cd = '1' and debit_cr_cd = 'D' and mrch_cntry_
cd = 'USA';"
To get a substring of the zip codes, the command is "substr(mrch_pstl_cd,1,5)".
To cast to int in Hive, the command would be "cast(mrch_pstl_cd as int)". To com
bine this with the substring command, you'd do "cast(substr(mrch_pstl_cd,1,5) as
int)".
You can make the intermediate table as follows:
create table pyw137_fixed_trxns
stored as textfile
location '/user/pyw137/fixed_trxns/'
as select cast(substr(mrch_pstl_cd,1,5) as int) as mrch_pstl_cd,trxn_amt from t2
_postd_trxn_BDA
where tsys_tcat_cd = '1' and tsys_tbal_cd = '1' and debit_cr_cd = 'D' and mrch_c
ntry_cd = 'USA';
"select count(*) from pyw137_fixed_trxns where mrch_pstl_cd = 1001;" to get the
number of transactions in the 01001 zip code
The command to get the median of a column of floats is percentile_approx([COLUMN
_NAME],0.5).
The full query to get the medians for each zip code is something like this:
create table pyw137_median_trxn_by_zip
stored as textfile
location '/user/pyw137/median_trxn_by_zip/'
as select mrch_pstl_cd,percentile_approx(trxn_amt,0.5) as median_amt,count(*) as
n_trxn
from pyw137_fixed_trxns
group by mrch_pstl_cd;

To get the max median spend in a zip code the query is:
select max(median_amt) from pyw137_median_trxn_by_zip;
To join the two tables into a new one:
create table pyw137_median_trxn_with_income
stored as textfile
location '/user/pyw137/median_trxn_with_income/'
as select trxn.*,income.median_income,income.num_people
from pyw137_median_trxn_by_zip trxn
inner join pyw137_income_by_zip income
on trxn.mrch_pstl_cd = income.zip;
To extract the data into a CSV:
insert overwrite local directory '/home/pyw137/project_data/joined_data'
row format delimited
fields terminated by ','
select * from pyw137_median_trxn_with_income;
Then (outside of Hive):
cd /home/pyw137/project_data/
mv joined_data/000000_0 joined_data.csv

Part 4: Detailed analysis.


A working 'load_data_from_file' looks like this:
def load_data_from_file(filename):
'''
This function loads data from a file into arrays, using np.loadtxt.
Arguments:
filename: A string representing the name of the data file.
Returns:
data_array: a 2D numpy array of the data.
'''
#Enter code here!
data_array = np.loadtxt(filename,delimiter=',')
return data_array
A working 'cut_array_on_column_min' looks like this:
def cut_array_on_column_min(data_array,column_to_slice_idx,column_min_value):
'''
This function returns the data array, where all rows having a column with a
value below a min value have been removed.
Arguments:
data_array: a 2D numpy array of the data.
column_to_slice_idx: The index of the column you want to slice.
column_min_value: The minimum value of the column for a row to stay in the d
ata.
Returns:
cut_data_array: The input data array with all 'bad' values removed.
'''
#Enter code here!
cut_data_array = data_array[data_array[:,column_to_slice_idx] > column_min_v
alue]
return cut_data_array

A working 'compute_spearman' looks like this:


def compute_spearman(median_trxn,median_income):
'''
Compute the Spearman rank-order correlation between the median transaction a
mount and household income.
Arguments:
median_trxn: A 1D numpy array of median transactions.
median_income: A 1D numpy array of median household incomes.
Returns:
rho: The correlation coefficient.
pval: The p-value of the correlation.
'''
#Enter code here!
rho,pval = spearmanr(median_trxn,median_income) #note that the Spearman func
tion will be invoked as 'spearmanr([INPUTS])'
return rho,pval
A working 'fit_line" looks like this:
def fit_line(median_income,median_trxn):
'''
Compute the best fitting line to predict the median transaction amount as a
function of household income.
Arguments:
median_income: A 1D numpy array of median household incomes.
median_transaction: A 1D numpy array of median transactions.
Returns:
m: The slope of the line.
b: The y-intercept.
'''
#Enter code here!
A = np.vstack([median_income,np.ones(len(median_income))]).T
m,b = np.linalg.lstsq(A,median_trxn)[0] #note that the least-squares fitter
should be invoked as 'np.linalg.lstsq([INPUTS])'.
return m,b

Potrebbero piacerti anche