Sei sulla pagina 1di 56

Analysis of the 2007 Survey of Business Owners

Public Use Microdata Sample



Arthur Wu

Supervisor: Professor Amber Tomas
STAT 4993 Independent Study
5/9/2013
2

Table of Contents
Introduction ..................................................................................................................................... 4
Overview of the Survey of Business Owners Public Microdata Sample ........................................ 4
Primary Methodology ................................................................................................................. 4
Missing Data ............................................................................................................................... 6
Nonresponse ................................................................................................................................ 7
Inherent Differences between SBO and PUMS Information ...................................................... 7
Data Manipulation and Cleaning .................................................................................................... 8
Subsetting to Small Businesses and One Owner Data ................................................................ 8
Tabulation Weights ..................................................................................................................... 9
Fitting a Regression ........................................................................................................................ 9
Regression One: All Variables Against Receipts ...................................................................... 10
Model Selection Techniques ..................................................................................................... 12
Regression Two: PROC GLSELECT with the Schwarz Bayesian Criterion ........................... 14
Regression Three: PROC GLMSELECT with Akaikes Information Criterion....................... 15
Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm Transform ........ 18
Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and Employment
with Other Character Variables against the Logarithm of Receipts .......................................... 20
Analysis of Language ................................................................................................................... 22
Research Question 1a: Does the language spoken in transactions produce a difference in
correlated receipts? .................................................................................................................... 22
Research Question 1b: Out of only existing Only English and Spanish businesses, which are
the most popular industries for business? ................................................................................. 26
Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United States ... 27
Analysis of Capital Sources in California versus the United States ............................................. 29
Dataset Overview ...................................................................................................................... 29
Research Question 2a: What sources of capital have the most positive relationship with
receipts? ..................................................................................................................................... 30
Research Question 2b: How does the spread of receipts for certain capital sources compare
between businesses in California and the general United States? ............................................. 32
Conclusion .................................................................................................................................... 33
Lessons Learned............................................................................................................................ 33
Appendix A: Full Output of Regression with log(Receipts) and Modified Payroll and
Employment .................................................................................................................................. 34
3

Appendix B: Regression of All Available Variables .................................................................... 46
Appendix C: Full Tables of Capital Regression ........................................................................... 54


4

Introduction
In selecting my topic for independent study, I wanted to combine the skills I had learned
for statistical computing software with my main passion and primary fields of study business
and entrepreneurship. As a result, I originally intended to build a predictive model for new or
small business venture success. According to the U.S. Small Business Administration (SBA),
small businesses represent 99.7 percent of all employer firms. Since 1995, small businesses have
generated 64 percent of new jobs, and paid 44 percent of the total United States private payroll,
according to the SBA. However, I quickly realized that there was a dearth of accurate, thorough,
and easily accessible data that had a large enough sample size to satisfy the normality
assumption across all sectors and states. Moreover, attempting to answer this question would
further require significant longitudinal data on the individual business level. Ultimately, I relied
on the U.S. Census Bureaus Survey of Business Owners (SBO) Public Use Microdata Sample
(PUMS), in which I examine entrepreneurial activity and the relationships between business
characteristics such as access to capital, firm size, employer-paid benefits, minority ownership,
and firm age. In this report, I detail how I conducted data cleaning on the 2007 SBO PUMS in
addition to the development of a regression model as well as more in-depth analyses of the
relationships between specific variables.
Overview of the Survey of Business Owners Public Microdata Sample
Primary Methodology
The 2007 Survey of Business Owners (SBO) questionnaire, Form SBO-1, was mailed to
a random sample of 2.3 million businesses selected from a list of 27 million firms operating
during 2007 with receipts of $1,000 or more. The list of all firms (the sampling universe) was
derived from both official business tax returns and data collected on other economic census
reports. The Census Bureau obtained electronic files from the Internal Revenue Service (IRS) for
all companies reporting any business activity on 2007 IRS Tax Forms such as Form 1040 and
1065.
With regards to the background of the SBO, this survey is part of the Economic Census
program, which the Census Bureau is required by law to conduct every 5 years for years ending
in "2" and "7." The Census Bureau combines and crosschecks data from the SBO with data from
other economic surveys, economic censuses, and administrative records. The published data
include number of firms (both firms with paid employees and firms with no paid employees),
sales and receipts, number of paid employees, and annual payroll; they are presented by kind of
business, geographic area, and size of firm (employment and receipts). These results will also
contain summary statistics on the composition of businesses in the United States by gender,
ethnicity, race, and veteran status. Additional demographic and economic characteristics of
business owners and their businesses are included, such as: owner's age, education level, hours
worked, and primary function in the business; family- and home-based businesses; types of
customers and workers; sources of financing for start-up, expansion, or capital improvements;
outsourcing; use of Internet and e-commerce; and employer-paid benefits.
The IRS provided certain identification, classification, and measurement data for
businesses filing those forms. For most firms with paid employees, the Census Bureau also
5

collected employment, payroll, receipts, and kind of business for each plant, store, or physical
location during the 2007 Economic Census.
For the 2007 SBO, firms could either report electronically by using Census Taker, the
Census Bureau's secure online interactive application, or return their completed form by mail.
Three report form re-mails to employer firms and two report form re-mails to nonemployer firms
were conducted at one-month intervals to all delinquent respondents. The returned forms
underwent extensive review and computer processing. All reports were geographically coded,
data-keyed, and edited.
This wealth of data provides a resource to main parties from government officials to
industry organization leaders. For example, this data allows agencies such as the Small Business
Administration to identify and address the needs of small businesses in the United States. In the
private sector, consultants and researchers to analyze long-term economic and demographic
shifts, and differences in ownership and performance among geographic areas.
Survey Overview:
Form SBO-1, given to every sampled business, primarily asked basic information about
the business in general while focusing on the demographics and level of ownership for each
listed owner. There are only 9 numeric variables (tabulation weight, total revenues with injected
noise, payroll with injected noise, employment injected with noise, and general ownership
percentages for up to four owners) while the other hundreds are character variables (age,
education, startup capital type, race, ethnicity, etc.). These character variables usually adopt the
format of a binary yes/no answer to most questions, except for variables with multiple levels
(education, age, and race).














6


Missing Data
For the numeric variables of the cleaned data set (receipts, payroll, employment, and
percent of ownership), there were no missing values. However, for most character variables, the
percentage of missing data ranged from 20-40%. There were also several variables in which over
General Owner and
Business Characteristics
Additional Owner
Characteristics

Additional Business Characteristics
Sector
Employer status
Random group
(for variance
estimation)
Tabulation
weight
Measures of size
(noise-infused
for disclosure
avoidance):
Employment
Payroll
Receipts
Individual owner
information (for
up to four
owners):
Percentage
ownership
Gender
Ethnicity
Race
Veteran status

How the owner
initially acquired the
business
When the owner
acquired the business
Owners primary
function in the
business
Owners average
number of hours per
week spent working
in the business
Whether the business
provided the owners
primary source of
personal income
Whether the owner
previously owned a
business or had been
self-employed
Owners educational
background
Owners age
Whether the owner
was born in the
United States
If the owner was a
veteran, whether the
owner was disabled
as the result of injury
incurred during
active military
service

Year business was established
Source(s) of start-up or acquisition
capital
Amount of start-up or acquisition
capital
Home-based business
Operated as a franchise
Owned by a franchise
Source(s) of capital used to expand
business
Types of customers
Percent of total sales exported
Operations established outside the
United States
Outsourced any business function
outside the United States
Language(s) used in transactions
Types of workers employed
Employer-paid benefits offered
Whether the company had a website
Whether the company had e-
commerce sales
E-commerce as a percentage of total
sales
Whether the company made online
purchases
Business activity (e.g., seasonal or
part-time)
Whether the business currently
operates
Reasons for ceasing operations
Joint ownership by husband and wife
Family-owned business
Number of owners

7

90% of the observations had missing values, such as whether or not the owner is retired, the
owner is deceased, or the business had low or inadequate sales.
Nonresponse
Approximately 62 percent of the 2.3 million businesses in the SBO sample responded to
the survey, compared to 75 percent for the 2002 survey. For the 2007 survey, 72 percent of the
companies in the SBO sample returned a questionnaire, but 10 percent of the returns did not
contain enough information to be considered a response for the estimates by race, gender,
ethnicity or veteran status. Many of these respondents were sole proprietors that answered "No"
to Item 8, "In 2007, did any individual own 10% or more of the rights, claims, interests, or stock
in this business?" Another identified issue was duality between race (Hispanic vs non-Hispanic)
and ethnicity (White, Black, Asian, American Indian). Every Hispanic business owner also had
to identify at least one additional ethnicity, which may lead to indication of mixed race when an
owner is solely Hispanic is heritage. This led to consequent variable manipulation for correction.
According to the U.S. Census, about 4 percent of the 2007 nonrespondents were selected
for and responded to the 2002 SBO. For these firms, data from the 2002 survey were used in
place of the missing 2007 responses. For the remaining nonrespondents, gender, ethnicity, race
and veteran status were imputed from donor respondents in the same sampling frame with
similar characteristics (state, industry, employment status, size). Because the assignment of
businesses to sampling frames relies heavily on administrative data, and there is a high level of
agreement between sampling frame assignment and tabulated race or ethnicity for responding
firms, the donor imputations are considered to be reliable. Estimates of sampling variability are
adjusted to account for nonresponse. Estimates with high error (relative standard error for sales
or receipts of 50 percent or more) are suppressed. Overall, imputed data accounted for
approximately 47 percent of the firm count estimates by gender, ethnicity, race, and veteran
status and approximately 20 percent of the estimates of sales.
Inherent Differences between SBO and PUMS Information
The Public Use Microdata Sample (PUMS) is a large dataset available to the public derived
from the original SBO dataset of responses. According to the U.S. Census Bureau, measures
were taken in constructing the PUMS file to protect the confidentiality of the SBO data in order
for it to be used freely among the public. In the PUMS file, each record corresponds to a
business, but deliberate measures were taken to ensure the anonymity of each business. For
businesses operating in multiple states and/or industry sectors, one record exists for each state
combination in which the firm conducts business. Identifiers to link the component records of a
business are not included. Additionally, businesses classified in the SBO as publicly owned or
not classifiable by gender, ethnicity, race, or veteran status are not included in the PUMS file
because many publicly owned firms are easily identifiable. Since the primary focus of my
research is on small businesses, exclusion of possibly larger public corporations does not
significantly impact the integrity of the data.
Finally, the U.S. Census Bureau infused noise into the PUMS data for disclosure avoidance
and confidentiality protection. Values are perturbed prior to tabulation by applying a random
noise multiplier to the magnitude data, such as the sales and receipts for all firms. This
introduced variation perturbed data points by no more than a few percentage points.
8

Data Manipulation and Cleaning
Subsetting to Small Businesses and One Owner Data
According to the Small Business Administration, small businesses are generally
considered to have less than 500 employees and under 7 million in receipts depending on the
industry. By subsetting the original dataset according to these new parameters, the total
observation count decreased from 2,165,680 to 2,025,530.
Furthermore, according to the 2007 SBO PUMS Guide, a response of 0 for most of the
qualitative questions indicated that the data for these variables were missing. As a result, each 0
was converted to to more accurately reflect the nature of this data and to correctly set up
regression procedures later on.
Another major issue for data analysis were the inclusion of data points of up to three
additional business owners. As a result, there exist three additional sets of demographic
variables. However, if a business only has one owner, these three additional sets of variables
would all have missing variables. Given the fact that PROC REG and other regression
commands exclude observations that have even one missing value, allowing the sixty extra
variables for additional owners would lead to over-exclusion of a significant amount of
observations. Moreover, I noticed that virtually every business had responses to all variables
describing the first owner, and that the first owner almost always owned as much (if not a greater
amount) of the business than his or her other 1-3 business partners. This justified my decision to
remove all variables affiliated with the second, third, and fourth owners.
One future recommendation would be to keep these deleted variables and find other
options to analyze the overall dataset despite the necessary inclusion of extra missing variables.
Observing businesses with the intent to analyze the relationship between multiple business
owners (when also factoring in age, experience, and education) could be incredibly valuable for
the studies of organization behavior and business.
Additional deleted variables included those that had over a 90% missing rate, since it
severely diminished the number of total observations used in regression. For the purposes of my
research questions, having more observations to interpret is preferable, but including these
variables in another analysis of businesses that have generally ceased activity would also be
worthwhile. These variables include:
CEASEOTHER ceased operations for another reason
SOLDBUS sold this business
STARTANOTHER Started another business
NOPERSCRED Lack of personal loans/credit
NOBUSCRED Lack of business loans/credit
LOWSALES Inadequate cash flow or low sales
ONETIME Operated for one-time event
DECEASED Owner died
RETIRE Owner retired
9


Lastly, I consolidated all language variables and the race and ethnicity variables in order
to improve overall adjusted fit and reduce noise. After analyzing the spread of the many
language variables in the dataset (English, Arabic, Chinese, French, German, Greek, Hindi,
Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Tagalog, and Vietnamese), the
most frequent languages spoken during business transactions are naturally English and Spanish.
Other minor languages collectively do not constitute more than 3% of the total number of
observations, so we created a new language variable with five levels: Only English, Only
Spanish, Only English and Spanish, Only English and Other, and Only Other Language, which
effectively consolidated sixteen binary variables to one variable with five levels. Additionally,
the race variable had at least twenty levels due to the combinations of difference races, which
encouraged consolidation. Both race and ethnicity were combined to create a new variable,
which followed this algorithm:

1. If ethnicity is Hispanic then Race/Ethnicity is Hispanic
2. If ethnicity is not Hispanic and the owner is White AND another minority, then the owner
is considered part of the other minority
3. If ethnicity is not Hispanic and the owner is only White or another minority, then the race
stays as listed
4. If ethnicity is not Hispanic and the owner is at least two types of race (both of which are
not White), then the owner is considered Mixed.

As a result, the final levels for the new race/ethnicity variable are: W for White, B for Black, A
for Asian, H for Hispanic, I for American Indian, P for Nhopi (Native Hawaiian), Mixed (for any
combination of two non-white and non-hispanic races such as Black/Asian.

Tabulation Weights

In most surveys, it will be the case that some groups are over-represented in the raw data
and others under-represented. In order to address this, weights are assigned to each observation
to compensate for the over/under-representation of data. While the exact method of determining
these weights for the PUMS is unknown, the values of the tabulation weight range from 1.0 to
35.0. In this sense, a single observation with a weight of 35.0 would functionally be the same as
thirty-five individual observations with the same parameters except for a weight of 1.0. Analysis
of the data therefore requires the weights to be properly factored into averages, percentages, and
regressions through the proper SAS procedures.
Fitting a Regression
Intuitively, I first wanted to fit a regression for all variables against receipts to observe
the overall coefficient of determination and the comparative significance of each individual
variable in determining receipts. This would immediately confirm or deny several of my possible
research questions about the variables and would then be a starting point for other further
research ideas. I gave further consideration to other possible response variables besides receipts,
such as employment, payroll, and certain categorical variables such as whether or not the
business was still operating. Ultimately, I decided to primarily focus on receipts as the response
10

variable, seeing as in general business theory that an increase in either payroll or employment
usually comes after an increase in overall revenues.
In order to begin, I had to identify the proper statistical procedure to regress several
quantitative and multi-level categorical variables against receipts. I also wanted to use a
procedure that would automatically create dummy variables and evaluate correlations while
controlling for other variables. Ultimately, I selected the generalized linear model procedure
because it satisfied the aforementioned criteria. With this in mind, limitations include the fact
that responses must be independent of another (which is extremely difficult to prove given game
theory and general economics of competition), and the fact that predictors are still assumed to be
linear.
Regression One: All Variables Against Receipts
Data Summary:
Number of Observations 874,182

Included Variables: RECEIPTS_NOISY EMPLOYMENT_NOISY PAYROLL_NOISY PCT1
EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTOR N07_EMPLOYER SEX1
FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1
HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED
SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR
SCBANKLOAN SCFAMLOAN SCGRANT SCDONTKNOW SCAMOUNT HOMEBASED
FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECCREDIT ECGOVTLOAN
ECGOVTGUAR ECBANKLOAN ECFAMLOAN ECVENTURE ECPROFITS ECGRANT
ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL STATELOCAL
INDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME DAYLABOR
TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE
HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS
LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV CEASENR HUSBWIFE
FAMILYBUS NUMOWNERS race1noblanks OPERATING LANGUAGE
Further analysis shows that these are the top five variables based on both the magnitude
of the t-value and the estimate, as these parameters indicate both statistical and practical
significance. Since the vast majority of variables are significant in the model on a 0.05
significance level, the magnitude of the estimate is the most determinant factor in establishing
importance.
11

Estimated Regression Coefficients
Parameter Estimate Standard Error t Value Pr > |t
SECTOR 11 Agriculture, Fishing 273.5380 19.767836 13.84 <.0001
SECTOR 21 Mining, Oil Extraction 268.7993 21.180031 12.69 <.0001
SECTOR 22 Trade, Transportation, Utilities 239.4820 20.632043 11.61 <.0001
SECTOR 23 Construction 335.1240 19.601069 17.10 <.0001
SECTOR 31 Manufacturing 352.3094 20.018820 17.60 <.0001
SECTOR 42 Wholesale Trade 638.2235 20.447289 31.21 <.0001
SECTOR 44 Retail Trade 392.6039 19.464185 20.17 <.0001
SECTOR 48 Warehousing / Distribution 270.2850 19.592456 13.80 <.0001
SECTOR 51 Information and Media 227.2462 19.599947 11.59 <.0001
SECTOR 52 Finance and Insurance 189.1639 19.496633 9.70 <.0001
SECTOR 53 Real Estate and Rental 201.2833 19.350880 10.40 <.0001
SECTOR 54 Professional, Scientific, and
Technical Services
207.9890 19.428692 10.71 <.0001
SECTOR 55 Mgmt. of Companies and
Enterprises
-1827.4092 68.385800 -26.72 <.0001
SECTOR 56 Admin. And Support and Waste
Management
205.8534 19.478828 10.57 <.0001
SECTOR 61 Education 220.9200 19.407797 11.38 <.0001
SECTOR 62 Healthcare 194.4817 19.586984 9.93 <.0001
SECTOR 71 Arts and Entertainment 224.3419 19.398649 11.56 <.0001
SECTOR 72 Accommodation and Food 131.5430 19.560860 6.72 <.0001
SECTOR 81 Other Services 228.5184 19.431691 11.76 <.0001

Sector has the highest magnitude in its level estimates, while several other variables have similar
estimate levels:
N07_EMPLOYER E 193.2805 3.474976 55.62 <.0001
N07_EMPLOYER N 0.0000 0.000000 .
HEALTHINS 1 174.8802 5.455380 32.06 <.0001
HEALTHINS 2 0.0000 0.000000 .
.
12

HOLIDAYS 1 168.2109 5.420379 31.03
<.0001
HOLIDAYS 2 0.0000 0.000000 . .

Fit Statistics:
R-square 0.5771
Adjusted R-square 0.5770
Root MSE 446.73

Interpretation: Overall, the regression has a moderately strong positive correlation between
explanatory and response variables. There are certain variables that have far higher t-values than
others. Here is a list of the most significant explanatory variables in this regression: Sector,
Provide1 (if owner 1 provided goods or services), HomeBased, and Contractors. There were
many variables that were also insignificant, however, which led me to consider further testing
that would automatically select a better model with fewer variables and a lower MSE, such as
stepwise regression.

Model Selection Techniques
To give an overview of available procedures:
Forward Analysis
In this approach, one adds variables to the model one at a time. At each step, each
variable that is not already in the model is tested for inclusion in the model. The most significant
of these variables is added to the model, so long as its p-value of the F-statistic is below some
pre-set level. It is customary to set this value above the conventional .05 level at say .10 or .15 In
action, the user begins with a model including the variable that is most significant in the initial
analysis, and continue adding variables until none of remaining variables are "significant" when
added to the model. The following hypotheses are tested for each iteration of the process.

Null Hypothesis: k = 0

Alternative Hypothesis k =/= 0 (such that the coefficient of the additional variable is statistically
nonzero)

Therefore, the decision statistic is the partial F-test. SAS stops the forward, backward, or
stepwise regression if the probability of the partial F-statistic for a new model is above a
maximum value, or continues model selection so long as the partial F-statistic is below the
threshold.

Backwards Analysis
This method is less popular because it begins with a model in which all candidate
variables have been included. However, because of its structure, a high coefficient of
determination is always maintained. The problem is that the models selected by this procedure
may include variables that are not really necessary. At each step, the variable that is the least
significant is removed. This process continues until no nonsignificant variables remain. The user
13

sets the significance level at which variables can be removed from the model, which is once
again usually 0.05.
Stepwise Regression
As a fusion of both forward and backward stepwise regression, in stepwise regression
four options are considered at each stage: add a term, delete a term, swap a term in the model for
one not in the model, or stop. This algorithm is most often used in practice. Despite its
widespread use, it has little theoretical basis. A more theoretically robust tool like Akaikes
Information Criterion (AIC) can also be used as a good metric to assess models. Limitations O
Fit Statistics
Adjusted Coefficient of Determination / Adjusted R-Square

The adjusted R-squared is a modified version of R-squared that has been adjusted for the
number of predictors in the model. The adjusted R-squared increases only if the new term
improves the model more than would be expected by chance. It decreases when a predictor
improves the model by less than expected by chance. The adjusted R-squared can be negative,
but its usually not. It is always lower than the R-squared.
Akaike Information Criterion
The Akaike Information Criterion (AIC) is a way of selecting a model from a set of
models. The chosen model is the one that minimizes the Kullback-Leibler divergence between
the model and the truth. It's based on information theory, but a heuristic way to think about it is
as a criterion that seeks a model that has a good fit to the truth but few parameters. It is defined
as:
AIC = -2 ( ln ( SSE / n )) + 2 K
where likelihood is the probability of the data given a model and K is the number of free
parameters in the model. AIC scores are often shown as AIC scores, or difference between the
best model (smallest AIC) and each model (so the best model has a AIC of zero). Used in
stepwise regression, AIC can be used instead of the p-value as the main criterion for model
selection. Each iterative models AIC should be calculated and be compared to the previous, and
should only be preferred if the current AIC is smaller than the AIC of the prior model. This
process continues until the best model is selected.
Bayesian Information Criterion
BIC = n log (SSEp) n log (n) + p log (n)
The BIC acts essentially the same as AIC but incorporates a more severe decrease if n > 8
14

Schwarz Bayesian Criterion
SBC = n ln (SSE / n) + k ln (n)
This is essentially like the AIC equation but uses a multiplicative penalty term based on
sample size rather than a constant of 2. By default, PROC GLMSELECT uses the stepwise
selection based on the Schwarz Bayesian Criterion.
1

Mallows Cp
Cp = ((1-Rp
2
)(n-T) / (1-RT
2
)) [n 2(p+1)]
The AIC has been shown as equivalent to Mallows Cp, which is used to assess the fit of
a regression model that has been estimated using least ordinary squares. This measures the bias
in the reduced regression model relative to the full model having all T candidate predictors. If Cp
is roughly equivalent to p, then the reduced model predicts as well as the full model. If Cp < p
then the reduced model is estimated to predict better than the full model. In practice, the selected
model should have the smallest Cp.
Mean Squared Error (MSE)

The Mean Squared Error in regression refers to the residual sum of squares divided by
the number of degrees of freedom. Minimizing MSE is important to ensure that the maximum
amount of variation of the regression can be explained by the independent variables, thus
establishing the robustness of a model. It is one of the most important and fundamental criteria
that can be used to evaluate models.
General Criteria:
General diagnostics should be calculated for each model to help determine which model
is best. These model diagnostics include the mean square error (MSE), the adjusted coefficient
of determination (R2), and Mallows Cp. A good linear model will have small MSE and Cp and
a high adjusted R2 close to 1. With these criteria in mind in addition to stepwise regression with
tools such as AIC, BIC, and SBC, I can develop a more robust model than the original. However,
it is also important to note that use of these criteria and selection procedures will not definitively
yield the best model due to the sheer number of potential models and inherent limitations of
these tools.
Regression Two: PROC GLSELECT with the Schwarz Bayesian Criterion
Using PROC GLMSELECT, I used the aforementioned steps to select a model just using
stepwise regression.
Data Summary:
Number of Observations 874,182

Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 SECTOR
N07_EMPLOYER SEX1 VET1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1

1
http://www2.sas.com/proceedings/sugi31/207-31.pdf
15

FNCTNABV1 HOURS1 PRMINC1 EDUC1 BORNUS1 ESTABLISHED SCSAVINGS
SCEQUITY SCCREDIT SCGOVTGUAR SCDONTKNOW SCAMOUNT HOMEBASED
FRANCHISE ECSAVINGS ECASSETS ECCREDIT ECBANKLOAN ECVENTURE
ECPROFITS ECGRANT ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL
STATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME
TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE
HOLIDAYS BENENABV WEBSITE ECOMMPCT LT40HOURS SEASONAL
OCCASIONALLY ACTIVITYNABV OPERATING HUSBWIFE NUMOWNERS REGION

Fit Statistics:
R-square 0.5768
Adjusted R-square 0.5768
Root MSE 1516.49340

Regression Three: PROC GLMSELECT with Akaikes Information Criterion
Using PROC GLMSELECT, I used the aforementioned steps to select a model using the AIC.
Data Summary:
Number of Observations 874,182

Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTOR
N07_EMPLOYER SEX1 FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1
FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1
DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT
SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCGRANT
SCDONTKNOW SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS
ECASSETS ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECFAMLOAN
ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESS
ECNOEXPAND FEDERAL STATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDE
FULLTIME PARTTIME DAYLABOR TEMPSTAFF LEASED CONTRACTORS
HEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITE
ECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLY
ACTIVITYNABV OPERATING CEASENR HUSBWIFE FAMILYBUS NUMOWNERS
race1noblanks LANGUAGE
Fit Statistics:
R-square 0.5771
Adjusted R-square 0.5770
Root MSE 1516.10130
Since the AIC model is on par with highest overall adjusted R-squared, has the least
model variables, and relatively low MSE compared to the second stepwise model, it should be
considered the most robust. It is noted that the first linear regression of all variables has a
considerably lower MSE than the stepwise-selected models, despite the inclusion of far more
variables. However, given the prevalence of missing data and nonresponse in these data sets,
future data collection may benefit from relying on fewer variables in the generalized linear
model. Therefore, the AIC model ranks the most effective.
16

The most significant variables are also those aforementioned in Regression One: sector,
health insurance, holidays, and employer status for tabulation. Sector is arguably the most
intuitively significant, while both health insurance and holidays are usually indicative of a
business performing well enough to provide extended luxuries for employees. Surprisingly, both
payroll and employment had far smaller magnitudes in their estimates, which indicates
insignificance on a practical level.
There existed several limitations to the regression. As stated earlier, one of the main
drawbacks of using the generalized linear model is that it can produce over-fitting to data as well
as the assumption that the relationships between the explanatory and response variable is linear,
which may not be the case. Ideally, tests for nonlinearity could be conducted on each variable by
plotting the residuals versus predicted values. Furthermore, multicollinearity or the variance
inflation factor should be calculated for each variable, which remains to be done considering
there is no built-in functionality for this purpose for survey data in SAS. Due to the large amount
of variables, testing for multicollinearity is especially important. From a cursory analysis, most
of the t-ratios for the individual coefficients is statistically significant, which could indicate that
multicollinearity is not severe. Another possible option was to create a correlation matrix, which
proved to be too cumbersome considering the matrix have dimensions greater than 150 x 150.
Finally, further work could be done to investigate the addition of extra terms, such as interactive
combinations of other original terms that better reflect the lack of complete independence
between explanatory variables.
Proper model selection requires conscientious consideration of the tradeoff between the
competing objectives of conformity and adherence to data and model simplicity. Good models
conform to the data with a strong goodness of fit, but can also be easily generalizable in its
interpretation. Finally, good models should not under-fit (leaving out key variables in favor of
attempt to be generalizable) or over-fit to the data (including extraneous or unrealistic variable
effects in its attempt to have the best goodness of fit) because in each scenario, the conclusion
loses value.








17

QQ Plot of Residuals versus a Normal Distribution

This quantile-quantile plot of the residuals versus a normal distribution show that the data seems
to be normally distributed through the inner quartiles, but heavily skewed with long-tailed
distributions on both sides particularly the left side. Since this QQ plot indicates significant
skew in this model, its conclusions cannot be used to draw the strongest conclusions.
Plot of Fitted Values versus Residuals

18

This plot of fitted values essentially entails a biased and heteroscedastic spread with an
interesting phenomenon of residuals steadily decreasing in variation at higher response variable
values. Due to the extreme density of points attributable to the large amount of overall variation
and sheer number of datapoints, more specific phenomena cannot be analyzed at this point in
time.
The plot also points to the issue of the model
drastically over-estimating predicted values
given the sheer magnitude of the negative
residuals, which calls for further analysis
among observed businesses that deviate the
most from the model. Given that over half of the observed businesses have receipts less than
$16,000 according to the five-number summary, running a regression on the subset of PUMS to
only include businesses that earn more than the median point may produce a more well-fitting,
unbiased, and homoscedastic model.
The five number summary for
residual values clearly shows greater
negative skew on the whole, but a
greater density of positive values over smaller intervals, thus confirming the residuals versus
fitted plot. Since numerical variables typically have more leverage over the fit and spread of the
residuals and receipts is extremely skewed right, an analysis of both payroll and employment is
warranted to see if similar effects are in place. The following two tables describe the five number
summaries for both payroll and employment.

These results confirm that both employment and payroll are heavily skewed right, which
would explain the potential for the model to drastically overestimate certain values. Since the
actual magnitude (from a dollar or labor force standpoint) causes this extreme skew, creating a
new binary variable for each of these variables such that 0 indicates employment or payroll of 0
and 1 indicates employment or payroll greater than 0 could result in better fit. Additionally,
taking the logarithm of RECEIPTS_NOISY may induce better fit as well due to the skew.
In order to establish control for the following regressions, I first conduct the PROC
GLMSELECT procedure with AIC with the only change of taking the logarithm of receipt
values.
Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm
Transform
Data Summary:
Number of Observations - 784,208
Five Number Summary of
RECEIPTS_NOISY
Min Q1 Median Q3 Max
0 1.72393 15.23411 81.77834 6900
Five Number Summary of Residuals
Min Q1 Median Q3 Max
-93534.00 -222.63 -36.71 101.67 8271.42
Five Number Summary of EMPLOYMENT_NOISY
Min Q1 Median Q3 Max
0 0 0 0 4890.00
Five Number Summary of PAYROLL_NOISY
Min Q1 Median Q3 Max
0 0 0 0 280000.00
19

Included Variables: PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTOR
N07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1 RECEIVED1
ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1
SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETS
SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN
SCVENTURE SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASED
FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDIT
ECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANT ECOTHER
ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL OTHERBUS INDIVIDUALS
EXPORTS FULLTIME PARTTIME LEASED CONTRACTORS HEALTHINS RETIREMENT
PROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH
LT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV
OPERATING CEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE
Fit Statistics:
R-Square 0.6689
Adjusted R-Square 0.6688
Root MSE 3.07507


While the fit has improved by most
standards, the overall Q-Q plot and fitted
versus residual values plot still indicate
major issues in skew and bias. As stated
earlier, trying to correct the issue by
creating binary variables out of the
existing numeric payroll and
employment variables could control
some of the drastic skewedness observed
in both plots, which leads to the next
regression.

20


Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and
Employment with Other Character Variables against the Logarithm of Receipts
Using PROC GLMSELECT, I used the aforementioned steps to select a model using the AIC
with the modified variables of binary payroll and employment.
Variables Used: PCT1 FIPST SECTOR N07_EMPLOYER SEX1 VET1 FOUNDED1
PURCHASED1 INHERITED1 RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1
FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1
ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN
SCBANKLOAN SCFAMLOAN SCVENTURE SCGRANT SCOTHER SCAMOUNT
HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY
ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECVENTURE ECPROFITS
ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL
STATELOCAL OTHERBUS INDIVIDUALS EXPORTS FULLTIME DAYLABOR
TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE
HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS
LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING CEASENR
HUSBWIFE FAMILYBUS NUMOWNERS race1noblanks LANGUAGE ifpayroll
ifemployment

Fit Statistics:
R-square 0.6550
Adjusted R-square 0.6549
Root MSE 3.13900

Although the fit statistics seem to
indicate worse fit than the previous
model given lower coefficients of
determination and a higher root MSE,
the Q-Q plot absolutely indicates better
overall fit and far less skew especially
left skewedness.

21


The correction of skew also allows me to observe the actual bulk of the fitted versus
residuals plot more closely, which is starting to show very peculiar patterns that require further
investigation. One possible reason for this is the fact that virtually all of the regressed variables
are categorical with the exception of ownership percentage of the first business owner.
Although the previous model had better fit values and the current model may be
susceptible to data overfitting to the sample, this model iteration still ultimately has better results
according to the Q-Q plot and the fitted versus residuals plot due to reduced skewedness.

22

Analysis of Language
Research Question 1a: Does the language spoken in transactions produce a
difference in correlated receipts?
The United States is often seen as the nation most conducive to immigrant
entrepreneurship, especially in an increasingly globalized society as well. To start, I investigated
the relationship between different languages and overall business receipts. Running a general
linear model procedure for only the language variable against receipts results in the following:
Estimates of Language Variable Levels Against Receipts

One interesting phenomenon is that a business that conducts transactions through only
English and Spanish is associated with higher gross receipts than a business that conducts
business transactions through any other language or language combination. Most notably, the
estimate for a business that only speaks English and Spanish is nearly 40% greater than the
estimate for a business that only speaks English for transactions.
Its possible that Hispanics are largely employed by the agriculture, manufacturing, and
construction industries more so than others. According to 2008 research by the Center for
Construction Research and Training, the following depicts a graph of Hispanic employees as a
percentage of each industry.

This data point to the possible underlying non-uniform distribution of Hispanics across
business industries. Its important to understand the difference between businesses that conduct
23

transactions in a certain language and businesses that operate internally using the language. For
example, a business that could have a predominantly large portion of Hispanics may not
necessarily conduct business transactions in Spanish. Therefore, an additional research question
could be the relationship between the language and new race/ethnicity variable. The percentage
of Hispanics by industry in this instance simply points to industries to investigate more closely,
such as construction and agriculture.
Using this information, I analyzed the breakdown of language by NAICS sector code,
paying special attention to the industries that had the greatest percentage of Only English and
Spanish businesses.


24

Industries with the Greatest Proportion of Only Spanish and English Businesses:
Sector 55: 13.63%
The Management of Companies and Enterprises sector comprises (1) establishments that hold
the securities of (or other equity interests in) companies and enterprises for the purpose of
owning a controlling interest or influencing management decisions or (2) establishments (except
government establishments) that administer, oversee, and manage establishments of the company
or enterprise and that normally undertake the strategic or organizational planning and decision
making role of the company or enterprise.

Sector 62: 10.74%
The Health Care and Social Assistance sector comprises establishments providing health care
and social assistance for individuals.

Industries with the Greatest Proportion of Only English Businesses:

Sector 21: 95.71%

The Mining sector comprises establishments that extract naturally occurring mineral solids, such
as coal and ores; liquid minerals, such as crude petroleum; and gases, such as natural gas.

Sector 11: 92.90%

The Agriculture, Forestry, Fishing and Hunting sector comprises establishments primarily
engaged in growing crops, raising animals, harvesting timber, and harvesting fish and other
animals from a farm, ranch, or their natural habitats.

Given these interpretations, looking at the general mean receipts for each sector could then
explain why businesses that conducted business transactions in English and Spanish have
statistically higher receipts on average. The following table depicts sector and it means receipts:

Statistics
Variable Mean Std Error of Mean
RECEIPTS_NOISY 174.631446 0.353243

Domain Analysis: SECTOR
SECTOR Mean Std Error of Mean
11 Agriculture, Fishing 96.067138 2.182337
21 Mining, Quarrying, Oil Extraction 233.433129 6.192045
22 Trade, Transportation, Utilities 113.609696 6.397033
23 Construction 218.497973 1.149555
25

Domain Analysis: SECTOR
SECTOR Mean Std Error of Mean
31 Manufacturing 527.070573 4.920525
42 Wholesale Trade 628.044453 5.295762
44 Retail Trade 267.633023 1.575551
48 Transportation and Warehousing 144.352095 1.225643
51 Information 156.241621 2.441339
52 Finance and Insurance 170.082058 1.558410
53 Real Estate and Rental 120.396544 0.699686
54 Professional, Scientific, Technical
Services
139.530715 0.707921
55 Mgmt. of Companies and Enterprises 490.728375 19.475524
56 Admin. and Support and Waste
Management
100.129900 0.835629
61 Education 45.306452 0.868810
62 Healthcare 175.218046 1.177138
71 Arts and Etnmt. 58.393615 0.740265
72 Accommodation and Food Services 356.426518 2.657923
81 Other Services 69.516196 0.417748
99 Unclassifiable 102.268452 8.707951

According to this PROC SURVEYMEANS procedure of the mean gross receipts of the average
business in each industry and the average gross receipts of all businesses (across industries),
which is 174.63, both sectors 55 and 62 earn above average. While this preliminarily explains
why the overall estimate for Only English and Spanish businesses is higher than the estimate
of other language levels, the following question should be investigated:




26

Research Question 1b: Out of only existing Only English and Spanish businesses,
which are the most popular industries for business?

Table of LANGUAGE by SECTOR
LANGUAGE SECTOR Frequency Weighted
Frequency
Percent
Only English and Spanish 11 390 4076 0.4269
21 263 1907 0.1998
22 123 533.41200 0.0559
23 7472 101173 10.5960
31 3480 17790 1.8632
42 4346 27406 2.8702
44 13181 113815 11.9199
48 3979 44151 4.6239
51 1460 9222 0.9658
52 5665 46651 4.8858
53 6760 85715 8.9770
54 11573 117115 12.2656
55 1159 1581 0.1656
56 5933 74071 7.7575
61 1325 16665 1.7453
62 11269 127184 13.3201
71 1917 25298 2.6495
72 4017 35816 3.7511
81 7699 104522 10.9467
99 22 135.88500 0.0142
Total 92033 954827 100.000

The top give highest concentrations of Only English and Spanish businesses are in
sectors 62 (13.32% and receipts of 175.22), 54 (12.27% and receipts of 139.53), 44 (11.92% and
receipts of 267.63), 81 (10.95% and receipts of 69.52), and 23 (10.60% and receipts of 218.50).
Although within on a sector level, Only English and Spanish businesses in sector 55 made up a
large share of businesses in sector 55 overall, it actually had one of the smallest actual
frequencies with just 1159 businesses total. Revising my initial conclusion, I argue that the
higher estimate is most likely derived from consistently above average performance in sectors
where Only English and Spanish businesses are prevalent.
27

Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United
States
As noted earlier, there exists a dichotomy between businesses run by Hispanic owners
and businesses that conduct transactions in English and Spanish. I approached this issue in a
different fashion, by plotting the percentage of both (for each state) on a map of the United
States. I pre-defined set intervals after looking at the spread of percentages for each state. The
following graphs generally depict the same patterns of higher concentrations of all
Hispanic/Spanish-speaking subjects located in the Southwest and Florida. I obtained 2007 data
on the number of Hispanics living in each state from the Pew Research Hispanic Trend Project.



28



29

Analysis of Capital Sources in California versus the United States
Californias $2 trillion economy would be the ninth biggest in the world if it were a
country. The state represents 13% of the U.S. economy. However, California has been ranked as
one of the worst states to do business in recent years according to business executives and
publications.
2
The state has been under duress from the dramatic fall in home prices and the
reduced tax revenues for the state. Moreover, California consistently boasts one of the highest
costs for living and operation. Interestingly, California also ranks among the best for technology
and innovation. Another plus is the $36 billion in venture capital money invested in California
companies the past three years, which is four times the total of any other state.
3
California is also
noted to be the home of Silicon Valley. According to a 2006 study done by the American
Electronics Association, Silicon Valley and the Bay Area as a whole ranked first in terms of the
number of high-tech jobs in the United States. While the PUMS does not have location data
more granular than the state level, the existence of Silicon Valley itself could point to interesting
statistical characteristics of California that no other state may share.
As the state of extremes, I found it interesting to investigate sources of startup and
expansion capital and their relationship to revenues in California, especially compared to this
relationship between capital and revenues in the United States.
Dataset Overview
The observed dataset is simply a subset of the cleaned PUMS done by only setting
observations that have indicated 06 (for California) as its FIPS code. This dataset has 182932
observations, and most categorical variables (except location) seem to have a missing percentage
of 30-50%, which is slightly higher than the typical missing pattern observed in the U.S. dataset.
In this section, both startup and expansion capital will be analyzed.
Startup capital refers to the initial cost of investment to fully bring a product or service to
market. It can be used for everything from business operation expenses to research and
development to payroll. It is typically used to fund businesses still in their infancy, and can be
repaid once the business reaches a level of maturity to earn revenues on its own.
Companies that seek expansion capital, on the other hand, will often do so in order to
finance a transformational event in their business. These companies are likely to be more mature
(in terms of operating time) than venture capital funded companies, able to generate revenue and
operating profits but unable to generate sufficient cash to fund major opportunities, acquisitions
or other investments. Because of this lack of scale these companies generally can find few
alternative conduits to secure capital for growth, so access to growth equity can be critical to
pursue necessary facility expansion, sales and marketing initiatives, equipment purchases, and
new product development.


2
http://www.cnbc.com/id/100843287
3
http://www.forbes.com/pictures/mli45kikd/41-california/
30

Glossary of Capital Sources
Startup Capital (SC)
SCSAVINGS: Personal Savings
SCASSETS: Other Personal Assets
SCEQUITY: Home Equity
SCCREDIT: Credit Cards
SCGOVTLOAN: Government Loan
SCGOVTGUAR: Government Guaranteed Loan the United States government and the Small
Business Administration provides loans to certain businesses depending on size and capital
purposes
SCVENTURE: Venture Capitalist
SCGRANT: Grant
SCOTHER: Other
Research Question 2a: What sources of capital have the most positive relationship
with receipts?
Regressing both startup capital and expansion capital variables against receipts in both
the California and United States datasets reveals interesting statistics on both the percentage of
usage of each type of capital, as well as the practical significance of each capital source
represented via the estimate. The following data tables derive content from the full list of SAS
outputs contained in Appendix C.
Summarized Table for Startup Capital
California U.S. Estimates
True Estimates (added to
Intercept)
Yes No Yes No CA USA CA USA Difference
SCSAVINGS 60.97% 39.03% 58.01% 41.99% 11.70 -15.78 113.71 154.40 -40.69
SCASSETS 6.23% 93.77% 7.05% 92.95% 66.98 48.85 168.98 219.03 -50.04
SCEQUITY 6.12% 93.88% 5.07% 94.93% 71.58 49.37 173.59 219.55 -45.96
SCCREDIT 11.22% 88.78% 10.31% 89.69% -53.88 -90.21 48.13 79.96 -31.83
SCGOVTLOAN 0.40% 99.60% 0.56% 99.44% 155.74 91.93 257.74 262.10 -4.36
SCGOVTGUAR 0.45% 99.55% 0.59% 99.41% 128.86 195.60 230.87 365.78 -134.92
SCBANKLOAN 4.47% 95.53% 9.18% 90.82% 247.36 305.75 349.37 475.93 -126.56
SCFAMLOAN 2.23% 97.77% 2.28% 97.72% 36.52 180.69 138.53 350.86 -212.34
SCVENTURE 0.30% 99.70% 0.25% 99.75% 68.87 364.61 170.88 534.78 -363.91
SCGRANT 0.18% 99.82% 0.20% 99.80% -78.63 -103.56 23.37 66.62 -43.24
SCOTHER 1.74% 98.26% 1.70% 98.30% 203.10 170.52 305.10 340.69 -35.59
SCDONTKNOW 4.47% 95.53% 4.58% 95.42% 150.63 183.79 252.64 353.96 -101.32
SCNONENEEDED 23.59% 76.41% 24.35% 75.65% -53.49 111.12 48.52 281.30 -232.78
SCNOTREPORTED 5.33% 94.67% 5.32% 94.68% 0.00 0.00 102.01 170.18 -68.17
INTERCEPT 102.01 170.18
In terms of usage frequency, differences greater than 1% between the United States and
California are bolded. Regardless, its important to note that most of these differences (even for
differences less than 1%) are statistically significant due to the large sample size. Businesses in
the United States as a whole are more than twice as likely to use bank loans as a source of startup
capital when compared to businesses in California, while Californian business owners are more
Startup Capital (SC)
SCSAVINGS: Personal Savings
SCASSETS: Other Personal Assets
SCEQUITY: Home Equity
SCCREDIT: Credit Cards
SCGOVTLOAN: Government Loan
SCGOVTGUAR: Government Guaranteed Loan
the United States government and the Small
Business Administration provides loans to
certain businesses depending on size and capital
purposes
SCBANKLOAN: Loan from Bank
SCFAMLOAN: Loan from family and friends
SCVENTURE: Venture Capitalist
SCGRANT: Grant
SCOTHER: Other
Expansion Capital (EC)
ECSAVINGS: Personal Savings
ECASSETS: Other Personal Assets
ECEQUITY: Home Equity
ECCREDIT: Credit Cards
ECGOVTLOAN: Government Loan
ECGOVTGUAR: Government Guaranteed
Loan the United States government and the
Small Business Administration provides loans
to certain businesses depending on size and
capital purposes
ECBANKLOAN: Loan from Bank
ECFAMLOAN: Loan from family and friends
ECVENTURE: Venture Capitalist
ECPROFITS: Business Profits
ECGRANT: Grant
ECOTHER: Other
ECNOEXPAND: Did not expand
ECNOACCESS: No Access to Expansion
Capital


31

likely to use home equity and their own savings to start ventures. Overall, the top three
categories with the highest estimate magnitudes for the United States are venture capital, bank
loans, and government guaranteed loans. For California, these categories are bank loans, other
sources of capital (unspecified), and government loans.
Consistent with its low ranking, California as a whole seems to entirely perform worse
than businesses in the United States according to the difference in estimates. Most interestingly,
California comparatively the worst in the venture capital category, despite being the state with
the greatest amount of venture capital invested in business venture formation. Other
comparatively poor categories are loans from family members and businesses that do not need
startup capital. These results warrant further analysis is conducted on the spread, rather than the
average, of certain categories, as it is possible for Californian businesses to have greater
extremes than American businesses in general.
Summarized Table for Expansion Capital
California U.S. Estimates
True Estimates (added to
Intercept)
Yes No Yes No CA USA CA USA Difference
ECSAVINGS 32.24% 67.76% 29.00% 71.00% -33.74 -112.84 76.38 118.78 -42.39
ECASSETS 3.90% 96.10% 4.00% 96.00% 21.89 5.40 132.03 237.03 -105.00
ECEQUITY 5.84% 94.16% 4.33% 95.67% 135.96 90.04 246.09 321.67 -75.57
ECCREDIT 13.68% 86.32% 12.09% 87.91% -9.45 -88.74 100.67 142.87 -42.19
ECGOVTLOAN 0.33% 99.67% 0.39% 99.61% 419.09 99.43 529.23 331.06 198.17
ECGOVTGUAR 0.26% 99.74% 0.29% 99.71% 63.47 167.61 173.61 399.24 -225.63
ECBANKLOAN 4.60% 95.40% 7.54% 92.46% 461.09 446.34 571.22 677.97 -106.74
ECFAMLOAN 1.06% 98.94% 0.93% 99.07% 69.17 113.62 179.31 345.25 -165.93
ECVENTURE 0.18% 99.82% 0.13% 99.87% 514.95 521.87 625.09 753.49 -128.40
ECPROFITS 8.95% 91.05% 9.23% 90.77% 124.95 162.90 235.09 394.53 -159.43
ECGRANT 0.20% 99.80% 0.19% 99.81% -98.08 -131.52 12.05 100.10 -88.04
ECOTHER 0.83% 99.17% 0.76% 99.24% 127.44 132.62 237.57 364.24 -126.67
ECDONTKNOW 6.76% 93.24% 6.57% 93.43% -9.43 -43.75 100.70 187.86 -87.16
ECNOACCESS 1.93% 98.07% 1.80% 98.20% -68.32 -161.82 41.81 69.80 -27.98
ECNOEXPAND 45.80% 54.20% 48.29% 51.71% -26.99 -108.06 83.14 123.56 -40.41
ECNOTREPORTED 7.38% 92.62% 7.61% 92.39% 0 0 110.13 231.63 -121.49
INTERCEPT N/A N/A N/A N/A 110.137 231.627

As done previously, differences greater than 1% between the United States and California
are bolded. Businesses in the United States as a whole are more likely to use bank loans for
expansion capital, or not require it at all. Businesses in California are more likely to use their
own savings, home equity, or credit card debt to fund expansion.
Overall, the top three categories with the highest estimate magnitudes for the United
States are venture capital, bank loans, and government guaranteed loans (which is also identical
to the top three for startup capital). For California, these categories are venture capital, bank
loans, and government loans. Once again, businesses in the United States tend to benefit more
from virtually all sources of expansion capital than businesses in California, with the exception
of having a government loan. These interesting characteristics further warrant an analysis of
spread, rather than just the average, for certain categories.
32

Research Question 2b: How does the spread of receipts for certain capital sources
compare between businesses in California and the general United States?
SCVENTURE and ECVENTURE Analysis
Even though California performed worse on average according to its estimates, I initially
hypothesized that California had an overall larger spread with a maximum and upper quartile
point likely exceeding the maximum and upper quartile of receipts in the United States.
According to the following give number summary, Californias quantiles for startup venture
capital exceed those of the United States except for the maximum value, which indicates that
Californian businesses holistically perform better than businesses in the United States in general.
The previous estimates from the regression are influenced due to the heavier right skewedness of
the United States.
Venture Capital - Five Number Summaries
Quantile
California -
SC
US -
SC
California -
EC US - EC
Min 0 0 0
0
Q1 10.57 7.71 9.34 12.32
Median 97.28 52.54 91.73 92.12
Q3 774.90 454.80 608.45 725.24
Max 6600.00 6900.00 6900.00 6800.00

ECGOVTGUAR Analysis
Since ECGOVTGUAR was the only source of expansion capital in which California had
a greater estimate for than the United States, I also wanted to conduct a spread analysis.
According to the PROC SURVEYFREQ procedure, there are 440 businesses in California that
used a government guaranteed loan for expansion capital.
Expansion Capital from Gov't Guar. Loan
Quantile California - SC US - SC
Min 0.00 0.00
Q1 83.57 22.29
Median 301.42 174.09
Q3 993.67 621.90
Max 6900.00 6900.00

The spread confirms the general interpretation from the estimate such that Californian
businesses as a whole tend to benefit more from government guaranteed loans.

33

Conclusion
My analysis of the Public Microdata Sample of the 2007 Survey of Business Owners
involved cleaning and manipulating raw data, selecting a generalized linear model of all
variables against receipts through the Akaikes Information Criterion, an analysis of business
transaction language, industry of business, and ethnicity of the owner in the context of state
location, and an investigation of the differences in capital sources between businesses in
California and the United States in general.
From the data cleaning, I was able to successfully incorporate my knowledge of the
background survey methodology to effectively consolidate and remove variables and prepare it
for proper analysis in this context. From the model fitting, I was able to learn and apply different
model selection techniques and selection criteria (AIC, BIC, SBC, Mallows Cp, MSE, and
adjusted r-square) to ultimately choose the best fitting model that attained a moderately strong
adjusted coefficient of determination, after several model selection manipulations that involved
logarithm variable transformation to reduce skewness and improve overall fit.
The language analysis revealed that businesses that conducted transactions in only
English and Spanish had higher estimates that businesses that only used English. Investigating
these businesses within the context of sector ultimately showed that businesses that only used
English and Spanish were statistically represented at a higher percentage for certain sectors (such
as sector 23 / Construction ) that earned more on average than other sectors where Only English
businesses had a higher percentage. Finally, the analysis of sources of capital revealed that
California overwhelmingly performed worse than businesses in the United States based on
averages and estimates in regression, but closer analysis on the spread of receipts given startup
venture capital in California shows that estimates can be misleading, and heavy right-skewedness
invalidates conclusions based on the average alone.
Given more time and access to the actual dataset (without noise and other confidentiality-
preserving measures), I would be able to develop a more powerful and accurate model, along
with other analyses. Other important research questions to investigate would be creating a
correlation matrix to observe collinearity between variables, observing the relationship between
receipts with more demographic information such as age, gender, and education level, as well as
conducting statistical analysis with different response variables, such as employment and payroll.
Lessons Learned
Ive come to believe that doing independent research in the context of statistics is
incredibly important. Through this study, Ive been able to apply everything that Ive learned in
all of my statistics courses, from learning how to handle a very large data set in SAS to making
the proper assumptions and conclusions from my analyses. I argue that this is the highest form of
learning as it is completely experiential, based off of existing data, and set entirely in real-
world scenarios. Ive also been fortunate enough to study the fusion of my two academic fields
business and statistics. The flexibility of independent research has allowed me to learn about
existing literature in the vast field of business statistics and entrepreneurship, as well as field
other possible research ideas such as social network analysis and survival analysis.
34

Appendix A: Full Output of Regression with log(Receipts) and Modified
Payroll and Employment

The GLMSELECT Procedure
Selected Model

The selected model is the model at the last step (Step 81).
Effects: Intercept PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTOR
N07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1
RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1
HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1
ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT
SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCVENTURE
SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASED
FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDIT
ECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANT
ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL
OTHERBUS INDIVIDUALS EXPORTS FULLTIME PARTTIME LEASED
CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYS
BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS
LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING
CEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value
Model 207 14974589 72341 7650.22
Error 784001 7413568 9.45607
Corrected Total 784208 22388157

Root MSE 3.07507
Dependent Mean 4.23116
R-Square 0.6689
Adj R-Sq 0.6688
AIC 2546267
AICC 2546268
SBC 1764464
35


Parameter Estimates
Parameter DF Estimate Standard Error t Value
Intercept 1 1.889667 0.094597 19.98
PAYROLL_NOISY 1 0.001098 0.000006900 159.06
EMPLOYMENT_NOISY 1 0.013737 0.000194 70.72
PCT1 1 0.000336 0.000107 3.13
FIPST 01 1 0.069510 0.012035 5.78
FIPST 04 1 0.098799 0.010865 9.09
FIPST 05 1 0.007242 0.013682 0.53
FIPST 06 1 0.190992 0.008333 22.92
FIPST 08 1 0.042436 0.010223 4.15
FIPST 09 1 0.176561 0.011957 14.77
FIPST 12 1 0.042020 0.008756 4.80
FIPST 13 1 0.066957 0.009863 6.79
FIPST 15 1 0.128502 0.017731 7.25
FIPST 16 1 -0.020266 0.015023 -1.35
FIPST 17 1 0.053238 0.009259 5.75
FIPST 18 1 0.012449 0.010725 1.16
FIPST 19 1 -0.066745 0.012767 -5.23
FIPST 20 1 -0.015773 0.013107 -1.20
FIPST 21 1 -0.007839 0.012161 -0.64
FIPST 22 1 0.102305 0.012320 8.30
FIPST 23 1 0.000535 0.015441 0.03
FIPST 24 1 0.120146 0.010882 11.04
FIPST 25 1 0.144996 0.010390 13.96
FIPST 26 1 0.006431 0.009752 0.66
FIPST 27 1 0.006867 0.010502 0.65
FIPST 28 1 0.052754 0.014819 3.56
FIPST 29 1 -0.014865 0.010791 -1.38
36

Parameter Estimates
Parameter DF Estimate Standard Error t Value
FIPST 30 1 -0.065514 0.016294 -4.02
FIPST 31 1 -0.083451 0.015061 -5.54
FIPST 32 1 0.171132 0.014256 12.00
FIPST 33 1 0.122197 0.015510 7.88
FIPST 34 1 0.184363 0.009862 18.69
FIPST 35 1 0.046100 0.015776 2.92
FIPST 36 1 0.132543 0.008847 14.98
FIPST 37 1 0.057997 0.009811 5.91
FIPST 39 1 0.021252 0.009527 2.23
FIPST 40 1 -0.000955 0.012247 -0.08
FIPST 41 1 0.055542 0.011366 4.89
FIPST 42 1 0.069120 0.009322 7.41
FIPST 45 1 0.045966 0.011929 3.85
FIPST 47 1 0.070336 0.010845 6.49
FIPST 48 1 0.107862 0.008739 12.34
FIPST 49 1 0.083133 0.012981 6.40
FIPST 51 1 0.079476 0.010169 7.82
FIPST 53 1 0.103092 0.010198 10.11
FIPST 54 1 -0.057228 0.017523 -3.27
FIPST 55 0 0 . .
SECTOR 11 1 1.034824 0.086277 11.99
SECTOR 21 1 1.021291 0.086922 11.75
SECTOR 22 1 0.990292 0.096556 10.26
SECTOR 23 1 1.275192 0.085487 14.92
SECTOR 31 1 1.101583 0.085674 12.86
SECTOR 42 1 1.555542 0.085640 18.16
SECTOR 44 1 1.279568 0.085497 14.97
SECTOR 48 1 1.200337 0.085620 14.02
37

Parameter Estimates
Parameter DF Estimate Standard Error t Value
SECTOR 51 1 0.863523 0.085909 10.05
SECTOR 52 1 0.940771 0.085589 10.99
SECTOR 53 1 0.995807 0.085505 11.65
SECTOR 54 1 0.908610 0.085467 10.63
SECTOR 55 1 -0.452943 0.100108 -4.52
SECTOR 56 1 0.809018 0.085534 9.46
SECTOR 61 1 0.734224 0.085848 8.55
SECTOR 62 1 0.966935 0.085523 11.31
SECTOR 71 1 0.742947 0.085621 8.68
SECTOR 72 1 1.069889 0.085657 12.49
SECTOR 81 1 0.856258 0.085507 10.01
SECTOR 99 0 0 . .
N07_EMPLOYER E 1 1.137851 0.003439 330.84
N07_EMPLOYER N 0 0 . .
SEX1 F 1 -0.152693 0.002767 -55.18
SEX1 M 0 0 . .
VET1 1 1 0.994431 0.478293 2.08
VET1 2 0 0 . .
FOUNDED1 1 1 -0.079178 0.028948 -2.74
FOUNDED1 2 0 0 . .
PURCHASED1 1 1 -0.087677 0.028811 -3.04
PURCHASED1 2 0 0 . .
INHERITED1 1 1 -0.091757 0.028593 -3.21
INHERITED1 2 0 0 . .
RECEIVED1 1 1 -0.060479 0.028701 -2.11
RECEIVED1 2 0 0 . .
ACQYR1 1 1 0.123718 0.010722 11.54
ACQYR1 2 1 0.114924 0.010468 10.98
38

Parameter Estimates
Parameter DF Estimate Standard Error t Value
ACQYR1 3 1 0.122968 0.009717 12.66
ACQYR1 4 1 0.112593 0.009256 12.16
ACQYR1 5 1 0.084345 0.011292 7.47
ACQYR1 6 1 0.055247 0.011273 4.90
ACQYR1 7 1 -0.069739 0.011162 -6.25
ACQYR1 8 0 0 . .
PROVIDE1 1 1 -0.225547 0.003055 -73.82
PROVIDE1 2 0 0 . .
MANAGE1 1 1 -0.065064 0.002819 -23.08
MANAGE1 2 0 0 . .
FINANCIAL1 1 1 0.104690 0.002771 37.78
FINANCIAL1 2 0 0 . .
FNCTNABV1 1 1 -0.117711 0.006090 -19.33
FNCTNABV1 2 0 0 . .
HOURS1 1 1 -0.135239 0.008345 -16.21
HOURS1 2 1 -0.336065 0.004656 -72.19
HOURS1 3 1 -0.181169 0.004282 -42.31
HOURS1 4 1 -0.150164 0.003995 -37.59
HOURS1 5 1 -0.082393 0.003423 -24.07
HOURS1 6 0 0 . .
PRMINC1 1 1 0.242443 0.002948 82.23
PRMINC1 2 0 0 . .
SELFEMP1 1 1 0.047307 0.002353 20.10
SELFEMP1 2 0 0 . .
EDUC1 1 1 -0.166470 0.005955 -27.95
EDUC1 2 1 -0.108420 0.003966 -27.34
EDUC1 3 1 -0.181064 0.005205 -34.78
EDUC1 4 1 -0.124506 0.003872 -32.15
39

Parameter Estimates
Parameter DF Estimate Standard Error t Value
EDUC1 5 1 -0.146356 0.005298 -27.62
EDUC1 6 1 -0.073524 0.003406 -21.58
EDUC1 7 0 0 . .
AGE1 1 1 -0.184340 0.009761 -18.88
AGE1 2 1 0.007893 0.005418 1.46
AGE1 3 1 0.071539 0.004581 15.62
AGE1 4 1 0.071199 0.004198 16.96
AGE1 5 1 0.036784 0.003984 9.23
AGE1 6 0 0 . .
BORNUS1 1 1 -0.037608 0.004052 -9.28
BORNUS1 2 0 0 . .
DISVET1 1 1 -1.093928 0.478399 -2.29
DISVET1 2 1 -1.034112 0.478302 -2.16
DISVET1 3 0 0 . .
ESTABLISHED 1 1 0.133473 0.009513 14.03
ESTABLISHED 2 1 0.135956 0.009681 14.04
ESTABLISHED 3 1 0.132354 0.008972 14.75
ESTABLISHED 4 1 0.111280 0.008763 12.70
ESTABLISHED 5 1 0.095458 0.009609 9.93
ESTABLISHED 6 1 0.094962 0.009231 10.29
ESTABLISHED 7 1 0.071583 0.010832 6.61
ESTABLISHED 8 1 0.051642 0.010923 4.73
ESTABLISHED 9 1 -0.016931 0.010863 -1.56
ESTABLISHED A 0 0 . .
SCSAVINGS 1 1 -0.017371 0.003576 -4.86
SCSAVINGS 2 0 0 . .
SCASSETS 1 1 -0.056578 0.004313 -13.12
SCASSETS 2 0 0 . .
40

Parameter Estimates
Parameter DF Estimate Standard Error t Value
SCEQUITY 1 1 -0.039681 0.005002 -7.93
SCEQUITY 2 0 0 . .
SCCREDIT 1 1 -0.051551 0.003860 -13.36
SCCREDIT 2 0 0 . .
SCGOVTLOAN 1 1 -0.028818 0.013311 -2.16
SCGOVTLOAN 2 0 0 . .
SCGOVTGUAR 1 1 0.022142 0.012371 1.79
SCGOVTGUAR 2 0 0 . .
SCBANKLOAN 1 1 0.055183 0.004020 13.73
SCBANKLOAN 2 0 0 . .
SCFAMLOAN 1 1 -0.018966 0.006370 -2.98
SCFAMLOAN 2 0 0 . .
SCVENTURE 1 1 -0.096289 0.022065 -4.36
SCVENTURE 2 0 0 . .
SCGRANT 1 1 -0.145460 0.028804 -5.05
SCGRANT 2 0 0 . .
SCOTHER 1 1 -0.035163 0.008224 -4.28
SCOTHER 2 0 0 . .
SCDONTKNOW 1 1 -0.052654 0.008604 -6.12
SCDONTKNOW 2 0 0 . .
SCAMOUNT 1 1 -0.002310 0.004671 -0.49
SCAMOUNT 2 1 0.062655 0.005589 11.21
SCAMOUNT 3 1 0.087967 0.005551 15.85
SCAMOUNT 4 1 0.131828 0.006121 21.54
SCAMOUNT 5 1 0.182133 0.006361 28.63
SCAMOUNT 6 1 0.247689 0.006671 37.13
SCAMOUNT 7 1 0.385266 0.007672 50.22
SCAMOUNT 8 1 0.587252 0.011676 50.29
41

Parameter Estimates
Parameter DF Estimate Standard Error t Value
SCAMOUNT 9 1 0.186741 0.006260 29.83
SCAMOUNT A 0 0 . .
HOMEBASED 1 1 -0.193209 0.002631 -73.42
HOMEBASED 2 0 0 . .
FRANCHISE 1 1 0.095547 0.008042 11.88
FRANCHISE 2 0 0 . .
FRANCHISER50 1 1 -0.022814 0.012799 -1.78
FRANCHISER50 2 0 0 . .
ECSAVINGS 1 1 -0.071333 0.003629 -19.66
ECSAVINGS 2 0 0 . .
ECASSETS 1 1 -0.050133 0.005692 -8.81
ECASSETS 2 0 0 . .
ECEQUITY 1 1 0.032494 0.005389 6.03
ECEQUITY 2 0 0 . .
ECCREDIT 1 1 -0.024643 0.003773 -6.53
ECCREDIT 2 0 0 . .
ECGOVTLOAN 1 1 0.038285 0.015650 2.45
ECGOVTLOAN 2 0 0 . .
ECBANKLOAN 1 1 0.103370 0.004219 24.50
ECBANKLOAN 2 0 0 . .
ECVENTURE 1 1 -0.246374 0.031677 -7.78
ECVENTURE 2 0 0 . .
ECPROFITS 1 1 0.031553 0.003818 8.27
ECPROFITS 2 0 0 . .
ECGRANT 1 1 -0.158893 0.028705 -5.54
ECGRANT 2 0 0 . .
ECOTHER 1 1 -0.032423 0.012595 -2.57
ECOTHER 2 0 0 . .
42

Parameter Estimates
Parameter DF Estimate Standard Error t Value
ECDONTKNOW 1 1 -0.088126 0.007204 -12.23
ECDONTKNOW 2 0 0 . .
ECNOACCESS 1 1 -0.095309 0.010197 -9.35
ECNOACCESS 2 0 0 . .
ECNOEXPAND 1 1 -0.043675 0.003966 -11.01
ECNOEXPAND 2 0 0 . .
FEDERAL 1 1 0.041492 0.007992 5.19
FEDERAL 2 0 0 . .
OTHERBUS 1 1 0.045852 0.003214 14.27
OTHERBUS 2 0 0 . .
INDIVIDUALS 1 1 -0.149726 0.003468 -43.17
INDIVIDUALS 2 0 0 . .
EXPORTS 1 1 0.073348 0.007377 9.94
EXPORTS 2 1 0.138000 0.010933 12.62
EXPORTS 3 1 0.177615 0.013409 13.25
EXPORTS 4 1 0.214486 0.016534 12.97
EXPORTS 5 1 0.166936 0.016124 10.35
EXPORTS 6 1 0.178288 0.016120 11.06
EXPORTS 7 1 0.293766 0.017115 17.16
EXPORTS 8 1 0.320932 0.021770 14.74
EXPORTS 9 0 0 . .
FULLTIME 1 1 0.179631 0.003769 47.67
FULLTIME 2 0 0 . .
PARTTIME 1 1 -0.020924 0.003068 -6.82
PARTTIME 2 0 0 . .
LEASED 1 1 0.334245 0.011579 28.87
LEASED 2 0 0 . .
CONTRACTORS 1 1 0.246384 0.002523 97.67
43

Parameter Estimates
Parameter DF Estimate Standard Error t Value
CONTRACTORS 2 0 0 . .
HEALTHINS 1 1 0.078051 0.004257 18.33
HEALTHINS 2 0 0 . .
RETIREMENT 1 1 0.170154 0.004144 41.06
RETIREMENT 2 0 0 . .
PROFITSHARE 1 1 -0.020966 0.006971 -3.01
PROFITSHARE 2 0 0 . .
HOLIDAYS 1 1 0.120158 0.004618 26.02
HOLIDAYS 2 0 0 . .
BENENABV 1 1 -0.074976 0.005111 -14.67
BENENABV 2 0 0 . .
WEBSITE 1 1 0.019704 0.002904 6.78
WEBSITE 2 0 0 . .
ECOMMPCT 1 1 -0.033776 0.009894 -3.41
ECOMMPCT 2 1 -0.076199 0.010547 -7.22
ECOMMPCT 3 1 -0.065350 0.013496 -4.84
ECOMMPCT 4 1 -0.107320 0.012424 -8.64
ECOMMPCT 5 1 -0.053331 0.012200 -4.37
ECOMMPCT 6 1 -0.060335 0.010540 -5.72
ECOMMPCT 7 1 -0.093930 0.013852 -6.78
ECOMMPCT 8 1 -0.158731 0.015935 -9.96
ECOMMPCT 9 0 0 . .
ONLINEPURCH 1 1 0.003940 0.002547 1.55
ONLINEPURCH 2 0 0 . .
LT40HOURS 1 1 -0.060922 0.004991 -12.21
LT40HOURS 2 0 0 . .
LT12MONTHS 1 1 -0.050281 0.004365 -11.52
LT12MONTHS 2 0 0 . .
44

Parameter Estimates
Parameter DF Estimate Standard Error t Value
SEASONAL 1 1 -0.038413 0.005919 -6.49
SEASONAL 2 0 0 . .
OCCASIONALLY 1 1 -0.102215 0.005775 -17.70
OCCASIONALLY 2 0 0 . .
ACTIVITYNABV 1 1 0.189169 0.005265 35.93
ACTIVITYNABV 2 0 0 . .
OPERATING 1 1 0.247984 0.003370 73.58
OPERATING 2 0 0 . .
CEASENR 1 1 0.062078 0.022437 2.77
CEASENR 2 0 0 . .
HUSBWIFE 1 1 -0.064987 0.004525 -14.36
HUSBWIFE 2 1 -0.068549 0.004097 -16.73
HUSBWIFE 3 1 -0.126576 0.006033 -20.98
HUSBWIFE 4 0 0 . .
NUMOWNERS 1 1 0.300708 0.014590 20.61
NUMOWNERS 2 1 0.419767 0.015134 27.74
NUMOWNERS 3 1 0.486225 0.016239 29.94
NUMOWNERS 4 1 0.489487 0.017344 28.22
NUMOWNERS 5 1 0.505033 0.017709 28.52
NUMOWNERS 6 1 0.444110 0.023362 19.01
NUMOWNERS 7 1 0.051559 0.046352 1.11
NUMOWNERS 8 0 0 . .
race1noblanks A 1 -0.020036 0.005832 -3.44
race1noblanks B 1 -0.185359 0.006773 -27.37
race1noblanks H 1 -0.067546 0.005681 -11.89
race1noblanks I 1 -0.072423 0.014317 -5.06
race1noblanks Mixed 1 0.004210 0.029596 0.14
race1noblanks P 1 -0.131819 0.039328 -3.35
45

Parameter Estimates
Parameter DF Estimate Standard Error t Value
race1noblanks S 1 0.002974 0.026112 0.11
race1noblanks W 0 0 . .
LANGUAGE Only English 1 0.216239 0.017187 12.58
LANGUAGE Only English and Other 1 0.143442 0.017956 7.99
LANGUAGE Only English and Spanish 1 0.219765 0.017202 12.78
LANGUAGE Only Other Language 1 0.007221 0.023920 0.30
LANGUAGE Only Spanish 0 0 . .


46

Appendix B: Regression of All Available Variables

The SAS System

The SURVEYREG Procedure

Regression Analysis for Dependent Variable RECEIPTS_NOISY
Data Summary
Number of Observations 874182
Sum of Weights 10068531
Weighted Mean of RECEIPTS_NOISY 251.05405
Weighted Sum of RECEIPTS_NOISY 2527745467

Fit Statistics
R-square 0.5771
Root MSE 446.73
Denominator DF 874181

Class Level Information
Class Variable Levels Values
FIPST 43 01 04 05 06 08 09 12 13 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 39 40 41 42 45 47 48 49 51
53 54 55
SECTOR 20 11 21 22 23 31 42 44 48 51 52 53 54 55 56 61 62 71 72 81 99
N07_EMPLOYER 2 E N
SEX1 2 F M
VET1 2 1 2
FOUNDED1 2 1 2
PURCHASED1 2 1 2
INHERITED1 2 1 2
RECEIVED1 2 1 2
ACQUIRENR1 1 2
ACQYR1 8 1 2 3 4 5 6 7 8
PROVIDE1 2 1 2
47

Class Level Information
Class Variable Levels Values
MANAGE1 2 1 2
FINANCIAL1 2 1 2
FNCTNABV1 2 1 2
FNCTNR1 1 2
HOURS1 6 1 2 3 4 5 6
PRMINC1 2 1 2
SELFEMP1 2 1 2
EDUC1 7 1 2 3 4 5 6 7
AGE1 6 1 2 3 4 5 6
BORNUS1 2 1 2
DISVET1 3 1 2 3
ESTABLISHED 10 1 2 3 4 5 6 7 8 9 A
SCSAVINGS 2 1 2
SCASSETS 2 1 2
SCEQUITY 2 1 2
SCCREDIT 2 1 2
SCGOVTLOAN 2 1 2
SCGOVTGUAR 2 1 2
SCBANKLOAN 2 1 2
SCFAMLOAN 2 1 2
SCVENTURE 2 1 2
SCGRANT 2 1 2
SCOTHER 2 1 2
SCDONTKNOW 2 1 2
SCNONENEEDED 2 1 2
SCNOTREPORTED 1 2
SCAMOUNT 10 1 2 3 4 5 6 7 8 9 A
HOMEBASED 2 1 2
48

Class Level Information
Class Variable Levels Values
FRANCHISE 2 1 2
FRANCHISER50 2 1 2
ECSAVINGS 2 1 2
ECASSETS 2 1 2
ECEQUITY 2 1 2
ECCREDIT 2 1 2
ECGOVTLOAN 2 1 2
ECGOVTGUAR 2 1 2
ECBANKLOAN 2 1 2
ECFAMLOAN 2 1 2
ECVENTURE 2 1 2
ECPROFITS 2 1 2
ECGRANT 2 1 2
ECOTHER 2 1 2
ECDONTKNOW 2 1 2
ECNOACCESS 2 1 2
ECNOEXPAND 2 1 2
ECNOTREPORTED 1 2
FEDERAL 2 1 2
STATELOCAL 2 1 2
OTHERBUS 2 1 2
INDIVIDUALS 2 1 2
CUSTNR 1 2
EXPORTS 9 1 2 3 4 5 6 7 8 9
OPSOUTSIDE 2 1 2
OUTSOURCE 2 1 2
FULLTIME 2 1 2
PARTTIME 2 1 2
49

Class Level Information
Class Variable Levels Values
DAYLABOR 2 1 2
TEMPSTAFF 2 1 2
LEASED 2 1 2
CONTRACTORS 2 1 2
EMPNR 1 2
HEALTHINS 2 1 2
RETIREMENT 2 1 2
PROFITSHARE 2 1 2
HOLIDAYS 2 1 2
BENENABV 2 1 2
BENENR 1 2
WEBSITE 2 1 2
ECOMMERCE 2 1 2
ECOMMPCT 9 1 2 3 4 5 6 7 8 9
ONLINEPURCH 2 1 2
LT40HOURS 2 1 2
LT12MONTHS 2 1 2
SEASONAL 2 1 2
OCCASIONALLY 2 1 2
ACTIVITYNABV 2 1 2
ACTIVITYNR 1 2
OPERATING 2 1 2
CEASENR 2 1 2
CEASENA 2 1 2
HUSBWIFE 4 1 2 3 4
FAMILYBUS 2 1 2
NUMOWNERS 8 1 2 3 4 5 6 7 8
race1noblanks 8 A B H I Mixed P S W
50

Class Level Information
Class Variable Levels Values
LANGUAGE 5 Only English Only English and Other Only English and
Spanish Only Other Language Only Spanish
region 9 East Sout Mid-Atlan Midwest Mountain Northeast Pacific
South Atl West Nort West Sout

Tests of Model Effects
Effect Num DF F Value Pr > F
Model 215 1538.56 <.0001
Intercept 1 29.25 <.0001
EMPLOYMENT_NOISY 1 309.80 <.0001
PAYROLL_NOISY 1 265.90 <.0001
PCT1 1 19.81 <.0001
FIPST 34 9.88 <.0001
SECTOR 19 1069.57 <.0001
N07_EMPLOYER 1 3093.66 <.0001
SEX1 1 246.76 <.0001
VET1 1 2.04 0.1535
FOUNDED1 1 0.50 0.4798
PURCHASED1 1 0.28 0.5989
INHERITED1 1 0.01 0.9314
RECEIVED1 1 0.12 0.7329
ACQUIRENR1 0 . .
ACQYR1 7 19.91 <.0001
PROVIDE1 1 1530.50 <.0001
MANAGE1 1 129.53 <.0001
FINANCIAL1 1 103.32 <.0001
FNCTNABV1 1 563.30 <.0001
FNCTNR1 0 . .
HOURS1 5 32.80 <.0001
51

Tests of Model Effects
Effect Num DF F Value Pr > F
PRMINC1 1 592.68 <.0001
SELFEMP1 1 6.46 0.0110
EDUC1 6 64.21 <.0001
AGE1 5 7.11 <.0001
BORNUS1 1 91.23 <.0001
DISVET1 2 3.52 0.0297
ESTABLISHED 9 18.27 <.0001
SCSAVINGS 1 26.69 <.0001
SCASSETS 1 5.12 0.0237
SCEQUITY 1 25.51 <.0001
SCCREDIT 1 26.19 <.0001
SCGOVTLOAN 1 7.28 0.0070
SCGOVTGUAR 1 11.62 0.0007
SCBANKLOAN 1 1.90 0.1681
SCFAMLOAN 1 6.95 0.0084
SCVENTURE 1 0.71 0.3984
SCGRANT 1 4.05 0.0441
SCOTHER 1 0.01 0.9415
SCDONTKNOW 1 34.80 <.0001
SCNONENEEDED 0 . .
SCNOTREPORTED 0 . .
SCAMOUNT 8 238.89 <.0001
HOMEBASED 1 1212.08 <.0001
FRANCHISE 1 9.31 0.0023
FRANCHISER50 1 5.69 0.0170
ECSAVINGS 1 102.66 <.0001
ECASSETS 1 13.38 0.0003
ECEQUITY 1 0.04 0.8475
52

Tests of Model Effects
Effect Num DF F Value Pr > F
ECCREDIT 1 477.03 <.0001
ECGOVTLOAN 1 3.27 0.0707
ECGOVTGUAR 1 4.01 0.0452
ECBANKLOAN 1 513.45 <.0001
ECFAMLOAN 1 1.30 0.2542
ECVENTURE 1 19.82 <.0001
ECPROFITS 1 119.75 <.0001
ECGRANT 1 9.33 0.0023
ECOTHER 1 5.79 0.0161
ECDONTKNOW 1 31.56 <.0001
ECNOACCESS 1 54.77 <.0001
ECNOEXPAND 1 59.10 <.0001
ECNOTREPORTED 0 . .
FEDERAL 1 15.44 <.0001
STATELOCAL 1 68.88 <.0001
OTHERBUS 1 1.57 0.2104
INDIVIDUALS 1 394.79 <.0001
CUSTNR 0 . .
EXPORTS 8 61.77 <.0001
OPSOUTSIDE 1 20.45 <.0001
OUTSOURCE 1 0.45 0.5014
FULLTIME 1 397.22 <.0001
PARTTIME 1 209.27 <.0001
DAYLABOR 1 8.02 0.0046
TEMPSTAFF 1 251.19 <.0001
LEASED 1 70.43 <.0001
CONTRACTORS 1 1279.91 <.0001
EMPNR 0 . .
53

Tests of Model Effects
Effect Num DF F Value Pr > F
HEALTHINS 1 1027.62 <.0001
RETIREMENT 1 658.70 <.0001
PROFITSHARE 1 153.11 <.0001
HOLIDAYS 1 963.05 <.0001
BENENABV 1 650.68 <.0001
BENENR 0 . .
WEBSITE 1 251.96 <.0001
ECOMMERCE 0 . .
ECOMMPCT 7 14.86 <.0001
ONLINEPURCH 1 5.27 0.0217
LT40HOURS 1 301.67 <.0001
LT12MONTHS 1 39.13 <.0001
SEASONAL 1 101.90 <.0001
OCCASIONALLY 1 1056.52 <.0001
ACTIVITYNABV 1 720.66 <.0001
ACTIVITYNR 0 . .
OPERATING 0 . .
CEASENR 1 10.48 0.0012
CEASENA 0 . .
HUSBWIFE 3 121.35 <.0001
FAMILYBUS 1 3.80 0.0512
NUMOWNERS 7 63.77 <.0001
race1noblanks 7 7.65 <.0001
LANGUAGE 4 10.80 <.0001

Note: The denominator degrees of freedom for the F tests is 874181.


54

Appendix C: Full Tables of Capital Regression
Regressing Startup Capital Variables Against Receipts in California

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 102.007317 21.726311 4.70 <.0001
SCSAVINGS 1 11.703091 21.591132 0.54 0.5878
SCSAVINGS 2 0.000000 0.000000 . .
SCASSETS 1 66.975345 30.587008 2.19 0.0286
SCASSETS 2 0.000000 0.000000 . .
SCEQUITY 1 71.580606 25.060914 2.86 0.0043
SCEQUITY 2 0.000000 0.000000 . .
SCCREDIT 1 -53.875284 14.852958 -3.63 0.0003
SCCREDIT 2 0.000000 0.000000 . .
SCGOVTLOAN 1 155.736475 138.276227 1.13 0.2601
SCGOVTLOAN 2 0.000000 0.000000 . .
SCGOVTGUAR 1 128.860018 116.790788 1.10 0.2699
SCGOVTGUAR 2 0.000000 0.000000 . .
SCBANKLOAN 1 247.364500 46.176948 5.36 <.0001
SCBANKLOAN 2 0.000000 0.000000 . .
SCFAMLOAN 1 36.520896 43.002700 0.85 0.3958
SCFAMLOAN 2 0.000000 0.000000 . .
SCVENTURE 1 68.869045 144.952490 0.48 0.6347
SCVENTURE 2 0.000000 0.000000 .
SCGRANT 1 -78.634338 22.323468 -3.52 0.0004
SCGRANT 2 0.000000 0.000000 . .
SCOTHER 1 203.096510 82.066815 2.47 0.0133
SCOTHER 2 0.000000 0.000000 . .
SCNOTREPORTED 2 0.000000 0.000000 . .
SCNONENEEDED 1 -53.489854 22.523561 -2.37 0.0176
SCNONENEEDED 2 0.000000 0.000000 . .
SCDONTKNOW 1 150.634193 57.878529 2.60 0.0093
SCDONTKNOW 2 0.000000 0.000000 . .
Regressing Startup Capital Variables Against Receipts in the United States

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 170.17753 8.2478523 20.63 <.0001
SCSAVINGS 1 -15.78058 8.1811433 -1.93 0.0537
SCSAVINGS 2 0.00000 0.0000000 . .
SCASSETS 1 48.84989 9.4472158 5.17 <.0001
SCASSETS 2 0.00000 0.0000000 . .
SCEQUITY 1 49.36879 9.5195614 5.19 <.0001
SCEQUITY 2 0.00000 0.0000000 . .
SCCREDIT 1 -90.21357 5.5115225 -16.37 <.0001
SCCREDIT 2 0.00000 0.0000000 . .
SCGOVTLOAN 1 91.92590 36.2985579 2.53 0.0113
SCGOVTLOAN 2 0.00000 0.0000000 . .
SCGOVTGUAR 1 195.60483 35.4273235 5.52 <.0001
SCGOVTGUAR 2 0.00000 0.0000000 . .
SCBANKLOAN 1 305.75374 12.0092423 25.46 <.0001
SCBANKLOAN 2 0.00000 0.0000000 . .
55

SCFAMLOAN 1 180.68722 20.0165018 9.03 <.0001
SCFAMLOAN 2 0.00000 0.0000000 . .
SCVENTURE 1 364.60586 67.9164595 5.37 <.0001
SCVENTURE 2 0.00000 0.0000000 . .
SCGRANT 1 -103.56030 42.2222199 -2.45 0.0142
SCGRANT 2 0.00000 0.0000000 . .
SCOTHER 1 170.51641 22.3585041 7.63 <.0001
SCOTHER 2 0.00000 0.0000000 . .
SCNOTREPORTED 2 0.00000 0.0000000 . .
SCNONENEEDED 1 -111.12210 8.4799463 -13.10 <.0001
SCNONENEEDED 2 0.00000 0.0000000 . .
SCDONTKNOW 1 183.78634 18.7281721 9.81 <.0001
SCDONTKNOW 2 0.00000 0.0000000 . .
Regressing Expansion Capital Variables Against Receipts in California

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 110.137009 34.177799 3.22 0.0013
ECSAVINGS 1 -33.747432 32.519405 -1.04 0.2994
ECSAVINGS 2 0.000000 0.000000 . .
ECASSETS 1 21.898471 39.924959 0.55 0.5834
ECASSETS 2 0.000000 0.000000 . .
ECEQUITY 1 135.962566 43.448548 3.13 0.0018
ECEQUITY 2 0.000000 0.000000 . .
ECCREDIT 1 -9.457578 25.125694 -0.38 0.7066
ECCREDIT 2 0.000000 0.000000 . .
ECGOVTLOAN 1 419.097302 218.040548 1.92 0.0546
ECGOVTLOAN 2 0.000000 0.000000 . .
ECGOVTGUAR 1 63.473356 199.360658 0.32 0.7502
ECGOVTGUAR 2 0.000000 0.000000 . .
ECBANKLOAN 1 461.091917 74.804907 6.16 <.0001
ECBANKLOAN 2 0.000000 0.000000 . .
ECFAMLOAN 1 69.175737 56.384978 1.23 0.2199
ECFAMLOAN 2 0.000000 0.000000 . .
ECVENTURE 1 514.953434 505.667496 1.02 0.3085
ECVENTURE 2 0.000000 0.000000 . .
ECPROFITS 1 124.956905 41.547571 3.01 0.0026
ECPROFITS 2 0.000000 0.000000 . .
ECGRANT 1 -98.081367 30.526976 -3.21 0.0013
ECGRANT 2 0.000000 0.000000 . .
ECOTHER 1 127.441581 125.654170 1.01 0.3105
ECOTHER 2 0.000000 0.000000 . .
ECDONTKNOW 1 -9.436620 39.857040 -0.24 0.8128
ECDONTKNOW 2 0.000000 0.000000 . .
ECNOACCESS 1 -68.322663 36.561903 -1.87 0.0617
ECNOACCESS 2 0.000000 0.000000 . .
ECNOEXPAND 1 -26.991879 34.551469 -0.78 0.4347
ECNOEXPAND 2 0.000000 0.000000 . .
ECNOTREPORTED 2 0.000000 0.000000 . .
Regressing Expansion Capital Variables Against Receipts in the United States

Standard
Parameter Estimate Error t Value Pr > |t|
56


Intercept 231.62717 10.454859 22.15 <.0001
ECSAVINGS 1 -112.84097 10.041882 -11.24 <.0001
ECSAVINGS 2 0.00000 0.000000 . .
ECASSETS 1 5.40888 11.418978 0.47 0.6357
ECASSETS 2 0.00000 0.000000 . .
ECEQUITY 1 90.04280 12.687683 7.10 <.0001
ECEQUITY 2 0.00000 0.000000 . .
ECCREDIT 1 -88.74814 7.067847 -12.56 <.0001
ECCREDIT 2 0.00000 0.000000 . .
ECGOVTLOAN 1 99.43388 56.715749 1.75 0.0796
ECGOVTLOAN 2 0.00000 0.000000 . .
ECGOVTGUAR 1 167.61845 68.795704 2.44 0.0148
ECGOVTGUAR 2 0.00000 0.000000 . .
ECBANKLOAN 1 446.34946 16.833708 26.52 <.0001
ECBANKLOAN 2 0.00000 0.000000 . .
ECFAMLOAN 1 113.62459 31.340437 3.63 0.0003
ECFAMLOAN 2 0.00000 0.000000 .
ECVENTURE 1 521.87107 123.773572 4.22 <.0001
ECVENTURE 2 0.00000 0.000000 . .
ECPROFITS 1 162.90603 13.425908 12.13 <.0001
ECPROFITS 2 0.00000 0.000000 . .
ECGRANT 1 -131.52301 99.407582 -1.32 0.1858
ECGRANT 2 0.00000 0.000000 . .
ECOTHER 1 132.62127 35.661671 3.72 0.0002
ECOTHER 2 0.00000 0.000000 . .
ECDONTKNOW 1 -43.75725 13.986171 -3.13 0.0018
ECDONTKNOW 2 0.00000 0.000000 . .
ECNOACCESS 1 -161.82463 12.476000 -12.97 <.0001
ECNOACCESS 2 0.00000 0.000000 . .
ECNOEXPAND 1 -108.06462 10.661485 -10.14 <.0001
ECNOEXPAND 2 0.00000 0.000000 . .
ECNOTREPORTED 2 0.00000 0.000000 . .\

Potrebbero piacerti anche