Supervisor: Professor Amber Tomas STAT 4993 Independent Study 5/9/2013 2
Table of Contents Introduction ..................................................................................................................................... 4 Overview of the Survey of Business Owners Public Microdata Sample ........................................ 4 Primary Methodology ................................................................................................................. 4 Missing Data ............................................................................................................................... 6 Nonresponse ................................................................................................................................ 7 Inherent Differences between SBO and PUMS Information ...................................................... 7 Data Manipulation and Cleaning .................................................................................................... 8 Subsetting to Small Businesses and One Owner Data ................................................................ 8 Tabulation Weights ..................................................................................................................... 9 Fitting a Regression ........................................................................................................................ 9 Regression One: All Variables Against Receipts ...................................................................... 10 Model Selection Techniques ..................................................................................................... 12 Regression Two: PROC GLSELECT with the Schwarz Bayesian Criterion ........................... 14 Regression Three: PROC GLMSELECT with Akaikes Information Criterion....................... 15 Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm Transform ........ 18 Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and Employment with Other Character Variables against the Logarithm of Receipts .......................................... 20 Analysis of Language ................................................................................................................... 22 Research Question 1a: Does the language spoken in transactions produce a difference in correlated receipts? .................................................................................................................... 22 Research Question 1b: Out of only existing Only English and Spanish businesses, which are the most popular industries for business? ................................................................................. 26 Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United States ... 27 Analysis of Capital Sources in California versus the United States ............................................. 29 Dataset Overview ...................................................................................................................... 29 Research Question 2a: What sources of capital have the most positive relationship with receipts? ..................................................................................................................................... 30 Research Question 2b: How does the spread of receipts for certain capital sources compare between businesses in California and the general United States? ............................................. 32 Conclusion .................................................................................................................................... 33 Lessons Learned............................................................................................................................ 33 Appendix A: Full Output of Regression with log(Receipts) and Modified Payroll and Employment .................................................................................................................................. 34 3
Appendix B: Regression of All Available Variables .................................................................... 46 Appendix C: Full Tables of Capital Regression ........................................................................... 54
4
Introduction In selecting my topic for independent study, I wanted to combine the skills I had learned for statistical computing software with my main passion and primary fields of study business and entrepreneurship. As a result, I originally intended to build a predictive model for new or small business venture success. According to the U.S. Small Business Administration (SBA), small businesses represent 99.7 percent of all employer firms. Since 1995, small businesses have generated 64 percent of new jobs, and paid 44 percent of the total United States private payroll, according to the SBA. However, I quickly realized that there was a dearth of accurate, thorough, and easily accessible data that had a large enough sample size to satisfy the normality assumption across all sectors and states. Moreover, attempting to answer this question would further require significant longitudinal data on the individual business level. Ultimately, I relied on the U.S. Census Bureaus Survey of Business Owners (SBO) Public Use Microdata Sample (PUMS), in which I examine entrepreneurial activity and the relationships between business characteristics such as access to capital, firm size, employer-paid benefits, minority ownership, and firm age. In this report, I detail how I conducted data cleaning on the 2007 SBO PUMS in addition to the development of a regression model as well as more in-depth analyses of the relationships between specific variables. Overview of the Survey of Business Owners Public Microdata Sample Primary Methodology The 2007 Survey of Business Owners (SBO) questionnaire, Form SBO-1, was mailed to a random sample of 2.3 million businesses selected from a list of 27 million firms operating during 2007 with receipts of $1,000 or more. The list of all firms (the sampling universe) was derived from both official business tax returns and data collected on other economic census reports. The Census Bureau obtained electronic files from the Internal Revenue Service (IRS) for all companies reporting any business activity on 2007 IRS Tax Forms such as Form 1040 and 1065. With regards to the background of the SBO, this survey is part of the Economic Census program, which the Census Bureau is required by law to conduct every 5 years for years ending in "2" and "7." The Census Bureau combines and crosschecks data from the SBO with data from other economic surveys, economic censuses, and administrative records. The published data include number of firms (both firms with paid employees and firms with no paid employees), sales and receipts, number of paid employees, and annual payroll; they are presented by kind of business, geographic area, and size of firm (employment and receipts). These results will also contain summary statistics on the composition of businesses in the United States by gender, ethnicity, race, and veteran status. Additional demographic and economic characteristics of business owners and their businesses are included, such as: owner's age, education level, hours worked, and primary function in the business; family- and home-based businesses; types of customers and workers; sources of financing for start-up, expansion, or capital improvements; outsourcing; use of Internet and e-commerce; and employer-paid benefits. The IRS provided certain identification, classification, and measurement data for businesses filing those forms. For most firms with paid employees, the Census Bureau also 5
collected employment, payroll, receipts, and kind of business for each plant, store, or physical location during the 2007 Economic Census. For the 2007 SBO, firms could either report electronically by using Census Taker, the Census Bureau's secure online interactive application, or return their completed form by mail. Three report form re-mails to employer firms and two report form re-mails to nonemployer firms were conducted at one-month intervals to all delinquent respondents. The returned forms underwent extensive review and computer processing. All reports were geographically coded, data-keyed, and edited. This wealth of data provides a resource to main parties from government officials to industry organization leaders. For example, this data allows agencies such as the Small Business Administration to identify and address the needs of small businesses in the United States. In the private sector, consultants and researchers to analyze long-term economic and demographic shifts, and differences in ownership and performance among geographic areas. Survey Overview: Form SBO-1, given to every sampled business, primarily asked basic information about the business in general while focusing on the demographics and level of ownership for each listed owner. There are only 9 numeric variables (tabulation weight, total revenues with injected noise, payroll with injected noise, employment injected with noise, and general ownership percentages for up to four owners) while the other hundreds are character variables (age, education, startup capital type, race, ethnicity, etc.). These character variables usually adopt the format of a binary yes/no answer to most questions, except for variables with multiple levels (education, age, and race).
6
Missing Data For the numeric variables of the cleaned data set (receipts, payroll, employment, and percent of ownership), there were no missing values. However, for most character variables, the percentage of missing data ranged from 20-40%. There were also several variables in which over General Owner and Business Characteristics Additional Owner Characteristics
Additional Business Characteristics Sector Employer status Random group (for variance estimation) Tabulation weight Measures of size (noise-infused for disclosure avoidance): Employment Payroll Receipts Individual owner information (for up to four owners): Percentage ownership Gender Ethnicity Race Veteran status
How the owner initially acquired the business When the owner acquired the business Owners primary function in the business Owners average number of hours per week spent working in the business Whether the business provided the owners primary source of personal income Whether the owner previously owned a business or had been self-employed Owners educational background Owners age Whether the owner was born in the United States If the owner was a veteran, whether the owner was disabled as the result of injury incurred during active military service
Year business was established Source(s) of start-up or acquisition capital Amount of start-up or acquisition capital Home-based business Operated as a franchise Owned by a franchise Source(s) of capital used to expand business Types of customers Percent of total sales exported Operations established outside the United States Outsourced any business function outside the United States Language(s) used in transactions Types of workers employed Employer-paid benefits offered Whether the company had a website Whether the company had e- commerce sales E-commerce as a percentage of total sales Whether the company made online purchases Business activity (e.g., seasonal or part-time) Whether the business currently operates Reasons for ceasing operations Joint ownership by husband and wife Family-owned business Number of owners
7
90% of the observations had missing values, such as whether or not the owner is retired, the owner is deceased, or the business had low or inadequate sales. Nonresponse Approximately 62 percent of the 2.3 million businesses in the SBO sample responded to the survey, compared to 75 percent for the 2002 survey. For the 2007 survey, 72 percent of the companies in the SBO sample returned a questionnaire, but 10 percent of the returns did not contain enough information to be considered a response for the estimates by race, gender, ethnicity or veteran status. Many of these respondents were sole proprietors that answered "No" to Item 8, "In 2007, did any individual own 10% or more of the rights, claims, interests, or stock in this business?" Another identified issue was duality between race (Hispanic vs non-Hispanic) and ethnicity (White, Black, Asian, American Indian). Every Hispanic business owner also had to identify at least one additional ethnicity, which may lead to indication of mixed race when an owner is solely Hispanic is heritage. This led to consequent variable manipulation for correction. According to the U.S. Census, about 4 percent of the 2007 nonrespondents were selected for and responded to the 2002 SBO. For these firms, data from the 2002 survey were used in place of the missing 2007 responses. For the remaining nonrespondents, gender, ethnicity, race and veteran status were imputed from donor respondents in the same sampling frame with similar characteristics (state, industry, employment status, size). Because the assignment of businesses to sampling frames relies heavily on administrative data, and there is a high level of agreement between sampling frame assignment and tabulated race or ethnicity for responding firms, the donor imputations are considered to be reliable. Estimates of sampling variability are adjusted to account for nonresponse. Estimates with high error (relative standard error for sales or receipts of 50 percent or more) are suppressed. Overall, imputed data accounted for approximately 47 percent of the firm count estimates by gender, ethnicity, race, and veteran status and approximately 20 percent of the estimates of sales. Inherent Differences between SBO and PUMS Information The Public Use Microdata Sample (PUMS) is a large dataset available to the public derived from the original SBO dataset of responses. According to the U.S. Census Bureau, measures were taken in constructing the PUMS file to protect the confidentiality of the SBO data in order for it to be used freely among the public. In the PUMS file, each record corresponds to a business, but deliberate measures were taken to ensure the anonymity of each business. For businesses operating in multiple states and/or industry sectors, one record exists for each state combination in which the firm conducts business. Identifiers to link the component records of a business are not included. Additionally, businesses classified in the SBO as publicly owned or not classifiable by gender, ethnicity, race, or veteran status are not included in the PUMS file because many publicly owned firms are easily identifiable. Since the primary focus of my research is on small businesses, exclusion of possibly larger public corporations does not significantly impact the integrity of the data. Finally, the U.S. Census Bureau infused noise into the PUMS data for disclosure avoidance and confidentiality protection. Values are perturbed prior to tabulation by applying a random noise multiplier to the magnitude data, such as the sales and receipts for all firms. This introduced variation perturbed data points by no more than a few percentage points. 8
Data Manipulation and Cleaning Subsetting to Small Businesses and One Owner Data According to the Small Business Administration, small businesses are generally considered to have less than 500 employees and under 7 million in receipts depending on the industry. By subsetting the original dataset according to these new parameters, the total observation count decreased from 2,165,680 to 2,025,530. Furthermore, according to the 2007 SBO PUMS Guide, a response of 0 for most of the qualitative questions indicated that the data for these variables were missing. As a result, each 0 was converted to to more accurately reflect the nature of this data and to correctly set up regression procedures later on. Another major issue for data analysis were the inclusion of data points of up to three additional business owners. As a result, there exist three additional sets of demographic variables. However, if a business only has one owner, these three additional sets of variables would all have missing variables. Given the fact that PROC REG and other regression commands exclude observations that have even one missing value, allowing the sixty extra variables for additional owners would lead to over-exclusion of a significant amount of observations. Moreover, I noticed that virtually every business had responses to all variables describing the first owner, and that the first owner almost always owned as much (if not a greater amount) of the business than his or her other 1-3 business partners. This justified my decision to remove all variables affiliated with the second, third, and fourth owners. One future recommendation would be to keep these deleted variables and find other options to analyze the overall dataset despite the necessary inclusion of extra missing variables. Observing businesses with the intent to analyze the relationship between multiple business owners (when also factoring in age, experience, and education) could be incredibly valuable for the studies of organization behavior and business. Additional deleted variables included those that had over a 90% missing rate, since it severely diminished the number of total observations used in regression. For the purposes of my research questions, having more observations to interpret is preferable, but including these variables in another analysis of businesses that have generally ceased activity would also be worthwhile. These variables include: CEASEOTHER ceased operations for another reason SOLDBUS sold this business STARTANOTHER Started another business NOPERSCRED Lack of personal loans/credit NOBUSCRED Lack of business loans/credit LOWSALES Inadequate cash flow or low sales ONETIME Operated for one-time event DECEASED Owner died RETIRE Owner retired 9
Lastly, I consolidated all language variables and the race and ethnicity variables in order to improve overall adjusted fit and reduce noise. After analyzing the spread of the many language variables in the dataset (English, Arabic, Chinese, French, German, Greek, Hindi, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Tagalog, and Vietnamese), the most frequent languages spoken during business transactions are naturally English and Spanish. Other minor languages collectively do not constitute more than 3% of the total number of observations, so we created a new language variable with five levels: Only English, Only Spanish, Only English and Spanish, Only English and Other, and Only Other Language, which effectively consolidated sixteen binary variables to one variable with five levels. Additionally, the race variable had at least twenty levels due to the combinations of difference races, which encouraged consolidation. Both race and ethnicity were combined to create a new variable, which followed this algorithm:
1. If ethnicity is Hispanic then Race/Ethnicity is Hispanic 2. If ethnicity is not Hispanic and the owner is White AND another minority, then the owner is considered part of the other minority 3. If ethnicity is not Hispanic and the owner is only White or another minority, then the race stays as listed 4. If ethnicity is not Hispanic and the owner is at least two types of race (both of which are not White), then the owner is considered Mixed.
As a result, the final levels for the new race/ethnicity variable are: W for White, B for Black, A for Asian, H for Hispanic, I for American Indian, P for Nhopi (Native Hawaiian), Mixed (for any combination of two non-white and non-hispanic races such as Black/Asian.
Tabulation Weights
In most surveys, it will be the case that some groups are over-represented in the raw data and others under-represented. In order to address this, weights are assigned to each observation to compensate for the over/under-representation of data. While the exact method of determining these weights for the PUMS is unknown, the values of the tabulation weight range from 1.0 to 35.0. In this sense, a single observation with a weight of 35.0 would functionally be the same as thirty-five individual observations with the same parameters except for a weight of 1.0. Analysis of the data therefore requires the weights to be properly factored into averages, percentages, and regressions through the proper SAS procedures. Fitting a Regression Intuitively, I first wanted to fit a regression for all variables against receipts to observe the overall coefficient of determination and the comparative significance of each individual variable in determining receipts. This would immediately confirm or deny several of my possible research questions about the variables and would then be a starting point for other further research ideas. I gave further consideration to other possible response variables besides receipts, such as employment, payroll, and certain categorical variables such as whether or not the business was still operating. Ultimately, I decided to primarily focus on receipts as the response 10
variable, seeing as in general business theory that an increase in either payroll or employment usually comes after an increase in overall revenues. In order to begin, I had to identify the proper statistical procedure to regress several quantitative and multi-level categorical variables against receipts. I also wanted to use a procedure that would automatically create dummy variables and evaluate correlations while controlling for other variables. Ultimately, I selected the generalized linear model procedure because it satisfied the aforementioned criteria. With this in mind, limitations include the fact that responses must be independent of another (which is extremely difficult to prove given game theory and general economics of competition), and the fact that predictors are still assumed to be linear. Regression One: All Variables Against Receipts Data Summary: Number of Observations 874,182
Included Variables: RECEIPTS_NOISY EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTOR N07_EMPLOYER SEX1 FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCGRANT SCDONTKNOW SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECFAMLOAN ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL STATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME DAYLABOR TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV CEASENR HUSBWIFE FAMILYBUS NUMOWNERS race1noblanks OPERATING LANGUAGE Further analysis shows that these are the top five variables based on both the magnitude of the t-value and the estimate, as these parameters indicate both statistical and practical significance. Since the vast majority of variables are significant in the model on a 0.05 significance level, the magnitude of the estimate is the most determinant factor in establishing importance. 11
Estimated Regression Coefficients Parameter Estimate Standard Error t Value Pr > |t SECTOR 11 Agriculture, Fishing 273.5380 19.767836 13.84 <.0001 SECTOR 21 Mining, Oil Extraction 268.7993 21.180031 12.69 <.0001 SECTOR 22 Trade, Transportation, Utilities 239.4820 20.632043 11.61 <.0001 SECTOR 23 Construction 335.1240 19.601069 17.10 <.0001 SECTOR 31 Manufacturing 352.3094 20.018820 17.60 <.0001 SECTOR 42 Wholesale Trade 638.2235 20.447289 31.21 <.0001 SECTOR 44 Retail Trade 392.6039 19.464185 20.17 <.0001 SECTOR 48 Warehousing / Distribution 270.2850 19.592456 13.80 <.0001 SECTOR 51 Information and Media 227.2462 19.599947 11.59 <.0001 SECTOR 52 Finance and Insurance 189.1639 19.496633 9.70 <.0001 SECTOR 53 Real Estate and Rental 201.2833 19.350880 10.40 <.0001 SECTOR 54 Professional, Scientific, and Technical Services 207.9890 19.428692 10.71 <.0001 SECTOR 55 Mgmt. of Companies and Enterprises -1827.4092 68.385800 -26.72 <.0001 SECTOR 56 Admin. And Support and Waste Management 205.8534 19.478828 10.57 <.0001 SECTOR 61 Education 220.9200 19.407797 11.38 <.0001 SECTOR 62 Healthcare 194.4817 19.586984 9.93 <.0001 SECTOR 71 Arts and Entertainment 224.3419 19.398649 11.56 <.0001 SECTOR 72 Accommodation and Food 131.5430 19.560860 6.72 <.0001 SECTOR 81 Other Services 228.5184 19.431691 11.76 <.0001
Sector has the highest magnitude in its level estimates, while several other variables have similar estimate levels: N07_EMPLOYER E 193.2805 3.474976 55.62 <.0001 N07_EMPLOYER N 0.0000 0.000000 . HEALTHINS 1 174.8802 5.455380 32.06 <.0001 HEALTHINS 2 0.0000 0.000000 . . 12
Fit Statistics: R-square 0.5771 Adjusted R-square 0.5770 Root MSE 446.73
Interpretation: Overall, the regression has a moderately strong positive correlation between explanatory and response variables. There are certain variables that have far higher t-values than others. Here is a list of the most significant explanatory variables in this regression: Sector, Provide1 (if owner 1 provided goods or services), HomeBased, and Contractors. There were many variables that were also insignificant, however, which led me to consider further testing that would automatically select a better model with fewer variables and a lower MSE, such as stepwise regression.
Model Selection Techniques To give an overview of available procedures: Forward Analysis In this approach, one adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these variables is added to the model, so long as its p-value of the F-statistic is below some pre-set level. It is customary to set this value above the conventional .05 level at say .10 or .15 In action, the user begins with a model including the variable that is most significant in the initial analysis, and continue adding variables until none of remaining variables are "significant" when added to the model. The following hypotheses are tested for each iteration of the process.
Null Hypothesis: k = 0
Alternative Hypothesis k =/= 0 (such that the coefficient of the additional variable is statistically nonzero)
Therefore, the decision statistic is the partial F-test. SAS stops the forward, backward, or stepwise regression if the probability of the partial F-statistic for a new model is above a maximum value, or continues model selection so long as the partial F-statistic is below the threshold.
Backwards Analysis This method is less popular because it begins with a model in which all candidate variables have been included. However, because of its structure, a high coefficient of determination is always maintained. The problem is that the models selected by this procedure may include variables that are not really necessary. At each step, the variable that is the least significant is removed. This process continues until no nonsignificant variables remain. The user 13
sets the significance level at which variables can be removed from the model, which is once again usually 0.05. Stepwise Regression As a fusion of both forward and backward stepwise regression, in stepwise regression four options are considered at each stage: add a term, delete a term, swap a term in the model for one not in the model, or stop. This algorithm is most often used in practice. Despite its widespread use, it has little theoretical basis. A more theoretically robust tool like Akaikes Information Criterion (AIC) can also be used as a good metric to assess models. Limitations O Fit Statistics Adjusted Coefficient of Determination / Adjusted R-Square
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but its usually not. It is always lower than the R-squared. Akaike Information Criterion The Akaike Information Criterion (AIC) is a way of selecting a model from a set of models. The chosen model is the one that minimizes the Kullback-Leibler divergence between the model and the truth. It's based on information theory, but a heuristic way to think about it is as a criterion that seeks a model that has a good fit to the truth but few parameters. It is defined as: AIC = -2 ( ln ( SSE / n )) + 2 K where likelihood is the probability of the data given a model and K is the number of free parameters in the model. AIC scores are often shown as AIC scores, or difference between the best model (smallest AIC) and each model (so the best model has a AIC of zero). Used in stepwise regression, AIC can be used instead of the p-value as the main criterion for model selection. Each iterative models AIC should be calculated and be compared to the previous, and should only be preferred if the current AIC is smaller than the AIC of the prior model. This process continues until the best model is selected. Bayesian Information Criterion BIC = n log (SSEp) n log (n) + p log (n) The BIC acts essentially the same as AIC but incorporates a more severe decrease if n > 8 14
Schwarz Bayesian Criterion SBC = n ln (SSE / n) + k ln (n) This is essentially like the AIC equation but uses a multiplicative penalty term based on sample size rather than a constant of 2. By default, PROC GLMSELECT uses the stepwise selection based on the Schwarz Bayesian Criterion. 1
Mallows Cp Cp = ((1-Rp 2 )(n-T) / (1-RT 2 )) [n 2(p+1)] The AIC has been shown as equivalent to Mallows Cp, which is used to assess the fit of a regression model that has been estimated using least ordinary squares. This measures the bias in the reduced regression model relative to the full model having all T candidate predictors. If Cp is roughly equivalent to p, then the reduced model predicts as well as the full model. If Cp < p then the reduced model is estimated to predict better than the full model. In practice, the selected model should have the smallest Cp. Mean Squared Error (MSE)
The Mean Squared Error in regression refers to the residual sum of squares divided by the number of degrees of freedom. Minimizing MSE is important to ensure that the maximum amount of variation of the regression can be explained by the independent variables, thus establishing the robustness of a model. It is one of the most important and fundamental criteria that can be used to evaluate models. General Criteria: General diagnostics should be calculated for each model to help determine which model is best. These model diagnostics include the mean square error (MSE), the adjusted coefficient of determination (R2), and Mallows Cp. A good linear model will have small MSE and Cp and a high adjusted R2 close to 1. With these criteria in mind in addition to stepwise regression with tools such as AIC, BIC, and SBC, I can develop a more robust model than the original. However, it is also important to note that use of these criteria and selection procedures will not definitively yield the best model due to the sheer number of potential models and inherent limitations of these tools. Regression Two: PROC GLSELECT with the Schwarz Bayesian Criterion Using PROC GLMSELECT, I used the aforementioned steps to select a model just using stepwise regression. Data Summary: Number of Observations 874,182
Fit Statistics: R-square 0.5768 Adjusted R-square 0.5768 Root MSE 1516.49340
Regression Three: PROC GLMSELECT with Akaikes Information Criterion Using PROC GLMSELECT, I used the aforementioned steps to select a model using the AIC. Data Summary: Number of Observations 874,182
Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTOR N07_EMPLOYER SEX1 FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCGRANT SCDONTKNOW SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECFAMLOAN ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL STATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME DAYLABOR TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING CEASENR HUSBWIFE FAMILYBUS NUMOWNERS race1noblanks LANGUAGE Fit Statistics: R-square 0.5771 Adjusted R-square 0.5770 Root MSE 1516.10130 Since the AIC model is on par with highest overall adjusted R-squared, has the least model variables, and relatively low MSE compared to the second stepwise model, it should be considered the most robust. It is noted that the first linear regression of all variables has a considerably lower MSE than the stepwise-selected models, despite the inclusion of far more variables. However, given the prevalence of missing data and nonresponse in these data sets, future data collection may benefit from relying on fewer variables in the generalized linear model. Therefore, the AIC model ranks the most effective. 16
The most significant variables are also those aforementioned in Regression One: sector, health insurance, holidays, and employer status for tabulation. Sector is arguably the most intuitively significant, while both health insurance and holidays are usually indicative of a business performing well enough to provide extended luxuries for employees. Surprisingly, both payroll and employment had far smaller magnitudes in their estimates, which indicates insignificance on a practical level. There existed several limitations to the regression. As stated earlier, one of the main drawbacks of using the generalized linear model is that it can produce over-fitting to data as well as the assumption that the relationships between the explanatory and response variable is linear, which may not be the case. Ideally, tests for nonlinearity could be conducted on each variable by plotting the residuals versus predicted values. Furthermore, multicollinearity or the variance inflation factor should be calculated for each variable, which remains to be done considering there is no built-in functionality for this purpose for survey data in SAS. Due to the large amount of variables, testing for multicollinearity is especially important. From a cursory analysis, most of the t-ratios for the individual coefficients is statistically significant, which could indicate that multicollinearity is not severe. Another possible option was to create a correlation matrix, which proved to be too cumbersome considering the matrix have dimensions greater than 150 x 150. Finally, further work could be done to investigate the addition of extra terms, such as interactive combinations of other original terms that better reflect the lack of complete independence between explanatory variables. Proper model selection requires conscientious consideration of the tradeoff between the competing objectives of conformity and adherence to data and model simplicity. Good models conform to the data with a strong goodness of fit, but can also be easily generalizable in its interpretation. Finally, good models should not under-fit (leaving out key variables in favor of attempt to be generalizable) or over-fit to the data (including extraneous or unrealistic variable effects in its attempt to have the best goodness of fit) because in each scenario, the conclusion loses value.
17
QQ Plot of Residuals versus a Normal Distribution
This quantile-quantile plot of the residuals versus a normal distribution show that the data seems to be normally distributed through the inner quartiles, but heavily skewed with long-tailed distributions on both sides particularly the left side. Since this QQ plot indicates significant skew in this model, its conclusions cannot be used to draw the strongest conclusions. Plot of Fitted Values versus Residuals
18
This plot of fitted values essentially entails a biased and heteroscedastic spread with an interesting phenomenon of residuals steadily decreasing in variation at higher response variable values. Due to the extreme density of points attributable to the large amount of overall variation and sheer number of datapoints, more specific phenomena cannot be analyzed at this point in time. The plot also points to the issue of the model drastically over-estimating predicted values given the sheer magnitude of the negative residuals, which calls for further analysis among observed businesses that deviate the most from the model. Given that over half of the observed businesses have receipts less than $16,000 according to the five-number summary, running a regression on the subset of PUMS to only include businesses that earn more than the median point may produce a more well-fitting, unbiased, and homoscedastic model. The five number summary for residual values clearly shows greater negative skew on the whole, but a greater density of positive values over smaller intervals, thus confirming the residuals versus fitted plot. Since numerical variables typically have more leverage over the fit and spread of the residuals and receipts is extremely skewed right, an analysis of both payroll and employment is warranted to see if similar effects are in place. The following two tables describe the five number summaries for both payroll and employment.
These results confirm that both employment and payroll are heavily skewed right, which would explain the potential for the model to drastically overestimate certain values. Since the actual magnitude (from a dollar or labor force standpoint) causes this extreme skew, creating a new binary variable for each of these variables such that 0 indicates employment or payroll of 0 and 1 indicates employment or payroll greater than 0 could result in better fit. Additionally, taking the logarithm of RECEIPTS_NOISY may induce better fit as well due to the skew. In order to establish control for the following regressions, I first conduct the PROC GLMSELECT procedure with AIC with the only change of taking the logarithm of receipt values. Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm Transform Data Summary: Number of Observations - 784,208 Five Number Summary of RECEIPTS_NOISY Min Q1 Median Q3 Max 0 1.72393 15.23411 81.77834 6900 Five Number Summary of Residuals Min Q1 Median Q3 Max -93534.00 -222.63 -36.71 101.67 8271.42 Five Number Summary of EMPLOYMENT_NOISY Min Q1 Median Q3 Max 0 0 0 0 4890.00 Five Number Summary of PAYROLL_NOISY Min Q1 Median Q3 Max 0 0 0 0 280000.00 19
While the fit has improved by most standards, the overall Q-Q plot and fitted versus residual values plot still indicate major issues in skew and bias. As stated earlier, trying to correct the issue by creating binary variables out of the existing numeric payroll and employment variables could control some of the drastic skewedness observed in both plots, which leads to the next regression.
20
Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and Employment with Other Character Variables against the Logarithm of Receipts Using PROC GLMSELECT, I used the aforementioned steps to select a model using the AIC with the modified variables of binary payroll and employment. Variables Used: PCT1 FIPST SECTOR N07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1 RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCBANKLOAN SCFAMLOAN SCVENTURE SCGRANT SCOTHER SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL STATELOCAL OTHERBUS INDIVIDUALS EXPORTS FULLTIME DAYLABOR TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING CEASENR HUSBWIFE FAMILYBUS NUMOWNERS race1noblanks LANGUAGE ifpayroll ifemployment
Fit Statistics: R-square 0.6550 Adjusted R-square 0.6549 Root MSE 3.13900
Although the fit statistics seem to indicate worse fit than the previous model given lower coefficients of determination and a higher root MSE, the Q-Q plot absolutely indicates better overall fit and far less skew especially left skewedness.
21
The correction of skew also allows me to observe the actual bulk of the fitted versus residuals plot more closely, which is starting to show very peculiar patterns that require further investigation. One possible reason for this is the fact that virtually all of the regressed variables are categorical with the exception of ownership percentage of the first business owner. Although the previous model had better fit values and the current model may be susceptible to data overfitting to the sample, this model iteration still ultimately has better results according to the Q-Q plot and the fitted versus residuals plot due to reduced skewedness.
22
Analysis of Language Research Question 1a: Does the language spoken in transactions produce a difference in correlated receipts? The United States is often seen as the nation most conducive to immigrant entrepreneurship, especially in an increasingly globalized society as well. To start, I investigated the relationship between different languages and overall business receipts. Running a general linear model procedure for only the language variable against receipts results in the following: Estimates of Language Variable Levels Against Receipts
One interesting phenomenon is that a business that conducts transactions through only English and Spanish is associated with higher gross receipts than a business that conducts business transactions through any other language or language combination. Most notably, the estimate for a business that only speaks English and Spanish is nearly 40% greater than the estimate for a business that only speaks English for transactions. Its possible that Hispanics are largely employed by the agriculture, manufacturing, and construction industries more so than others. According to 2008 research by the Center for Construction Research and Training, the following depicts a graph of Hispanic employees as a percentage of each industry.
This data point to the possible underlying non-uniform distribution of Hispanics across business industries. Its important to understand the difference between businesses that conduct 23
transactions in a certain language and businesses that operate internally using the language. For example, a business that could have a predominantly large portion of Hispanics may not necessarily conduct business transactions in Spanish. Therefore, an additional research question could be the relationship between the language and new race/ethnicity variable. The percentage of Hispanics by industry in this instance simply points to industries to investigate more closely, such as construction and agriculture. Using this information, I analyzed the breakdown of language by NAICS sector code, paying special attention to the industries that had the greatest percentage of Only English and Spanish businesses.
24
Industries with the Greatest Proportion of Only Spanish and English Businesses: Sector 55: 13.63% The Management of Companies and Enterprises sector comprises (1) establishments that hold the securities of (or other equity interests in) companies and enterprises for the purpose of owning a controlling interest or influencing management decisions or (2) establishments (except government establishments) that administer, oversee, and manage establishments of the company or enterprise and that normally undertake the strategic or organizational planning and decision making role of the company or enterprise.
Sector 62: 10.74% The Health Care and Social Assistance sector comprises establishments providing health care and social assistance for individuals.
Industries with the Greatest Proportion of Only English Businesses:
Sector 21: 95.71%
The Mining sector comprises establishments that extract naturally occurring mineral solids, such as coal and ores; liquid minerals, such as crude petroleum; and gases, such as natural gas.
Sector 11: 92.90%
The Agriculture, Forestry, Fishing and Hunting sector comprises establishments primarily engaged in growing crops, raising animals, harvesting timber, and harvesting fish and other animals from a farm, ranch, or their natural habitats.
Given these interpretations, looking at the general mean receipts for each sector could then explain why businesses that conducted business transactions in English and Spanish have statistically higher receipts on average. The following table depicts sector and it means receipts:
Statistics Variable Mean Std Error of Mean RECEIPTS_NOISY 174.631446 0.353243
Domain Analysis: SECTOR SECTOR Mean Std Error of Mean 11 Agriculture, Fishing 96.067138 2.182337 21 Mining, Quarrying, Oil Extraction 233.433129 6.192045 22 Trade, Transportation, Utilities 113.609696 6.397033 23 Construction 218.497973 1.149555 25
Domain Analysis: SECTOR SECTOR Mean Std Error of Mean 31 Manufacturing 527.070573 4.920525 42 Wholesale Trade 628.044453 5.295762 44 Retail Trade 267.633023 1.575551 48 Transportation and Warehousing 144.352095 1.225643 51 Information 156.241621 2.441339 52 Finance and Insurance 170.082058 1.558410 53 Real Estate and Rental 120.396544 0.699686 54 Professional, Scientific, Technical Services 139.530715 0.707921 55 Mgmt. of Companies and Enterprises 490.728375 19.475524 56 Admin. and Support and Waste Management 100.129900 0.835629 61 Education 45.306452 0.868810 62 Healthcare 175.218046 1.177138 71 Arts and Etnmt. 58.393615 0.740265 72 Accommodation and Food Services 356.426518 2.657923 81 Other Services 69.516196 0.417748 99 Unclassifiable 102.268452 8.707951
According to this PROC SURVEYMEANS procedure of the mean gross receipts of the average business in each industry and the average gross receipts of all businesses (across industries), which is 174.63, both sectors 55 and 62 earn above average. While this preliminarily explains why the overall estimate for Only English and Spanish businesses is higher than the estimate of other language levels, the following question should be investigated:
26
Research Question 1b: Out of only existing Only English and Spanish businesses, which are the most popular industries for business?
Table of LANGUAGE by SECTOR LANGUAGE SECTOR Frequency Weighted Frequency Percent Only English and Spanish 11 390 4076 0.4269 21 263 1907 0.1998 22 123 533.41200 0.0559 23 7472 101173 10.5960 31 3480 17790 1.8632 42 4346 27406 2.8702 44 13181 113815 11.9199 48 3979 44151 4.6239 51 1460 9222 0.9658 52 5665 46651 4.8858 53 6760 85715 8.9770 54 11573 117115 12.2656 55 1159 1581 0.1656 56 5933 74071 7.7575 61 1325 16665 1.7453 62 11269 127184 13.3201 71 1917 25298 2.6495 72 4017 35816 3.7511 81 7699 104522 10.9467 99 22 135.88500 0.0142 Total 92033 954827 100.000
The top give highest concentrations of Only English and Spanish businesses are in sectors 62 (13.32% and receipts of 175.22), 54 (12.27% and receipts of 139.53), 44 (11.92% and receipts of 267.63), 81 (10.95% and receipts of 69.52), and 23 (10.60% and receipts of 218.50). Although within on a sector level, Only English and Spanish businesses in sector 55 made up a large share of businesses in sector 55 overall, it actually had one of the smallest actual frequencies with just 1159 businesses total. Revising my initial conclusion, I argue that the higher estimate is most likely derived from consistently above average performance in sectors where Only English and Spanish businesses are prevalent. 27
Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United States As noted earlier, there exists a dichotomy between businesses run by Hispanic owners and businesses that conduct transactions in English and Spanish. I approached this issue in a different fashion, by plotting the percentage of both (for each state) on a map of the United States. I pre-defined set intervals after looking at the spread of percentages for each state. The following graphs generally depict the same patterns of higher concentrations of all Hispanic/Spanish-speaking subjects located in the Southwest and Florida. I obtained 2007 data on the number of Hispanics living in each state from the Pew Research Hispanic Trend Project.
28
29
Analysis of Capital Sources in California versus the United States Californias $2 trillion economy would be the ninth biggest in the world if it were a country. The state represents 13% of the U.S. economy. However, California has been ranked as one of the worst states to do business in recent years according to business executives and publications. 2 The state has been under duress from the dramatic fall in home prices and the reduced tax revenues for the state. Moreover, California consistently boasts one of the highest costs for living and operation. Interestingly, California also ranks among the best for technology and innovation. Another plus is the $36 billion in venture capital money invested in California companies the past three years, which is four times the total of any other state. 3 California is also noted to be the home of Silicon Valley. According to a 2006 study done by the American Electronics Association, Silicon Valley and the Bay Area as a whole ranked first in terms of the number of high-tech jobs in the United States. While the PUMS does not have location data more granular than the state level, the existence of Silicon Valley itself could point to interesting statistical characteristics of California that no other state may share. As the state of extremes, I found it interesting to investigate sources of startup and expansion capital and their relationship to revenues in California, especially compared to this relationship between capital and revenues in the United States. Dataset Overview The observed dataset is simply a subset of the cleaned PUMS done by only setting observations that have indicated 06 (for California) as its FIPS code. This dataset has 182932 observations, and most categorical variables (except location) seem to have a missing percentage of 30-50%, which is slightly higher than the typical missing pattern observed in the U.S. dataset. In this section, both startup and expansion capital will be analyzed. Startup capital refers to the initial cost of investment to fully bring a product or service to market. It can be used for everything from business operation expenses to research and development to payroll. It is typically used to fund businesses still in their infancy, and can be repaid once the business reaches a level of maturity to earn revenues on its own. Companies that seek expansion capital, on the other hand, will often do so in order to finance a transformational event in their business. These companies are likely to be more mature (in terms of operating time) than venture capital funded companies, able to generate revenue and operating profits but unable to generate sufficient cash to fund major opportunities, acquisitions or other investments. Because of this lack of scale these companies generally can find few alternative conduits to secure capital for growth, so access to growth equity can be critical to pursue necessary facility expansion, sales and marketing initiatives, equipment purchases, and new product development.
Glossary of Capital Sources Startup Capital (SC) SCSAVINGS: Personal Savings SCASSETS: Other Personal Assets SCEQUITY: Home Equity SCCREDIT: Credit Cards SCGOVTLOAN: Government Loan SCGOVTGUAR: Government Guaranteed Loan the United States government and the Small Business Administration provides loans to certain businesses depending on size and capital purposes SCVENTURE: Venture Capitalist SCGRANT: Grant SCOTHER: Other Research Question 2a: What sources of capital have the most positive relationship with receipts? Regressing both startup capital and expansion capital variables against receipts in both the California and United States datasets reveals interesting statistics on both the percentage of usage of each type of capital, as well as the practical significance of each capital source represented via the estimate. The following data tables derive content from the full list of SAS outputs contained in Appendix C. Summarized Table for Startup Capital California U.S. Estimates True Estimates (added to Intercept) Yes No Yes No CA USA CA USA Difference SCSAVINGS 60.97% 39.03% 58.01% 41.99% 11.70 -15.78 113.71 154.40 -40.69 SCASSETS 6.23% 93.77% 7.05% 92.95% 66.98 48.85 168.98 219.03 -50.04 SCEQUITY 6.12% 93.88% 5.07% 94.93% 71.58 49.37 173.59 219.55 -45.96 SCCREDIT 11.22% 88.78% 10.31% 89.69% -53.88 -90.21 48.13 79.96 -31.83 SCGOVTLOAN 0.40% 99.60% 0.56% 99.44% 155.74 91.93 257.74 262.10 -4.36 SCGOVTGUAR 0.45% 99.55% 0.59% 99.41% 128.86 195.60 230.87 365.78 -134.92 SCBANKLOAN 4.47% 95.53% 9.18% 90.82% 247.36 305.75 349.37 475.93 -126.56 SCFAMLOAN 2.23% 97.77% 2.28% 97.72% 36.52 180.69 138.53 350.86 -212.34 SCVENTURE 0.30% 99.70% 0.25% 99.75% 68.87 364.61 170.88 534.78 -363.91 SCGRANT 0.18% 99.82% 0.20% 99.80% -78.63 -103.56 23.37 66.62 -43.24 SCOTHER 1.74% 98.26% 1.70% 98.30% 203.10 170.52 305.10 340.69 -35.59 SCDONTKNOW 4.47% 95.53% 4.58% 95.42% 150.63 183.79 252.64 353.96 -101.32 SCNONENEEDED 23.59% 76.41% 24.35% 75.65% -53.49 111.12 48.52 281.30 -232.78 SCNOTREPORTED 5.33% 94.67% 5.32% 94.68% 0.00 0.00 102.01 170.18 -68.17 INTERCEPT 102.01 170.18 In terms of usage frequency, differences greater than 1% between the United States and California are bolded. Regardless, its important to note that most of these differences (even for differences less than 1%) are statistically significant due to the large sample size. Businesses in the United States as a whole are more than twice as likely to use bank loans as a source of startup capital when compared to businesses in California, while Californian business owners are more Startup Capital (SC) SCSAVINGS: Personal Savings SCASSETS: Other Personal Assets SCEQUITY: Home Equity SCCREDIT: Credit Cards SCGOVTLOAN: Government Loan SCGOVTGUAR: Government Guaranteed Loan the United States government and the Small Business Administration provides loans to certain businesses depending on size and capital purposes SCBANKLOAN: Loan from Bank SCFAMLOAN: Loan from family and friends SCVENTURE: Venture Capitalist SCGRANT: Grant SCOTHER: Other Expansion Capital (EC) ECSAVINGS: Personal Savings ECASSETS: Other Personal Assets ECEQUITY: Home Equity ECCREDIT: Credit Cards ECGOVTLOAN: Government Loan ECGOVTGUAR: Government Guaranteed Loan the United States government and the Small Business Administration provides loans to certain businesses depending on size and capital purposes ECBANKLOAN: Loan from Bank ECFAMLOAN: Loan from family and friends ECVENTURE: Venture Capitalist ECPROFITS: Business Profits ECGRANT: Grant ECOTHER: Other ECNOEXPAND: Did not expand ECNOACCESS: No Access to Expansion Capital
31
likely to use home equity and their own savings to start ventures. Overall, the top three categories with the highest estimate magnitudes for the United States are venture capital, bank loans, and government guaranteed loans. For California, these categories are bank loans, other sources of capital (unspecified), and government loans. Consistent with its low ranking, California as a whole seems to entirely perform worse than businesses in the United States according to the difference in estimates. Most interestingly, California comparatively the worst in the venture capital category, despite being the state with the greatest amount of venture capital invested in business venture formation. Other comparatively poor categories are loans from family members and businesses that do not need startup capital. These results warrant further analysis is conducted on the spread, rather than the average, of certain categories, as it is possible for Californian businesses to have greater extremes than American businesses in general. Summarized Table for Expansion Capital California U.S. Estimates True Estimates (added to Intercept) Yes No Yes No CA USA CA USA Difference ECSAVINGS 32.24% 67.76% 29.00% 71.00% -33.74 -112.84 76.38 118.78 -42.39 ECASSETS 3.90% 96.10% 4.00% 96.00% 21.89 5.40 132.03 237.03 -105.00 ECEQUITY 5.84% 94.16% 4.33% 95.67% 135.96 90.04 246.09 321.67 -75.57 ECCREDIT 13.68% 86.32% 12.09% 87.91% -9.45 -88.74 100.67 142.87 -42.19 ECGOVTLOAN 0.33% 99.67% 0.39% 99.61% 419.09 99.43 529.23 331.06 198.17 ECGOVTGUAR 0.26% 99.74% 0.29% 99.71% 63.47 167.61 173.61 399.24 -225.63 ECBANKLOAN 4.60% 95.40% 7.54% 92.46% 461.09 446.34 571.22 677.97 -106.74 ECFAMLOAN 1.06% 98.94% 0.93% 99.07% 69.17 113.62 179.31 345.25 -165.93 ECVENTURE 0.18% 99.82% 0.13% 99.87% 514.95 521.87 625.09 753.49 -128.40 ECPROFITS 8.95% 91.05% 9.23% 90.77% 124.95 162.90 235.09 394.53 -159.43 ECGRANT 0.20% 99.80% 0.19% 99.81% -98.08 -131.52 12.05 100.10 -88.04 ECOTHER 0.83% 99.17% 0.76% 99.24% 127.44 132.62 237.57 364.24 -126.67 ECDONTKNOW 6.76% 93.24% 6.57% 93.43% -9.43 -43.75 100.70 187.86 -87.16 ECNOACCESS 1.93% 98.07% 1.80% 98.20% -68.32 -161.82 41.81 69.80 -27.98 ECNOEXPAND 45.80% 54.20% 48.29% 51.71% -26.99 -108.06 83.14 123.56 -40.41 ECNOTREPORTED 7.38% 92.62% 7.61% 92.39% 0 0 110.13 231.63 -121.49 INTERCEPT N/A N/A N/A N/A 110.137 231.627
As done previously, differences greater than 1% between the United States and California are bolded. Businesses in the United States as a whole are more likely to use bank loans for expansion capital, or not require it at all. Businesses in California are more likely to use their own savings, home equity, or credit card debt to fund expansion. Overall, the top three categories with the highest estimate magnitudes for the United States are venture capital, bank loans, and government guaranteed loans (which is also identical to the top three for startup capital). For California, these categories are venture capital, bank loans, and government loans. Once again, businesses in the United States tend to benefit more from virtually all sources of expansion capital than businesses in California, with the exception of having a government loan. These interesting characteristics further warrant an analysis of spread, rather than just the average, for certain categories. 32
Research Question 2b: How does the spread of receipts for certain capital sources compare between businesses in California and the general United States? SCVENTURE and ECVENTURE Analysis Even though California performed worse on average according to its estimates, I initially hypothesized that California had an overall larger spread with a maximum and upper quartile point likely exceeding the maximum and upper quartile of receipts in the United States. According to the following give number summary, Californias quantiles for startup venture capital exceed those of the United States except for the maximum value, which indicates that Californian businesses holistically perform better than businesses in the United States in general. The previous estimates from the regression are influenced due to the heavier right skewedness of the United States. Venture Capital - Five Number Summaries Quantile California - SC US - SC California - EC US - EC Min 0 0 0 0 Q1 10.57 7.71 9.34 12.32 Median 97.28 52.54 91.73 92.12 Q3 774.90 454.80 608.45 725.24 Max 6600.00 6900.00 6900.00 6800.00
ECGOVTGUAR Analysis Since ECGOVTGUAR was the only source of expansion capital in which California had a greater estimate for than the United States, I also wanted to conduct a spread analysis. According to the PROC SURVEYFREQ procedure, there are 440 businesses in California that used a government guaranteed loan for expansion capital. Expansion Capital from Gov't Guar. Loan Quantile California - SC US - SC Min 0.00 0.00 Q1 83.57 22.29 Median 301.42 174.09 Q3 993.67 621.90 Max 6900.00 6900.00
The spread confirms the general interpretation from the estimate such that Californian businesses as a whole tend to benefit more from government guaranteed loans.
33
Conclusion My analysis of the Public Microdata Sample of the 2007 Survey of Business Owners involved cleaning and manipulating raw data, selecting a generalized linear model of all variables against receipts through the Akaikes Information Criterion, an analysis of business transaction language, industry of business, and ethnicity of the owner in the context of state location, and an investigation of the differences in capital sources between businesses in California and the United States in general. From the data cleaning, I was able to successfully incorporate my knowledge of the background survey methodology to effectively consolidate and remove variables and prepare it for proper analysis in this context. From the model fitting, I was able to learn and apply different model selection techniques and selection criteria (AIC, BIC, SBC, Mallows Cp, MSE, and adjusted r-square) to ultimately choose the best fitting model that attained a moderately strong adjusted coefficient of determination, after several model selection manipulations that involved logarithm variable transformation to reduce skewness and improve overall fit. The language analysis revealed that businesses that conducted transactions in only English and Spanish had higher estimates that businesses that only used English. Investigating these businesses within the context of sector ultimately showed that businesses that only used English and Spanish were statistically represented at a higher percentage for certain sectors (such as sector 23 / Construction ) that earned more on average than other sectors where Only English businesses had a higher percentage. Finally, the analysis of sources of capital revealed that California overwhelmingly performed worse than businesses in the United States based on averages and estimates in regression, but closer analysis on the spread of receipts given startup venture capital in California shows that estimates can be misleading, and heavy right-skewedness invalidates conclusions based on the average alone. Given more time and access to the actual dataset (without noise and other confidentiality- preserving measures), I would be able to develop a more powerful and accurate model, along with other analyses. Other important research questions to investigate would be creating a correlation matrix to observe collinearity between variables, observing the relationship between receipts with more demographic information such as age, gender, and education level, as well as conducting statistical analysis with different response variables, such as employment and payroll. Lessons Learned Ive come to believe that doing independent research in the context of statistics is incredibly important. Through this study, Ive been able to apply everything that Ive learned in all of my statistics courses, from learning how to handle a very large data set in SAS to making the proper assumptions and conclusions from my analyses. I argue that this is the highest form of learning as it is completely experiential, based off of existing data, and set entirely in real- world scenarios. Ive also been fortunate enough to study the fusion of my two academic fields business and statistics. The flexibility of independent research has allowed me to learn about existing literature in the vast field of business statistics and entrepreneurship, as well as field other possible research ideas such as social network analysis and survival analysis. 34
Appendix A: Full Output of Regression with log(Receipts) and Modified Payroll and Employment
The GLMSELECT Procedure Selected Model
The selected model is the model at the last step (Step 81). Effects: Intercept PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTOR N07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1 RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCVENTURE SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDIT ECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL OTHERBUS INDIVIDUALS EXPORTS FULLTIME PARTTIME LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING CEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE
Analysis of Variance Source DF Sum of Squares Mean Square F Value Model 207 14974589 72341 7650.22 Error 784001 7413568 9.45607 Corrected Total 784208 22388157
Parameter Estimates Parameter DF Estimate Standard Error t Value race1noblanks S 1 0.002974 0.026112 0.11 race1noblanks W 0 0 . . LANGUAGE Only English 1 0.216239 0.017187 12.58 LANGUAGE Only English and Other 1 0.143442 0.017956 7.99 LANGUAGE Only English and Spanish 1 0.219765 0.017202 12.78 LANGUAGE Only Other Language 1 0.007221 0.023920 0.30 LANGUAGE Only Spanish 0 0 . .
46
Appendix B: Regression of All Available Variables
The SAS System
The SURVEYREG Procedure
Regression Analysis for Dependent Variable RECEIPTS_NOISY Data Summary Number of Observations 874182 Sum of Weights 10068531 Weighted Mean of RECEIPTS_NOISY 251.05405 Weighted Sum of RECEIPTS_NOISY 2527745467
Fit Statistics R-square 0.5771 Root MSE 446.73 Denominator DF 874181
Class Level Information Class Variable Levels Values LANGUAGE 5 Only English Only English and Other Only English and Spanish Only Other Language Only Spanish region 9 East Sout Mid-Atlan Midwest Mountain Northeast Pacific South Atl West Nort West Sout