Sei sulla pagina 1di 28

Yang 1

Jane Yang
Matthew Eckel
Analysis of Political Data
30 April 2018

The Power of Ideas: Does Ideology Affect the Percentage of Women in State Legislatures?

In this paper, I theorize that people who embody more liberal ideologies tend to vote

more women into office. More specifically, I posit that U.S. states with a more liberal

constituency on average tend to have a higher percentage of women in their state legislatures.

This is due to the fact “Ideas about women’s role and position in society can enhance or

constrain women’s ability to seek political power.”1 In other words, “despite the presence of

favorable political systems or an adequate supply of female candidates, […] ideologies and

arguments against women’s right[s] to participate in politics have created substantial barriers to

women’s political participation for many years.”2 Put simply, our ideologies dictate how we

perceive women’s rights, women’s interests, and women’s roles; these perceptions then play a

role in whether or not we vote for a woman. Of course, my theory can only apply under certain

conditions. For one, because the theory specifies U.S. states, it can only be applied to U.S. states.

Moreover, my theory is contingent on the assumption that the data are coming from strong

democracies, where constituents are voting in free and fair elections and have complete agency

over their voting decisions. In other words, voters are not being coerced to vote for specific

candidates, to the extent that they might even be compromising their own ideological stances.

To test this theory, I draw from the “Correlates of State Policy Project” (CSPP) dataset,

compiled by the Institute for Public Policy and Social Research at Michigan State University,

and compare ADA/COPE measures of citizen ideology to the percentage of women in each

1
Pamela Marie Paxton and Sheri Kunovich. “Women’s Political Representation: The Importance of Ideology,” Social Forces 82:1 (September
2003): 90. http://muse.jhu.edu/article/47842/pdf
2
Paxton and Kunovich, “Women’s Political Representation,” 90-91.
Yang 2

state’s legislature.3 The CSPP includes “more than nine-hundred variables, with observations

across the U.S. 50 states and time (1900 – 2016). These variables represent policy outputs or

political, social, or economic factors that may influence policy differences across the states.”4 I

then proceed as follows. First, I consider other possible theories about women’s political

representation. Then, I describe in detail the independent variable, dependent variable, units of

observation, and possible control variables. After that, I run a preliminary bivariate linear

regression, as well as a more comprehensive multivariate linear regression, both with

visualizations and interpretations. Then, I assess the validity of the model itself. Finally, I

conclude with an overall analysis of my theory given my tests, a consideration for other possible

threats to my tests (namely, endogeneity and a lack of clarity on the ideology variable), as well

as ideas for future tests or areas of improvement.

Competing Theories

There are many other theories on what factors affect women’s representation in politics.

Broadly, these theories can be categorized as “supply” and “demand” factors; the idea is that “the

‘supply’ of female candidates and ‘demand’ for female candidates” affect the number of women

in legislatures.5 The “supply” factor posits that “Political elites are pulled disproportionately

from the highly educated and from certain professions, such as law,” thus, “if women do not

have access to educational and professional opportunities, they will not have the human financial

capital necessary to run for office.”6 The demand factor suggests that “institutional differences in

political systems may manifest a different ‘demand’ for women, irrespective of the available

3
Correlates of State Policy Web site. Michigan State University, Institute for Public Policy and Social Research (IPPSR).
http://ippsr.msu.edu/public-policy/correlates-state-policy. This paper will elaborate on the ADA/COPE measure of ideology and the
percentage of women in each state’s legislature in subsequent sections.
4
Correlates of State Policy Web site.
5
Paxton and Kunovich, “Women’s Political Representation,” 89.
6
Paxton and Kunovich, “Women’s Political Representation,” 89.
Yang 3

supply,” such as political parties and electoral systems, which “can be crucial factors in allowing

women access in equal numbers.”7 My theory on ideology counts as a “demand” factor, as I posit

that people with more liberal ideologies have a higher “demand” for female representatives.

Other competing theories include those on government corruption and socioeconomic

factors. For instance, “Studies on corruption, such as those initiated by researchers at the World

Bank, find evidence of a relationship between the number of women in parliament and the level

of corruption;” however, the causality is unclear.8 That is, national parliaments with a lower level

of corruption are correlated with higher numbers of women, but the causal mechanism is

unknown.9 Meanwhile, socioeconomic factors such as “women’s share in professional

occupations,” “welfare state policies,” and “increases in government (non-military) expenditure”

all increase female political representation due to the fact that these phenomena empower

women, such that “the political interests of working women are changed enough to create an

ideological gender gap.”10 My test takes these theories into consideration by controlling for

education and income levels, thus accounting for socioeconomic factors which may empower

women and increase the “supply” of eligible female candidates, as opposed to the “demand.”

Independent Variable: Citizen Ideology Measure (1960 – 2013)

This variable is based on Berry, Ringquist, Fording and Hanson’s aggregation of ADA

(American’s for Democratic Action) and COPE (Committee on Political Education) scores to

measure ideology, with 0 being the most conservative and 100 being the most liberal.11 As we

7
Paxton and Kunovich, “Women’s Political Representation,” 90.
8 Lena Wängnerud, “Women in Parliaments: Descriptive and Substantive Representation,” Annual Review of Political Science 12 (2009): 58.
https://www.annualreviews.org/doi/pdf/10.1146/annurev.polisci.11.053106.123839
9
Wängnerud, “Women in Parliaments,” 58.
10
Wängnerud, “Women in Parliaments,” 56-58.
11
William D. Berry, Evan J. Ringquist, Richard C. Fording, and Russell I. Hanson, “Measuring Citizen and Government Ideology in the
American States, 1960-93,” American Journal of Political Science 42:1 (January 1998): 327 – 348.
https://www.jstor.org/stable/pdf/2991759.pdf?refreqid=excelsior:6047b68e3bbde6c3420943cee0689d9c
Yang 4

can see in the following histogram, the distribution of the scores is fairly normal, with the lowest

score being 0.963, the highest score being 95.972, and the mean score being 47.838.

It is important to note here that the conceptual definition of this variable is somewhat

misleading. Firstly, the Citizen Ideology Score is computed based on the ideology of the

representative of each district, which is then used to compute an average for the state as a whole.

As such, the Citizen Ideology Score is technically based off of an aggregation of the ideology of

the representatives—not each individual citizen. Moreover, whereas social scientists generally

agree that ideological factors differ from political-institutional factors, the Citizen Ideology

Measure variable conflates ideology with political preferences.12 Both ADA and COPE measure

ideology based on how representatives vote on specific political issues. For instance, ADA uses

the Liberal Quotient, which “combin[es] 20 key votes on a wide range of social and economic

issues, both domestic and international…[to provide] a basic overall picture of an elected

official’s political position.”13 Meanwhile, COPE measures ideology based on voting records

12
Paxton and Kunovich, “Women’s Political Representation,” 89-91.
13
“ADA Voting Records.” Americans for Democratic Action. https://adaction.org/ada-voting-records/. Emphasis added.
Yang 5

tracked by the American Federation of Labor and Congress of Industrial Organizations (AFL-

CIO): the more a representative votes to strengthen “Social Security and Medicare, freedom to

join a union, workplace safety,” the more “liberal” they are, and the higher their score.14 These

phenomena may compromise the validity of my tests, which I will expand upon later.

Dependent Variable: Percentage of state legislators who are women, by state (1975 – 2016)

This variable is straightforward. Quite simply, it measures the percentage of state

legislators who are women by state, in a given year, between the years of 1975 and 2016. As

seen in the following histogram, the distribution of these percentages is also fairly normal, with

the lowest percentage at 0.70%, the highest percentage at 42%, and the mean at 19.49%.

Units of Observation: state-year

The CSPP has over 700 variables “with observations across the U.S. 50 states and time

(1900 – 2016).”15 In other words, I am working with time-series data, where the unit of

14
“Legislative Scorecard.” American Federation of Labor and Congress of Industrial Organizations (AFL-CIO). https://aflcio.org/what-unions-
do/social-economic-justice/advocacy/scorecard
15
Jordan, Marty P. and Matt Grossmann. 2016. The Correlates of State Policy Project v1.14. East Lansing, MI: Institute for Public Policy and
Social Research (IPPSR). http://ippsr.msu.edu/sites/default/files/CorrelatesCodebook.pdf
Yang 6

observation is not just fixed on states, but rather, a hybrid of state-years. This is important to note

because it raises the possibility of running into autocorrelation errors in my regression models.

Indeed, it would be naïve to assume that the citizen ideology score of a state in one year has no

effect on the citizen ideology score of that same state in the following year. The same goes for

percentage of female state legislators in a given year. Moreover, because the unit of observation

is state-year, not country-year, the results can only be applied in the context of U.S. state

legislatures, not for U.S. national legislatures or for legislatures of other countries.

Control Variables: Income per Capita, Education Level, Lagged IV, Lagged DV

As mentioned, while ideological factors constitute as one possible explanation for

women’s representation in legislatures, other explanations include socioeconomics as a causal

factor. In my multivariate regression test, I include variables for the average income per capita

(total personal income in that state divided by total midyear population), percentage of

respondents with a high school diploma or higher, the citizen ideology score for a state the year

before (in other words, a one-year lagged citizen ideology variable), and a one-year lagged

percentage of women in legislature variable. The first two variables are meant to account for

socioeconomic factors. Indeed, we can assume that the higher the average income per capita in a

state, and the larger the percentage of residents with a high school diploma or higher, the better

the socioeconomic circumstances. The idea here is that better socioeconomic circumstances lead

to more female empowerment, which may increase the “supply” of eligible female candidates.16

The lag variables are then meant to mitigate possible autocorrelation errors due to the time-series

nature of both the IV and the DV. The distribution of the control variables are shown in the

following histograms.

16
For more on this theory, see the section in this paper on “Competing Theories.”
Yang 7

Bivariate Hypothesis Test

H0: Citizen Ideology Score does not have an effect on the Percentage of Female Legislators

Ha: Citizen Ideology Score has an effect on the Percentage of Female Legislators. As ideology

scores increase, percentage of female legislators either increases or decreases.


Yang 8

As we can see from our regression table, the beta coefficient for Citizen Ideology Score is

0.176. This means that as Citizen Ideology Score increases by 1 point, the Percentage of Female

Legislators increases by 0.176 percent. Meanwhile, because the beta coefficient of the constant is

10.370, we can assume that when the Citizen Ideology Score is 0, the estimated Percentage of

Female Legislators is 10.73. Finally, because the p-value is less than 0.01 for both Citizen
Yang 9

Ideology Score and the constant, we can assume that both of these numbers are statistically

significant. As such, we can reject the null hypothesis in favor of the alternative hypothesis.

It is also important to consider how much the Percentage of Female Legislators is

explained by Citizen Ideology Score alone. In order to assess this, we look to the Adjusted R-

squared. At 0.105, the Adjusted R-squared tells us that 10.5% of the variation in the Percentage

of Female Legislators can be explained by the variation in the Citizen Ideology Score. This is a

fairly low percentage; thus, we can assume that Citizen Ideology Score has a positive effect on

the Percentage of Female Legislators, but also a weak one. The low Adjusted R-squared is most

likely due to omitted variable bias; that is, there are many other causal factors which affect the

Percentage of Female Legislators. In the following section, I conduct a multivariate hypothesis

test and include control variables in order to mitigate omitted variable bias.

Multivariate Hypothesis Test

H0: Controlling for average state Income-per-Capita, Percentage of Residents with a High School

Diploma or Higher, and Citizen Ideology Score and Percentage of Female Legislators the year

before, Citizen Ideology Score does not have an effect on the Percentage of Female Legislators.

H1: Accounting for the control variables, Citizen Ideology Score has an effect on Percentage of

Female Legislators. As Citizen Ideology Score increases, Percentage of Female Legislators

either increases or decreases.


Yang 10

From our regression table, we can see that, when controlling for average state Income-

per-Capita, Percentage of Residents with a High School Diploma or Higher, and lagged variables

for both the IV and the DV, the beta coefficient for Citizen Ideology Score decreases from 0.176
Yang 11

to 0.070. Although the inclusion of control variables has caused the effect of Citizen Ideology

Score itself to decrease, the effect of Citizen Ideology Score in estimating Percentage of Female

Legislators is still statistically significant at p < 0.01. Thus, we can reject the null hypothesis

again in favor of the alternative hypothesis.

At the same time, we consider the effects of the average state Income-per-Capita,

Percentage of Residents with a High School Diploma or Higher, and the lagged variables for

Citizen Ideology Score and Percentage of Female Legislators. Because the lagged variable for

Citizen Ideology Score is not statistically significant (p > 0.05), we assume that it has no effect

on the Percentage of Female Legislators.17 The beta coefficients for average state Income-per-

Capita, Percentage of Residents with a High School Diploma or Higher, and lagged Percentage

of Female Legislators respectively are 0.0003, 0.503, and 0.095, all at a p-value of less than 0.01.

Not only do each of these variables have a positive effect on the Percentage of Female

Legislators, but they are all statistically significant. That is, as each of these variables increases,

so too does the Percentage of Female Legislators. The strength of this model is also much higher,

as the Adjusted R-squared increased from 0.105 to 0.476. This Adjusted R-squared suggests that

47.6% of the variation in the Percentage of Female Legislators is explained by the aggregate

variation of average state Income-per-Capita, Percentage of Residents with a High School

Diploma or Higher, and lagged variables for the IV and DV. This is a moderately high Adjusted

R-squared, which means that my multivariate model fits the data moderately well overall.

What is rather confusing is the beta coefficient for the constant, -32.995. Not only is this

beta coefficient statistically significant at a p-value of less than 0.01, but it is also negative. As

17
We can also tell from our visualization of the predicted values that Lagged Citizen Ideology is not statistically significant because the 95%
confidence interval includes the value 0, which means that a possible beta coefficient for Lagged Citizen Ideology is 0, aka no effect on the
Percentage of Female Legislators.
Yang 12

such, the interpretation would be: when Citizen Ideology Score, average state Income-per-

Capita, Percentage of Residents with a High School Diploma or Higher, and the lagged variables

for the IV and DV are all 0, the estimated Percentage of Female Legislators is -32.995%. This,

however, does not make logical sense, as the lowest percentage of female legislators possible is

0%. Unfortunately, due to a lack of statistical knowledge and resources, I cannot determine why

exactly this is the case. However, it is certainly important to note, and perhaps a point of

exploration for future studies.

Diagnostics

When using linear regression models to analyze data, the model must meet several basic

assumptions in order to be considered a good “fit” for the data. However, these assumptions are

sometimes violated. Thus, before I conclude with my overall analysis, I check for some of these

assumptions—as well as other potential factors which may threaten the validity of my tests—by

looking at the following: linearity of relationship, normality of residuals, homoscedasticity vs.

heteroscedasticity of error variance, multicollinearity, and influential points.18

The first assumption of multivariate linear regression models is that there is a linear

relationship between our predictor variables and the dependent variable. To check for this, I

plotted Component-Residual plots for each individual predictor variable and an aggregate

Residuals vs. Fitted plot. In general, a major difference between the residual lines and the

component lines indicate that the predictor variables do not have a linear relationship with the

dependent variable. As we can see from both our aggregate Residuals vs. Fitted plot and the

individual Component-Residual plots, the component lines are quite closely matched to the

18
See Appendix 1 for the relevant tables and graphs for these tests.
Yang 13

residual lines, and the data are quite evenly distributed along the residual lines. Thus, we can

assume that a linear model fits our data well.

The second assumption is that the residuals are normally distributed. In order to check

this, I plotted a Normal Quantile-Quantile plot (Q-Q plot). This is essentially a scatterplot which

plots the quantiles of the residuals of predictor variables against standardized residuals.19 If the

residuals are normal, then the points in the scatterplot should form a straight line.20 As we can

see from the Normal Q-Q plot, the line is quite straight and matches the fitted line very well.

Thus, we can assume that our residuals are normally distributed.

A third assumption is homoscedasticity of residual error terms: we assume that the

residuals are the same across all predictor variables, or that there is equal variance.21 If the

residuals are not spread equally across the predictors, then the data is not homoscedastic.22 To

test this, I created a Scale-Location plot, which is a scatterplot of the fitted values against the

standardized residuals. If there is a distinct pattern in the data, then it is not homoscedastic.23 As

a general rule, a straight line with randomly distributed points is consistent with the assumption

of homoscedasticity.24 In the case of my Scale-Location plot, the line is somewhat straight, but

also somewhat parabolic. That said, the points seem to be randomly scattered, and the line is

more horizontal than it is parabolic; thus, we assume that the error terms are homoscedastic.

The final assumption of multivariate linear regression is that there is no multicollinearity.

That is, the predictor variables in my model are not so highly correlated to each other that they

also predict each other to some degree. This is especially important to note because it is possible

19
“Understanding Q-Q Plots.” University of Virginia Library. http://data.library.virginia.edu/understanding-q-q-plots/
20
“Understanding Q-Q Plots.”
21
“Diagnostic Plots.” University of Virginia Library. http://data.library.virginia.edu/diagnostic-plots/
22
“Diagnostic Plots.”
23
“Diagnostic Plots.”
24
“Diagnostic Plots.”
Yang 14

that the average state Income-per-Capita is linearly related to the Percentage of Residents with a

High School Diploma or Higher; after all, the higher the income, the easier it is to pay for higher

education. Even more likely is a linear correlation between the lagged IV and the lagged DV. We

do not want these relationships to affect the variance of the overall model. Thus, to check for

multicollinearity, I generated variance inflation factors (VIFs) for each variable, which estimates

for us how much the regression’s variance is increased because of collinearity. As a general rule

of thumb, any VIF greater than 10 suggests that there may be multicollinearity. In my model, the

highest VIF is for average state Income-per-Capita at 2.249091. As such, it is safe for us to

assume that there is no multicollinearity in our model.

Finally, it is important to check for influential points; as in, observations which are

outliers and have high leverage, and thus are influential enough to distort the coefficients in our

linear model. To check this, I plotted a Residuals vs. Leverage chart and checked for any points

that fall in the top right or bottom right corner beyond the Cook’s distance line; any points that

fall in those regions are said to have “high leverage or potential for influencing [the] model.”25 In

my Residuals vs. Leverage plot, none of the observations fall in the top right or bottom right

corner outside of the Cook’s distance line. Thus, we can assume that there are no observations

which disproportionately influence the regression results.

Overall, my model seems to pass the respective tests for checking linearity, normality of

residuals, homoscedasticity of error variance, and multicollinearity; thus, all of these

assumptions hold for my model. Similarly, my regression results do not seem to be distorted by

any influential points. That said, it is important to remember that other than the multicollinearity

25
“R Tutorial: How to use Diagnostic Plots for Regression Models.” http://analyticspro.org/2016/03/07/r-tutorial-how-to-use-diagnostic-plots-
for-regression-models/. Cook’s distance (or Cook’s D) is a “measure that combines the information of leverage and residual of the
observation,” and can be used to determine influential points. See also: “Robust Regression: R Data Analysis Examples.” Institute for
Digital Research and Education, University of California – Los Angeles, https://stats.idre.ucla.edu/r/dae/robust-regression/
Yang 15

and influential points tests, all of the tests are based purely on visual interpretation and

determining patterns. That is, interpreting these graphs can be quite subjective—as in the case of

our homoscedasticity test.26 Ultimately, however, the diagnostics tests seem to show strong

evidence of a linear regression model fitting the data well.

Conclusion

What does all of this suggest? Is there a relationship between Citizen Ideology Score and

Percentage of Female Legislators? Does the average ideology of the people in a given state affect

the percentage of females in their state legislature? Even more broadly, does our ideology have

an effect on what genders we vote for? Are there any potential problems with the model and/or

potential threats to causal validity (outside of the diagnostics tests) that I could not eliminate?

Are there any areas that require further exploration? And finally, why does this matter?

According to my bivariate and multivariate linear regression models, Citizen Ideology

Score has a positive effect on the Percentage of Female Legislators to a statistically significant

degree (p < 0.01). Indeed, even when accounting for other statistically significant causal factors,

such as per capita income, education level, and the Percentage of Female Legislators one year

prior, the estimating effect of Citizen Ideology Score on Percentage of Female Legislators is still

positive and statistically significant, albeit a small effect. Thus, we can say that U.S. states with a

more liberal constituency on average tend to vote more women into their state legislatures, even

when accounting for other causal factors like socioeconomic status and the percentage of female

legislators in the previous year.

However, as mentioned briefly earlier in this paper, my analysis runs into two major

threats to causal validity: endogeneity and a lack of distinction between politics and ideology. Of

26
The influential points test is also based on visual interpretation; however, the Cook’s distance line makes it easier to have a more objective,
universal interpretation.
Yang 16

course, since less than 50% of the variation in Percentage of Female Legislators can be explained

by the variation in the predictor variables in my multivariate regression model, there is a

possibility that my model still suffers from omitted variable bias. Even more important is the

cyclical logic of my model. That is, I argue that more liberal ideologies lead to more positive

perceptions of women, which then affects the percentage of female legislators, yet at the same

time, my ideology variable is measured based on the ideology of the representatives—and

obviously, female legislators will have a higher opinion of women, which will lead to a more

liberal ideology score for a state. Thus, this logic is self-serving. Of course, the only way to

really avoid this endogeneity problem would be to measure the ideology of each individual

citizen in a state and then average that out, instead of using the representatives as proxy

measures. This is a possible area of future study.

The other threat to the validity of my test is the fact that my ideology variable conflates

political preference with ideology. That is, the Citizen Ideology Score assumes that people who

are pro-Social Security, pro-Medicare, pro-gun control, pro-abortion—essentially anyone who

considers themselves a Liberal on the political spectrum—is thus liberal overall. In the context of

my theory, this would suggest that anyone who considers themselves a Conservative is thus less

inclined to vote women into office; and yet, there are plenty of women in politics who identify as

Conservatives, and many Conservatives who believe in female representation in politics. The

ultimate problem here is that there is no concrete and specific definition for ideology, so it is

incredibly difficult to measure. Thus, coming up with a more concrete and specific way to define

and operationalize ideology is another area of further exploration.

One final thing to consider would be differences across time periods. Although my model

includes lagged variables for the independent variable and the dependent variable one year prior,
Yang 17

it still does not account for the fact that these observations fluctuate over time, as opposed to

increasing or decreasing in a steady linear fashion over time. In other words, it does not account

for the possibility that some citizen ideology scores and some percentages of female legislators

might be higher in, say, 2006 than in 2010. Thus, we cannot say that: “as a U.S. state’s average

citizen ideology becomes more liberal, the percentage of women in its state legislature

increases.” However, it would certainly be interesting to consider how time might affect the

theory; that is, whether or not certain time periods experienced higher citizen ideologies scores

and higher percentages of women in state legislature, and why.

All things considered, however, and especially given the results of my regression models

and diagnostics tests, we can say with confidence that higher citizen ideology scores cause

higher percentages of female legislators, even when controlling for socioeconomic factors like

income and education level. In other words, U.S. states with a more liberal constituency are more

likely to vote more women into their state legislature. This is important to note because it

suggests that our ideals, our values, the ways we think, the ways we were raised even, affect the

level of female representation in politics. If we wish to see a greater level of female

representation in politics in the future, then we should certainly consider ideological factors, not

just socioeconomic ones.27

27
All of the relevant coding for this project can be seen in Appendix 2.
Yang 18

BIBLIOGRAPHY

“ADA Voting Records.” Americans for Democratic Action. https://adaction.org/ada-voting-

records/

“Diagnostic Plots.” University of Virginia Library. http://data.library.virginia.edu/diagnostic-

plots/

“Legislative Scorecard.” American Federation of Labor and Congress of Industrial Organizations

(AFL-CIO). https://aflcio.org/what-unions-do/social-economic-

justice/advocacy/scorecard

“R Tutorial: How to use Diagnostic Plots for Regression Models.”

http://analyticspro.org/2016/03/07/r-tutorial-how-to-use-diagnostic-plots-for-regression-

models/

“Robust Regression: R Data Analysis Examples.” Institute for Digital Research and Education,

University of California – Los Angeles, https://stats.idre.ucla.edu/r/dae/robust-regression/

“Understanding Q-Q Plots.” University of Virginia Library.

http://data.library.virginia.edu/understanding-q-q-plots/

Berry, William D., Evan J. Ringquist, Richard C. Fording, and Russell I. Hanson. “Measuring

Citizen and Government Ideology in the American States, 1960-93,” American Journal

of Political Science 42:1 (January 1998): 327 – 348.

https://www.jstor.org/stable/pdf/2991759.pdf?refreqid=excelsior:6047b68e3bbde6c3420

943cee0689d9c

Correlates of State Policy. Michigan State University, Institute for Public Policy and Social

Research (IPPSR). http://ippsr.msu.edu/public-policy/correlates-state-policy


Yang 19

Jordan, Marty P. and Matt Grossmann. 2016. The Correlates of State Policy Project v1.14. East

Lansing, MI: Institute for Public Policy and Social Research (IPPSR).

http://ippsr.msu.edu/sites/default/files/CorrelatesCodebook.pdf

Paxton, Pamela Marie, and Sheri Kunovich. “Women’s Political Representation: The Importance

of Ideology,” Social Forces 82:1 (September 2003): 87 – 113.

http://muse.jhu.edu/article/47842/pdf

Wängnerud, Lena. “Women in Parliaments: Descriptive and Substantive Representation,”

Annual Review of Political Science 12 (2009): 51 – 69.

https://www.annualreviews.org/doi/pdf/10.1146/annurev.polisci.11.053106.123839
Yang 20

APPENDIX 1: DIAGNOSTIC TESTS

Linearity of Relationship
Residuals vs. Fitted Plot (Aggregate of Predictor Variables)

Component-Residual Plots (Linearity of Individual Predictor Variables)


Yang 21

Normality of Residuals (Normal Quantile-Quantile Plot)

Homoscedasticity vs. Heteroscedasticity


Yang 22

Multicollinearity
Variance Inflation Factors for Each Predictor Variable
citizenideology incomepcap hsdiploma lagcitizenideology lagpctfemaleleg
1.065776 2.249091 1.486476 1.097590 1.726666

Influential Points
Yang 23

APPENDIX 2: CODING

Coding (with Outputs)


> #Renaming and calling the dataset
> cspdataset <- read.csv("C:/Users/janieyangbang/Desktop/College/Senior
Year/~Senior Spring~ LIT AF/GOVT 201 Analysis of Political Data
I/correlatesofstatepolicyprojectv1_14.csv")
>
> #Renaming and summarizing DV (percentage of female legislators)
> pctfemaleleg <- cspdataset$pctfemaleleg
> lagpctfemaleleg <- lag(pctfemaleleg, k = 1)
> summary(pctfemaleleg)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.70 13.20 19.00 19.49 25.60 42.00 4121
> var(pctfemaleleg, na.rm = TRUE)
[1] 71.22083
> sd(pctfemaleleg, na.rm = TRUE)
[1] 8.439243
> IQR(pctfemaleleg, na.rm = TRUE)
[1] 12.4
> ggplot(data = cspdataset, aes(x = pctfemaleleg)) + geom_histogram(binwidth
= 3) + labs(title = "Distribution of Percentage of Female State Legislators",
x = "Percentage", y = "Number of Observations")
Warning message:
Removed 4121 rows containing non-finite values (stat_bin).
>
> #Renaming and summarizing IV (measure of citizen ideology)
> citizenideology <- cspdataset$citi6013
> lagcitizenideology <- lag(citizenideology, k = 1)
> summary(citizenideology)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.963 37.026 47.891 47.838 58.682 95.972 3318
> var(citizenideology, na.rm = TRUE)
[1] 274.4515
> sd(citizenideology, na.rm = TRUE)
[1] 16.56658
> IQR(citizenideology, na.rm = TRUE)
[1] 21.65585
> ggplot(data = cspdataset, aes(x = citizenideology)) +
geom_histogram(binwidth = 3) + labs(title = "Distribution of Citizen Ideology
Scores", x = "Score", y = "Number of Observations")
Warning message:
Removed 3318 rows containing non-finite values (stat_bin).
>
> #Running a bivariate linear regression on the percentage of female
legislators by citizen ideology
> reg1 <- lm(pctfemaleleg ~ citizenideology, data = cspdataset)
> summary(reg1)
Yang 24

Call:
lm(formula = pctfemaleleg ~ citizenideology, data = cspdataset)

Residuals:
Min 1Q Median 3Q Max
-19.9834 -5.9796 -0.1783 5.7894 23.5963

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.37029 0.63464 16.34 <2e-16 ***
citizenideology 0.17627 0.01227 14.37 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.967 on 1745 degrees of freedom


(4271 observations deleted due to missingness)
Multiple R-squared: 0.1058, Adjusted R-squared: 0.1052
F-statistic: 206.4 on 1 and 1745 DF, p-value: < 2.2e-16

> stargazer(reg1, type="text", dep.var.labels=c("Percentage of Female


Legislators"), covariate.labels=c("Citizen Ideology Score"),
out="models.txt")

=======================================================
Dependent variable:
--------------------------------
Percentage of Female Legislators
-------------------------------------------------------
Citizen Ideology Score 0.176***
(0.012)

Constant 10.370***
(0.635)

-------------------------------------------------------
Observations 1,747
R2 0.106
Adjusted R2 0.105
Residual Std. Error 7.967 (df = 1745)
F Statistic 206.364*** (df = 1; 1745)
=======================================================
Note: *p<0.1; **p<0.05; ***p<0.01
> plot(citizenideology, pctfemaleleg, main = "Percentage of Female
Legislators by Citizen Ideology Score", xlab = "Citizen Ideology Score", ylab
= "Percentage of Female Legislators")
> abline(lm(pctfemaleleg ~ citizenideology))
>
> #Running a multivariate linear regression on the percentage of female
legislators by citizen ideology,
> #controlling for average income per capita, percent of respondents with a
high school diploma or higher,
> #number of residents who are female, and lagged IV and DV
> reg2 <- lm(pctfemaleleg ~ citizenideology + incomepcap + hsdiploma +
lagcitizenideology + lagpctfemaleleg, data = cspdataset)
> summary(reg2)

Call:
Yang 25

lm(formula = pctfemaleleg ~ citizenideology + incomepcap + hsdiploma +


lagcitizenideology + lagpctfemaleleg, data = cspdataset)

Residuals:
Min 1Q Median 3Q Max
-14.8428 -4.0661 -0.6256 3.7721 19.9827

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.300e+01 2.238e+00 -14.743 < 2e-16 ***
citizenideology 7.004e-02 1.102e-02 6.355 2.84e-10 ***
incomepcap 2.689e-04 2.772e-05 9.699 < 2e-16 ***
hsdiploma 5.029e-01 3.037e-02 16.556 < 2e-16 ***
lagcitizenideology 1.486e-02 1.139e-02 1.304 0.192297
lagpctfemaleleg 9.506e-02 2.584e-02 3.679 0.000244 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.064 on 1358 degrees of freedom


(4654 observations deleted due to missingness)
Multiple R-squared: 0.4781, Adjusted R-squared: 0.4762
F-statistic: 248.8 on 5 and 1358 DF, p-value: < 2.2e-16

>
> #Creating confidence intervals for multivariate beta coefficients,
visualizing predicted values
> ci_citizenideology <- coef(reg2)[2] + c(-1, 1) * se.coef(reg2)[2] * 1.96
> ci_citizenideology_dataframe <- data.frame(est = coef(reg2)[2],
+ lb = ci_citizenideology[1],
+ ub = ci_citizenideology[2],
+ model = "Citizen Ideology Score")
> ci_incomepcap <- coef(reg2)[3] + c(-1, 1) * se.coef(reg2)[3] * 1.96
> ci_incomepcap_dataframe <- data.frame(est = coef(reg2)[3],
+ lb = ci_incomepcap[1],
+ ub = ci_incomepcap[2],
+ model = "Income-per-Capita")
> ci_hsdiploma <- coef(reg2)[4] + c(-1, 1) * se.coef(reg2)[4] * 1.96
> ci_hsdiploma_dataframe <- data.frame(est = coef(reg2)[4],
+ lb = ci_hsdiploma[1],
+ ub = ci_hsdiploma[2],
+ model = "High School Diploma or
Higher")
> ci_lagcitizenideology <- coef(reg2)[5] + c(-1, 1) * se.coef(reg2)[5] * 1.96
> ci_lagcitizenideology_dataframe <- data.frame(est = coef(reg2)[5],
+ lb = ci_lagcitizenideology[1],
+ ub = ci_lagcitizenideology[2],
+ model = "Lag Citizen Ideology")
> ci_lagpctfemaleleg <- coef(reg2)[6] + c(-1, 1) * se.coef(reg2)[6] * 1.96
> ci_lagpctfemaleleg_dataframe <- data.frame(est = coef(reg2)[6],
+ lb = ci_lagpctfemaleleg[1],
+ ub = ci_lagpctfemaleleg[2],
+ model = "Lag Percent Female
Legislators")
> est <- rbind(ci_citizenideology_dataframe, ci_incomepcap_dataframe,
ci_hsdiploma_dataframe, ci_lagcitizenideology_dataframe,
ci_lagpctfemaleleg_dataframe)
> ggplot(est, aes(x = model, y = est)) +
Yang 26

+ geom_point() +
+ geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
+ geom_hline(yintercept = 0, lty = 2, color = "red") +
+ labs(title = "Predicted Values for Multivariate Regression Beta
Coefficients (95% Confidence Interval)", x = "", y = "Predicted Values")
>
> #Creating a multivariate regression table
> stargazer(reg2, type="text", dep.var.labels=c("Percentage of Female
Legislators"), covariate.labels=c("Citizen Ideology Score", "Income per
Capita", "Percentage of Residents with a High School Diploma or Higher",
"Lagged Citizen Ideology Score", "Lagged Percentage of Female Legislators"),
out="models.txt")

>
> #Visualizing the distribution of each predictor variable in multivariate
regression
> ggplot(data = cspdataset, aes(x = incomepcap)) + geom_histogram(binwidth =
300) + labs(title = "Distribution of Average Income per Capita", x = "Average
Income per Capita", y = "Number of Observations")
Warning message:
Removed 1878 rows containing non-finite values (stat_bin).
> ggplot(data = cspdataset, aes(x = hsdiploma)) + geom_histogram(binwidth =
1) + labs(title = "Distribution of Percentage of Residents with a High School
Diploma or Higher", x = "Percentage of Residents with a High School Diploma
or Higher", y = "Number of Observations")
Warning message:
Removed 4406 rows containing non-finite values (stat_bin).
> ggplot(data = cspdataset, aes(x = lagcitizenideology)) +
geom_histogram(binwidth = 1) + labs(title = "Distribution of Lagged Citizen
Ideology", x = "Laggued Citizen Ideology", y = "Number of Observations")
Warning message:
Removed 3318 rows containing non-finite values (stat_bin).
Yang 27

> ggplot(data = cspdataset, aes(x = lagpctfemaleleg)) +


geom_histogram(binwidth = 1) + labs(title = "Distribution of Lagged
Percentage of Female Legislators", x = "Laggued Percentage of Female
Legislators", y = "Number of Observations")
Warning message:
Removed 4121 rows containing non-finite values (stat_bin).
>
> #Diagnostics tests: Component-Residual, Multicollinearity, Linearity,
Normality, Homoscedasticity, Influential Points
> crPlots(reg2)
> vif(reg2)

> plot(reg2)
Hit <Return> to see next plot: gvlmareg2 <- gvlma(reg2)
Hit <Return> to see next plot: summary(gvlmareg2)
Hit <Return> to see next plot: #Not exactly sure what the results of this
gvlma means--it seems that most of the assumptions are NOT met,
Hit <Return> to see next plot: #which is the opposite of my results from the
other tests. Since I don't want to get ahead of myself,
> #I'm going to stick with my crPlots, vif, and plot diagnostic tests.
However, I do want to note that this
> #inconsistency is both interesting and confusing...
>
> #Running a multivariate linear regression on the percentage of female
legislators by citizen ideology,
> #controlling for average income per capita and percent of respondents with
a high school diploma or higher,
> #with interactions with female population
> pctfemaleleg_by_citizenideology_incomeppcap_hsdiploma_withinteraction <-
+ lm(pctfemaleleg ~ citizenideology + incomepcap + hsdiploma + popfemale +
incomepcap*popfemale + hsdiploma*popfemale, data = cspdataset)
>
summary(pctfemaleleg_by_citizenideology_incomeppcap_hsdiploma_withinteraction
)

Call:
lm(formula = pctfemaleleg ~ citizenideology + incomepcap + hsdiploma +
popfemale + incomepcap * popfemale + hsdiploma * popfemale,
data = cspdataset)

Residuals:
Min 1Q Median 3Q Max
-14.3383 -4.3692 0.1633 4.2516 17.8847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.569e+01 4.958e+00 -5.182 3.01e-07 ***
citizenideology 1.477e-01 1.934e-02 7.637 8.98e-14 ***
incomepcap -1.580e-04 7.663e-05 -2.062 0.0396 *
hsdiploma 5.527e-01 7.109e-02 7.774 3.39e-14 ***
popfemale 1.074e-07 1.681e-06 0.064 0.9491
incomepcap:popfemale 5.552e-12 1.848e-11 0.300 0.7640
hsdiploma:popfemale -2.473e-09 2.507e-08 -0.099 0.9215
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Yang 28

Residual standard error: 6.211 on 591 degrees of freedom


(5420 observations deleted due to missingness)
Multiple R-squared: 0.2842, Adjusted R-squared: 0.2769
F-statistic: 39.11 on 6 and 591 DF, p-value: < 2.2e-16

> #Interactions are NOT statistically significant! Don't need to include in


paper

Potrebbero piacerti anche