Sei sulla pagina 1di 4

Newsletter

Researchers Corner

Volume 4 Issue 8 August 2012

Correlation Research: 1. Relation between Qualitative/Categorical Attributes


Human brain has evolved as highly sensitive to regular patterns leading to sweeping conclusions with even limited evidence. In other words, human memory is strongly associative and tends to relate many things in life including obviously unrelated things like beauty and brain. The hunch for prediction of relationship is rampant in day-to-day life like office goers feeling that it always rains around 5 pm when they have to leave the office. This kind of relationship or association or dependence is called correlation when we are also able to measure the degree of dependence. Correlation studies are plenty in social research. Correlation, regression & causal analyses are part of inferential statistical analyses of data.

Recall four basic types of data discussed in March 2011 issue that nominal and ordinal are attributes (qualitative categorical variables) and the interval and ratio are quantitative variables. The research itself, based on these criteria, is often divided as qualitative research (for example, observational research, ex post facto research, contingency table research, case study, etc.) and quantitative research. As far as qualitative research is concerned, we attempt for association between attributes (and obviously also called correlation of attributes or relational research). Such research does not determine the direction and degree/ strength of relation. Like experimental research, they do not involve manipulation of causal variables for causal inference, finding combined effect of factors or controlling the extraneous variable. However like ex post facto research it helps to identify relationships that can be further subjected to cause-effect tests in subsequent experimentation. On the other hand, contingency table research examine whether or not observations distribute systematically in a table. In that process of examining the table, since there is a chance of misjudgment, one resort to tests like chi square test for establishing the significance of relationship. In this part let us examine three popular techniques of association of attributes. 1. Cross Tabulation 2. Yules Coefficient of Association 3. Chi- square test Men Women Total Like Dislike Total 55 60 115 23 33 56 78 93 171

1. Cross Tabulation is useful in finding relationship in nominal data, but it is not a powerful form of measure/ test. In this method, we classify each variable into two or more categories/attributes. Beginning with a two-way table, we see whether there is interrelationship between variables.

Take a simple example of attributes like gender versus liking or disliking of a particular

product elicited from 171 persons and presented in a 2X2 table. Even a layman can interpret this data as 55 out of 115 or 47.8% men as against 23 out of 56 or 41.1% women liked the product and may even conclude that this product is more liked by men than women. It is not wrong, but a true researcher would ask the question whether such an association between genders and like/dislike are statistically valid or significant. For this purpose, apart from cross tabulation, two most frequently applied techniques are Yules Coefficient of Association and Chi-square Test. It may be noted that the ordinal data allow for rank order correlation also. For example, if like or dislike is presented on a five-point scale, it becomes an ordinal data and we shall see this kind of correlation in a future issue. 2. Yules Coefficient of Association assesses the strength of association between two attributes given in a 2X2 table. Note that larger than 2X2 tables have to be reduced to 2X2 by combining some classes. Taking the above example, this technique requires us to calculate the probabilities p and p, where p is the probability of attribute B (Like/Dislike) being in the universe (total) of attribute A (gender), and p is the probability of attribute B (Like/Dislike) not being in the universe (total) of attribute A (gender). In our example, p = 55/ 115 = 0.4782 and p = 23/ 56 = 0.4107. Then, p > p implies association between two attributes. The strength of association p is given by p = p p. In the above example, since p > p there is association, but the strength [p = p p = 0.0675] of the association is negligible.

3. Chi- square test is an Important non-parametric test for significance of association (as well as for (i) testing hypothesis regarding goodness of fit and (ii) homogeneity or significance of population variance). In Nov 2011 issue, we have noted that non-parametric tests assume nominal or ordinal data, they are much less cumbersome to use as far as computational techniques are concerned, are most useful when dealing with qualitative variables and with data that can be classified in order (or as ranks). As a non-parametric test chi-square is used when responses are classified into two mutually exclusive classes like favor/not favor, like/dislike, etc. (chi-square test can be applied on to larger than 2X2 table also). It is used to decide whether observed frequencies of occurrences of a qualitative attribute differ significantly from the expected (or theoretical) frequencies. In other words, chi-square statistic is a measure of discrepancy between observed and expected frequencies with the understanding that larger the value, greater the discrepancy.

(Oij

- Eij) 2 / Eij where Oij is observed frequency of the cell in i th row & j th

column and Eij is expected frequency of the cell in i th row & j th column. The formula for calculating expected frequency is

Expected frequency of any cell =

total for the row of that cell

x total for the column of that cell

Grand total Degrees of freedom, df = (c-1) (r-1) , where c is the number of columns and r is the number of rows. If the calculated value of association is significant. Table In the above example, the observed frequencies are already given and the expected frequencies which are required to work out chi-square are in the adjacent table. The sum of all chi-square values as per the formula
2 2

is equal or more than that tabulated for the given df the of Expected

Frequencies Men Women Total Like Dislike Total


2

52.5 62.5 115

25.5 30.5 56

78 93 171

(Oij - Eij) 2 / Eij after computing for each of the cells, is

0.119 + 0.245 + 0.100 + 0.205 = 0.669. The tabulated value of

at df =1 and p-value = 0.95

is 3.84 (this can be had from the standard table in books and from the same table you may
Expected frequency for cell 1 = total for the row x total for the column = 78 X 115 = 52.5 Grand total 171 2 (for cell 1) 2 = (Oij - Eij) / Eij = (55 52.5)2 / 52.5 = 0.119

note that for this

value to be accepted we need to water down p-value (i.e., the

significance level to 0.50). Hence there is no significant difference between the observed and expected data in this example. In other words, our naked eye observation of cross tabulation that this product is more liked by men than women is not significant enough to accept as per Chi-square test and Yules association (see October 2011 issue for more about confidence level and significance level).

As an over used technique, chi-square test has other limitations like assumptions that observations are random, items in the sample are independent, the relation is linear and no cell contains less than five as frequency value and over all number of items must be reasonably large. Suppose we had an example where chi-square statistic is significant, even then note that the chi-square value does not tell us anything about the nature of the association like is the association positive or negative and how strong or weak the association is, etc. except that there is an association. However, there are measures of strength of association between categorical variables (and to know whether the association is positive or negative). 1. Coefficient of Contingency (or Phi Coefficient, ) can be worked out based on interpreted similar to the Pearson correlation coefficient: =
2 2

and

/ ( 2+N)

2. Correlation of attributes is given by r = [ 2/N(k-1))] for a kXk table and for a 2X2 table it is r = [ 2/N].

3. Comparing proportions: We have already seen in Yules coefficient that in our example, the proportion of men liking the product is 55/115 = 0.478 and the proportion of women liking the product is 23/56 = 0.411. So the difference in proportions is 0.478 -0.411 = 0.067, suggesting a very low or negligible association. 4. Odds Ratio {} refers to ratio of odds from two rows in a 2X2 table where attributes are binary like high and low (opposites/ extremes). In the adjacent table of income level and education level, graduates and postgraduates can be treated as low and high level of education. For graduates, the estimated probability of low income is [a / a + b] and the estimated probability of high income is [b/ a + b]. Therefore, for graduates, the odds favouring low income are given by odds= probability of success / probability of failure = [a / a + b] / [b/ a + b] = a/b = # successes / # failures, success represents low income. Similarly for postgraduates, the odds where Educatio n level Graduate Postgraduat e a c Income level Low Hig h b d

favouring low income can be worked out to be

c/d. Since the ratio of odds from two rows is the odds ratio {} = [a/b] / [c/d] = ad / bc. Suppose = 2, then it indicates that the odds in favour of having a low income job if a person is a graduate are twice the odds in favour of having a low income job if a person is a postgraduate. M S Srdihar sridhar@informindia.co.in

Potrebbero piacerti anche