Sei sulla pagina 1di 10

This Is The Architecture Of Our Practice

An overview of how Rock Creek Analytics provides opinion research designed and used for the Internet, from sampling to analysis.
Rock Creek Analytics pioneering tools use Internet text to analyze public opinion. We provide thorough profiles of people and issues. We find, follow, and assess developing trends. Our work is quick, accurate and thorough. We use Internet content because everyone people, institutions, even issues leaves a record there: what we say and what is said about us. Even more important, opinion in Net content is there because people want it there: people go to blogs, Facebook, and Twitter to express their opinions and leave them for others to read. We download that content and analyze it for the characteristic and distinctive words and phrases that mark off opinion about a person, an issue, or any other factor in developing policy. We use a variety of statistical perspectives to create profiles that characterize whats being said online by and about someone (or some-thing). Or whats being said about something new: an emerging trend. Here is what we do: Create profiles from as many perspectives as you need. What makes Glen Beck and his message jump out for his audience? We tell you the how and the why of the rise of Glen Beck into prominence, and not just the how much. Identify and assess emerging trends. Where did the nativist outcry during the financial crisis come from? What made the argument effective and how did it trail off? Measure the prominence newsworthiness, notoriety of an agitator or a cause. When and by how much did someone become noticed, and how quickly did they fade into the crowd? We do this by benchmarking: comparing text and opinion about a person with a context, comparing one political position with another, finding celebrity against shifting background: Al Gore and global warming debate; Sarah Palin and the 2008 Presidential campaign. Our benchmarking uses a series of statistical tools: most importantly, evaluating significance. We find the differences that matter, and put them together. What follows is how we do it.

Search and sampling


Most opinion research uses random sampling. Ours does not. In random sampling, each item has an equal chance of being selected and each selection is made independently. Randomness is modeled by normal distribution. Even in a non-random environment, randomness is the basis of the standard polling process. Internet content is not randomly ordered. The Net material we use is not amenable to random sampling, and is not described with the mathematical models of randomness, such as normal distribution. Rather than randomness, we base our analysis on the makeup of Internet architecture: power laws, and the scale-free and small world distribution. Power laws, scale freedom, and small world distribution therefore apply to samples of Net content, including those we use in our work. Random sampling is unlikely to produce workable and representative material in a nonrandom environment. This material is unlikely to be found through random sampling, which is unlikely to be representative of a non-random environment. This difficulty obtains for any grouped text, either downloaded, as contemplated above, or gathered off-line. Consider analyzing opinion leadership during the efforts to stem the financial crisis during September and October 2008. Ultimately the Net content of interest was in the topics of finance and politics. Under the assumption of selfsimilarity, those topics were the source from which Net content was taken. The units of sampling in this case were documents/texts, from 100 to 2000 words in length, in two groups of about 5000 Internet files total, taken from on-line websites of newspapers (the New York Times), topical websites, political and other weblogs, and assorted newsgroups.
The Economist/YouGov Internet Presidential poll. Krosnick, J. A. (2006) compares with Blumenthal, M.. No Such Thing As A Perfect Sample, (2009. And ) appears to be a typical debate, focusing on randomness as an unquestioned assumption, rather than considering whether it applies to Net content (including whats said by poll subjects found online.) The Structure of the Web; L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like Distributions: Evidence and Implications; Menczer, F., Lexical and Semantic Clustering by Web Links.

[C]ohesive collections of Web pages (for instances, pages on a site or pages about a topic) mirror the structure of the Web at large. Dill, S., et al., Self-Similarity In the Web, Lexical and Semantic Clustering by Web Links; D. Gibson, J. Kleinberg, Inferring Web Communities from Link Topology

The inherent non-randomness of corpus data renders statistical"estimates unreliable, since the random variation on which they are based will be smaller than the true variation of the observed frequencies. S. Evert, How random is a corpus? The library metaphor See Chrakrabarti, et al, The Structure of Broad Topics on the Web; David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Lee Giles. Winners dont take all: Characterizing the competition for links on the web The concurrent and cross-sectional analysis conducted for profiles is similar. Different Net upload-store-download technologies present different issues for, variously, search and retrieval (and analysis). Of the four here, only web pages are usually undated; by-lines are used in online newspaper articles in contrast with pseudonyms found elsewhere, collective versus individual authorship, and so on. See below. Originally the choices were made to see if opinion leadership existed in one technology, e.g. re-purposed newspapers, or another. In some cases, we use social media for time sensitive searching, weighted to reflect recency and time-sensitive topicality. Blogs present separate issues. Because a blog post with a comment string may be a kind of conversation (therefore a single functional text), but may also be broken up into several different chunks of code, we may sample this as a set of strings while analyzing the string as a single unit.

Blogs and Web 2.0 social media are sometimes used when changing opinion is being analyzed.

The unit of analysis we use is a text. A text may refer to a

single item (web page) or to an aggregate (10,000 Net files using the words financial and crisis which were uploaded or posted September 1, 2008 September 28, 2008). The term text as used here refers to each of two different functions: The formal definition required for sampling and retrieval: one or more sentences demarcated by typological conventions (white space, binding) or technical definitions and use (<body> text </body> in HTML). The functional definition required for analysis: a semantic unit of language in use, containing one or more sentences, containing chains of repeated and related words, and both familiar and novel information. Texts are also, as the formal definition implies, collections of words with more or less well-formed boundaries. Words are the units of measurement at this level of granularity, and thereby serve two critical functions: they are units of analysis for frequency, dispersion and collocation, and they are semantic anchors for contextual and topical analysis, in words, phrases, sentences, and texts, The dialectic between the formal and semantic/topical perspectives on words is the keystone of our work. Text files are obtained from the Internet by using several kinds of search engine, each with a different kinds of ranking algorithm: Google (as an example of PageRank), backlinks (Yahoo, among others), HITS/authority, and unique visitors/popularity. Results are retrieved (with date limitations, as needed), and downloaded by using returns from each search engine separately (ranked by weight and then recursed and results retrieved from a new search) and by aggregating results.

Net content is embedded in different kinds of code. Typically we analyze content in HTML files (also, when needed, blog comments in database languages). (Newsgroup code, UUE etc, presents separate issues out of scope here.) The working convention for retrieval is, therefore, file = HTML document = web page = text. Depending on context, however, a blog thread post and comments - may count as a single text. Or the post and each comment may each count as separate individual texts. (This is a working distinction; the details are out of scope.)

The formal extensional definition of words character strings bounded with white space, grammatical marking and so on is omitted here.

The search engines used and the weighting algorithms vary from case to case. The collection process is designed around self-similarity, small world, and scale free assumptions.

Analysis: Frequencies
Trend recognition is one of the most common forms of frequency analysis, and so we will focus on it here. Discovering a trend either retrospectively, or more or less concurrently - is a before-and-after analysis of content. Working with the blocs of text files and the statistics of word frequency change, we compare sequential blocks of comparable topical materials. The before-and-after analysis used for trends begins with compiling word frequencies in the before material which also functions as a benchmark. In this case the words are those in the September and October text, and frequencies are enumerated for each.

The compiled lists give what are sometimes called the observed absolute frequencies for the listed words.

We use several metrics. One set is drawn from changes in network graph and graph results.

Google (as an example of PageRank), backlinks, HITS/authority, and unique visitors/popularity For a comparable approach, see Gabrilovich, Dumais, Horvitz Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty, and, to similar effect, Jon Kleinberg, Temporal Dynamics of On-Line Information Streams.

To analyze incipient trends and numbers of words, a second set of metrics uses measurements of word association, the significance of changes in word frequency, and changes in the dispersion of the most important words to measure effects. Some trend metrics use an interval scale those used for word frequencies, for example. To an extent we can the measure the trend and its effect using interval data and derivatives. However, we are constrained by the need to use ordinal results (web page ranking systems), and nonparametric dispersion analysis. To compare before and after opinion on political issues, we looked at the two one-month periods before and after the September 26th, the date the first bailout legislation failed. We took each to mark an appropriate sampling unit for Net opinion on the political dimensions of the financial crisis. Assuming self-similarity, we used financial and crisis to define a set of texts dealing with that topic and also representative of the larger domains of opinion on the issue on the Net. These definitions also served as search engine queries (in the first instance, as discussed above). After weighted ranking, we also introduced date limitations (of convenience, further simplified for this discussion) and downloaded files in two sets of about 250,000 words each, for the two time periods. The lists word frequencies - here for the pre- and postSeptember 29 text collections are then compared. This contrast is the next step in showing whether a trend a discrete and identifiable chain of opinion emerged from one dated set of texts compared to its predecessor. What turns up when raw counts are compared?

For example: relative frequencies are critical for identifying the under the radar onset of trends, popularity and some Google functions for following trends, and word and link dispersion for quantifying effects.

There is also the dynamic case, not used here, with feedback, such that results (from one or another level) about opinion from earlier periods are introduced (at one or another level) for another period. This ranges from media feedback to explicitly and overtly gaming a popularity-based search engine such as Technorati.

There was very little difference between the periods at this level in this case: terms like bailout and financial, which have led the substantive discussion decrease, but only very slightly.

Observed absolute frequencies of critical terms before and after 9/26/08 (Functional words are much more common than the content words we analyze. The function word the has been added to the table for comparison; its use declines as well, also only slightly. This suggests that decline in observed frequencies standing alone is unlikely to be informative.) pre-9/26 st 14,171 (1 ) nd 899 (32 ) th 661 (45 ) th 601 (50 ) post-9/26 st 13,300 (1 ) th 582 (59 ) th 437 (79 ) rd 553 (63 )

the bailout government financial

However, going beyond this case, as a general matter, if the sample sets are different sizes, comparing frequencies in word usage between sets would not standing alone even be valid, much less useful, at least not until the comparison is checked. One method for benchmarking comparisons normalizes the different instances, a ratio of its raw count to the word count for the entire text. This can be expressed as [word X] per thousand, or as a percentage. Differences after normalizing continue to be slight in this case. [F]inancial and bailout dominated the content words of the substantive debate, but, for example, showing only about a 0.15% decrease in the use of bailout for the post-September 26 period. As the tables suggest, comparing word frequencies in almost any pair of texts may not show much difference even when normalized. The critical question is whether the differences in use (i.e., frequencies) for important terms matter. Our analysis relies on a statistical test for comparing frequencies. The results of relative frequency testing serve two different but related inquiries: First, are the two texts being compared (non-trivially) distinct from one another? Is there word use in the sets of text that shows meaningful differences in opinion for September and October 2008? Second, how are the texts distinct? What word use distinguishes one from the other? In this case, did patterns of word use ultimately trends in opinion emerge and develop? Third, if there are conspicuous differences, do the sharpedged differences in word patterns suggest more or less topically and thematically related set(s) of words?

(Numbers in parentheses show frequency rank) (Data from a Rock Creek study, the pre- and post9/26 text sets were about 500,00 words each) each The more frequent occurrence of a word in one text collection does not by itself show that the observed word is actually more frequent because the observed frequencies are dependent on the sizes of Normalized frequencies the texts that are being compared. Gries, S. Th. Useful statistics for corpus linguistics

Important terms (as percentages) of observed total word counts pre-9/26 5.6 .35 .26 .24 post-9/26 4.6 .20 .15 .19

the bailout government financial

The process begins with the frequency lists just described. For each word in the two frequency lists we derive the significance statistic to obtain a value with which to distinguish the September and October texts and to analyze the distinction. There are more than two dozen tests now being discussed in the scientific disciplines concerned with evaluating the significance of frequency differences in word use when paired texts are compared. The most commonly used are a log likelihood test and the chi-squared test.
See Dunning T., Accurate methods for statistics of surprise and coincidence (cited more than 1300 times); Rayson P, Garside R. Comparing Corpora using Frequency Profiling. By contrast the commonly used chi-squared test derives probabilities for the frequencies by comparing them with random ordering. The examples are not random. This depends on the assumption - not applicable for Net content or words in general - that words are independent and identically distributed.

The log likelihood test we use does not assume randomness or normally distributed data in making comparisons. Therefore it is better this test as suited to the Net's non-random word and content distribution.

We also use log likelihood analysis to derive the sets of words that can be used to distinguish one set of texts from another, or to characterize one of them. Sometimes these sharply defined words are referred to as "key words": those that occur uncommonly more or less in one text (or set of texts) than another. Keyness measures relative distinctiveness: how a far a term departs from its comparative benchmark. These keywords are the words that characterize individual texts - Romeo and Juliet - or groups of text (post- as opposed to pre-September 26 discussions of the financial crisis).

Scott M. & Tribble C., TEXTUAL PATTERNS. Note that key/keyness are derivative terms: the log likelihood test (discussed below) measures the salience (more or less the same as statistical significance) of relative frequency between texts. That is, the metric looks not at the arithmetical difference in word use, but at how much that difference matters.

When we applied keyness to the September-October Net discussion, some words stood out. While used less often as a matter of in raw numbers, these words stood out and made the later October discussion distinctive. [M]inorities, hate, and alien have become visible in this relative frequency analysis These emerging keywords are evidence of a change in the terms of the debate

The log likelihood test also uses a logarithmically based ratio scale that facilitates comparisons of individual word (and some other) usage across sets of texts. This in turn allows cross-sectional and longitudinal comparisons. New terms in the crisis debate (Left column is absolute percentage and rank; right column is departure from expected frequencies; negative declines in red) Percentage KEYNESS th bailout .20 (64 ) -117 th government .15 (85 ) -82 th financial .19 (68 ) -14 th minorities .02 (525 ) 34 th hate .01 (284 ) 21 alien -- (3141st) 10

Keywords are the hallmarks of frequency change they bring out the contrast between profile and background, or between the blocks of opinion recorded in Net text. If people talk differently about Toyota than they do about General Motors, what stands out by the log likelihood statistical metric in the comparison?

Keywords are also are markers for shifts in word use as the Net discussion moves forward. How are keywords situated within text? Where do they fall? This kind of location is measured with analysis of dispersion: the even or uneven distribution of an item through the text being studied the closely related issue of dispersion. Dispersion - placing keywords in the Internet text environment - is the basis for the audience-effect side of our opinion research. We can place keywords in different audience segments represented by the different groups of Net text, and the compare the text groups for the distribution of keywords. Iif minority and alien are found significantly more often during October on conservative blogs than elsewhere, this is evidence of an echo chamber effect for that issue. Dispersion, then, shows where messages have taken hold: where and by how much the message has had an effect. That is, we are developing ways to measure where, how, and how much a message has affected different blocs of Internet opinion. Where are the key terms found most often in the audience and in which parts of the audience? In this case, the keyword term with visibly uneven dispersion is minorities, and the effect is concentrated in the posts of three bloggers, two visibly conservative-leaning. In this case, Malkin seems ultimately to have been preaching to the converted.
Ordinarily dispersion for ratio-scaled data is measured by standard deviation or variance. However, where the data may be non-parametric, those metrics are not available. Also, pair-wise data is not available, and sample sizes are large. For continuing surveys of the problem see S. Gries, Dispersions and adjusted frequencies in corpora; and Dispersions and adjusted frequencies in corpora, further explorations. Nativist rhetoric was found on conservative blogs at least 40,000 of them (by different sampling than that used above.) However, this was less than 4% of blogs discussing the financial crisis, and only traces of the message could be found in mainstream media common websites. Please note that these results are crude, and common from the application of a form of head-counting, using the results of the frequency analysis.

Collocation
Collocation is the degree to which words occur together unusually often, by some measure of significance. "Collocation" is a formal term for this intuition - that some words tend to occur near each other: "night" and "day", "kick" and "bucket", "global" and "warming". If one or more collocations - a set of key phrases and other patterns - can be found in a text, then we can build up to quantitatively derived core features of the text. Moreover, when collocates can be found and aggregated for the distinctively frequent vocabulary (keywords) in a text, the process marks off the message of a text, whether this is the intended message or the message picked up by the Internet audience reading the text and its message.
Evert, S., Corpora and collocations

It follows that study of collocations and their uses extends and applies statistics in virtually every language discipline, from machine translation to literary analysis to email forensics.

As a rule of thumb, the higher the statistical score for a word pair's collocation, the more the association tells us about the pairs role in the text. This measurement and analysis can be done by hand (as by inspecting a text for every instance of a word in order to identify which words recur near each other) or using statistics. Collocates add color to literal meaning; repeated and prominent usage may enhance the coloring of surrounding words. Cause is an example; when used as a verb its usual collocates are negative. Collocation can compound the effect of a distinctive and vivid vocabulary. For example, at the end of 2008 we analyzed the impact of an online essay Michelle Malkin wrote in late September of that year, arguing that illegal immigrants were to blame for the banking collapse. There were interlocking word patterns in that essay that, as received and passed on in Net discussion, could be captured and measured with statistical collocation analysis. These words also interlocked: illegal collocated markedly with both alien and Hispanic and so on, as shown in the figure below.

This is a considerable over-simplification: there are well more than 25 measures for collocation that have been proposed; seldom, if ever, will all point in the same direction. Corpora and collocations; Oakes M, STATISTICS FOR CORPUS LINGUISTICS188195. In practice, we use one or more of the mutual information, t-test, and log likelihood tests for a project. Cause collocates with, (among other things): damage, problems, pain, disease, distress, trouble, blood, concern, degradation, harm, pollution, suffering, anxiety, death, fear, stress, surprise, symptoms

Illegal immigration and the mortgage mess


The full recursive collocation analysis is beyond scope here. Details on request.

Figure 2 how the core words in Malkins essay interlocked


From the Rock Creek Analytics collocation analysis of the Malkin essay. The links and nodes are not to scale, except that the node and link sizes are scaled as shown relative to each other and these are among the most common words in the text. The graphic was created using Voisine network visualization software


Each of these keywords collocated significantly often with each of the others. The width of the lines represents the result of applying the collocation metrics, a kind of tensile strength. Moreover two of these keywords were key, used unusually often (by Malkin in comparison with other September Net opinion). The result was a tightly bound bundle of blame. The word pattern, by itself, or in noteworthy part, was picked up by about 40,000 weblogs in October 2008.

See above for her keywords.

However, given the dispersion review noted above, this number may not reflect a significant impact on the overall discussion.

One of the discussion threads picking up and echoing her phrasing also used the negative coloration of cause described above: Giving home loans to minorities caused financial crisis. Repeating phrasing and other distinctive vocabulary in this way reflects its influence - we tend to quote or reword the phrasing for ideas we agree with - and by tracking phraseology in this we can track influence.

There are many other rhetorical devices in Net text that also served to convey Malkins message (and that can be measured but were not analyzed in this case). They include synonyms, homonyms, and other rhetorical figures, like part-for-whole synecdoche (alien). For an example of tracking phraseology in this way, see J. Leskovec, L. Backstrom, and J. Kleinberg, Meme-tracking and the Dynamics of the News Cycle. (However, to be clear, the conception of a meme used in that article is very different than we use, as shown for example, in Figure 2.)

Concordancing
A concordance is a list of a word (or sometimes a brief phrase), along with immediate context, from a corpus or text collection. The process produces a list of occurrences of the search term, with each occurrence centered in the GUI window of specialized concordance software. Each instance of the search term displays the words that come before and after in the text to the left and right of the term as it is shown in the software window Although not by itself a complex tool, concordancing serves several functions: when counting from display, it can be used to discover latent word patterns. It can analyze a text using the context for collocated terms. It can be used to identify and supply words for relative frequency analysis Here is an example of a concordance list, centered on the word illegal, in the Malkin essay discussed above, using illegal as a search term

See S. Hunston, CORPORA IN APPLIED LINGUISTICS, 38-66

Figure 3: screenshot of concordance software, showing illegal as used in context in Malkins essay Material taken from the Malkin essay, using Antconc software

Beyond investigation and research, a concordance serves as a check for the results of other functions. Do the collocations appear to be significant when examined in context? What do key word results show when their usage is examined in context? Illegal here shows as a linchpin of nativist rhetoric.

Putting the data together


This technical review has shown how we extract opinion from Internest text and put it to work. Here is a brief summary of the example. Which are the words that matter that make a Frequency and relative frequency analysis. message stand out, that make a text distinctive. Keywords: How do words and phrases com[are with minorities and alien as core conservative competing messages? rhetoric How do critical words especially keywords Collocation: illegal, Hispanics and alien hold together? Where and how much do messages have an Nativism had conservative resonanace, but impact? slight if any effect elsewhere. How were critical words used in context? Concordancing: here, for example, illegal and in context: the massive illegal alien mortgage racket

Conclusion
The Internet is nothing more than a vast collection of computer files, billions of them. Many are machinereadable text that can be displayed in English. These text files are documents describing, referring to, and corresponding to people, institutions, and issues. Many of these reflect and express opinion. Neglecting this analysis Net text means missing out a critical resource for opinion research. These are the most critical and the most valuable tools available at Rock Creek Analytics.

They work. Contact Donald Weightman, principal (cell 202 997-3290) dweightman@rock-creek-analytics.com or info@rock-creek-analytics.com

10

Potrebbero piacerti anche