Sei sulla pagina 1di 21

What is a Value Added Analysis Teacher Rating?

The Value Added Model of Teacher Effectiveness

With such a push for holding teachers accountable for student achievement and high scores on high-stakes testing, there has also been a push to figure out a way to look at teacher ratings from a more statistical viewpoint. One way of doing this is by using value added analysis. While value added analysis can be very helpful in identifying teachers who are effective in teaching the information measured on a specific test, its not always a good measure of how a teacher rates in other areas of teaching.

What is Value Added Analysis? Value added analysis provides an estimate of how well a teacher is doing at increasing student performance on standardized tests. It provides an idea of how well a student will do on future tests by looking at past test scores (his baseline) and comparing them to current scores, measuring student academic growth. Academic Growth = Current/recent test performance - baseline (past test performance) Academic growth in this equation, be it positive, negative or neutral is the value. This val ue is used as an indicator of how effective the current teacher is in teaching the student. It is, in essence, her value as a teacher. What Are the Benefits of Value Added Analysis? Having data makes it easier to rate a teacher based on measurable, quantifiable information rather than anecdotal evidence, report cards or observations and reviews. While those measures may still play some role in measuring teacher success, a value-added score looks at teacher effectiveness on a case-by-case basis as well. Some of the other benefits of this type of rating include:

Looking at individual student success, rather than the class as a whole. In measuring each childs progress from his own baseline, its easier to see a trend. If most of the scores are improving, then its likely the teacher is being effective in getting information across to students. It reduces the discrepancies found in other measurement systems due to socio-economic class, gender and English-as-a-second-language speakers. It allows administrators to look at specific areas of need, both in terms of professional development for teachers and intervention programs for students. It combines the elements of achievement and progress, whereas some other ratings of teacher effectiveness are solely based on achievement.

What Are the Drawbacks? There are some problems with using value-rated analysis as the only way of measuring teacher effectiveness. A teacher has no control over what happened educationally before a student stepped into her classroom and is only able to propel a child forward from where he began. If he began below standard and she taught him so that he barely reaches the standard, the student has still made progress that will not show up in testing. Other drawbacks include:

The system can only be used in school districts which give standardized tests on an annual basis. The test being used for comparison needs to evaluate the goals being targeted by the schools curriculum. Standardized tests arent always the most effective assessment of student skill and learning. Using stand ardized tests as a measurement tool doesnt take into account learning and growth that have been made in areas outside the scope of the test, not only academically, but socially and behaviorally as well.

Teacher ratings can vary wildly from year to year, based on the student population and their prior test scores. Also, in his report on this issue for the Anneberg Institute for School Reform Professor Sean P. Corcoran pointed out that "because value-added is statistically estimated, it is subject to uncertainty, or a 'margin of error'.

Is There One Way to Calculate a Teachers Score? There is no one way to calculate a teachers score using value-added analysis, a fact to which critics, including the National Education Association and local teachers unions, point in their concerns about using this model as a teacher rating system. A report from the National Academies of Science points out such scores have yet to be scientifically validated. In its report, Using Student Progress To Evaluate Teachers: A Primer on Value-Added Models,the Educational Testing Service identifies three different systems used to calculate teacher scores and notes these are only the most widely used. These models are the: 1. Educational Value-Added Assessment System, which was developed in 1993 for use in Tennessee and is now used by many other districts across the nation. It uses a complicated system that looks not only at student achievement and progress, but also district averages and the teacher effect from year to year. 2. Dallas Value-Added Accountability System, which is in use in the large Dallas school district and differs from the EVAAS in that it doesnt look at scores across many grades, but only from year to year, looking not at gains, but a connection between the scores. 3. Rate of Expected Academic Change, which proposes to use test data to measure a students progress toward a proficiency goal instead of comparing student-to-student achievement, as many standardized tests do. Whats the Bottom Line? Using a value-added model for measuring teacher effectiveness works well in providing an overall snapshot of how the teacher is doing in terms of teaching to the test. It can be of assistance to administrators in helping to make decisions about classroom and professional development funding. However, until there is a scientifically validated and standardized model, using VAM for incentive pay or contract renewal may cause districts to lose good teachers to bad decisions.

New York City Teacher Ratings: How Its Value-Added Model Compares To Other Districts
Posted: 03/ 2/2012 12:48 pm Updated: 03/ 2/2012 12:59 pm Sarah Butrymowicz and Sarah Garland

New York City schools erupted in controversy last week when the school district released its "value-added" teacher scores to the public after a yearlong battle with the local teachers union. The city cautioned that the scores had large margins of error, and many education leaders around the country believe that publishing teachers' names alongside their ratings is a bad idea. Still, a growing number of states are now using evaluation systems based on students' standardized test-scores in decisions about teacher tenure, dismissal and compensation. So how does the city's formula stack up to methods used elsewhere? The Hechinger Report has spent the past 14 months reporting on teacher-effectiveness reformsaround the country, and has examined value-added models in several states. New York City's formula, which was designed by researchers at the University of Wisconsin-Madison, has elements that make it more accurate than other models in some respects, but it also has elements that experts say may increase errors a major concern for teachers whose job security is tied to their value-added ratings.

"There's a lot of debate about what the best model is," said Douglas Harris, an expert on valueadded modeling at the University of Wisconsin-Madison who was not involved in the design of New York's statistical formula. The city used the formula from 2007 to 2010 before discontinuing it, in part because New York State announced plans to incorporate a different formula into its teacher-evaluation system. Value-added models use complex mathematics to predict how well a student can be expected to perform on an end-of-the-year test based on several characteristics, such as the student's attendance and past performance on tests. Teachers with students who take standardized math and English tests (usually fewer than half of the total number of teachers in a district) are held accountable for getting students to reach this mark. If a teacher's students, on average, fall short of their predicted test-scores, the teacher is generally labeled ineffective, whereas if they do as well as or better than anticipated, the teacher is deemed effective or highly effective. A number of states and districts across the country already tie student performance on standardized tests to teacher evaluations; others have plans to do so. Many education reformers, including those in the Obama administration, commend the practice. States were awarded points in the federal Race to the Top grant competition for creating policies that tie student academic growth to teacher evaluations. In Florida, by 2014, all districts must use value-added ratings for at least half of a teacher's total evaluation score. Ohio districts will start doing so in 2013. This year in Tennessee, student testscore data will count for 35 percent of each teacher's evaluation. Value-added ratings make up 20 to 25 percent of New York's new teacher evaluation framework. And politicians in Nebraska andLouisiana are pushing for these measures to be included in new teacherevaluation systems. The new evaluations, which will generally use test-scores as one of multiple measures, including classroom observations, are increasingly being used in decisions about compensation, retention and tenure. Advocacy groups like The New Teacher Project, now known as TNTP, and the National Council on Teacher Quality have cheered the inclusion of value-added scores in teacher-evaluation systems. In the past, most teachers were rated based on infrequent, "drive-by" principal observations that resulted in satisfactory ratings for up to 99 percent of teachers. But skeptics, including teachers unions and researchers, say that value-added models have reliability problems. Depending on which variables are included in a value-added model, the ratings for teachers can vary dramatically, critics say. As an example, researchers at the University of Colorado examined the formula that an economist hired by the Los Angeles Times created to rate teachers there (the economist's work was funded in part by the Hechinger Institute on Education and the Media). The University of Colorado researchers found that more than a third of L.A. Unified teachers would have had different scores if a slightly different formula had been used. A 2010 study by Mathematica Policy Research found that the error rate for value-added scores based on three years of data was 25 percent. In other words, a three-year model would rate one out of every four teachers incorrectly. The error rate jumped to 35 percent with only one year of data. The report cautioned against using value-added models for personnel decisions, a position that other experts have echoed. In New York City, some of the teachers whose scores were published last week received ratings based on multiple years of data, according to a 23-page technical report describing the city's statistical formula. But other New York City teachers a spokesperson for the city education department was unable to say exactly how manywere rated based on only one year of data.

Washington, D.C. also uses just one year of student test-scores in its statistical model. But the system that Bill Sanders, a researcher known as the "grandfather" of value-added measurement, designed for Tennessee uses five years of data in creating a score for each teacher. To ensure that elementary teachers aren't judged based on just one or two years of test-score data, the Tennessee model takes into account a student's performance in later years, Sanders says. For example, third-grade teachers are rated based in part on how their students do in subsequent grades. "When any one student takes a math test, on any one day, there is a huge uncertainty around that score," Sanders told The Hechinger Report in an interview last year. "It could be the kid got lucky this year, and guessed two or three right questions. Or the kid this morning could not have been feeling well. Consequently that score on any one day is not necessarily a good reflection of a kid's attainment level." Another question that educators and researchers have debated is whether the statistical models should account for student characteristics that are linked to achievement for example, poverty, English ability and special-education status. In places like Florida and Washington, D.C., valueadded models have accounted for such factors, in part because of the limitations of using fewer years of test-score data. New York City's model does as well. Variables include race, gender, socioeconomic status, and even whole-class characteristics like the size of the class and how many students are new to the city. Many researchers argue that adjusting for student demographic characteristics is unnecessary because the growth scores are calculated by comparing students against themselves. Sanders and others say that including student characteristics could bias the scores by making it easier for teachers of disadvantaged students to be rated more highly. A black student, for example, might be expected to do worse than a white student in such a model, an assumption that Sanders says lowers expectations for the black student, along with the teacher who has that student in class. In New York, high-rated teachers are evenly spread across both low-performing and highperforming schools, which experts say is partly a result of the formula's adjustments for student demographics. Teachers with demographically similar studentswhether they are low-income, minority, or have special needsare ranked relative to one other, not the entire teaching force. Other researchers have argued that factors like student poverty should be taken into account, however, because concentrated poverty, for example, is linked to lower student performance, suggesting that a student's peers may affect how that student does in school and on tests. That is, a teacher who has a large number of disadvantaged students in class may have a more difficult job getting a higher rating than teachers with fewer disadvantaged students. In an attempt to settle the question, Mathematica, the research group, is currently examining the effects of whole-class characteristics on teacher value-added ratings in a study of 30 districts across the country. Although it gets much less attention, one of the biggest problems with value-added modeling, according to many experts, is that the ratings cover only a fraction of teachersthose whose students take standardized tests in math and English, typically in grades three through eight. As new teacher-evaluation systems go into effect in more districts and states in the next two years, many, including New York City, will be grappling with how to rate everyone else. Rhode Island is using teacher-created goals on classroom work and tests. Colorado is planning to use off-the-shelf assessments and school-generated methods to gauge how teachers in subjects like physical education and music are performing. In Tennessee, teachers without value-added ratings are graded in part on how the teachers who do receive ratings in their

school perform. And Florida is creating more tests, one for every subject and grade level down through kindergarten. Harris calls Florida "an example of what not to do." Given the problems with value-added modeling, no matter which formula is used, he suggests that the best uses of the ratings might not be to make decisions about hiring, firing and tenure. Instead, they can be used to give lowrated teachers more training or principal observations, rather than pink slips. This story also appeared on GothamSchools on March 1, 2012.

May 2010 | Volume 67 | Number 8 The Key to Changing the Teaching Profession Pages 81-82

Using Value-Added Measures to Evaluate Teachers


Jane L. David Even the staunchest advocates of performance-based pay don't think it's fair to judge teachers' effectiveness solely on the basis of end-of-year test scores, without regard to where the teachers' students started at the beginning of the year. Can value-added measures, which show students' growth from one year to the next, solve this problem?

What's the Idea?


The claim for value-added measures is that they capture how much students learn during the school year, thereby putting teachers on a more level playing field as they aim for tenure or additional pay.

What's the Reality?


End-of-year test scores do not show how much students learned that year in that class, so measures that take into account where students started are surely an improvement. However, such measures of growth are only a starting point. Making judgments about individual teachers requires sophisticated analyses to sort out how much growth is probably caused by the teacher and how much is caused by other factors. For example, students who are frequently absent tend to have lower scores regardless of the quality of their teacher, so it is vital to take into account how many days students are present. Thus, to be fair and to provide trustworthy estimates of teacher effectiveness, value-added measures require complicated formulas that take into account as many influences on student achievement as possible.

What's the Research?


A growing number of researchers are studying whether value-added measures can do a good job of measuring the contribution of teachers to test score growth. Here I summarize a handful of analyses that shed light on two questions.

How Fair Are Value-Added Measures?


The trustworthiness of a value-added measure depends on how it is defined and calculated. Koretz (2008) argues that measuring the value added by the teacher requires knowing not only how much students have learned in a given year, but also the rates at which those particular students learn. Students reading well

above grade level, for example, would be expected to learn faster than struggling readers. Value-added measures should take these differences into account. Rothstein (2008) worries that test score gains are biased because students are not randomly assigned to teachers. For example, comparing teachers whose classrooms are treated as dumping grounds for troubled students with teachers whose classrooms contain the best-behaved students will favor the latter. RAND researchers examined whether giving students different tests would lead to different conclusions about teacher effectiveness (Lockwood et al., 2006). They calculated value-added ratings of middle school teachers in a large school district on the basis of their students' end-of-year scores from one year to the next on two different math subtests. They found large differences in teachers' apparent effectiveness depending on which subtest was used. The researchers concluded that if judgments about teacher effectiveness vary simply on the basis of the test selected, administrators should use caution in interpreting the meaning of results from value-added measures. Researchers have also identified other threats to the trustworthiness of value-added measures. Goldhaber and Hansen (2008) looked at the stability of such measures over time: Do value-added analyses identify the same teachers as effective every year? Using a large data set from North Carolina, they found that estimates of teacher effectiveness were not the same across years in reading or math. Other researchers (for example, Koretz, 2008) question whether it is even possible to compare gains from one year to the next using tests that do not measure the same content.

Are Value-Added Measures More Accurate Than Traditional Evaluations?


Traditional methods for evaluating teacher effectiveness have their own problems for example, infrequent or poor classroom observations or administrator bias. In fact, the persistently subjective nature of these more traditional evaluations is what fuels the current enthusiasm among policymakers for basing teacher evaluation on "objective" test scores. Do value-added measures do a better job of judging teacher effectiveness than traditional teacher evaluations do? Researchers have looked at this question by comparing results from the two approaches. When Jacob and Lefgren (2008) looked at 201 teachers in 2nd through 6th grade, they found a strong relationship between principals' evaluations and value-added ratings (based on student math and reading scores) of the same teachers. The researchers then asked which method did a better job of predicting how the teachers' future classes would score. They found that either method was fairly accurate in predicting which teachers would be in the top and bottom 20 percent the following year in terms of their students' test scores. Although value-added measures did a slightly better job of predicting future test scores, adding principal ratings increased the accuracy of these predictions. Studies of teacher evaluation systems in Cincinnati, Ohio, and Washoe County, Nevada, also found that value-added measures and well-done evaluations based on principal observations produced similar results (Milanowski, Kimball, & White, 2004).

What's One to Do?


From the federal government to foundations, the pressure is on to use student test score gains to evaluate teachers. Yet doing so in a credible and fair way is a complex and expensive undertaking with no guarantee that intended improvements in teaching and learning will result. What's more, it is not clear that value-added measures yield better information than more traditional teacher evaluation practices do. The complexity and uncertainty of measuring student achievement growth and deciding how much responsibility for gains to attribute to the teacher argue against using such measures for high-stakes decisions about individuals. To protect teachers from erroneous and harmful judgments, a consensus is emerging that we need multiple measures that tap evidence of good teaching practices as well as a variety

of student outcomes, including but not limited to standardized test score gains. According to a recent study (Coggshall, Ott, & Lasagna, 2010), most teachers support such a multiple-measures approach. Investing in expensive data analysis systems may not be as important as investing in ways of measuring teacher effectiveness that can identify the specific supports teachers need to improve their practice.

The Nation: Teachers Matter. Now What?


by DANA GOLDSTEIN

EnlargeSpencer Platt/Getty Images

City officials, state lawmakers and union officials protest proposed cuts of the New York Department of Education on Oct. 4, 2011 in New York City. New research suggests that teachers do matter in improving learning. text size A A A

January 17, 2012 Dana Goldstein is a writer for The Nation. Last month, economists at Harvard and Columbia released the largest-ever study of teachers' "value-added" ratings a controversial mathematical technique that measures a teacher's effectiveness by looking at the change in his students' standardized test scores from one year to the next, while controlling for student demographic traits like poverty and race. Raj Chetty, John Friedman, and Jonah Rockoff analyzed the test scores and family tax returns of 2.5 million Americans over a twenty-year period, from 1989 to 2009. The team concludedthat students who have teachers with high value-added ratings are more likely to attend college and earn higher incomes, and are less likely to become pregnant teens. In a rare instance of edu-wonk consensus, both friends and skeptics of standardized tests are praising the study as reliable and groundbreaking. Indeed, these findings raise several interesting questions about how to evaluate and pay teachers one of the most controversial

topics in American urban politics. In his annual State of the City speech last Wednesday, New York Mayor Mike Bloomberg cited the new research as he promised annual bonuses of up to $20,000 for teachers rated "highly effective," based partially on value-added measures and partially on principals' judgments. In a move that befuddled many casual observers of the education debate, the New York City teachers' union, the United Federation of Teachers, immediately opposed the proposal. If we now know teacher effectiveness has a real, measurable impact on both student academic achievement and life outcomes like teen pregnancy, why aren't teachers' unions supporting plans to pay teachers with high value-added ratings more money? Pundits like Nick Kristof and the Daily News editorial page have jumped in to claim that the new research justifies merit pay plans like Bloomberg's, and the one instituted by former chancellor Michelle Rhee in Washington, DC. The policy implications of the Chetty, Friedman, and Rockoff paper are, however, far from clear. As the researchers note in their conclusion, their study was conducted in a low-stakes setting, one in which student test scores were used neither to evaluate nor pay teachers. In a littlenoticed footnote (#64) on page 50, the economists write: even in the low-stakes regime we study, some teachers in the upper tail of the VA [valueadded] distribution have test score impacts consistent with test manipulation . If such behavior becomes more prevalent when VA is actually used to evaluate teachers, the predictive content of VA as a measure of true teacher quality could be compromised. [Emphasis added.] The importance of this caveat cannot be overstated. As I've written in the past, there is evidence of increased teaching-to-the-test, curriculum-narrowing and outright cheating nationwide since the implementation of No Child Left Behind, which put an unprecedented focus on the test scores of disadvantaged children. Despite these concerns about testing, the United Federation of Teachers has agreed in principal to a new evaluation system that depends in part on value-added; a similar system, after all, is already in place for determining whether teachers earn tenure. Negotiations between the union and the city are stalled not because, in the words of the Daily News, the union has "placed protecting the jobs of incompetents over the future financial well-being of children," but because the union would like teachers who receive an "unsatisfactory" rating under the new system to have the right to file an appeal to a neutral arbitrator. Currently, the city Department of Education determines whether to hear appeals of teacher evaluations, and it rejects 99.5 percent of the appeals filed. Given the widespread, non-ideological worries about the reliability of standardized test scores when they are used in high-stakes ways, it makes good sense for reform-minded teachers'

unions to embrace value-added as one measure of teacher effectiveness, while simultaneously pushing for teachers' rights to a fair-minded appeals process. What's more, just because we know that teachers with high value-added ratings are better for children, it doesn't necessarily follow that we should pay such teachers more for good evaluation scores alone. Why not use value-added to help identify the most effective teachers, but then require these professionals to mentor their peers in order to earn higher pay? That's the sort of teacher "career ladder" that has been so successfulin high-performing nations like South Korea and Finland, and that would guarantee that excellent teachers aren't just reaching twenty-five students per year but are truly sharing their expertise in a way that transforms entire schools and districts.

N.Y. Suspends Federal Grants For 10 School Districts


by LARRY ABRAMSON

January 5, 2012 New York state announced this week that it has suspended millions of dollars in federal grants for 10 school districts because they failed to reach agreements over new evaluations for teachers and principals. New York City schools could lose $60 million.
Copyright 2012 National Public Radio. For personal, noncommercial use only. See Terms of Use. For other uses, prior permission required.

ROBERT SIEGEL, HOST: Now, to the issue of education funding. In New York state, the commissioner of education has suspended $100 million in federal grants. The money was supposed to go to struggling schools in 10 districts, but the districts were required to come up with new teacher evaluation systems, and they missed their deadline. As NPR's Larry Abramson reports, the cuts could stall the effort to turn around New York state's lowest-performing schools. LARRY ABRAMSON, BYLINE: Yonkers Public Schools, just north of New York City, is home to 26,000 students and two chronically failing schools. The Department of Education in Washington, D.C., gave New York state around $4 million to help Yonkers turn those schools around. But because of a dispute, Yonkers Superintendent Bernard Pierorazio says he may have to pull the plug on those projects. BERNARD PIERORAZIO: We have 19 staff members attached to this grant, you know, as well as contracts with national consulting groups that are coming in and working with our staff.

ABRAMSON: The reason for the holdup? Yonkers and nine other districts have not come to a final agreement with the unions on new evaluation systems for teachers and principals. Those evaluation systems were mandated by state law over a year ago, and they have to link teacher assessment to students' performance. But negotiations are complicated. Pat Puleo, of the Yonkers Federation of Teachers, says it's hard to figure out how to evaluate teachers in subjects that are not tested. PAT PULEO: Guidance counselors, art teachers, music teachers - there are not state tests that the children take, so we have to come up with a rainbow of evaluation systems for everyone. ABRAMSON: Puleo says the state knew these talks would take time. But then, over the holidays, New York Education Commissioner John King announced districts must finish negotiations or risk losing millions of dollars. King says one of the reasons these schools are failing is the lack of effective evaluations. JOHN KING: The evaluations are really about a vehicle for improving student achievement and obviously, that's particularly urgent in these schools who've performed so poorly for so long. ABRAMSON: Some districts are further along than others. They all can appeal the commissioner's decisions to suspend those grants. But in the meantime, they have to figure out whether to lay off people hired with that money, and then possibly hire them back. The dispute also threatens millions in federal aid under Race to the Top, another big federal effort that depends heavily on new evaluation systems for teachers and school leaders. Larry Abramson, NPR News.

Civic Report
No. 70 September 2012

Transforming Tenure: Using Value-Added Modeling to Identify Ineffective Teachers


Marcus A. Winters, Senior Fellow, Manhattan Institute for Policy Research

Executive Summary

EMAIL THIS | PRINTER FRIENDLY DOWNLOAD PDF

8
PRESS RELEASE OP-ED

Chicago Teachers Balk at Accountability, USA Today, 09-13-12


IN THE NEWS

Teacher Evaluations Highlight Divide Between Unions and Reformers, The Pelican Post, 9-19-12 Reformers Push 'Value-Added' Teacher Model; Unions Push Back,Virginia Watchdog, 9-1112 Winters' Work on VAM Adds Value to Colorado Educator Effectiveness Policy, EdIsWatching.org, 9-10-12 MT: Think Tank Proposes Tool To Measure Tenured Teachers, Montana Watchdog, 9-10-12 Teacher Evaluation Report Criticized,The Advocate, 9-8-12 Which Side Is Right About Evaluating Teachers?, Education Week, 9-6-12 Report backs evaluating teachers on test scores, The Advocate, 9-6-12 Putting value-added model to the test: Study finds student scores can predict teacher effectiveness,Atlanta Journal-Constitution's "Get Schooled", 9-6-12 Teacher evaluations, value-added data, and Fl's plan, Orlando Sentinel's "School Zone", 9-6-12 Linked on RealClearPolicy.com, 9-612 Study Touts New Measures For Teacher Effectiveness, Nevada News

Public school teachers in the United States are famously difficult to dismiss. The reason is simple: after three years on the job, most receive tenureafter a brief and subjective evaluation process (typically, a classroom visit or two by an administrator or another teacher) in which few receive negative ratings. Once tenured, teachers are armored against efforts to remove them, and most do not face any serious reevaluation to ensure that their skills stay up to standard. With this traditional approach, tenured teachers sometimes lose their positions for insubordination, criminal conduct, gross neglect, or other reasonsbut almost never for simply being bad at the job. This state of affairs protects teachers (both good and bad) quite well but is clearly harmful to students. The effects of a poor teacher, research has shown, haunt pupils for years afterward. Being assigned to such a teacher reduces the amount that a student learns in school and is associated with lower earnings in adulthood (in part because having an inadequate teacher makes a child more likely to have an early pregnancy and less likely to go to college). An education system that protects bad teachers does a grave disservice to the children in its care.

Bureau, 9-5-12 Report: Student Test Data Predicts Teacher Quality, The Heartlander, 95-12
TABLE OF CONTENTS:

Executive Summary About the Author Introduction Part I: VAM Is a Reliable Predictor of Future Performance Part II: Comparing the Effects of Different VAM-Based Policies Appendix Endnotes References

In recent years, some school districts have experimented with changes in tenure rules. They seek the power to remove ineffective teachers and, in some jurisdictions, to reevaluate teachers throughout their careers. A keystone of this reform movement is the replacement of subjective evaluation with quantifiable measures of each teachers effectiveness. The quantitative method is known as value-added modeling (VAM), a statistical analysis of student scores that seeks to identify how much an individual teacher contributes to a pupils progress over the years. The use of VAM in teacher evaluations is growing, but the method remains extremely controversial. Critics often claim that it does not and cannot measure actual teacher quality. This paper addresses that claim. Part I analyzes data from Florida public schools to show that a VAM score in a teachers third year is a good predictor of that teachers success in his or her fifth year. Having established that VAM is a useful predictive tool, Part II of the paper addresses the most effective ways that VAM can be used in tenure reform. VAM is not a perfect measure of teacher quality because, like any statistical test, it is subject to random measurement errors. So it should not be regarded as the magic bullet solution to the problem of evaluating teacher performance. However, the method is reliable enough to be part of a sensible policy of tenure reformone that replaces automatic tenure with rigorous evaluation of new candidates and periodic reexamination of those who have already received tenure.

About the Author


Marcus A. Winters is a senior fellow at the Manhattan Institute and an assistant professor at the University of Colorado Colorado Springs. He conducts research and writes extensively on education policy, including topics such as school choice, high school graduation rates, accountability, and special education. Winters has performed several studies on a variety of education policy issues including high-stakes testing, performance-pay for teachers, and the effects of vouchers on the public school system. His research has been published in the journals Educational Evaluation and Policy Analysis, Education Finance and Policy,Economics of Education Review, Teachers College Record, and Education Next. His op-ed articles have appeared in numerous newspapers and magazines, including The Wall Street Journal, The Washington Post, USA Today, the New York Post, the New York Daily News, the Weekly Standard, and National Affairs. He is often quoted in the media on education issues. Winters received a B.A. in political science from Ohio University in 2002, and a Ph.D. in economics from the University of Arkansas in 2008.

Introduction
Tenure and the Problem of Teacher Quality Bad teachers substantially harm a childs prospects. Studies have found that an ineffective teacher can cost pupils as much as a grade levels worth of learning during a single school year.[1] Further, bad teachersthose who do not make any measurable contribution to their students advancementmake students more likely to have an early pregnancy, reduce the chances that they will go to college, and have a negative impact years later on their pupils earnings as adults.[2] A wide body of research has shown that even as teacher quality is a schools most important driver of achievement, teacher quality varies a great deal from classroom to classroom in public schools.[3] Since 2009, a few school districts around the nation have been experimenting with changes to business as usual, seeking ways to improve the quality of their teachers. Though these districts remain a small minority, the reform effort has gathered steam, especially in the past year. One of its most controversial suggestions is the redefinitionor even eliminationof tenure for public school teachers. For years, teachers unions and their supporters have described tenure as a necessary bulwark against arbitrary or discriminatory termination, which was a common practice before the advent of modern employment law and labor standards. But the current tenure system protects bad teachers as well as good ones. (Very few tenured teachers are ever forced to leave the classroom.) We know that teachers vary in quality and that removing less competent teachers has the potential to improve students education. Therefore, we can be sure that pupils are ill-served by a system that ensures that bad teachers cannot be fired. As its defenders like to point out, tenure ensures only that teachers receive due process before they are terminated. However, in most school systems, the required due process is so burdensomeand has so small a chance of successthat in practice, poor performance is rarely a firing offense. To be rid of a teacher for poor performance, in most public school systems, an administrator must carefully document several proofs of incompetence over a sustained period of timein the form of botched lesson plans, improper classroom development, and observed poor practice. These proof points are inherently subjective, and each is contestable in the hearing process. Meanwhile, measurements of actual outcomeshow much students have learned in the teachers classroomare rarely considered. This is why poor classroom performance is so rarely cited as a reason for dismissal. For instance, competence was mentioned in only eight of the 45 cases in which tenured teachers were terminated in New York City in 2008 and 2009. And six of those eight included other charges such as insubordination or misconduct.[4] One might argue that worthy teachers with good records have earned some protection against the effects of a personal crisis or a rough year in the classroom. Tenure, though, is not reserved for proven educators. On the contrary, public school teachers are offered lifetime tenure very early in their careersusually after three yearsand the offer seldom has much to do with their performance. As of 2011, according to a review of tenure laws by the National Council on Teacher Quality (2011), only eight states require that performance of a teachers students be central to deciding whether to award a teacher tenure. That actually represents considerable progress, since in 2009 the NCTQ found that not a single state awarded tenure primarily based on effectiveness. Moreover, in most American public schools, that early-career tenure decision is often the only systematic examination of a teachers worth. Tenured teachers are rarely reexamined to ensure that their skills are maintained. Why are measurements of effectiveness given so little weight in tenure processes? The simple answer is that, until recently, such measures did not exist. Tenure rules were written when performance was evaluated entirely on the basis of a classroom visit or two by an experienced observer. School systems simply lacked any objective measure of the teachers contribution to student learning. Today, better measuring tools exist, but the rules remain as written. When tenure is decided, nearly all the teachers in a typical school system receive a satisfactory or higher rating.[5] School systems need a better approach to tenure. Job protection, if it is to be offered at all, should be restricted to the best teachers. And policies should permit reevaluations, lest once-worthy

teachers be protected long after their performance has faltered. Most important, tenure should be related to meaningful and objective measurements of teaching effectiveness. On this last point, modern statistical tools present a promising avenue for reform. These measures, used in tandem with traditional subjective measures of teacher quality, could help administrators make better-informed decisions about which teachers should receive tenure and which should be denied it. Statistical evaluations can also be used to identify experienced teachers who are performing poorly, with an objectivity that reduces the risk of a teacher being persecuted by an administrator. To those dissatisfied with the status quo, one technique in particular seems to offer a good basis for reform, and it has been implemented in many recent attempts to change tenure rules in order to improve teacher quality. It is the method known as value-added modeling (VAM). VAM uses a complex statistical procedure to determine each teachers independent contribution to improvement in his or her students test scores. Many school systems across the nation have recently used, or are currently considering using, VAM assessments when making employment decisions. For instance, under new laws passed in Colorado in 2010, Tennessee in 2011, and just recently in New Jersey, teachers in those states will lose their tenure if they receive below-satisfactory performance ratings in two consecutive years. Those ratings are based, in part, on VAM. Some worry that because VAM is an imperfect measure of classroom effectiveness, it will incorrectly deny tenure protections to some effective teachersor even cause good teachers to lose their jobs. If so, VAMs negative impact might cancel out its benefits and result in no net improvement in the quality of a school districts teaching staff. After all, research shows that VAM is an imprecise measure of a teachers true performance.[6] For this report, I test the premise that a teachers VAM score can help predict his or her future performance. I use data from Florida to replicate recent analyses by two scholars, Dan Goldhaber and Michael Hansen (2010), who used data from North Carolina. Consistent with their research, my results show that pre-tenure VAM scores are significantly related to student test-score performance in the teachers classroom in later years. These results indicate that VAM often contains meaningful information about a teachers future effectiveness, which can usefully inform employment decisions. Obviously, the potential effects of any VAM-based tenure-reform policy would depend upon its design. Accordingly, the second part of this report looks at the number and type of teachers who would have been removed from the classroom (deselected) rather than tenured under different sorts of VAM-based policies, had those policies been in place in Florida when the data were collected. These comparisons show that the effects of such policies on teacher quality will depend on the standard that a teacher must meet to receive a satisfactory rating and on whether a teacher can lose tenure after it has been granted. These design issues, though important, should not obscure the fundamental point: VAM-based tenure policies hold considerable promise for removing consistently ineffective teachers and thus improving teacher quality throughout the public school system. Before considering the method and results from this report, it is worth emphasizing that though the analysis here focuses only on the influence of VAM on teacher tenure decisions, real-world policies will quite sensibly use VAM as only one measure of effectiveness when rating teachers. Therefore, this report has put VAM-based tenure policies to a hard test: by evaluating the effect of using VAM alone to identify and remove ineffective teachers, it has placed more reliance on VAM than a real district would. That the VAM approach passes this test is a striking indication of its usefulness. It is important to recall that this analysis was created to test the ability of VAM to identify lowperforming teachers under the structure of the current system. That is, the analysis assumes that teachers and school systems will not respond to the new rules by changing their other behaviors. This is unlikely to be the case in any real-world application of tenure reform. Instead, teachers could reasonably be expected to respond to a reformed tenure system in several ways. The reformed system might, for example, attract a different sort of candidate. Further, teachers could respond to the new possibilitiesnot receiving tenure or being removed from the classroomin ways that are good for students (by increasing their effort level), or that have unpredictable

effects (changing their teaching style), or that could have negative effects (emphasizing only testable material in the classroom). Additional theoretical and empirical research is needed to map the real-world effects of incorporating VAM-based measures of teacher quality into employment decisions. However, understanding the ability of VAM to predict future performance and the type of teacher identified as ineffective by a VAM-based system is an essential first step. Balancing the Needs of Teachers and Pupils Though VAM is a powerful technique, it is undoubtedly an imperfect measure of a teachers effectiveness. VAM is limited partly because it considers student performance only as measured by standardized tests, which are themselves imperfect measures of student achievement and account for only part of what school systems ask teachers to do. But even as a measure of the teachers contribution to student test scores, VAM has potentially serious limitations. Critics of VAM analysis rightly point out that, as a statistical tool, VAM must contend with measurement errorthe inevitable fact that measurements of the same thing, taken at different times, will vary, and some of this variation will be essentially random. VAM-based measures of teacher performance can be quite imprecise. When VAM is used to inform tenure decisions, it is likely that some average and even above-average teachers could be removed from the classroom because of a low VAM score caused by random variation in measurement over the years, rather than their own failures. The influence of measurement error can be mitigated by statistical adjustments and by incorporating multiple years of student performance when evaluating any particular teacher. But measurement error cannot be eliminated. From the perspective of teachers (and their unions), the collateral damage of even a single teacher losing tenure from an inaccurately low VAM score is unacceptable. However, the issue is not as cut-and-dried from the perspective of the student. A tenure-reform policy based on VAM will be an improvement for students if it removes enough low-performing teachers to improve overall teacher quality in a school district. If student achievement is our most pressing concern, we need to consider the possible consequences of VAM-based policies on whole districts, even as we acknowledge the potential for error in individual cases. No evaluation system creates a perfect measure of an employees productivity. VAM, then, should not be judged against a nonexistent ideal but rather evaluated for its potential to improve on the current systems ability to predict future performance. In the analyses that follow, this was my goal: to assess whether a tenure policy based on VAM would tend to improve a school districts overall teacher quality.

Part I: VAM Is a Reliable Predictor of Future Performance


Following Goldhaber and Hansens work from North Carolina, my primary analysis uses a simple value-added model to estimate a teachers contribution to student test scores during the first two years in the classroom. I then evaluate the relationship between this measure and the achievement of students in the teachers classroom during his or her fifth year. If the previous VAM measure of teacher quality is a significant predictor of the teachers later achievement, we can conclude that VAM provides reliable information about a teachers future performance. The analyses use detailed data about Florida students performance on the states annual highstakes math and reading exams, the Florida Comprehensive Assessment Test (FCAT) in the spring semesters from 2002 through 2009.[7] Though individuals are not identified by name, the data set permits the analyst to follow the performance of each student over time. It also includes identifying variables for each teacher and a variable used to match students to teachers in classrooms. My analyses only include students in the fourth and fifth grades. In later grades, students change teachers for each subject, making the assessment of teacher impact far more difficult. Further, testing in Florida begins in the third grade, and the analysis requires a baseline achievement score for the year before the study period. Therefore, grades before fourth are not available for this method.

I used student reading scores to create a simple value-added model by grade and year (a later check showed that results would be similar had I used math scores). The model accounted for the impact on test scores of such observed student characteristics as race/ethnicity, gender, and socioeconomic status (as measured by whether the children were eligible for free or reducedpriced lunches). After controlling for these and other variables, I was able to arrive at the estimated contribution of individual teachers to their students test scores.[8] With a measure of teacher impact in place for each student, I could then look at the data at the teacher level to develop a rolling measure of each teachers quality over the years. As we have mentioned, most school systems offer tenure after three years in the classroom. Therefore, I calculated each teachers average VAM score during his or her first three years in the classroom. Finally, I took the measure of each teachers average value-added score during his or her first three years back to the student-year data set. I used the VAM from those first three years to help predict each teachers students achievement in the teachers fifth year (the 200708 school year). What I was looking for was a significant and meaningful relationship between pre-tenure VAM score and the performance of students in the teachers future classroom years later. Relationship Between PreTenure VAM and Later Student Performance The results of the analysis are reported in Table 1. The first column reports the results from a regression analysis (a statistical method for showing the relationship among several variables) in which I mapped the relationship between student achievement and a teachers having a masters degree. (The masters is often used as a proxy for skill and commitment in current evaluation systems.) Consistent with previous research, I find no relationship between a teacher having a masters degree and student outcomes. The second column reports the results of a regression analyzing the relationship between the teachers average VAM score during the first three years in the classroom and the performance of that teachers students during his or her fifth year in the classroom. The result shows a statistically significant and substantial relationship between the teachers pre-tenure average VAM score and achievement in that teachers classroom several years later. The third column shows that a control for whether the teacher has a masters degree has no meaningful influence on the finding. Results reported in Table 1 demonstrate that the value-added assessment of the teachers effectiveness prior to the tenure decision is a significant predictor of the teachers later effectiveness. Thus, VAM measures early in a teachers career appear to be good predictors of how well a teacher will perform in the future. As mentioned, this result is consistent with the previous findings of Goldhaber and Hansen, who used data from North Carolina; it is important to note that data from another states school system, based on data from a different standardized test, show the same relationship between early-career VAM scores and later student success.

Part II: Comparing the Effects of Different VAM-Based Policies

Accepting that VAM can help predict future success for teachers, I turn to the next practical question for school districts: How should VAM be incorporated into tenure policy? Policymakers must first consider the level of performance that a teacher has to meet to avoid an ineffective rating. This bar must not be set too low, or the VAM will have little impact on quality. For instance, a VAM-based policy that removes a large school districts single worst teacher might have a substantial effect for the few students who would have been assigned to that teachers classroom but would have an infinitesimal effect on overall teacher quality throughout the school system. A second issue to consider is whether a teacher who receives tenure under a reformed system would keep it going forward (as is currently the case) or whether teachers could be continually reviewed. If tenure continues to be decided in teachers third year on the job and they experience no further significant reviews, the impact of any quality-improvement effort will be limited to teachers at the start of their careers. This means that the policy might affect too few teachers and do nothing about older teachers whose effectiveness is fading. Finally, policymakers must consider how to use multiple years of VAM scores to assign tenure or identify teachers for removal. The measurement error inherent in VAM analysis, along with other administrative issues, should lead school systems to use multiyear measures when making employment decisions. Policymakers could respond to this need by comparing teacher performance using the average VAM score over a multiyear period or, as districts in Colorado and Tennessee have already done, by removing teachers after they receive consecutive poor ratings. Table 2 reports the number of students in Florida who were attached to teachers who would have been fired according to different versions of a VAM-based policy: first, one that removes a teacher who has received a poor rating based on the previous three years performance; second, a policy that removes teachers only after they have demonstrated below-standard performance during their first three years in the classroom; and third, a policy that removes teachers who perform below a particular standard during consecutive years.

The table shows that different versions of a tenure-reform policy would benefit different numbers of students. As would be expected, policies that simply raise the VAM score considered acceptable will affect a greater number of teachers, and thus students. Similarly, the most impactful policy is one that affects all teachers, regardless of whether they have previously been granted tenure. The table also shows that the most conservative policythat is, the policy that leads to the fewest teacher removalsremoves teachers based on consecutive bad ratings rather than their average rating relative to other teachers during a multiyear period. That result occurs because under a system based on consecutive poor ratings, teachers who earned a single low ratingperhaps because of random errorhave the opportunity to correct the result by meeting the standard the next year. On the other hand, a policy that removes all teachers whose average score is below a particular percentile will always remove that percentage of teachers. By definition, a policy that removes teachers whose average VAM is below the fifth percentile of all average VAM scores during that period will remove 5 percent of the teachers, while a policy that removes teachers if they consecutively score below the fifth percentile will keep a teacher who scores in the third percentile during one year and the seventh percentile the next. The effect of a tenure-reform policy on overall teacher quality in the school system depends both on the number and quality of teachers denied tenure under such a policy. Figures 1 through 9 compare the distribution of the 200809 VAM scores of teachers who would have been deselected

at the end of the 200708 school year, according to these different systems, with those of teachers who would have avoided removal. Though each figure represents a different policy, all show that teachers who would have been fired in 200809 were less effective than teachers who would have survived review. However, the figures illustrate that some teachers who were observed to be performing at or above the mean in 200809 would have been fired according to any version of tenure reform. The riskof firing teachers whose later performance is above averageincreases as the standard for failure is set higher. For example, a policy that removes teachers performing below the 25th percentile sets a higher standard than a policy that removes those scoring below the fifth percentile. But that policy is more likely to remove teachers whose later effectiveness would prove to be well above average. The figures also enable us to compare the later performance of teachers who would have been deselected according to different policy styles. As was done in Table 2, we consider the quality of teachers deselected according to a policy that: a) removes any teacher whose average VAM score over a three-year period was below the Xth percentile; b) removes only entering fourth-year teachers whose average VAM score over their first three years was below the Xth percentile among all teachers; or c) removes any teacher with a VAM score below the Xth percentile among all teachers for consecutive years. The figures illustrate that the most conservative policy designthat is, the policy least likely to remove teachers who later perform well in the classroomremoves those teachers who score below the Xth percentile during consecutive years. As Table 2 illustrates, this is the policy design that removes the smallest number of teachers. On the other hand, a policy that removes any teacher whose average VAM score over a three-year period is below the Xth percentile will tend to remove more teachers who would later demonstrate themselves to be effective, though even this policy will tend to remove more ineffective teachers than effective ones.

Conclusion
Like previous research found in North Carolina, my analysis of Florida data found that pre-tenure VAM scores often provide information about a teachers future quality. Thus, VAM analysis can help replace automatic tenure with employment decisions based on reliable evaluations. It can be part of tenure reform and thus can contribute to improving public education in the United States. But which tenure-reform policies would make best use of this technique? I addressed this question by pinpointing the teachers in the Florida data who would have been removed from the classroom according to several different types of policies and performance standards. I found that any VAMbased policy would have removed teachers who, on average, performed worse than their peers later in their careers. However, different versions of VAM-based policies proved to have different consequences. Specifically, certain versions increased the risk that effective teachers (as measured by VAM) would be removed. For example, a policy could target teachers for removal if they have two or more periods of consecutive poor performance. Alternately, the policy could simply score teachers on an average of their performance ratings for a given number of years. I found that the latter policy was more likely than the former to result in the removal of effective teachers (teachers who, despite a bad patch in the records, would prove to be effective later). Another way to increase this risk of false positives, I found, was to set the performance bar high. Such policies, applied to the Florida data, would also have resulted in the removal of teachers who would later demonstrate effective performance. These results tell tenure reformers that they should consider the number and type of teachers likely to be denied tenure or removed from the classroom under their proposed policies. This will help them design policies that balance the interests of students in need of great teachers and the legitimate interests of teachers concerned that they will be inappropriately removed from the classroom because of a randomly low VAM score. The need for well-designed policies should not obscure the finding that public schools can indeed use VAM to help identify teachers for tenure or removal. Instead, these results underscore the importance of blending VAM with sound policies. This report does not argue that VAM should be used in isolation to evaluate teachers for tenure or to make any other employment decisions. VAM, as we have seen, is subject to random measurement errors, and so must be combined with other methods of teacher evaluation.

The lesson of this report and of other research is that VAM can be a useful piece of a comprehensive evaluation system. Claims that it is unreliable should be rejected. VAM, when combined with other evaluation methods and well-designed policies, can and should be part of a reformed system that improves teacher quality and thus gives Americas public school pupils a better start in life.

Appendix Endnotes
1.

(View PDF)

2. 3. 4. 5. 6. 7. 8.

E.g., Hanushek (1992) finds that students assigned to a teacher whose students have results in the 75th percentile (i.e., whose scores are better than three-quarters of their fellow pupils) will test one year and a half ahead of where they started when the school year is over. Students with teachers in the 25th percentile, on the other hand, end up with scores that are only a half-year better than their starting point. Chetty, Friedman, and Rockoff (2011). See Hanushek and Rivkin (2010). E-mail correspondence with the Department of Education. Weisberg, Sexton, Mulhern, and Keeling (2009). See, e.g., McCaffrey, Sass, Lockwood, and Mihaly (2009). The analyses use a rich student-level panel data set acquired from the Florida K-20 data warehouse. Consistent with previous research, I adjust the teacher effects according to the empirical Bayes estimator.

Great Schools for America www.greatschoolsforamerica.org Value Added teacher evaluation all you ever wanted to know This week the New York Times published teacher rankings of 18,000 New York city teachers.

The ratings, known as teacher data reports, covered three school years ending in 2010, and are intended to show how much value individual teachers add by measuring how much their students test scores exceeded or fell short of expectations based on demographics and prior performance. Such value-added assessments are increasingly being used in teacher evaluation systems, but they are an imprecise science. For example, the margin of error is so wide that the average confidence interval around each rating spanned 35 percentiles in math and 53 in English, the city said. Some teachers were judged on as few as 10 students. Evaluators deserve a failing grade on their value-added system, but it seems only teachers must be held accountable for the work they do. How do these geniuses rank teachers, and remember this is a ranking system, not a scoring system? In ranking teachers, someone has to be at the bottom and top, everyone else falls in between. Ranking does not give teachers an A, B, C, D, or failing grade based on desired criteria. The late Gerald Bracey explained the difference in Some Common Errors in Interpreting Test Scores. Of course, the article only discusses the evaluation of students, schools, districts, and states. At the time this article was written, Bracey could not have conceived of government officials ranking teachers with a rating system as draconian as value added evaluations. Accumulated here are articles explaining the ridiculous value-added system that has for some inexplicable reason gained legitimacy. Also, included are responses from teachers and parents. NYC Teacher Evaluations Released Ratings are out for some 12,700 fourth to eighth grade New York City public schoolteachers. Called teacher data reports, they were released to the public for the first time ever Friday

afternoon. Data is old, from 2007-2010, and about 30% of teachers listed no longer work for NYC schools. The teacher and the consultant Value-added measures to judge a teachers worth whats that all about? If we would only listen to teachers.

Value added measures sound fair, but they are not. In this video Prof. Daniel Willingham describes six problems (some conceptual, some statistical) with evaluating teachers by comparing student achievement in the fall and in the spring. Pearson wants to control the worlds curriculum and testing. (Gag factor ipecac)

http://www.greatschoolsforamerica.org/gsa-wp/all-you-never-wanted-to-know-aboutvalue-added-teacher-evaluation/

Potrebbero piacerti anche