Sei sulla pagina 1di 14

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

VOL. 38,

NO. 1,

JANUARY/FEBRUARY 2012

How We Refactor, and How We Know It


Emerson Murphy-Hill, Chris Parnin, and Andrew P. Black
AbstractRefactoring is widely practiced by developers, and considerable research and development effort has been invested in
refactoring tools. However, little has been reported about the adoption of refactoring tools, and many assumptions about refactoring
practice have little empirical support. In this paper, we examine refactoring tool usage and evaluate some of the assumptions made by
other researchers. To measure tool usage, we randomly sampled code changes from four Eclipse and eight Mylyn developers and
ascertained, for each refactoring, if it was performed manually or with tool support. We found that refactoring tools are seldom used:
11 percent by Eclipse developers and 9 percent by Mylyn developers. To understand refactoring practice at large, we drew from a
variety of data sets spanning more than 39,000 developers, 240,000 tool-assisted refactorings, 2,500 developer hours, and 12,000
version control commits. Using these data, we cast doubt on several previously stated assumptions about how programmers refactor,
while validating others. Finally, we interviewed the Eclipse and Mylyn developers to help us understand why they did not use
refactoring tools and to gather ideas for future research.
Index TermsRefactoring, refactoring tools, floss refactoring, root-canal refactoring.

INTRODUCTION

EFACTORING is the process of changing the structure of


software without changing its behavior. While the
practice of restructuring software has existed ever since
software has been structured, the term was introduced by
Opdyke and Johnson [12]. Later, Fowler popularized
refactoring when he cataloged 72 different refactorings,
ranging from localized changes such as INLINE TEMP to
more global changes such as TEASE APART INHERITANCE [5].
Especially in the last decade, the body of research about
refactoring has been growing rapidly. Examples of such
research include studies of the effect of refactoring on errors
[20] and the relationship between refactoring and software
quality [17]. Such research builds upon a foundation of
previous work about how programmers refactor, such as
what kinds of refactorings programmers perform and how
frequently they perform them.
Unfortunately, this foundation is, in some cases, based
on limited evidence or on no evidence at all. For example,
consider Murphy-Hill and Blacks Refactoring Cues tool that
allows programmers to refactor several program elements
at once [8]. If the assumption that programmers frequently
want to refactor several program elements at once holds,
this tool would be very useful. However, prior to the tool
being introduced, no foundational research existed to
support this assumption. As we show in this paper, this

. E. Murphy-Hill is with the Department of Computer Science, North


Carolina State University, 890 Oval Drive, Campus Box 8206, Raleigh,
NC 27695-8206. E-mail: emerson@csc.ncsu.edu.
. C. Parnin is with the College of Computing, Georgia Institute of
Technology, 324808 Georgia Tech Station, Atlanta, GA 30332.
E-mail: parnin@gatech.edu.
. A.P. Black is with the Department of Computer Science, Portland State
University, PO Box 751, Portland, OR 97207-0751.
E-mail: black@cs.pdx.edu.
Manuscript received 2 Mar. 2010; revised 23 July 2010; accepted 21 Dec.
2010; published online 1 Apr. 2011.
Recommended for acceptance by J.M. Atlee and P. Inverardi.
For information on obtaining reprints of this article, please send e-mail to:
tse@computer.org, and reference IEEECS Log Number TSESI-2010-03-0056.
Digital Object Identifier no. 10.1109/TSE.2011.41.
0098-5589/12/$31.00 2012 IEEE

case is not isolated; other research also rests on unsupported or weakly supported foundations.
Without strong foundations for refactoring research, we
can have only limited confidence that research built on
these foundations will be valid in the larger context of realworld software development. As a step toward strengthening these foundations, this paper revisits some of the
assumptions and conclusions drawn in previous research.
Our experimental method takes data from eight different
sources (described in Section 2) and applies several
different analysis strategies to them. The contributions of
our work lie in both the experimental method and in the
conclusions that we are able to draw:
The RENAME refactoring tool is used much more
frequently by ordinary programmers than by the
developers of refactoring tools (Section 3.1).
. About 40 percent of refactorings performed using a
tool occur in batches (Section 3.2).
. About 90 percent of configuration defaults in
refactoring tools are not changed when programmers use the tools (Section 3.3).
. Messages written by programmers in version-control commit logs do not reliably indicate the presence
of refactoring in the commit (Section 3.4).
. Programmers frequently floss refactor, that is, they
interleave refactoring with other types of programming activity (Section 3.5).
. About half of refactorings are not high level, so
refactoring detection tools that look exclusively for
high-level refactorings will not detect them (Section 3.6).
. Refactorings are performed frequently (Section 3.7).
. Close to 90 percent of refactorings are performed
manually, without the help of tools (Section 3.8).
. The kind of refactoring performed with a tool differs
from the kind performed manually (Section 3.9).
In Section 4, we discuss the interaction between these
conclusions and the assumptions and conclusions of other
researchers.
.

Published by the IEEE Computer Society

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

This paper is an extension of work reported at ICSE 2009


[11] which provided evidence for each of the conclusions
above. The primary weakness of the ICSE work was that
several of our conclusions were based on data from a single
development team. This paper includes analysis of four
additional data sets, with the consequence that every
conclusion drawn here is based on data from at least two
development teams.

THE DATA THAT WE ANALYZED

The work described in this paper is based on eight sets of


data. The first set we will call Users; it was originally
collected in the latter half of 2005 by Murphy et al. [7], who
used the Mylyn Monitor tool to capture and analyze finegrained usage data from 41 volunteer programmers in the
wild using the Eclipse development environment (http://
eclipse.org). These data capture an average of 66 hours of
development time per programmer; about 95 percent of the
programmers wrote in Java. The data include information
on which Eclipse commands were executed and at what
time. Murphy et al. originally used these data to characterize the way programmers used Eclipse, including a coarsegrained analysis of which refactoring tools were used most
often. Murphy-Hill and Black have also used these data as a
source of evidence for the claim that refactoring tools are
underused [10].
The second set of data we will call Everyone; it is publicly
available from the Eclipse Usage Collector [18], and
includes data from every user of the Eclipse Ganymede
release who consented to an automated request to send data
back to the Eclipse Foundation. These data aggregate
activity from over 13,000 Java developers between April
2008 and January 2009, but also include non-Java developers. The data include information on the number of
programmers who have used each Eclipse command
(including the refactoring commands) and how many times
each command was executed. We know of no other
researchers who have used this data for investigating
programmer behavior.
The third set of data we will call Toolsmiths; it includes
refactoring histories from four developers who primarily
maintain Eclipses refactoring tools. These data include
detailed histories of which refactorings were executed,
when they were performed, and with what configuration
parameters. These data include all the information necessary to recreate the usage of a refactoring tool, assuming
that the original source code is also available. These data
were collected between December 2005 and August 2007,
although the date ranges are different for each developer.
This data set is not publicly available. The only author that
we know of using similar data is Robbes [16]; he reports on
refactoring tool usage by himself and one other developer.
The fourth set of data we will call Eclipse CVS because it is
the version history of the Eclipse and JUnit (http://junit.org)
code bases as extracted from their Concurrent Versioning
System (CVS) repositories. Commonly, CVS data must be
preprocessed before analysis. This is because CVS does not
record which file revisions were committed in a single
transaction. The standard approach for recovering transactions is to find revisions committed by the same developer

VOL. 38,

NO. 1, JANUARY/FEBRUARY 2012

with the same commit message within a small time window


[22]; we use a 60 second time window. Henceforth, we use
the word revision to refer to a particular version of a file,
and the word commit to refer to one of these synthesized
commit transactions. We excluded from our sample 1) commits to CVS branches, which would have complicated our
analysis, and 2) commits that did not include a change to a
Java file.
In our experiments, we focus on a subset of the commits
in Eclipse CVS. Specifically, we randomly sampled from
about 3,400 source file commits (Section 3.4) that correspond to the same time period, the same projects, and the
same developers represented in Toolsmiths. Using these
data, two of the authors (Murphy-Hill and Parnin) inferred
which refactorings were performed by comparing adjacent
commits manually. While many authors have mined software repositories automatically for refactorings (for example, Weigerber and Diehl [20]), we know of no other
researchers that have compared refactoring tool logs with
code histories.
The fifth set of data we call Mylyn; it includes the
refactoring histories from eight developers who primarily
maintain the Mylyn project, a task-focused interface for
Eclipse (www.eclipse.org/mylyn). The data format is the
same as for Toolsmiths, although we obtained it through
different means; the developers checked their refactoring
histories into CVS while working on the project, so those
histories are publicly available from Mylyns open-source
code repository. The refactoring history spans the period
from February 2006 to August 2009, though different
developers worked on the project at different periods of time.
The sixth data set we call Mylyn CVS, which is the
version control history corresponding to the Mylyn refactoring history, in the same way that Eclipse CVS corresponds to
Toolsmiths. We analyzed, filtered, randomly drew from, and
inspected the data in the same way as with Eclipse CVS.
The seventh data set is called UDC Events, which is a
subset of the Everyone data containing more detail: Instead
of aggregating counts of Eclipse command uses, UDC
Events contains timestamps for each command usage. These
data are much like the Users data, but include 275,903
developers spanning several weeks in June and July 2009.
The final data set is called Developer Responses. When we
completed our analysis of the first six data sets, we sent a
survey to three developers in the Toolsmiths data set and
four developers in the Mylyn data set. The survey included
several questions about those developers refactoring habits
and refactoring tool use habits.1 In total, we received two
responses from Toolsmith developers and three responses
from Mylyn developers. We use this qualitative Developer
Responses data to augment the quantitative data in the other
seven data sets.

FINDINGS ON REFACTORING BEHAVIOR

In this section, we analyze these eight sets of data and


discuss our findings.
1. The survey template appears in Appendix B.

MURPHY-HILL ET AL.: HOW WE REFACTOR, AND HOW WE KNOW IT

TABLE 1
Refactoring Tool Usage in Eclipse

Some tool logging began in the middle of the Toolsmiths and Mylyn data collection (shown in light gray) and after the Users data collection (denoted
with a *). The - symbol denotes a percentage corresponding to a fraction for which the denominator is zero.

3.1 Toolsmiths and Users Differ


We hypothesize that the refactoring behavior of the
programmers who develop the Eclipse refactoring tools
differs from that of the programmers who use them. Toleman
and Welsh assume a variant of this hypothesisthat the
designers of software tools erroneously consider themselves
typical tool usersand argue that the usability of software
tools should be objectively evaluated [19]. To test our
hypothesis, we compared the refactoring tool usage in the
Toolsmith data set against the tool usage in the User and
Everyone data sets. For this comparison, we will omit the
Mylyn data set because Users and Everyone are more likely to
represent the behavior of people who do not develop
refactoring tools.
In Table 1, the Uses columns indicate the number of
times each refactoring tool was invoked in that data set. The
Use percent column presents the same measure as a
percentage of the total number of refactorings. (The
Batched columns are discussed in Section 3.2.) Notice
that while the rank order of each tool is similar across
Toolsmiths, Users, and EveryoneRENAME, for example,
always ranks firstthe proportion of uses of the individual
refactoring tools varies widely between Toolsmiths and
Users/Everyone. In Toolsmiths, RENAME accounts for about
29 percent of all refactorings, whereas in Users it accounts
for about 62 percent, and in Everyone for about 72 percent.
We suspect that this difference is not because Users and
Everyone perform more RENAMES than Toolsmiths, but
because Toolsmiths are more frequent users of the other
refactoring tools.
This analysis is limited in two ways. First, each data set
was gathered over a different period of time, and the tools
themselves may have changed between those periods.
Second, the Users data include both Java and non-Java
RENAME and MOVE refactorings, but the Toolsmiths, Mylyn,
and Everyone data report on just Java refactorings. This may
inflate actual RENAME and MOVE percentages in Users.
3.2 Programmers Repeat Refactorings
We hypothesize that when programmers perform a refactoring, they typically perform several more refactorings of the

same kind within a short time period. For instance, a


programmer may perform several EXTRACT LOCAL VARIABLES in preparation for a single EXTRACT METHOD or may
RENAME several related instance variables at once. Based on
personal experience and anecdotes from programmers, we
suspect that programmers often refactor multiple pieces of
code because several related program elements need to be
refactored in order to perform a composite refactoring. In
previous research, Murphy-Hill and Black built a refactoring
tool that supported refactoring multiple program elements at
once, on the assumption that this is common [8].
To determine how often programmers do in fact repeat
refactorings, we used the Toolsmiths, Mylyn, and Users data
to measure the temporal proximity of multiple invocations
of a refactoring tool. We say that refactorings of the same
kind that execute within 60 seconds of each another form a
batch. From our personal experience, we think that
60 seconds is usually long enough to allow the programmer
to complete an Eclipse wizard-based refactoring, yet short
enough to exclude refactorings that are not part of the same
conceptual group. Additionally, a few refactoring tools,
such as PULL UP in Eclipse, can refactor multiple program
elements, so a single application of such a tool is an explicit
batch of related refactorings; we measured the median
batch size for these tools.
In Table 1, each Batched column indicates the number
of refactorings that appeared as part of a batch, while each
Batched percent column indicates the percentage of
refactorings that appeared as part of a batch. Overall, we
can see that certain refactorings, such as RENAME, INLINE,
and ENCAPSULATE FIELD, are more likely to appear as part
of a batch, while others, such as EXTRACT METHOD and
PULL UP, are less likely to appear in a batch. In total, we see
that 30 percent of Toolsmiths refactorings, 39 percent of
Mylyn refactorings, and 47 percent of Users refactorings
appear as part of a batch.2
2. We suspect that the difference in percentages arises partially because
the Toolsmiths and Mylyn data set counts the number of completed
refactorings while Users counts the number of initiated refactorings. We
have observed that programmers occasionally initiate a refactoring tool on
some code, cancel the refactoring, and then reinitiate the same refactoring
shortly thereafter [9].

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

VOL. 38,

NO. 1, JANUARY/FEBRUARY 2012

TABLE 2
The Median Number of Explicitly Batched Elements
Used for Several Refactoring Tools, Where n Is the
Number of Total Uses of That Refactoring Tool

The median batch size for explicitly batched refactorings


in tools that can refactor multiple program elements varied
between tools in both Toolsmiths and Mylyn (Table 2).
Overall, the table indicates that, with the exception of
MOVE, most refactorings are performed in batches.
The main limitation of this analysis is that, while we
wished to measure how often several related refactorings
are performed in sequence, we instead used a 60-second
heuristic. It may be that some related refactorings occur
outside our 60-second window, and that some unrelated
refactorings occur inside the window. To show how
sensitive these results are to the batch threshold, Fig. 1
displays the total percentage of batched refactorings for
several different batch thresholds. Other metrics for detecting batches, such as burstiness, should be investigated in
the future.

3.3

Programmers Often Dont Configure


Refactoring Tools
Refactoring tools are typically of two kinds: They either
force the programmer to provide configuration information, such as whether a newly created method should be
public or private an example is shown in Fig. 2or they
quickly perform a refactoring without allowing any
configuration. Configurable refactoring tools are more
common in some environments, such as Netbeans
(http://netbeans.org), whereas nonconfigurable tools are
more common in others, such as X-develop (http://
www.omnicore.com/en/xdevelop.htm). Which interface
is preferable depends on how often programmers configure refactoring tools.
We hypothesize that programmers often dont configure
refactoring tools. We suspect this because tweaking code

Fig. 1. Percentage of refactorings that appear in batches as a function


of batch threshold, in seconds. Sixty seconds, the batch size used in
Table 1, is drawn as a vertical line.

Fig. 2. A configuration dialog box in Eclipse.

manually after the refactoring may be easier than configuring the tool. In the past, we have found some limited
evidence that programmers perform only a small amount of
configuration of refactoring tools. When we conducted a
small survey in September 2007 at a Portland Java Users
Group meeting, eight programmers estimated that, on
average, they supply configuration information only 25 percent of the time.
To validate this hypothesis, we counted how often
programmers used various configuration options in the
Toolsmiths and Mylyn data when performing the five
refactorings most frequently performed by Toolsmiths.
We skipped refactorings that did not have configuration
options. The results of the analysis are shown in Table 3.
Configuration Option refers to a configuration parameter
that the user can change. Default Value refers to the
default value that the tool assigns to that option. Change
Frequency refers to how often a user used a configuration
option other than the default. The data suggest that
refactoring tools are configured infrequently: The overall
mean change frequency for these options is about 10 percent
in Toolsmiths and 12 percent in Mylyn. Although different
configuration options are changed from defaults with
varying frequencies, almost all configuration options that
we inspected were below the average configuration percentage predicted by the Portland Java Users Group survey.
This analysis has several limitations. First, we could not
count how often certain configuration options were
changed, such as how often parameters are reordered when
EXTRACT METHOD is performed. Second, we examined
only the five most-common refactorings; configuration may
be more frequent for less popular refactorings. Third, we
measured how often a single configuration option is
changed, but not how often any configuration option is
changed for a refactoring. Fourth, we were not able to
distinguish between the case where a developer purposefully used a nondefault configuration option and the case

MURPHY-HILL ET AL.: HOW WE REFACTOR, AND HOW WE KNOW IT

TABLE 3
Refactoring Tool Configuration in Eclipse

where she blindly used the nondefault option left over from
the last time that she used the tool.

3.4 Commit Messages Dont Predict Refactoring


Several researchers have used messages attached to
commits into a version control as indicators of refactoring
activity [6], [14], [15], [17]. For example, if a programmer
commits code to CVS and attaches the commit message
refactored class Foo, we might predict that the committed
code contains more refactoring activity than if a programmer commits with a message that does not contain the word
refactor. However, we hypothesize that this assumption
is false, perhaps because refactoring can be an unconscious
activity [2, p. 47] and perhaps because the programmer may
consider the refactoring subordinate to some other activity,
such as adding a feature [10].
In his thesis, Ratzinger describes the most sophisticated
strategy for finding refactoring messages of which we are
aware [14]: searching for the occurrence of 13 keywords,
such as move and rename, and excluding needs
refactoring. Using two different project histories, the
author randomly drew 100 file modifications from each
project and classified each as either a refactoring or as some
other change. He found that his keyword technique
accurately classified modifications 95.5 percent of the time.
Based on this technique, Ratzinger et al. concluded that an
increase in refactoring activity tends to be followed by a
decrease in software defects [15].
We replicated Ratzingers experiment for the Eclipse
code base. Using the Eclipse CVS data, we grouped

individual file revisions into global commits, as previously


discussed in Section 2. We also manually removed commits
whose messages referred to changes to a refactoring tool
(for example, 105654 [refactoring] Convert Local Variable
to Field has problems with arrays), because such changes
are false positives that occur only because the projects code
is itself implementing refactoring tools. Next, using Ratzingers 13 keywords, we automatically classified the log
messages for the remaining 2,788 commits. Ten percent of
these commits matched the keywords, which compares
with Ratzingers reported 11 and 13 percent for two other
projects [14]. Next, a third party randomly drew 20 commits
from the set that matched the keywords (which we will call
Labeled) and 20 from the set that did not match
(Unlabeled). Without knowing whether a commit was
in the Labeled or Unlabeled group, two of the authors
(Murphy-Hill and Parnin) manually compared each committed version of the code against the previous version,
inferring how many and which refactorings were performed and whether at least one nonrefactoring change was
made. Together, Murphy-Hill and Parnin compared these
40 commits over the span of about 6 hours, comparing the
code using a single computer and Eclipses standard
compare tool.
The results are shown in Table 4, under the Eclipse CVS
heading. In the left column, the kind of Change is listed.
Pure Whitespace means that the developer changed only
whitespace or comments; No Refactoring means that the
developer did not refactor but did change program
behavior; Some Refactoring means that the developer both

TABLE 4
Refactoring between Commits in Eclipse CVS and Mylyn CVS

Plain numbers count commits in the given category; tuples contain the number of refactorings in each commit.

10

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

VOL. 38,

NO. 1, JANUARY/FEBRUARY 2012

Fig. 3. Refactorings over 40 sessions for Mylyn (at left) and 40 sessions for Eclipse (at right). Refactorings shown include only those with tool support
in Eclipse.

refactored and changed program behavior, and Pure


Refactoring means the programmer refactored but did not
change program behavior. The center column counts the
number of Labeled commits with each kind of change, and
the right column counts the number of Unlabeled commits.
The parenthesized lists record the number of refactorings
found in each commit. For instance, the table shows that, in
five commits when a programmer mentioned a refactoring
keyword in the CVS commit message, the programmer
made both functional and refactoring changes. The five
commits contained 1, 4, 11, 15, and 17 refactorings.
These results suggest that classifying CVS commits by
commit message does not provide a complete picture of
refactoring activity. While all six pure-refactoring commits
were identified by commit messages that contained one of
the refactoring keywords, commits labeled with a refactoring
keyword contained far fewer refactorings (63, or 36 percent
of the total) than those not so labeled (112, or 64 percent).
Fig. 3 shows the variety of refactorings in Labeled (darker

bars) commits and Unlabeled (lighter bars) commits. We will


explain the (H), (M), and (L) tags in Section 3.6.
We replicated this experiment once more for the Mylyn
CVS data set. As with Eclipse CVS, 10 percent of the
commits were classified as refactorings using Ratzingers
method. Under the Mylyn CVS heading, Table 4 shows the
results of this experiment. These results confirm that
classifying CVS commits by commit message does not
provide a complete picture of refactoring, because commits
labeled with a refactoring keyword contained fewer
refactorings (52, or 46 percent) than those not so labeled
(60, or 54 percent).
There are several limitations to this analysis. First, while
we tried to replicate Ratzingers experiment [14] as closely as
was practicable, the original experiment was not completely
specified, so we cannot say with certainty that the observed
differences were not due to methodology. Likewise, observed differences may be due to differences in the projects
studied. Indeed, after we completed this analysis, a personal
communication with Ratzinger revealed that the original

MURPHY-HILL ET AL.: HOW WE REFACTOR, AND HOW WE KNOW IT

experiment included and excluded keywords specific to the


projects being analyzed. Second, because the process of
gathering and inspecting subsequent code revisions is labor
intensive, our sample size (40 commits in total) is smaller
than would otherwise be desirable. Third, the classification
of a code change as a refactoring is somewhat subjective. For
example, if a developer removes code known to her to never
be executed, she may legitimately classify that activity as a
refactoring, although to an outside observer it may appear to
be the removal of a feature. We tried to be conservative,
classifying changes as refactorings only when we were
confident that they preserved behavior. Moreover, because
the comparison was blind, any bias introduced in classification would have applied equally to both Labeled and
Unlabeled commit sets.

3.5 Floss Refactoring Is Common


In previous work, Murphy-Hill and Black distinguished
two tactics that programmers use when refactoring: floss
refactoring and root-canal refactoring [10]. During floss
refactoring, the programmer uses refactoring as a means to
reach a specific end, such as adding a feature or fixing a
bug. Thus, during floss refactoring the programmer intersperses other kinds of program changes with refactorings to
keep the code healthy. Root-canal refactoring, in contrast, is
used for correcting deteriorated code and involves a
protracted process consisting exclusively of refactoring.
A survey of the literature suggested that floss refactoring
is the recommended tactic, but provided only limited
evidence that it is the more common tactic [10].
Why does this matter? Case studies in the literature, for
example those reported by Pizka [13] and by Bourqun and
Keller [1], describe root-canal refactoring. However, inferences drawn from these studies will be generally applicable
only if most refactorings are indeed root canals.
We can estimate which refactoring tactic is used more
frequently from the Eclipse CVS and Mylyn CVS data. We
first define behavioral indicators of floss and root-canal
refactoring during programming sessions, which (in contrast to the intentional definitions given above) we can hope
to recognize in the data. For convenience, we define a
programming session as the period of time between
consecutive commits to CVS by a single programmer. In a
particular session, if a programmer both refactors and
makes a semantic change, then we say that the programmer
is floss refactoring. If a programmer refactors during a
session but does not change the semantics of the program,
then we say that the programmer is root-canal refactoring.
Note that a true root-canal refactoring must also last an
extended period of time or take place over several sessions.
The above behavioral definitions relax this requirement,
and so will tend to overestimate the number of root canals.
The results suggest that floss refactoring is more
common than root-canal refactoring. Returning to Table 4,
we can see that Some Refactoring, indicative of floss
refactoring, accounted for 28 percent (11/40) of commits in
Eclipse CVS and 50 percent (20/40) in Mylyn CVS.
Comparatively, Pure Refactoring, indicative of root-canal
refactoring, accounts for 15 percent (6/40) of commits in
both Eclipse CVS and Mylyn CVS. Normalizing for the fact
that only 10 percent (4/40) of all commits were labeled with

11

refactoring keywords in Eclipse CVS, commits indicating


floss refactoring would account for 30 percent of all
commits while commits indicating root-canal would account for only 3 percent of all commits.3 Normalizing in
Mylyn CVS, floss commits account for 54 percent while rootcanal commits account for 11 percent. Looking at the Eclipse
CVS data another way, 98 percent of individual refactorings
would occur as part of a Some Refactoring (floss) commit,
while only 2 percent would occur as part of a Pure
Refactoring (root canal) commit, again after normalizing
for labeled commits. For the Mylyn CVS data set, 86 percent
of individual refactorings would occur as part of a Some
Refactoring (floss) commit, while 14 percent would occur as
part of a Pure Refactoring (root canal) commit.
We also notice that for Eclipse CVS in Table 4, the Some
Refactoring (floss) row tends to show more refactorings
per commit than the Pure Refactoring (root canal) row.
However, this trend was not confirmed in the Mylyn CVS
data; the number of refactorings in floss commits does not
appear to be significantly different from the number in rootcanal commits.
Pure refactoring with tools is relativly infrequent in the
Users data set, suggesting that very little root-canal
refactoring occurred in Users as well. We counted the
number of refactorings performed using a tool during
sessions in that data. In no more than 10 out of 2,671 sessions
did programmers use a refactoring tool without also
manually editing their program. In other words, in less
than 0.4 percent of commits did we observe the possibility
of root-canal refactoring using only refactoring tools.
Our analysis of Table 4 is subject to the same limitations
described in Section 3.4. The analysis of the Users data set
(but not the analysis of Table 4) is also limited in that we
consider only those refactorings performed using tools.
Some refactorings may have been performed by hand; these
would appear in the data as edits, thus possibly inflating
the count of floss refactoring and reducing the count of rootcanal refactoring.

3.6 Many Refactorings Are Medium and Low Level


Refactorings operate at a wide range of levels, from as low
level as single expressions to as high level as whole
inheritance hierarchies. Past research has often drawn
conclusions based on observations of high-level refactorings. For example, several researchers have used automatic
refactoring-detection tools to find refactorings in version
histories, but these tools can generally detect only those
refactorings that modify packages, classes, and member
signatures [3], [4], [20], [21]. The tools generally do not
detect submethod level refactorings, such as EXTRACT
LOCAL VARIABLE and INTRODUCE ASSERTION. We hypothesize that, in practice, programmers also perform many
lower-level refactorings. We suspect this because lowerlevel refactorings will not change the programs interface,
and thus programmers may feel more free to perform them.
To investigate this hypothesis, we divided the refactorings that we observed in our manual inspection of Eclipse
CVS commits into three levelsHigh, Medium, and Low.
We classified refactoring tool uses in the Mylyn and
3. Our normalization procedure is described in Appendix A.

12

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

TABLE 5
Refactoring Level Percentages in Four Data Sets

Toolsmiths data in the same way. High-level refactorings are


those that change the signatures of classes, methods, or
fields; refactorings at this level include RENAME CLASS,
MOVE STATIC FIELD, and ADD PARAMETER. Medium-level
refactorings are those that change the signatures of classes,
methods, and fields and also significantly change blocks of
code; this level includes EXTRACT METHOD, INLINE
CONSTANT, and CONVERT ANONYMOUS TYPE TO NESTED
TYPE. Because medium-level refactorings affect both signatures and code alike, more sophistication is needed for
automated analysis and may not be properly identified by
automated refactoring detectors. Low-level refactorings are
those that make changes only to blocks of code; low-level
refactorings include EXTRACT LOCAL VARIABLE, RENAME
LOCAL VARIABLE, and ADD ASSERTION. Refactorings with
tool support that were found in the Eclipse CVS and Mylyn
CVS data set are labeled as high (H), medium (M), and low
(L) in Fig. 3.4
The results of this analysis are displayed in Table 5. For
each level of refactoring, we show what percentage of
refactorings from Eclipse CVS (normalized), Toolsmiths,
Mylyn CVS (normalized), and Mylyn make up that level.
We see that many low and medium-level refactorings do
indeed take place; as a consequence, tools that detect only
high-level refactorings will miss 24 to 60 percent of
refactorings.

3.7 Refactorings Are Frequent


While the concept of refactoring is now popular, it is not
entirely clear how commonly refactoring is practiced. In Xing
and Stroulias automated analysis of the Eclipse code base,
the authors conclude that indeed refactoring is a frequent
practice [21]. The authors make this claim largely based on
observing a large number of structural changes, 70 percent
of which are considered to be refactoring. However, this
figure is based on manually excluding 75 percent of
semantic changesresulting in refactorings accounting for
16 percent of all changes. Further, their automated
approach suffers from several limitations, such as the
failure to detect low-level refactorings, imprecision when
distinguishing signature changes from semantic changes,
and the course granularity available from the inspection of
CVS revisions.
To validate the hypothesis that refactoring is a frequent
practice, we characterize the occurrence of refactoring
activity in the Users, Toolsmiths, Mylyn data. Note that these
data sets contain records of only those refactorings that
were performed with tools.
In order for refactoring activity to be defined as frequent,
we seek to apply criteria that require refactoring to be
4. Note that one refactoring, GENERALIZE DECLARED TYPE, can be either
high (if the type is declared in a signature) or low (if the type is declared in
the body of a method). This refactoring was excluded from the analysis
reflected in the data in Table 5.

VOL. 38,

NO. 1, JANUARY/FEBRUARY 2012

habitual and to occur at regular intervals. For example, if


refactoring occurs just before a software release, but not at
other times, then we would not want to claim that
refactoring is frequent. First, we examined the Toolsmiths
and Mylyn data to ascertain how refactoring activity was
spread throughout the development cycle. Second, we
examined the Users data to determine how often refactoring
occurred within a programming session and whether there
was significant variation across the population.
In the Toolsmiths data, we found that refactoring
occurred throughout the Eclipse development cycle. In
2006, an average of 30 refactorings took place each week;
in 2007, there were 46 refactorings per week. Only 2 weeks
in 2006 did not have any refactoring activity, and one of
these was a winter holiday week. In 2007, refactoring
occurred every week. In the Mylyn data, we find a similar
trend for the first two years, but a drop in the frequency and
eventually the magnitude of refactoring in the last two
years. Specially, refactoring activity did not occur for 1 week
in both 2006 and 2007, 7 weeks in 2008, and 8 (out of 34)
weeks in 2009. The respective averages were 31, 28, 36, and
6 tool refactorings per week. However, the average number
of commits per week also declined in recent years (62, 65,
41, and 22); because there was decreased development
activity, we would expect lower refactoring activity.
In the Users data set, we found refactoring activity
distributed throughout the programming sessions. Overall,
41 percent of programming sessions contained refactoring
activity. More interestingly, if we assume that the number
of edits (changes to a program made with an editor)
approximates how much work was done during a session,
then significantly more work was done in sessions with
refactoring than without refactoring. We found that, on
average, sessions without refactoring activity contained an
order of magnitude fewer edits than sessions with refactoring. Looking at it a different way, sessions that contained
refactoring also contained, on average, 71 percent of the
total edits made by the programmer. This was consistent
across the population: 22 of 31 programmers had an
average greater than 72 percent, whereas the remaining 9
ranged from 0 to 63 percent. This analysis of the Users data
suggests that when programmers must make large changes
to a code base, refactoring is a common way to prepare for
those changes.
Inspecting refactorings performed using a tool does not
have the limitations of automated analysis; it is independent of the granularity of commits and semantic changes,
and captures all levels of refactoring activity. However, it
has its own limitation: the exclusion of manual refactoring.
Including manual refactorings can only increase the
observed frequency of refactoring. Indeed, this is likely:
As we will see in Section 3.8, many refactorings are in fact
performed manually.

3.8 Refactoring Tools Are Underused


A programmer may perform a refactoring manually or may
choose to use an automated refactoring tool if one is
available for the refactoring that she needs to perform.
Ideally, a programmer will always use a refactoring tool if
one is available because automated refactorings are theoretically faster and less error-prone than manual refactorings.
However, in one survey of 16 students, only two reported

MURPHY-HILL ET AL.: HOW WE REFACTOR, AND HOW WE KNOW IT

having used refactoring tools, and even then only 20 and


60 percent of the time [10]. In another survey of 112 agile
enthusiasts, we found that the developers reported refactoring with a tool a median of 68 percent of the time [10].
Both of these estimates of usage are surprisingly low, but
they are still only estimates. We hypothesize that programmers often do not use refactoring tools. We suspect this is
because existing tools may not have a sufficiently usable
user interface.
To validate this hypothesis, we correlated the refactorings that we observed by manually inspecting Eclipse CVS
commits with the refactoring tool usages in the Toolsmiths
data set. Similarly, we performed the same correlation for
the Mylyn CVS commits and tool usages in the Mylyn data
set. A refactoring found by manual inspection can be
correlated with the application of a refactoring tool by
looking for tool applications between commits. For example, the Toolsmiths data provide sufficient detail (time, new
variable name, and location) to correlate an EXTRACT
LOCAL VARIABLE performed with a tool with an EXTRACT
LOCAL VARIABLE observed by manually inspecting adjacent commits in Eclipse CVS.
After analyzing the Toolsmiths data, we were unable to
link 89 percent of 145 observed refactorings that had tool
support to any use of a refactoring tool (also 89 percent when
normalized). After analyzing the Mylyn data, we were
unable to link 78 percent of 72 observed refactorings that
had tool support to any use of a refactoring tool (91 percent
when normalized). This suggests that the developers
associated with the Toolsmiths and Mylyn data primarily
refactor manually. An unexpected finding was that 31 refactorings that were performed with tools were not visible by
comparing revisions in CVS for the Toolsmiths data; the same
phenomenon was observed 13 times with the Mylyn data. It
appears that most of these refactorings occurred in methods
or expressions that were later removed or in newly created
code that had not yet been committed to CVS.
Overall, the results support the hypothesis that programmers refactor manually in lieu of using tools.
Measured tool usage was even lower than the median
estimate from the professional agile developer survey. This
suggests that either programmers unconsciously or consciously overestimate their tool usage (perhaps refactoring
is often an unconscious activity or perhaps expert programmers are embarrassed to admit that they do not use
refactoring tools) or that expert programmers prefer to
refactor manually.
To observe if tools were underused by a larger
population, we analyzed the UDC Events data, which
include timestamped Eclipse commands from developers.
In the UDC Events data, of the 275,903 participants, only
39,729 participants had used refactoring tools during the
period covered by the data. We would expect that not all of
the participants would have used refactoring tools because
this data set included non-Java developers, and many
participants were not active users of Eclipse. From the
39,729 tool-using participants, we examined how many
times they used a refactoring tool each week. We also
counted weeks where developers had development activity,
but no refactoring tool usage. The distribution is presented
in Table 6. Nearly 80 percent of the weekly development
sessions did not have any refactoring tool usage, even

13

TABLE 6
Distribution of Number of Tool Refactorings Per Week
from 39,729 Tool-Using Developers Collected
from 395,814 Weeks with Development Activity

among those who had used refactoring tools at some point.


When developers did use refactoring tools, the usage within
a week mostly remained in the single digits. The results
suggest that tool usage may be as low among a wider
population as with the Eclipse and Mylyn developers we
have studied.
This analysis suffers from several limitations. First, it is
possible that some tool usage data from Toolsmiths may be
missing. If programmers used multiple computers during
development, some of which were not included in the data
set, this would result in underreporting of tool usage. Given
a single commit, we could be more certain that we have a
record of all refactoring tool uses over the code in that
commit if we have a record of at least one refactoring tool
use applied to that code since the previous commit. If we
apply our analysis only to those commits, then 73 percent of
refactorings (also 73 percent when normalized) cannot be
linked with a tool usage. Likewise, data from Mylyn may be
missing because developers may not have checked their
refactoring histories into CVS. In fact, one developer in
Developer Responses explicitly confirmed that sometimes he
does not commit refactoring history. Applying our analysis
to only those commits for which we have at least one
refactoring tool use, then 41 percent of refactorings
(45 percent when normalized) cannot be linked with a tool
usage. Second, refactorings that occurred at an earlier time
might not be committed until much later; this would inflate
the count of refactorings found in CVS that we could not
correlate to the use of a tool, and thus cause us to
underestimate tool usage. We tried to limit this possibility
by looking back several days before a commit to find uses of
refactoring tools, but may not have been completely
successful. Finally, in our analysis of UDC Events, we
cannot discount the possibility that this population refactors
less frequently in general because we have no estimate of
their manual refactoring.

3.9

Different Refactorings Are Performed with and


without Tools
Some refactorings are more prone to being performed by
hand than others. We have recently identified a surprising
discrepancy between how programmers want to refactor and
how they actually refactor using tools [10]. Programmers
typically want to perform EXTRACT METHOD more often
than RENAME, but programmers actually perform RENAME
with tools more often than they perform EXTRACT METHOD
with tools. (This can also be seen in all four groups of
programmers in Table 1.) Comparing these results, we
inferred that the EXTRACT METHOD tool is underused: The
refactoring is instead being performed manually. However,
it is unclear why other refactoring tools are underused.
Moreover, there may be some refactorings that must be
performed manually because no tool yet exists. We suspect

14

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

the reason that some kinds of refactoringespecially


RENAMEare more often performed with tools is because
these tools have simpler user interfaces.
To explore this suspicion, we examined how the kinds of
refactorings differed between refactorings performed by
hand and refactorings performed using a tool. We once again
correlated the refactorings that we found by manually
inspecting Eclipse CVS commits with the refactoring tool
usage in the Toolsmiths data. We repeated this process for the
Mylyn CVS commits and Mylyn data. Finally, when inspecting the Eclipse CVS and Mylyn CVS commits, we identified
several refactorings that currently have no tool support.
The results are shown in Fig. 3. Tool indicates how many
refactorings were performed with a tool; Manual indicates
how many were performed without. The figure shows that
manual refactorings were performed much more often for
certain kinds of refactoring. For example, EXTRACT METHOD is performed nine times manually but just once with a
tool in Eclipse CVS; REMOVE P ARAMETER is never
performed with a tool in the Mylyn CVS commits. Moreover, no refactorings were performed more often with a tool
than manually in both Eclipse CVS and Mylyn CVS together.
We can also see from the figure that many kinds of
refactorings were performed exclusively by hand, despite
having tool support.
Most refactorings that programmers performed had tool
support. However, 30 refactorings from the Eclipse CVS
commits and 36 refactorings from the Mylyn CVS commits
did not have tool support. One of the most popular of these
was MODIFY ENTITY PROPERTY, performed eight times in
the Eclipse CVS commits and four times in the Mylyn CVS
commits, which would allow developers to safely modify
properties such as static or final. A frequent but
unsupported refactoring, REMOVE DECLARED EXCEPTION,
occurred 12 times in the Mylyn CVS commits; it was
commonly used to remove unnecessary exceptions from
method signatures. Finally, we observed three instances of a
REPLACE ARRAY WITH LIST refactoring in the Mylyn CVS
commits. The same limitations apply as in Section 3.8.

DISCUSSION

How do the results presented in Section 3 affect future


refactoring research and tools?

4.1 Tool-Usage Behavior


Several of our findings illuminate the behavior of programmers using refactoring tools. For example, our finding
about how toolsmiths differ from ordinary programmers in
terms of refactoring tool usage (Section 3.1) suggests that
most kinds of refactorings will not be used as frequently as
the toolsmiths hoped when compared to the RENAME
refactoring. For the toolsmith, this means that improving
the underused tools or their documentation (especially the
tool for EXTRACT LOCAL VARIABLE) may increase tool use.
Other findings provide insight into the typical workflow
involved in refactoring. Consider that refactoring tools are
often used repeatedly (Section 3.2), and that programmers
often do not configure refactoring tools (Section 3.3). For the
toolsmith, this means that configurationless refactoring
tools, which have recently seen increasing support in Eclipse

VOL. 38,

NO. 1, JANUARY/FEBRUARY 2012

and other environments, will suit the majority of, but not all,
refactoring situations. In addition, our findings about the
batching of refactorings provides evidence that tools that
force the programmer to repeatedly select, initiate, and
configure can waste programmers time. This was in fact one
of the motivations for Murphy-Hill and Blacks refactoring
cues, a tool that allows the programmer to select several
program elements for refactoring at one time [8].
Questions still remain for researchers to answer. Why is
the RENAME refactoring tool so much more popular than
other refactoring tools? Why do some refactorings tend to
be batched while others do not? Moreover, our experiments
should be repeated in other projects and for other
refactorings to validate our findings.

4.2 Detecting Refactoring


In our experiments, we investigated the assumptions
underlying several commonly used refactoring-detection
techniques. It appears that some techniques may need
refinement to address some of the concerns that we have
uncovered. Our finding that commit messages in version
histories are unreliable indicators of refactoring activity
(Section 3.4) is at variance with an earlier finding by
Ratzinger [14]. It also casts doubt on the reliability of
previous research that relies on this technique [6], [15], [17].
Thus, further replication of this experiment in other
contexts is needed to establish more conclusive results.
Our finding that many refactorings are medium or lowlevel suggests that refactoring-detection techniques used by
Weigerber and Diehl [20], Dig et al. [4], Counsell et al. [3],
and, to a lesser extent, Xing and Stroulia [21] will not detect
a significant proportion of refactorings. The effect that this
has on the conclusions drawn by these authors depends on
the scope of those conclusions. For example, Xing and
Stroulias conclusion that refactorings are frequent can be
strengthened by taking low-level refactorings into consideration. In contrast, Dig et al.s tool was intended to help
automatically upgrade library clients, and thus has no need
to find low-level refactorings. In general, researchers who
wish to detect refactorings automatically should be aware of
what level of refactorings their tool can detect.
Researchers can make refactoring detection techniques
more comprehensive. For example, we observed that a
common reason for Ratzingers keyword matching to
misclassify changes as refactorings was that a bug-report
title had been included in the commit message, and this title
contained refactoring keywords. By excluding bug-report
titles from the keyword search, accuracy could be increased.
In general, future research can complement existing
refactoring detection tools with refactoring logs from tools
to increase recall of low-level refactorings.
4.3 Refactoring Practice
Several of our findings add to existing evidence about
refactoring practice across a large population of programmers. Unfortunately, the findings also suggest that refactoring tools need further improvements before programmers
will use them frequently. First, our finding that programmers refactor frequently (Section 3.7) confirms the same
finding by Weigerber and Diehl [20] and Xing and Stroulia
[21]. For toolsmiths, this highlights the potential of

MURPHY-HILL ET AL.: HOW WE REFACTOR, AND HOW WE KNOW IT

refactoring tools, telling them that increased tool support


for refactoring may be beneficial to programmers.
Second, our finding that floss refactoring is a more
frequently practiced refactoring tactic than root-canal refactoring (Section 3.5) confirms that floss refactoring, in
addition to being recommended by experts [5], is also
popular among programmers. This has implications for
toolsmiths, researchers, and educators. For toolsmiths, this
means that refactoring tools should support flossing by
allowing the programmer to switch quickly between
refactoring and other development activities, which is not
always possible with existing refactoring tools, such as those
that force the programmers attention away from the task at
hand with modal dialog boxes [10]. For researchers, studies
should focus on floss refactoring for the greatest generality.
For educators, it means that when they teach refactoring to
students, they should teach it throughout the course rather
than as one unit during which students are taught to refactor
their programs intensively. Students should understand that
refactoring can be practiced both as a way of incrementally
improving the whole program design and also as a way to
simplify the process of adding a feature or to make a
fragment of code easier to understand.
Last, our findings that many refactorings are performed
without the help of tools (Section 3.8) and that the kinds of
refactorings performed with tools differ from the kinds
performed manually (Section 3.9) confirm the results of our
survey on programmers underuse of refactoring tools [10].
Toolsmiths need to explore alternative interfaces and
identify common refactoring workflows, such as reminding
users to EXTRACT LOCAL VARIABLE before an EXTRACT
METHOD or finding a easy way to combine these refactorings: The goal should be to encourage and support
programmers in taking full advantage of refactoring tools.
For researchers, more information is needed about exactly
why programmers do not use refactoring tools as much as
they could.

4.4 Developer Responses on Manual Refactoring


Our discussion with Eclipse and Mylyn developers (Developer Responses data) about their refactoring behavior generated several insights that may guide future research. We do
not claim any generality for these insights, but believe that
they do provide useful seeds for future investigation.
We provided developers with several examples of a
refactoring that they themselves performed with a tool in
one case but without a tool in another. We then asked them
to explain this behavior. From their responses, we identified
three factorsawareness, opportunity, and trustand two
issues with tool workflowtouch points and disrupted
flowthat may limit tool usage.
Awareness is whether a developer realizes that there is a
refactoring tool in her programming environment that can
change her code on her behalf. As an example, one
developer was not familiar with how the tools for the
INLINE refactoring worked despite being an experienced
developer. Similarly, one toolsmith described awareness
problems occurring in the following scenario:
.

I already know exactly how I want the code to look


like.

15

Because of that, my hands start doing copy-paste


and the simple editing without my active control.
. After a few seconds, I realize that this would have
been easier to do with a refactoring [tool].
. But since I already started performing it manually, I
just finish it and continue.
Awareness is partially a problem of education, exposure,
and encouragement; future research can consider how to
improve sharing and exposing developers to different IDE
features. But, as the toolsmiths scenario suggests, awareness can also be a problem when a developer knows what
refactoring tools are available; future research may be able
to help developers realize that the change that they are about
to make can be automated with a refactoring tool.
Opportunity is similar to awareness, but differs in that the
developer knows about a feature, but does not know of an
opportunity to use that feature. One developer described
this situation:
.

My problem isnt so much with how these tools do their


job, but more with [opportunity]. Are these tools. . . available within the editor when pressing Ctrl+1 [an Eclipse
automated suggestion tool]? Perhaps more should be
recommended based on where Im at in the code and then
perhaps Id use them more. As it is I have to remember the
functionality exists, locate it in a popup menu (hands
leaving keyboard bad) and figure out the proper name
for what it is I want to achieve.

Researchers and toolsmiths can consider how to improve


refactoring opportunities by introducing mechanisms for
filtering applicable refactorings or conveying bad smells
local to the source code currently being worked on.
The final factor is trust: Does a developer have faith in an
automated refactoring tool? Developers must often act on
faith because they are not informed of the limitations of a
tool or of cases in which an automated refactoring may fail.
Several developers mentioned they would avoid using a
refactoring tool because of worries about introducing errors
or unintended side-effects. Perhaps future research can
improve trust by generating inspection checklists to cover
situations where the tool may fail?
Developers also described several limitations with the
way that refactoring tools fit into their development
workflow. The first limitation involves touch points: the set
of places in the code that may be affected by a potential
change. One developer described how using an automated
refactoring tool would curtail what would otherwise be a
potentially more rich and thorough experience:
More often than not, when starting a manual refactoring,
upon saving that first change, all dependent code throughout the code base will light up (compile errors). At this
point, you can survey what the refactoring will involve and
potentially discover portions of the code you didnt realize
the refactoring would affect. Tool support tries to achieve
this by presenting users with a preview of the changes, but
this information is always being presented in an unfamiliar
way (i.e., presented in a wizard dialog instead of the
PackageExplorer for example). The previews are not only
unfamiliar UI, but usually more difficult to use to explore
the affected code than it would be in the native package
explorer and editor.

Current preview dialogs do not provide the same level of


participation and interaction available by visiting error

16

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

locations one by one in the editor. Researchers and


toolsmiths might well consider how to better engage a
developer with a thorough review or more intelligent
summarization of a proposed change.
A second limitation involves disrupted flow, an interruption to the focus and working memory of the developer.
Besides perceived slowness, developers mentioned disrupted flow as a general concern when using refactoring
tools. Consider why refactoring tools may be disruptive:
Development often requires deep concentration and maintenance of mental thoughts and plans, which are facilitated
by the availability of textual cues and fluid movement
through the source code. A developer may feel that using a
refactoring tool would disrupt her concentration: By
initiating a refactoring that requires configuration, a
developer traps herself within a modal window, temporarily isolating herself from the source code until the
refactoring is completed. The modal window blocks source
text previously available on the screen and limits mobility
to other parts of the code that the developer may want to
inspect. Toolsmiths should consider how to better integrate
refactoring tools within the programming environment in a
way that limits disruption.
Finally, we believe that further examination of why
developers use refactoring tools in some cases but not in
others will help identify why refactoring tools break down
and what can be done to improve these tools.

4.5 Limitations of This Study


In addition to the limitations noted in each section of
Section 3, some characteristics of our data limit the validity
of all of our analyses. First, all the data report on refactoring
Java programs in the Eclipse environment. While this is a
widely used language and environment, the results
presented in this paper may not hold for other languages
and environments. Second, Users, Toolsmiths, and Mylyn
may not represent programmers in general. Third, the Users
and Everyone data sets may overlap the Toolsmith or Mylyn
data sets: Both the Users and Everyone data sets were
gathered from volunteers, and some of those volunteers
may have been Toolsmiths or Mylyn developers. Finally, our
inspection of the Eclipse CVS and Mylyn CVS data excluded
commits to branches; readers of our previous work [11]
have expressed concern that developers may have refactored more in branch commits than in regular commits.
However, our interviews with both Toolsmith and Mylyn
developers confirmed that refactoring in branches was
discouraged because of the difficulty of later merging
refactored code.
4.6 Experimental Data and Tools
Our publicly available data, the SQL queries used for
correlating and summarizing that data, and the tools we
used for batching refactorings and grouping CVS revisions
can be found at http://multiview.cs.pdx.edu/refactoring/
experiments. Our normalization procedure and survey sent
to developers can be found in the appendices.

CONCLUSIONS

Research about refactoring, like research in all areas, relies on


a solid foundation of data. In this paper, we have examined

VOL. 38,

NO. 1, JANUARY/FEBRUARY 2012

the foundations of refactoring from several perspectives. In


some cases, the foundations appear solid; for example,
programmers do indeed refactor frequently. In other cases,
the foundations appear weak; for example, when committing
code to version control, developers messages appear not to
reliably indicate refactoring activity. Our results can help the
research community build better refactoring tools and
techniques in the future through enhanced knowledge of
how we refactor, and how we know it.

ERRATA

In the ICSE paper on which this work is based [11], we


made two minor mistakes that have been corrected in this
paper. We highlight the corrections here.
In the ICSE paper, we said that Low, Medium, and Highlevel refactorings made in Eclipse CVS accounted for 18, 22,
and 60 percent of refactorings, respectively. We did not
correctly apply our normalization procedures when we
reported these numbers. In this paper, in Table 5, we
include the corrected numbers: 21, 11, and 68 percent.
In the ICSE paper, we reported finding 21 PUSH DOWN
refactorings. We misclassified five of these refactorings;
they should have been classified as MOVE METHOD
refactorings. The correct classification is shown in this
paper in Fig. 3.

APPENDIX A
NORMALIZATION PROCEDURE
In Section 3.5, we discussed a normalization procedure for
some reported data. To explain the procedure we used for
calculating, well give the intuitive explanation and an
example calculation below:
We wish to estimate how many pure-refactoring commits were made to CVS. Recall that previously we sampled
20 Labeled projects and 20 Unlabeled projects, and we
know that six Labeled commits were pure refactoring and
zero Unlabeled commits were pure refactoring. Naively, we
might simply do the addition (6 0) and divide over the
total commits to get the estimate: 6=40 15%. However,
this is a good estimate for our sample, but a bad estimate for
the population as a whole because our 20-20 sample was
drawn from two unequal strata. Specifically, in this naive
estimate, we are giving too much weight to the six purerefactoring commits because Labeled commits only account
for about 10 percent of total commits. So what do we do?
Instead of the naive approach, we normalize our estimate
for the relative proportions of Labeled (10 percent) to
Unlabeled commits (90 percent). The following calculation
gives the normalized result:
.
.
.
.

6 is the number of Labeled pure-refactoring commits.


0 is the number of Unlabeled pure-refactoring
commits.
290 is the number of Labeled commits.
2,498 is the number of Unlabeled commits.
6=20290=290 2;498 0=201  290=
290 2;498 0:0312051649928264
And thus we estimate that about 3 percent of
commits contained pure refactorings.

MURPHY-HILL ET AL.: HOW WE REFACTOR, AND HOW WE KNOW IT

APPENDIX B
SURVEY
The following e-mail template describes a survey sent to
people who developed code for the Mylyn and Toolsmiths
data sets. In this template, XXX was instantiated with the
developers first name. YYY was instantiated with either
Mylyn or Eclipse, depending on which project the
developer worked on. ZZZ.refactoringname was instantiated with a transaction number and a refactoring name,
DD/MM/YYYY indicates when that refactoring was
checked in to CVS, and some comment was instantiated
with the comment that the developer made when checking
in to CVS.
Subject: Your thoughts on our refactoring study
Dear XXX,
My colleagues Chris Parnin, Andrew Black, and I are
completing a study about refactoring and refactoring tool
use. We investigated two case studies of refactoring, one of
which was of the YYY project, of which you were a
committer. In short, our analysis compared your refactoring
tool history (produced when you used the Eclipse refactoring
tools) with what we inferred as refactoring in the code when
you committed to CVS. From this, we made estimates about
how often people refactor, what kinds of refactoring tools
they use, and when they use or do not use refactoring tools.
We are hoping you will answer a few questions about
your thoughts on issues related to your refactoring. We
hope this will provide some insights into how we can
interpret our results. You are one of less than 10 people that
we are inviting to participate, so your comments are
extremely valuable to us.
Below, you will find several interview questions; we
anticipate that they will take around 15 minutes to complete
in total. Unless you indicate otherwise, we would like to
reserve the right to summarize or repeat your answers
verbatim in our forthcoming paper. As for privacy, in the
paper we do not personally identify developers by name or
by CVS username (although we do say which projects we
analyzed). You can respond to this questionnaire simply by
replying to this email.
Because we are on a tight publishing deadline, if you
choose to participate, we would appreciate your response
by February 25th.
Sincerely,
Emerson Murphy-Hill, The University of British Columbia
Chris Parnin, Georgia Institute of Technology
Andrew P. Black, Portland State University

We published a first version of our analysis in a


research paper called How We Refactor, and How
We Know It in 2009 at the International Conference
on Software Engineering. Did you happen to read it?
One of our main findings was that, by comparing
refactoring tool histories and the refactorings apparent in CVS, developers appear to use refactoring
tools for about 10 percent refactorings for which a
refactoring tool is available. Speaking for yourself,

17

why do you think you would not use a refactoring


tool when one was available?
In the attached PDFs, we have included a few
snippets of code where we inferred that you
performed a refactoring for which Eclipse has tool
support. However, for some of the examples, we did
not have a record of you using a refactoring tool and
thus concluded that you refactored without one. If
you are able to remember the change that you made,
could you recall or infer why you did or didnt use
the tool for that refactoring?
(For each file, we use green and red annotations
highlights to show what was added and removed,
and gray highlights to draw your attention to
specific parts of code.)
File: ZZZ.refactoringname-tool.diff.pdf
Change date: DD/MM/YYYY
CVSComment: some comment
Another finding was that developers sometimes
repeatedly used the same refactoring tool in quick
succession (e.g., used Inline, then used it again in the
next few seconds). Can you think of any reasons
why you might do this?
Off the top of your head, please try to name the three
refactorings that you perform most often, and three
that you perform most often using refactoring tools.
Do you plan long-term refactoring campaigns,
where you engage in extended refactoring for a
period of time? If so, what is the motivation? How
long do these usually take?
Are there pitfalls during these campaigns? How
would you want refactoring tools to help at those
times?
In our analysis, we looked at only refactorings in
commits to the main line. Do you think you refactored
differently when you committed to branches?
How you think the fact that your team developed
tools for Eclipse affected how you used refactoring
tools? Do you think you used the refactoring tools
more/less/about the same as the average Eclipse
Java developer?
Is there some particular Eclipse refactoring tool (or
part of a tool) that doesnt fit with the way that you
refactor? If so, please tell us which tool, and what the
problem is. Do you desire additional tool support
when you refactor? (Below, weve included a list of
Eclipse refactoring tools to help jog your memory.)
-

Rename
Extract Local Variable
Inline
Extract Method
Move
Change Method Signature
Convert Local To Field
Introduce Parameter
Extract Constant
Convert Anonymous To Nested
Move Member Type to New File
Pull Up
Encapsulate Field

18

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

Extract Interface
Generalize Declared Type
Push Down
Infer Generic Type Arguments
Use Supertype Where Possible
Introduce Factory
Extract Superclass
Extract Class
Introduce Parameter Object
Introduce Indirection
Would you like a copy of the paper when it is
complete?

ACKNOWLEDGMENTS
The authors thank Barry Anderson, Christian Bird, Tim
Chevalier, Danny Dig, Thomas Fritz, Markus Keller, Ciaran
Llachlan Leavitt, Ralph London, Gail Murphy, Suresh
Singh, and the Portland Java Users Group for their
assistance, as well as the US National Science Foundation
for partially funding this research under CCF-0520346.
Thanks to the anonymous reviewers and the participants of
the Software Engineering seminar at the University of
Illinois Urbana-Champaign for their excellent suggestions.

VOL. 38,

NO. 1, JANUARY/FEBRUARY 2012

[15] J. Ratzinger, T. Sigmund, and H.C. Gall, On the Relation of


Refactorings and Software Defect Prediction, Proc. Intl Working
Conf. Mining Software Repositories, pp. 35-38, 2008.
[16] R. Robbes, Mining a Change-Based Software Repository, Proc.
Fourth Intl Workshop Mining Software Repositories, pp. 15-23, 2007.
[17] K. Stroggylos and D. Spinellis, Refactoring-Does It Improve
Software Quality? Proc. Fifth Intl Workshop Software Quality,
pp. 10-16, 2007.
[18] The Eclipse Foundation. Usage Data Collector Results, Website,
http://www.eclipse.org/org/usagedata/reports/data/
commands.csv, Feb. 2009.
[19] M.A. Toleman and J. Welsh, Systematic Evaluation of Design
Choices for Software Development Tools, SoftwareConcepts and
Tools, vol. 19, no. 3, pp. 109-121, 1998.
[20] P. Weigerber and S. Diehl, Are Refactorings Less Error-Prone
than Other Changes? Proc. Intl Workshop Mining Software
Repositories, pp. 112-118, 2006.
[21] Z. Xing and E. Stroulia, Refactoring Practice: How It Is and How
It Should be SupportedAn Eclipse Case Study, Proc. 22nd IEEE
Intl Conf. Software Maintenance, pp. 458-468, 2006.
[22] T. Zimmermann and P. Weigerber, Preprocessing CVS Data for
Fine-Grained Analysis, Proc. Intl Workshop Mining Software
Repositories, pp. 2-6, 2004.
Emerson Murphy-Hill received the PhD degree
in computer science from Portland State University. He is an assistant professor at North
Carolina State University. His research interests
include human-computer interaction and software tools. http://www.csc.ncsu.edu/faculty/
emerson.

REFERENCES
[1]
[2]

[3]

[4]

[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]

[13]
[14]

F. Bourqun and R.K. Keller, High-Impact Refactoring Based on


Architecture Violations, Proc. 11th European Conf. Software
Maintenance and Reeng., pp. 149-158, 2007.
S. Counsell, Y. Hassoun, R. Johnson, K. Mannock, and E. Mendes,
Trends in Java Code Changes: The Key to Identification of
Refactorings? Proc. Second Intl Conf. Principles and Practice of
Programming in Java , pp. 45-48, 2003.
S. Counsell, Y. Hassoun, G. Loizou, and R. Najjar, Common
Refactorings, a Dependency Graph and Some Code Smells: An
Empirical Study of Java OSS, Proc. ACM/IEEE Intl Symp.
Empirical Software Eng., pp. 288-296, 2006.
D. Dig, C. Comertoglu, D. Marinov, and R. Johnson, Automated Detection of Refactorings in Evolving Components, Proc.
20th European Conf. Object-Oriented Programming, pp. 404-428,
2006.
M. Fowler, Refactoring: Improving the Design of Existing Code.
Addison-Wesley Longman Publishing Co., Inc., 1999.
A. Hindle, D.M. German, and R. Holt, What Do Large Commits
Tell Us?: A Taxonomical Study of Large Commits, Proc. 2008 Intl
Workshop Mining Software Repositories, pp. 99-108, 2008.
G.C. Murphy, M. Kersten, and L. Findlater, How Are Java
Software Developers Using the Eclipse IDE? IEEE Software,
vol. 23, no. 4, pp. 76-83, July/Aug. 2006.
E. Murphy-Hill and A.P. Black, High Velocity Refactorings in
Eclipse, Proc. OOPSLA Workshop Eclipse Technology Exchange,
2007.
E. Murphy-Hill and A.P. Black, Breaking the Barriers to
Successful Refactoring: Observations and Tools for Extract
Method, Proc. 30th Intl Conf. Software Eng., pp. 421-430, 2008.
E. Murphy-Hill and A.P. Black, Refactoring Tools: Fitness for
Purpose, IEEE Software, vol. 25, no. 5, pp. 38-44, Sept./Oct.
2008.
E. Murphy-Hill, C. Parnin, and A.P. Black, How We Refactor, and
How We Know It, Proc. 31st Intl Conf. Software Eng., 2009.
W.F. Opdyke and R.E. Johnson, Refactoring: An Aid in
Designing Application Frameworks and Evolving Object-oriented
Systems, Proc. Symp. Object-Oriented Programming Emphasizing
Practical Applications, Sept. 1990.
M. Pizka, Straightening Spaghetti-Code with Refactoring? Software Eng. Research and Practice, H.R. Arabnia and H. Reza, eds.,
pp. 846-852, CSREA Press, 2004.
J. Ratzinger, sPACE: Software Project Assessment in the Course
of Evolution, PhD thesis, Vienna Univ. of Technology, Austria,
2007.

Chris Parnin is working toward the PhD degree


at the Georgia Institute of Technology. His
research interests include psychology of programming and empirical software engineering.
http://cc.gatech.edu/~vector.

Andrew P. Black received the DPhil degree in


computation from the University of Oxford. He
is a professor at Portland State University. His
research interests include the design of programming languages and programming environments. In addition to his academic posts he
has also worked as an engineer at Digital
Equipment Corp.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

Potrebbero piacerti anche