Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
I. I NTRODUCTION
Modern software development processes often involve
multiple developers and development teams, sometimes
residing at different continents and time-zones. Communication and coordination in such projects necessitates
support by means of various kinds of software repositories,
including mail archives, bug trackers and version control
systems. A plentitude of data available in such repositories
triggered an extensive research effort on mining software
repositories [1], [2], [3], [4], [5], [6], i.e., automatic
analysis of the development process based on such data.
In this paper we advocate applying process mining
to analysis of data from multiple software repositories.
Process mining [7], [8] aims at extracting information
from event logs produced by an information system, in
order to capture the business process supported by the
information system. It has already been demonstrated to
be a valuable technique for analyzing business processes
in various domains [8], [9]. Process mining makes a clear
separation of the event log preparation [10] from the
event log analysis [7]. Therefore, once data from multiple
software repositories has been translated to an event log
suited for process mining, a wide range of process mining
techniques becomes applicable. This sharply contrasts
with many existing repository mining applications [4], [5]
that tightly couple the preprocessor to the analyzer, i.e.,
the analyzers cannot be reused or combined.
EventLog
1..*
Process
0..*
ProcessInstance
1..*
Event
1..*
Activity
Figure 1.
activity : Activity
* description : String
timestamp : Date
person : Originator
...
FRASR
1
Basic
1
*
Developer
DeveloperAlias
0..1
*
0..1
ExportType
0..1
Project
Datasource
*
Detailed
*
*
Type-specific
Filter
*
*
EventBinding
Category-specific
Case
*
Figure 2.
Architecture of FRASR
FRASR
ProM
Answer
Define data
sources
Define case
mapping
Attach event
bindings
Calculate
developer
matching
Export
most popular SourceForge project of all times. To analyse aMSN we have considered seven bug repositories
(bugs, feature requests, patches, plugins, skins, support
requests and translations), three mail archives (commits, devel and lang) and one Subversion repository located at https://amsn.svn.sourceforge.net/
svnroot/amsn/. We have focused on the period from
February 26, 2002 until July 09, 2010. In total, the
repositories contained 3137 bug reports, 34947 mail messages and 12062 revisions. The aMSN project also has a
discussion forum. However, the data of this forum cannot
be used in the current implementation of FRASR.
The data from the software repositories has been exported using the developer case and the data source specific binding for each data source. The developer matching
has been calculated automatically using the simple heuristics mentioned in Section II-C. Furthermore, we assume
that the time stamps are synchronous, i.e., when time
stamps from both repositories are equal, the points in real
time they were recorded, do not differ significantly.
3) Results: Using the exported log in combination with
the ProM Dotted Chart visualization and a spreadsheet
application, the developers were assigned to one of the
available roles. Figure 4 presents a part of a Dotted
Chart visualization, used in the classification. Green dots
correspond to mail events such as Mail thread created and
Mail reply, black to Ticket-created, red to other bug tracker
events, blue to addition of files in the version control
system, and finally white to other events of the version
control system (modifications, deletions and renames). The
size of dots represents a number of events occurring in
the same week and color mixture corresponds to events
of different kinds occurring in the same week.
By inspecting Figure 4 we clearly see that the developer
in the first line is represented by a sequence of overlapping
white, red and blue dots, starting at the very beginning
of the project. According to the classification rules above
this developer (Alvaro J. Iradier Muro/airadier2 ) will be
classified as the project leader. Furthermore, we observe
that some developers are represented by long sequences of
overlapping dots, as is, for instance, the case for Alaoui
Youness/kakaroto and Boris Faure/billiob. These developers are core members of the project. Shorter sequences represent active developers, such as Arieh Schneier/lio lion
and Tom Jenkins/bluetit. Finally, disconnected dots are
characteristic for the sporadic activity of peripheral developers. This is, for instance, the case for Harry Vennik/thaven.
Visual inspection of Figure 4 provides for qualitative
results. Additional quantitative results have been obtained
by exporting the relation between the developers and
activities, expressed by a so called ProM originator
by activity matrix, to the spreadsheet application and
performing simple counting. In this way, out of 1725
developers we have identified 1443 bug reporters, 3 bug
2 The developer names are derived from the information of the associated developer aliases. This includes for example a username and a
name associated to an email address.
Figure 4. Dotted Chart visualization of the top-15 developers of aMSN, sorted by number of events. Color legend: Green: Mail thread created and
Mail reply. Black: Ticket-created. Red: Ticket-closed, Ticket-commented, Ticket-reopened. Blue: VCS: A (file added). White: VCS: M (file modified),
VCS: D (file deleted) and VCS: R (file renamed)
Figure 6.
Bug life cycle Fuzzy Graph, extracted from the GCC Bugzilla repository.
from Figure 5.
The thickness of an arrow in Figure 6 represents the
percentage of process instances (bug reports) making the
transition to that state. Figure 6 shows that most of the bug
reports are either immediately resolved (Ticket-created,
Ticket-new, Ticket-resolved(fixed)) or successfully resolved after one or more assignments (Ticket-created,
Ticket-new, Ticket-assigned, Ticket-resolved(fixed)).
Several other paths however are present in the graph,
including an unexpected recreation of tickets, indicated
by arrows entering Ticket-created, as opposed to Ticketreopened.
4) Conclusions: Summarizing the results above, we observe that the official bug reports life cycle as presented in
Figure 5 provides a simplified view on the way bug reports
has attracted significant attention from the research community: CVSgrab [1] supports gathering and visualization
data from CVS repositories. Using CVSgrab, a user can
answer questions like What is / was the development
process? and What are the main contributors and their
responsibilities?. Other tools like the eROSE [2] and
ProjectWatcher [6] plugins in Eclipse, assist developers in finding related artefacts. Hipikat [5] is similar
to ProjectWachter, but uses data from multiple software
repositories (e.g., Subversion, Bugzilla). Using Hipikat, a
developer new to a project can quickly become familiar
with the project group memory. However, as these tools
are geared towards a single developer, they are not very
well suited for analyzing the development process of the
project. Similarly, Alitheia Core [4] can calculate metric
values based on multiple software repositories, while
softChange is a fact enhancer and a visualizer supporting data import from mail archives, Bugzilla repositories
and CVS repositories. All these tools gear the analysis
towards specific visualizations. Unlike them, our approach
separates the preprocessing step carried out by FRASR
and the analysis step, and hence makes multiplicity of
mining and analysis techniques readily available. In the
case studies above, we have used ProM to carry out the
analysis but commercial process mining tools such as
Futura Reflect (http://www.futuratech.nl) could
have been used for this purpose as well.
Matching identities from various sources is a nontrivial
task, as the identities can be usernames, e-mail addresses,
real names, etc. In [15], Robles and Gonzalez-Barahona
present an approach for matching developer aliases from
various software repositories. In this approach, identities
are constructed by using information from the sources (like
an e-mail address or a username). For example, the name
and surname can be extracted from an e-mail address like
name.surname@example.com; a technique also applied in
FRASR. They also use GPG keys (which contains a list of
e-mail addresses a developer may use for encryption and
authentication purposes) and other information related to
the developers. Bird et al. propose a technique more specific to matching e-mail addresses [14]. In this approach
they use the Levenshtein edit distance between (parts of)
e-mail addresses to determine the similarity.
V. R ELATED W ORK
VI. C ONCLUSIONS
In this paper we advocate applying process mining techniques to mining software repositories. We have identified
the challenges that should be addressed to enable this
application, discussed how they can be addressed and
presented FRASR, our prototype implementation. Unlike
existing approaches to repository mining, the approach
proposed makes clear separation between the preprocessing step and the analysis step, fostering reuse of analysis
techniques.
Case studies (coming from different domains), have
shown that process mining in combination with FRASR
leverages valuable insights in software development processes. The flexibility provided by FRASR allows to
UNCONFIRMED
Bug is reopened,
was never confirmed
Bug confirmed or
receives enough votes
Developer takes
possession
NEW
Ownership
is changed
Development is
finished with bug
Developer takes
possession
Possible resolutions:
FIXED
DUPLICATE
WONTFIX
WORKSFORME
INVALID
ASSIGNED
Development is
finished with bug
Developer takes
possession
RESOLVED
Bug is closed
Issue is
resolved
QA not satisfied
with solution
REOPEN
QA verifies
solution worked
Bug is reopened
Bug is reopened
VERIFIED
Bug is closed
CLOSED
Figure 5.
[5] D. Cubrani
c, G. C. Murphy, J. Singer, and K. S. Booth,
Hipikat: A project memory for software development,
IEEE Trans. Softw. Eng., vol. 31, no. 6, pp. 446465, 2005.