Sei sulla pagina 1di 3

A CLUSTERING TECHNIQUE FOR SUMMARIZING

MULTIVARIATE DATA
By Geoffrey H . Ball and David J . Hall
Stanford Research Institute, Menlo Park, California

Scientific measurements frequently involve large numbers of variables whose complex


interactions are not easily found. A practical computing method termed ISODATA, which
h d s the cluster structure of such data, is described. The resulting description of the data
provides a fit to the data of a set of cluster centers that tends to miaimize the s u m of the
squared distances of each data point from its closest cluster center. An application to the
grouping or clustering of the answers of 209 people to an 80-question sociological survey
illustrates the utility of the method.
E*J

T HE work reported here grew out of


efforts to develop adaptive preprocessing
for pattern-recognition machines (Ball and
response patterns to improve the accuracy
of trial or existing average response patterns.
The process also combines average response
Hall, 1964).Itbecame apparent, however,that patterns that are so similar that their being
the techniques developed were not limited to separate fails to provide a significant amount
this task. In fact, they seem applicable to a of information about the structure of the re-
recurring basic problem in many disciplines sponse patterns. computation time for this
-that of analyzing large numbers of sam- technique grows linearly with the number of
ples of multivariate data. response patterns and the number of di-
This technique, termed ISODATA, groups mensions and the number of clusters. An
patterns in a way determined primarily by ALGOL program implementing this proce-
the data itself. (ISODATA represents dure on a Burroughs B-5500 digital computer
Iterative Self-organizing Data Analysis has been applied to two actual research
Technique (A) problems as well as to artificial problems
This technique automatically clusters a11 (Ball, Brain, Burch, and Hall, 1964). (Run
of the data into distinct and independent times were typically 15 to 20 minutes.) In
groups. An average response pattern is outline form, the essential steps of the
used to represent a group of patterns, and technique are :
the iterative process creates new average
(1) Select a “typical” set of response
The application of ISODATA t o sociological
patterns to be used as initial “cluster
data waa aided by contracts from the Behavioral points.”
Sciences Division of the Air Force Office of Scien- (2) Group together patterns that lie
tific Research, Washington, D.C. The develop- closest in Euclidian distance to the
ment of the technique was aided by the Graphical same cluster point. (The patterns and
Data Transducer Branch, Data Division, Com-
munications Department, USAEL, Fort Mon- these cluster points are both vectors.)
mouth, New Jersey, and by internal funding from (3) Split into two points the average re-
Stanford Research Institute. The questionnaires sponse pattern of each group found in
described in this paper were designed and admin- Step (2) if the group’s associated
istered by the Systems Analysis Laboratory of “within-group variability” exceeds a
the Management Sciences Division of Stanford
Research Institute, under the direction of Dr. threshold, ,e, which is set by the
Howard Vollmer. We would like to thank him for operator. The group is split by form-
providing the data and for his assistance in inter- ing two new cluster points from the
preting the results we obtained. (His interpreta- average response pattern. The new
tion appears in Vollmer, 1964.) Miss Wanda
Fedurek and Mrs. Klara Miracle rendered valuable cluster points are identical to the
assistance in plotting and tabulating data, for average response pattern, except for
which we thank them. that response to a single item, having
153
Behavioral Science. Volume 12, 1967
154 COMPUTERS
IN BEHAVIORAL
SCIENCE

highest vsriability, which is given the cent) ; planning to work for the employing
values +1 and -1, respectively, for organization in ten years; publishing pro-
the two new cluster points. This re- fessional papers frequently (75 percent have
sults in ithe “birth” of new cluster five or more in the last five years); quite
points. satisfied with their jobs (55 percent highly
Regroup the patterns using the new satisfied and only 15 percent dissatisfied).
cluster points, and then again find the This can be compared to the word profile
average response pattern within each of Group 6, where 20 percent are under 30
group. years of age; have a moderate amount of
Compute distances between all pairs education (33 percent M.S., 67 percent
of average response patterns. B.S.) ; are civilians; primarily performing
Combine groups whose average re- applied or basic-applied research (60 per-
sponse patterns are closer together cent); planning to be working in the same
than a threshold value Bc. This corre- organization in 10 years; publishing in-
sponds to the “death” of certain frequently (45 percent have 0-2 publications
cluster points through the combining in the last five years); only moderately satis-
of two clusters. fied with their jobs (30 percent highly satis-
Iterate the procedure. fied and 35 percent dissatisfied).
The average response pattern in each
The thresholds Bc and 13, are set by the group is a real-valued vector. Using these
operator. By changing them he can obtain vectors, we were able to compute a distance
any number of dusters between one and the between all pairs of average response pat-
number required for every response pattern terns and to plot the average response pat-
to be in cluster by itself. tern for each of the seven groups.
We now describe the analysis that the We also examined the change in the aver-
technique makes possible of the responses by age response pattern as we moved from one
209 Air Force scientists and engineers to an group to another group. These changes were
80-item sociological questionnaire. descriptive of the relationships between the
The procedure automatically sorted the groups.
209 responses of patterns given by the Air It is often difficult to comprehend and
Force scientists into seven groups. (The col- manipulate large tables of numerical data
lective response of each subject to all 80 and to extract the important relationships
items will be called a response pattern.) These from large quantities of numerical data. I n
seven groups are constructed so that the this specific instance, the average response
average response pattern calculated from pattern, calculated from all 209 responses, is
the patterns within a group is reasonably not an adequate representation of the data
typical of that group. I n other words, any -there is still too much variation around the
response pattern in a group more closely average for most of the responses. By cluster-
corresponded to the average response in its ing the response patterns into a number of
own group than the average response pat- separate groups, we obtained seven “typi-
tern of any other group. cal” average response patterns that were
These average response patterns were much more meaningful because the variety
used to provide a descriptive word “profile” of response within each group had been re-
of each group, which helps the researcher to
get a “feel” for the structure of the group and duced. These average response patterns can
its relationship to the other groups. For ex- be accurately related to a word profile that
ample, the word profile for Group 7 de- is much better matched to human compre-
scribes the people in this group as being be- hension and evaluation. We found our
tween 30 and 50 years of age, (95 percent are visual plot also helped to illustrate the rela-
in this category); highly educated (50 per- tionships, although it is not essential and
cent Ph.D., 50 percent M.S.); civilians, pri- may not be possible for larger numbers of
marily performing basic research (80 per- clusters.

Behavioral Science, V,>lume12. 1967


COMPUTERS SCIENCE
I N BEHAVIORAL 155

From our readings in the literature it is D. J. Graphical data processing research


evident that other people have considered study and experimental investigation. Tech-
nical Report 16, Contract DA 36439 AMC-
similar clustering techniques. We have 03247 (E), SRI Project 4565. Menlo Park,
made a comparison of ISODATA with California: Stanford Research Institute,
these and other techniques (Ball, 1965). July, 1964.
We are continuing to modify and refine Ball, G. H., & Hal1,D. J. Some fundamental con-
cepts and synthesis procedures for pattern
ISODATA and have developed a program to recognition preprocessors. Paper presented
cluster real-valued as well as binary-valued a t the International Conference on Micro-
data (Ball and Hall, 1965). We are now ap- waves, Circuit Theory, and Information
plying it to the design of pattern recognition Theory, Tokyo, September, 1964.
preprocessing for use with meteorological Ball, G. H., & Hall, D. J. ISODATA, A novel tech-
nique for data analysis, and pattern classi-
data. We are also using it in other applica- fication. Technical Report, Menlo Park,
tions including analysis of responses to California: Stanford Research Institute,
psychological tests and the examination of May 1965.
physiological measurements. Vollrner, Howard M. Applications of the behav-
ioral sciences to research management; an
REFERENCES initial study in the office of aerospace re-
search. Technical Report, Menlo Park, Cali-
Ball, G. H. Data analysis in the social sciences- fornia: Stanford Research Institute, No-
what about the details. Proc. Fall Joint vember 1964.
Cornpuler Conference, 1965,27, Pt. I, 533-559.
Ball, G. H., Brain, A. E., Burch, G. H., & Hall, (Manuscript received September 13, 1966)

The pure desire for social good does not indeed operate in human
affairs unalloyed by egotistic motives, but on the other hand whst
wc call egotistic motives do not act without direction from an
involuntary reference to social good.
T. H. GREEN.Principles of political obligation, 1880

Behavioral Science, Volume 12. 1967

Potrebbero piacerti anche