Molecular Mechanics With An Array Processor: Peter

313
Molecular Mechanics with an Array Processor
Peter H. Berens and Kent R. Wilson

Department of Chemistry, University of California,San Diego, La Jolla, California 92093
Received 6 July 1982; accepted 14 December 1982
Computer simulation of the mechanics of molecular systems is a popular and powerful method for un-
derstanding chemical processes. The complexity of modeled chemical systems has advanced from hard
spheres and rare gases to liquid solutions and biomolecules. Such simulations are computationally in-
tensive and thus are limited by the speed of available computers. This article describes the use of spe-
cialized hardware, a high-speed floating-point array processor (AP), to dramatically speed up molecu-
lar mechanics, in other words molecular dynamics, Monte Carlo, and energy-minimization calcula-
tions. Although the array processor is a cost effective solution for computationally intensive problems
in terms of hardware (full-time AP usage is equivalent to 2-8 h/day of Cray-l time), its full speed comes
at the expense of programming in a relatively difficult parallel assembly language. Since the architec-
ture of the machine is dramatically different from conventional computers and utilizing its fast speed
necessitates using this architecture on the assembly language level, the proper design and implementa-
tion of algorithms is critical. The molecular mechanics software design discussed here, consisting of
12,000 lines of C and 7000 lines of AP assembly language code, is quite general and has been used to
study systems ranging from rare gases to biomolecules.This implementation yields effective speeds ap-
proximately 35 times faster than a dedicated DEC VAX 11/780 computer with floating-point accelera-
tor and optimized VMS FORTRAN, thus allowing simulations to be run in one and a half weeks on the
A P which would require a year of dedicated VAX time. The flexibility of the UNIX operating system,
whose source code is accessible and can be modified to optimize performance, combined with the mod-
ern features of the C language, have made this implementation much easier by providing a convenient
and powerful environment in which to imbed the hand-coded AP assembly language modules. Applica-
tions to date range from the molecular-dynamic calculation of infrared, Raman, and electronic spectra
in gas and liquid solutions to the calculation of thermodynamic quantities for water and the simulation
of the molecular dynamics of solution reactions and polypeptides.
I. INTRODUCTION a n array of different processors operating in par-

allel. Yet a third solution3 is t o use a vector-pro-
In recent years molecular mechanics, the com- cessing machine such as a Cray-1. We will focus
puter simulation of molecular systems using mo- here primarily on molecular dynamics within the
lecular dynamics, Monte Carlo, and energy mini- array-processor-program package for molecular
mization, has emerged as a powerful tool for in- mechanics we have developed called “Newton,”
vestigating and understanding chemical properties whose conception is described in earlier papers.lV2
and processes. While straightforward in principle, In Section 11,we examine the architecture of the
these techniques can be unusually demanding AP-l20B, as this is necessary to understand how
computationally and thus the size and complexity to use the machine efficiently. In Section 111, we
of systems which can feasibly be simulated is de- lay out the specifics of the structure of the program
termined by the speed and availability of com- package we have developed for molecular dy-
puter hardware. In this article, we discuss a par- namics. Finally, in Section IV we present the re-
ticular solution to the computational needs of sults, analyze potential array-processor improve-
molecular mechanics, the use of specialized ments, and point out the advantages and impor-
hardware, a high-speed array processor, in our case tance of the environment in terms of language and
a Floating Point Systems, Inc. AP-12OB, whose use operating system.
we pioneered after purchasing serial number 2. In Classical molecular dynamics4 is the simulation
other papers,13nwe have also discussed a n alter- of systems by the numerical integration of e.g.,
native solution, the division of the problem among Newton’s second law, to obtain particle trajec-
Journal of Computational Chemistry, Vol. 4, No. 3,313-332 (1983)

0 1983 by ,John Wiley & Sons, Inc. CCC 0192-8651/83/030313-20$03.00
314 Berens and Wilson
Table I. Comparison of processor cost/performancefor demodulationproblem.

Time Approximate Installation Software
Maximum Per cost costs development
theoretical iteration Achieved (dollars (millions time
Machine megaflops (ms) megaflops per flop) of dollars) (man-months)
Cray-1 80-140 2.1 38.4 0.21 8 0.5
Star-100 25-50 4.9 16.8 0.48 8 2.0
Illiac IV 40-80 9.0 9.1 1.10 10 3.0
AP-120B 6-12 13.9 5.9 0.03 0.15 6.0
CDC 7600 5-15 25.0 3.3 0.91 3 1.1
IBM 370-168 2-4 94.0 0.87 2.30 2 1.o
CDC 6600 1-3 130.0 0.63 1.59 1 1.o
VAX- 11/780 0.5 311.0 0.26 0.77 0.2 0.5
PDP-11/70 0.2 870.0 0.09 1.67 0.15 1.2
tories, i.e., coordinates and momenta as functions have used in the development of Newton, our tool
of time, and then the computation of physical for molecular mechanics on an array processor, will
properties as averages over these trajectories. In prove useful to others using array processors for
the Monte Carlo technique, particles are moved molecular mechanics as well as for other classes of
artifically as the result of random number gener- problems.
ation rather than dynamically. The Monte Carlo
technique is less general than molecular dynamics 11. ARRAY PROCESSOR
in that it is appropriate for the calculation of
properties which depend upon averages over There are three ways to build faster computer
coordinates, but not for properties which are ex- hardware. The first is to increase the speed of the
plicitly time dependent. These fields are reviewed logical elements themselves; the second is to ho-
by Alder? Wood and Erpenbeck? Valleau and rizontally spread out the calculation among many
W h i t t i n g t ~ nMcDonald:
,~ Binder,g and Wood,10 parallel elements all working at once; and the third
and are the subject of a workshop.ll Energy min- is to vertically spread out the steps of the calcu-
i m i ~ a t i o n ' ~ -isl ~the search for the minimum en- lation in time along a pipeline, so that many suc-
ergy of the system as a function of particle coor- cessive different calculations can trickle down the
dinates. pipeline together. Arrays of processors1>2are an
Early liquid-state molecular-dynamic simula- example of highly parallel architecture, and su-
tion16 involved hard sphere atoms. Later, soft percomputers such as the Cray-1 and Cyber 205
potentials, e.g., Lennard-Jones, were intro- make extensive use of pipelines (as well as paral-
duced.'7p18 Computer simulation became a tool for lelism).
testing approximate theories for atomic 1 i q ~ i d s . l ~ The Floating Point Systems AP-12OB (AP)
More recently, with increasingly powerful software array processor is a special purpose computer,
and faster hardware, attention has focused on which combines all three approaches-logic speed,
more complex molecular systems, ranging from parallelism, and pipelining-but pushes none of
waterlo to proteins20 and nucleic acids. these to its extreme limit. It is designed for very
Array processors can bring much greater com- fast processing of floating-point arithmeti~.~3 In
puter power to bear on chemical problems than comparison, logical instructions are much slower
was previously available.1~2~21~22 However, their use and more cumbersome. It is not an array of pro-
carries the serious disadvantage that, at least up cessors as many infer from the name, but a single
to the present, their maximum efficiency has only parallel and pipelined processor whose architec-
been gained by assembly language programming, ture is easily exploited for array-type operations.
as an efficient compiler for such unusual archi- The maximum floating-point rate is 12 million
tecture is not yet available. Thus the development operations per second (12 megaflops). This speed
of efficient and general purpose software for array puts the AP in the same computational class as
processors is very important as, in the long run, many larger main-frame computers, as indicated
software costs will usually dominate hardware in Table I. Because it is a peripheral processor, the
costs even in a university environment. We thus AP requires a host computer to initiate data and
hope that the techniques discussed here which we program transfers to and from it. The nature of the
Molecular Mechanics 315
host interface as well as the host operating system memory is required. However, White29 has re-
are thus important factors, as will be discussed ported up to a 50-fold increase in speed when using
below. the Toast FORTRAN development system28when
compared to execution time on a DEC PDP 11/40
Why Use an Array Processor? minicomputer, although it should be noted that
the PDP 11/40 is significantly slower than the
The major reason that the use of an array pro- newer generation of superminis, such as the VAX
cessor is attractive is its cost effectiveness. This is 11/780.
shown in Table I from the work of Bucy and The second level of usage is to vectorize the al-
Senne24and is described in detail by Karplus and gorithm into a form in which the problem can be
Cohen.25 Columns 1 and 2 show computers that solved by a series of calls to canned vector routines
have been commonly used in scientific numerical written in assembly language by Floating Point
applications and their theoretical speed, as mea- Systems (FPS).This can be an improvement over
sured in millions of floating-point operations per the FORTRAN compiler, but since the code for the
second (megaflops). Columns 3 and 4 show the individual vector operations cannot be overlapped,
performance in terms of time and achieved me- full advantage cannot be taken of the speed of the
gaflops for a sample application. Columns 5 and AP. Two modes of operation are possible, one in
6 show the motivating economic reason for using which the canned routines are called on an indi-
the AP-12OB. It has the lowest cost per achieved vidual basis from the host C or FORTRAN program,
megaflop and one of the lowest overall installation and the other where these calls are linked together
costs. However, the last column shows where the in a rudimentary higher-level language, known as
real price is paid. The AP-12OB has a substantially the vector-function chainer. The vector-function
higher program-development time, in addition to program is then compiled into AP assembly code
a costly learning curve as users must become fa- and executes the set of operations as a group. It is
miliar with its unusual and challenging parallel vitally important to use the vector-function
assembly language. chainer to string these operations together so that
the overhead involved in loading and starting the
AP is only paid once. However, the vector-function
Programming Techniques
chainer generates an even worse code than the
Three methods of programming the array pro- FORTRAN compiler, so any loops coded in the
cessor are available. The level highest in abstrac- vector-function program can also degrade per-
tion, but slowest in speed, is the FORTRAN com- formance. A typical increase in speed over that of
plier. FORTRAN is a language familiar to most a VAX 11/780 is 3-20 using this approach, as it is
computational chemists, but unfortunately no very dependent on how well the particular canned
existing FORTRAN compiler26-28 for the AP-12OB functions encompass the problem being solved.
generates efficient or compact code. Compiled Molecular mechanics calculations tend to be at the
code is thus lengthy (one can quickly consume low end of this range.
more program-source memory than is available for The third level of usage, needed to realize the
the AP-120B) and therefore necessarily slow, since full potential of the AP, is to hand-code commonly
the machine is synchronous. Most applications used algorithms in AP-12OB assembly language.
would thus require the storage of instructions in This presents a problem to most computational
main data memory, which is costly not only due to chemists in that it may require skills unfamiliar to
the loss of main-data memory (two locations per them. In order to investigate the reason for the
each instruction stored) but also in the overhead necessity and inherent difficulty of assembly
of the overlaying process both in terms of the language coding, a knowledge of the architecture
bookkeeping required and the actual transfer of of the AP is helpful.
the instructions to and from main data memory as
the overlaying is done. The improvement in pro- Architecture
gram speed and performance of FORTRAN in the
AP-12OB over a VAX 11/780 can be as little as a The array processor is a parallel and pipelined
factor of 2 or 3.21This degrades even further if machine.30 A parallel computer is one that can
many data transfers to and from the host or much perform more than one operation in a single in-
loading ot' program source memory from main data struction. The array processor consists of several
REGISTERS
MEMORIES I
I
I
PROGRAM SOURCE I S PAD
I
I PROGRAY SOURCf
ADOMSS
16 BIT
8 ADDRESSABLE 8 LDORESSIBLE
I
I
I
I _ _ - ____________________--__-------- --
MAIN DATA II
FUNCTIONAL UNITS
ADDRESS I
64 K WORDS
16 BIT
38 BIT
TABLE MEMORY
I
II
I
I
I
I
II
I
I
ct;j
FLOATING
ADDER
INWT i INPUT 2 INPUTQ

FLOAT1NG
MULTIPLIER
2 I N
INTEGER
OPERATIONS
P U T 2
R O M [ r
-- -- ---- -
t 1 T,AB&R
’ y, I
I
4 K WORDS 16 BIT I RESULT RESULT RESULT
RAY 36 BIT I
I
I
Figure 1. AP-120B multiple functional units illustrating the parallel nature of the array processor (AP). It consists of
multiple memory and arithmetic units which can all be accessed or initiated in a single instruction cycle.
independent memories and arithmetic units, as there are two separate floating-point arithmetic
shown in Figure 1,all of which can be addressed or processors, an adder and a multiplier. The floating
initiated within a single instruction of the AP- adder is capable of a variety of operations such as
120B. There are three types of memory. Program addition, subtraction, logical and, as well as logical
source memory, which is 64 bits wide, is used to or. There is also an integer arithmetic unit which
store assembly language instructions. Data is not can be initiated a t every instruction and whose
stored intermixed with instructions, as in most result is available for use in the instruction in
computers, but resides primarily in main-data which it was initiated.
memory (38 bits wide), which is available in two Figure 2 shows the pipeline30 nature of the
speeds known as “fast” and “slow” memory. Data AP-12OB. The floating multiplier of the AP is a
can also be stored in table memory which can three-stage pipeline, and thus a multiply operation
consist of both read only memory (ROM) and started in one instruction will take a minimum of
random access memory (RAM). In our configura- three instructions to complete. A new multiply,
tion, we have 64K words of main data memory, 2K however, can be initiated a t every instruction and
words of table ROM, 4K words of writable table each initiation causes the preceding operations to
memory, and 1.5K words of program source be pushed through the stages of the pipeline. If the
memory. The actual hardware of the machine can preceding operations are not pushed, the results
support up to 4K of program source, 4K of writable stay in the pipeline until pushed out later by ini-
table memory, and 512K of main data memory tiating other multiplies. The floating adder is a
(however, only one individual 64K bank can be two-stage pipeline with similar properties to the
accessed at any one time). In addition, there are multiplier. The memory units are also pipelined
two banks, DPX and DPY, of 32 (38 bit) registers in the sense that a memory fetch initiated in one
called data pads (only eight of which can be ac- instruction does not complete and cannot be used
cessed at a time) for the storage of intermediate until three (or two, in the case of table memory)
results. Addresses (pointers) and integer param- instructions later.
eters are stored in sixteen (16 bit) integer registers, Program branching can only be performed, with
known as S pads. In addition to the memories, a few exceptions, on the output of either the
floating adder or the integer arithmetic unit. For to have writable table memory in addition to main
example, to branch if a particular floating number data memory. The difficulty encountered with the
is negative, the number must first be generated in use of writable table memory is that changing data
the adder, which requires two instructions. On the stored in that memory is a multiple-instruction
third instruction, the result is available from the process and very slow in comparison with writing
adder; and on the fourth, and not before, the result into main data memory. Whether it is important
may be tested and branched upon. This is the to have the fast rather than the slow main data
principal reason that calculations which require memory is highly dependent on the algorithm
a significant amount of logic do not perform as well being coded. For molecular mechanics code, which
on the AP-12OB. Branching based on integer op- is usually computation rather than memory bound,
erations is not as slow, because the result is avail- the slow memory seems adequate.
able in the instruction in which the operation is As can be seen in Figure 3, all these units inter-
initiated and may thus be tested in the following connect through a complicated data-pad bus
instruction. In either case, building logic efficiently structure. It is contention for the data-pad bus,
into an AP-12OB assembly language program can which can contain only one value per instruction,
be quite cumbersome. that is the most common difficulty in AP assembly
Another important feature of the architecture language programming. Thus efficient pro-
of the AP-12OB is that each separate memory or gramming of the AP-120B involves exploiting the
register unit can have one read or write (in the case parallel nature of the components of the AP and
of registers like DPX or DPY both a read and a the inner connectivity of the data paths to give the
write) initiated per instruction. Thus, the place- maximum throughput. This means that the
ment of data among the various memories is ex- manner in which a particular part of an algorithm
tremely important in that the more spread-out the is coded in the early part of the code has long-
data, the more data can be accessed in an indi- reaching effects in the later stages of coding the
vidual instruction. For this reason it is important algorithm. In many instances, it is necessary to
FLOATING FLOATING TABLE Y M W Y MAIN DATA

boo YULTIRY READ READ
STAGE
3
RESULT
AVAILABLE
ri
NORMALIZE
ROUND
RESULT
AVAILABLE
_ _ _ _ _ _ ------------__
F I C Y C L3 E
RESULT RESULT CYCLE

AVAILABLE AVAILABLE 4
Figure 2. AP-120B functional unit pipelines. The various memory and arithmetic unit pipelines are shown. The floating
multiplier and main data memory consist of three-stage pipelines. The floating adder and auxiliary memory (table memory)
are two stage. The arithmetic pipelines, unlike the memory pipelines, require the program to push preceding operations
through the pipe each instruction cycle for the data to advance.
code an algorithm in more than one way and then front panel or DMAing out the data in successive
examine which scheme can be made the most ef- passes. Preserving full precision is important if
ficient. This can require a high degree of patience numbers are to be later reloaded back into the AP
and persistence. and the calculation continued, as the result will not
The AP-12OB interfaces to the host computer be the same as would be obtained by doing the
(in our case a DEC VAX 11/750) through two (or same uninterrupted calculation in the AP if the
possibly three) channels, as shown in Figure 3. The numbers are rounded to 32 bits. The AP has a
virtual front panel is used by the host to control much larger dynamic range (10-153-10+’53) than
the AP-12OB. It does not exist in reality but is a set the usual host representation (10-36-10+36), and
of registers (in our case on the VAX UNIBUS) the only way in which these larger floating-point
which can be examined and set by the host com- numbers can be extracted from the AP-12OB
puter. The UNIX device driver for the AP uses preserving the full 38 bits of precision is to use the
these registers to start and stop the AP, to examine front panel. However, if the floating-point num-
and deposit into memories and registers, and t o bers are within range of the host’s floating-point
initiate direct memory accesses (DMAS) over the representation, precision can be preserved by
UNIBUS. Any data manipulation done using the DMAing out the data in multiple p a ~ s e s . 3First,
~
front panel is necessarily very slow. Thus data and the actual data is DMAed out, undergoing con-
program instructions are normally transferred vergent rounding to 32 bits. Next the data is
between the host and AP via the DMA process DMAed back to the AP and subtracted from its
which can occur at approximately a megabyte per original value in the AP. The result of this sub-
second. Owing to the difference in word lengths traction, the remainder, is then DMAed to the
between the usual host floating-point represen- host. These two arrays (the rounded data and re-
tation (32 bits) and the AP representation (38 bits), mainder) can then be used to reconstruct the
floating-point numbers are converted “on-the-fly” original value in the AP by DMAing them into the
during the DMA process by the AP’s interface AP separately and adding them together in the AP.
hardware. To preserve the full precision of the AP’s This requires the allocation of scratch areas in the
representation requires using the much slower AP’s memory.
ARRAY PROCESSOR (AP-1206)
Figure 3. AP-12OB host interface and internal connections. The various data paths connecting the individual functional
units internal to the AP-12OB are shown in the right portion of the figure. On the left are shown the data paths that interface
the AP to the host computer.
table lookup bits) can be accessed by the AP-12OB

FLOATING POINT REPRESENTATION program. Furthermore, there is no way to write
these integers back into the packed word from an
10 BITS 12 BITS 16 BITS A P program and these packed integers cannot be
EXPONENT
HIGH mw loaded by the DMA channel into the AP-12OB, but
MANTISSA MANTISSA
must be loaded using the front panel which can
examine and deposit all the bits of the low man-
LDSPE LDSPT LDSPI
tissa, high mantissa, and exponent, separately.
Figure 4. AP-120B floating-point representation. A main 111. NEWTON

data memory word is divided into three parts: a 16-bit low
mantissa field, a 12-bit high mantissa and a 10-bit expo- Our tool for molecular mechanics, Newton,
nent. Below are shown the various AP instructions that can
be used to read and treat these fields as three variable consists of a variety of hardware processors and a
length integers. Note that some of the bits are inaccessible generalized software package which will be de-
in this manner. scribed in this section. The software portion of
Newton consists of 12,000lines of C code and 7000
lines of A P code. It is designed to be a modular and
The DMA channel is also a bottleneck when large generalized approach, applicable to systems from
volumes of data are to be transferred to and from rare gases and diatomics to polypeptides, proteins,
the host computer. The AP can use approximately and nucleic acids in solution. While Newton can
80%of the bandwidth of the UNIBUS. A typical be used for Monte Carlo and energy minimization,
transfer rate is approximately 0.8 MB (megabytes its major use to date has been for molecular-dy-
per second). Particularly operating-system de- namics based calculations, and it is these that we
pendent is the severity of overhead paid for ini- will describe in more detail. The array processor,
tiating each transfer. On VAXs, the poor hardware as shown in Figure 3, is connected to a VAX 11/750
design of the UNIBUS adapter also lowers the as the host processor. The second DMA port can be
bandwidth. However, the throughput can be connected to a PDP 11/34 which serves as the host
maximized by using the asynchronous capability to a three-dimensional graphical-display system,
of the DMA channel. A third, optional DMA chan- an Evans and Sutherland Picture System. All four
nel, known as an IOP, operates in a manner similar of these processors can be activated simulta-
to the standard DMA channel except that it has less neously to allow the computation of molecular
general format-conversion capabilities and lacks dynamics (and derived properties such as spectra
a sophisticated handshaking protocol which may and thermodynamics) and the real-time display
make it inadequate for some applications. of the atomic motions or derived properties on the
Another important architectural feature of the picture system as they are calculated.
AP-120B is that all main-data memory locations We know of four other applications of aspects
are 38 bits and designed to accommodate a single of molecular mechanics which have been made
floating-point number. While it is quite possible using array processors. Pottle et al.32describe a
(and easy) to store a single 16-bit integer in each package for energy minimization of proteins. Al-
38-bit location, in many instances it would be more though the problems of force evaluation are simi-
efficient to pack more than one integer in each lar, the approaches chosen are different. Newton
floating-point word. This would provide savings is not confined to any one set of potential surfaces,
not only in the amount of memory required but such as the electrostatic plus Lennard-Jones po-
also in execution time by cutting down the number tential surfaces used by Pottle, but is capable of
of memory accesses required to retrieve the data. implementing essentially arbitrary potential sur-
Although the 38-bit word, as shown in Figure 4, faces, such as the electrostatic plus Lennard-Jones
consists of three different fields, a 16-bit low potential surfaces used by Pottle, but is capable of
mantissa, a 12-bit high mantissa, and a 10-bit ex- implementing essentially arbitrary potential sur-
ponent, not all of these bits are accessible to an faces, as will be seen below, However, we have also
AP-120B assembly language program. All 16 bits paid the price for this generality. While their inner
of the low mantissa and all 10 bits of the exponent loop code operates using 80% of the theoretical
can be placed into the S-pad registers, but only 8 maximum floating-point speed of the array pro-
of the 12 bits of the high mantissa (the so-called cessor, we average somewhat less than 50%. This
E8S
PICTURE
SYSTEM
VAX 11/750 PDP 11/34
Figure 5. Overview of the molecular mechanics software in Newton. The modules on the left are C program packages
that execute cooperativelyon the VAX host computer. Their intercommunicationand interface to the AP are shown. On
the right is an optional interface from the AP to another host computer, a DEC PDP 11/34, to which an Evans and Sutherland
Picture System is attached. The modules on the right are C program packages that run on the PDP 11/34.
is primarily due to the extensive logic and indexing stants and parameters, and APTABLES to calculate
necessary to handle the general case. Much of this the force lookup tables for the intermolecular in-
code is overlapped with the actual floating-point teractions whenever a new system is simulated.
computation (since the logic consists of integer DRAW is an optional process that can be used to
operations). With a conventional computer, the display the dynamics on the picture system as they
integer operations would have to be done in serial are being calculated on the AP. MOVIE can be used
with the floating-point operations. This group has to display the previously calculated dynamics of
also implemented molecular dynamics calculations a system using coordinates saved on a file by
for water33 and Monte Carlo calculations for liquid APRUN. I t also allows color movies to be made
ammonia34 and for liquid methane.35 In addition, from these files. DODATA is a program which al-
Andersensl and Swope et al.36have implemented lows the computation in background mode of a
molecular-dynamics code for water and atomic series of runs to collect data over an ensemble of
solutes in water, and Berne and c o - ~ o r k e r have
s~~ initial starting configurations and momenta. It is
carried out Monte Carlo calculations, both on an interruptible to allow users to carry out program
AP-12OB. Dammkoehler and c o - ~ o r k e r sare ~~ testing and development, and systems tasks such
implementing molecular-mechanics code on as disk backup, without interfering with the
multiple CSP39 array processors. Also of related runs.
interest are the crystallographic work of Furey et All of the actual code involved in molecular
al.40 and the simulations of plasma dynamics by dynamics calculation has been written in AP-12OB
the UCLA group,4l both on FPS AP-120Bs. assembly language and split into independent
Newton contains seven separate host programs modules, as shown in Figure 6. Other AP modules
which operate cooperatively as shown in Figure 5. not described here provide for other types of mo-
The first of these is APCOM which is an interactive lecular mechanics and for the calculation of phe-
command interpreter. It allows the user to type in nomena derived from molecular dynamics, such
various commands, validates the commands and, as infrared,42,43Raman,44-46and e l e c t r o n i ~ ~ ~ ~ ~ ~
where appropriate, instructs the next link in the spectra and thermodynamic q u a n t i t i e ~ .The
~~
chain, APRUN, to perform the operation requested. modules are then linked together using a simple
APRUN is the program responsible for running the vector-function chainer program that loops over
AP, and its only function is starting and stopping the routines to perform the number of integration
the AP and emptying the data-collection buffers steps requested. Data is buffered inside the AP-
while the AP is running. It uses two other pro- 120B and DMAed out asynchronously when the
grams, APLOAD to load the AP initially with con- buffers are filled, while the AP is running. The
struct i
char flags;
INTERMOLECULAR ~~~cw&oDlc char type;
CALCULATOR TRUN. OCT. PERIODIC int parent;
NONBONOED FORCES
int param;
1 parts;
TWO BODY FORCE
CALCULATOR BOND STRETCHING
where the above is the C language50template for
a structure describing the atom parts. The first
item in the structure is the atom pugs. These flags
IITHREE BODY FORCE
CALCULATOR
BOND ANGLE BENDING
AND CROSS TERMS can be used to enable or disable particular fea-
tures. For example, one flag bit can be set to fix the
atom in space so that it cannot move (although it
FOUR BODY FORCE
CALCULATOR TORSIONAL FORCES will still exert forces on neighboring atoms). An-
other is used to specify that the atom is the start
of a molecule. If this flag is set then the parum part
I1I INTEGRATOR
MODIFIED VERLET contains the total number of atoms that follow
which are in the same molecule. Another flag bit
can be set to indicate that the atom is part of a ring
structure. In this case, the parurn element contains
DATA COLLECTION WRITE TO DISK
the other parent of this atom necessary to com-
plete the ring structure. Another flag is used to
specify if the atom is to be drawn in the picture
display or not. Currently, these four flags are the
Figure 6. Newton A P modules for molecular dynamics. only ones used, although eight such flags are pos-
The various hand-coded AP-12OB assembly language
modules are shown in the loop in which they are executed. sible for future uses.
The entire loop is executed in the AP itself with no inter- Another part of the structure is the t y p e of the
vention required from the host. Brief descriptions of the atom, a number between 0 and 255. For example,
modules are given on the right.
for water the atom types are hydrogen and oxygen.
In organic molecules or ionic solutions, it is often
necessary to distinguish between different types
buffering mechanism will be examined later in of multivalent species (such as carbon) or different
more detail. ionic states, and thus each different chemical state
The remainder of this section concentrates on of a particular element will have a different t y p e
our approach to molecular mechanics. The im- number.
plementation of molecular dynamics consists of The parent element specifies to which atom the
two basic steps, force evaluation followed by nu- present atom is connected. Using this tree-linking
merical integration. It is these two functions which mechanism, the entire bond structure of any
are performed by the AP modules shown in Figure non-ring molecule can be determined. With the
6, and the methodology behind their operation and addition of the ring flag and the extra link pro-
design will be discussed. We examine the types of vided by the param word, any reasonable chemical
lists and indices needed for a general purpose structure can be handled.
molecular mechanics package. We also examine The atom f l a g , t y p e , and parent are stored as
various types of boundary conditions which are the exponent, high mantissa, and low mantissa,
important in many applications, especially those respectively, of an AP-12OB writable table memory
with long-range forces. location. The atom param word occupies the low
mantissa of a main-data memory word whose other
fields are currently unused.
How to Describe an Atom
Solely from this simple information, all the
In addition to initial coordinates and momenta, other lists needed by the AP-12OB modules de-
other information is needed to create the wide scribed below can be generated. The atom parts,
variety of lists necessary to look up masses, posi- atomic coordinates, and momenta are the only
tions, and forces among atoms. These can all be pieces of information kept in Newton fill files
derived from a basic set of information: which are used to start or reload a Newton run.
Boundary Conditions of the unit cell to avoid abrupt changes in force and
loss of energy conservation. For a cube, 48% of the
Two of the AP modules (the intermolecular force volume of the cube lies outside the inscribed
evaluater and the integrator) depend on the type sphere in the corners of the cube, and particles in
of boundary conditions being used. Currently, we this volume do not contribute to the forces on the
have programs that allow the use of four types of test atom. For the truncated octahedral geometry,
boundary conditions: soft walls, hard walls, mini- only 4.5% of the unit-cell volume lies outside the
mum-image cubic periodic, and minimum-image inscribed sphere, and thus more of the dynamics
truncated-octahedral p e r i ~ d i c . ~Other
’ possible calculation is effectively used. In addition, the
boundary schemes7 include spherical and periodic excluded volume is more evenly distributed in
boundary conditions using Ewald sums. angle than for the cube, and the isotropy of space
Cubic soft walls are the simplest. No imaging is is thus less distorted. There exists an easy ways1
done, and thus particles feel only the forces of the to code algorithms to implement the truncated
other particles in the cube. When a particle ap- octahedron. The number of possible space-filling
proaches a wall, a soft spring force pushes the solid tessellations is small. Out of the regular and
particle back into the cube. This type of boundary Archimedean polyhedra there are only five which
condition is useful for studying small clusters or are space filling: the cube, triangular prism, hex-
droplets of particles. The disadvantage is that a agonal prism, rhombic dodecahedron, and trun-
high fraction of the particles can be on the surface cated ~ c t a h e d r o n ,and
~ ? ~thus
~ the natural alter-
of the droplet or cluster and, in many situations, native to the truncated octahedron would be the
these surface effects can be important. Conse- rhombic dodecahedron.
quently, a larger number of particles are needed
to study bulk phenomena. In addition, collisions Intermolecular Force Evaluation
with a wall can give a molecule a large, artificial,
angular momentum as it is shoved back toward the The first module of AP code is the intermolec-
cluster. ular-force evaluator, used to compute nonbonded
Hard walls are specularly reflecting and cause forces; i.e., those between atoms on different
problems with integration as the velocity of the molecules, or those separated by so many bonds in
particle normal to the wall reverses itself in one a single molecule as to be considered independent.
integration time step. Such a discontinuous change As pointed out above, it comes at present in four
can cause integration algorithms to “blow up.” flavors: soft and hard walls, cubic periodic, and
This can be avoided by altering the algorithm so truncated-octahedral periodic-boundary condi-
as to alter the past time history of the particle (as tions. We approximate all intermolecular forces
far back as necessitated by the algorithm) when it as purely pairwise additive; thus
strikes the wall to become that of a particle having non bonded
entered the box with the reversed normal velocity. V= C V(rij) (1)
i<j
The problems with surface effects still remain.
Periodic boundary conditions7 are commonly in which r;j is the distance between the ith and j t h
used to reduce surface effects for the simulation atoms. The potential function V(rij) depends
of bulk matter. The simplest is a cubic minimum solely on the chemical type of the atoms involved.
image. In this scheme, the system of particles re- Currently, intermolecular forces are evaluated by
sides in a central cube which is surrounded by looping over all the possible pairs of atoms in the
exact replicas of this central cube on all sides, system simultaneously calculating the force for
edges, and corners. Particles interact only with the both members of the pair. This requires logic that
closest image of any other particle. In all cubic allows the intermolecular force evaluator to skip
periodic boundary algorithms, when a particle over all the pairs of atoms whose forces are to be
leaves the central box it is replaced by one of its calculated by one of the other bonded force rou-
images entering from the opposite side. tines.
A truncated octahedral boundary condition5’ To skip over the appropriate bonded interac-
is similar except that the unit cell is a truncated tions the parent (and, if the ring flag is set, the
octahedron which more closely resembles a sphere. param word) is used to determine bonding up to
This is important for minimum-image boundary and including four-body interactions which causes
conditions as the forces must be smoothly feath- the intermolecular force evaluator to skip over
ered to zero at the radius of the inscribed sphere these interactions. While this involves extensive
integer arithmetic and logic, the overhead is in- portional to N 2 ,while the bonded intramolecular
consequential as it is overlapped completely with calculations only scale as N . A possible alternative
the code to do the minimum imaging for the peri- is to use neighbor lists39s4which are only updated
odic boundary conditions. Earlier experimental after several time steps so that only the atoms
versions of the intermolecular module did not which are close to the atom in question are scanned
carry out this logic and instead calculated inter- to calculate the intermolecular forces. However,
molecular forces for all pairs of atoms whether neighbor lists create a considerable storage prob-
bonded or not. The bonded force evaluaters then lem in that each atom must have a list of all the
simply subtracted out these erroneous intermo- other atoms which are near it. For a relatively large
lecular forces when they calculated the intramo- system this list could easily exceed the amount of
lecular forces. This addition and subtraction main data memory available. For small systems,
caused disastrous results due to numerical the effort involved in updating and indexing the
round-off caused by adding the relatively small interactions from such a list may exceed the effort
intramolecular forces to the large erroneous in- to do all the pairwise calculations.
termolecular forces that had not yet been sub-
tracted out. Two-Body Module
Andersen31 has pointed out that when using a
The two-body module calculates all simple
smoothing function for potential energy, the cor-
bonded forces in diatomics such as CO and Na.
rect force evaluation involves calculating both the
The force between two bonded atoms i and j with
force and potential energy, as can be seen from
coordinates ri and rj, respectively, is
V s ( r )= V ( r ) S ( r ) (2)
dr
in which F; and Fj are the vector forces on the ith
and j th atoms, respectively, which are separated
by the distance rij = Iri - r; I. V ; is the gradient
where V , ( r )is the V ( r )potential smoothed to zero with respect to the Cartesian coordinates of atom
by the smoothing function S ( r ) . i as expressed in eq. (Al) of the Appendix.
Currently, all intermolecular forces are calcu- The two-body program uses a list as shown in
lated by table look up. This involves allocating Figure 7. The low mantissa and exponent of the
most of main-data memory to the force look-up first word in the list are used to index the two
tables. Linear interpolation of these grid points is atoms involved. The atom number of the first
used to give the actual forces. At least in principle, atom (stored in the low mantissa) is subtracted
as has been pointed out by Andrea, Swope, and from the atom number of the second before it is
A n d e r ~ e nthis
, ~ ~scheme has a flaw when used with stored in the exponent field. This allows us to use
the Verlet integration algorithm due to the infinite the fields, such as the exponent, to index an atom
second derivative of the force at the boundary number which in principle can be much larger in
between the linear segments. Andersen uses a magnitude than the bit field could normally han-
better scheme employing a polynomial to fit fixed dle. Since atoms that are bonded to each other
length segments of the potential curve with each tend also to be close to one another, the atom
polynomial joining smoothly and continuously out numbers are very close in magnitude, and thus the
to several derivatives at the end points of the ad- smaller bit fields are big enough to allow this rel-
joining segments. This involves less data storage ative indexing of atoms. Code common to all two-
to calculate the intermolecular forces as only the body force evaluations is used to fetch the atomic
polynomial coefficients need be stored. In addi- coordinates and calculate the internuclear vector.
tion, since polynomials are used, the potential The potential index, or switch parameter, is then
energy as well as the forces can be calculated with used to pick among a variety of force calculators
little extra effort from the same set of coeffi- (such as harmonic, Morse, etc.) that will calculate,
cients. given the internuclear vector, the scalar force along
For large systems, most of the computational the bond. The low mantissa of the second word in
time is spent in intermolecular force evaluation, the list is the address of the force constants for the
since when done on a pairwise basis for the entire force evaluator to use. In this way, for example, one
system of N atoms, it becomes a calculation pro- routine can be used to evaluate all harmonic forces.
due to bond angle bending which requires the

three atoms involved to be specified in order to
TWO BODY INTERACTION
calculate the bond angle 6. The three-body module
o----o
1 i
uses a list similar to the two-body module, as
shown in Figure 7. The low mantissa of the first
I-i
POTENTIAL
INDEX
i *3
I word contains the index to the middle atom in the
three-body interaction. The exponents of the first
and second words contain the atom numbers of the
other two atoms after subtracting the atom num-
THREE BODY INTERACTION ber of the middle atom. The high mantissa of the
first word selects which three-body force evalua-
i Ak tion routine is to be used. Currently we have two

such evaluaters, a complex one55 for water mole-
cules, and a simpler, harmonic one
i-j POTENTIAL
INDEX j 463
I V(6ra,6rb,66)= Izo(6ra)2
I '-' I UNUSED I FORCE CONSTANT
POINTER I + Iz1(6rb)2+ k2(66)2 (5)
which is written in terms of the bond vectors ra
FOUR BODY INTERACTION and rb where
iw' 6r, = Iri - rjl -rg = Ira! - r :

6rb = Irk - rjl - r g = 1 4-reb
(6)
(7)
and 68 = 6 - B e , in which r ; , rg, and 6" are the
1 I
equilibrium bond distances and angles, respec-
I I fORCE CONSTANT
POINTER I ' -1 tively, of the potential V. The two-body part of the
potential is shown in the above expression as it is
Figure 7. Format of interaction lists. The top panel also calculated in this module as explained above.
shows the format of the two-body interaction list as stored The water55 force evaluator has various higher-
in AP main data memory. The low mantissa of the first order terms among 6ra, arb, and 68 in addition to
word holds the ith atom number times three (for faster
indexing into the three-dimensional arrays). The high the terms shown in eq. (5).The low mantissa of the
mantissa is an %bit integer specifying which force calcu- second word contains the address of the force
lator to use (harmonic, Morse, etc.). The exponent field constants (ko,kl, and I z 2 ) used by the force eval-
contains the difference between the j t h and ith atom uators.
numbers. This is done to allow more dynamic range in the
10-bit field. The second word contains the address of the Common code is provided by the three-body
force constant. The switch parameter or potential index module to calculate the two internuclear vectors
is used to specify which force evaluation routine is to be and the bond angle before calling the appropriate
used for this interaction. The three-body list, in the middle
panel, is similar except that the middle atom is used to three-body force evaluator. The evaluator returns
index the other two. The four-body list, as shown in the scalar forces along the two internuclear vectors and
bottom panel, is also similar except for a constant offset a force associated with the bond angle, and the
which is subtracted from the force constant pointer to allow
more dynamic range. three-body module resolves these forces into
Cartesian forces on the three atoms involved. The
forces on the atoms are given by
The common main line code then decomposes the
force along the space fixed x , y, and z axes. This
module is only used for diatomics. For more com-
plex molecules, it is more efficient to have the
three-body module also calculate the two-body
forces.
Fj = -VjV
Three-Body Module
Three-body interactions are those whose cal-
culation depends on the coordinates of three par-
ticles, ri, rj, and rk. An example of this is a force
where we have used eqs. (A9) and (A10) of the Thus, using eqs. (8)-(13) and eqs. (22) and (23), the
Appendix to evaluate the chain-rule gradients for forces can be appropriately resolved onto the three
the tensor-vector product appearing in eqs. (8)- atoms involved. These results are the same as
(10). Similarly, using the results of the Appendix, those arrived at using the Eliaschevich and Wilson
the gradients involving the bond angle 8 in eqs. s-vector m e t h 0 d ~ ~ 7to5 ~
evaluate the elements of
(8)-(10) can be expanded in terms of gradients the B matrix used in normal-mode vibrational
involving the bond vectors ra and rb, analysis to relate the internal and Cartesian
coordinates through a Taylor's expansion.
Vi0 = V,%Vira= Va% (11)
vj8 = va8vjra + Vb8vjrb Four-Body Module
= -VQ%- vb8 (12)
A four-body interaction requires the knowledge
vk%= Vb%vkrb= vb8 (13)
of the positions of four particles to calculate the
Since the bond angle can be written in terms of the force. The two most common examples are tor-
dot product of the two bond vectors sional forces and out-of-plane bending forces.56157
The four-body module for torsional forces uses a
cos 8 = ra rb/rarb (14)
list as shown in Figure 7. The atom numbers of the
the terms in eqs. (11)-(13) can be evaluated as two inner atoms are stored in the low-mantissa
-1 field of the first and second words, with the j t h
0,s =-sin 8
Va cos 8 atom number multiplied by three for indexing
convenience. The inner-atom index numbers are
Using eqs. (14) and (15) first subtracted from the closer outer-atom num-
Va cos H ber and then stored in the exponent field of the
two words. The high mantissa of the first word is
used to store the potential index value to select a
particular torsional force evaluator with the force
where constants being indexed by the value in the high-
mantissa field of the second word. Since a com-
Va (r, * rb) = rb (17) plete memory address cannot be stored here, the
and value in the high mantissa of the second word is an
offset to the base address of the torsional force
v u (rarb) = (rb/ru)ra (18) constants. Currently, there are three types of tor-
Substituting eqs. (17) and (18) into eq. (16) sional-force evaluators which handle single, dou-
yields ble, and triple bonds.
Common code is used to calculate all the inter-
nuclear vectors and the torsional angle. A scalar
a' b
force is returned which is solely dependent on the
torsional angle. This force is then decomposed
Now, by using eq. (14) with eq. (19), we have onto the atoms as follows. If we take a four-body
rb ra C O S ~ interaction as shown in top half of Figure 8, each
v, cos0 =-- of the four atoms i, j , k, and 1 has coordinates
rarb r:
represented by the vectors ri, rj, rk, and ri. If we
A similar derivation for the gradient with respect now define the bond vectors
to the other bond vector rb yields
r1 = ri - r;, r2 = rk - r,, r3 = ri - rk (24)
and look down the j-k bond, then we have the
representation shown in bottom half of Figure 8
Finally, substituting eqs. (20) and (21) into eq. (15) where ra and rb are the projection of r1 and r3 into
we have a plane and the $-torsional angle is formed be-
cos 4 in terms of the projections r, and rb to

give
Vt cos 4 = V, cos 4 Vtr, + vb cos 4 Vtrb,
,$ = 1 , 2 , 3 . (34)
From eqs. (20) and (21), we know the gradients of
the torsional angle are given by
r, rb C O S ~
Vb cos 4 = -- (36)
rarb r26
Using eqs. (35) and (36) with the vector tensor
products of eq. (34) which are given by eqs.
(A13)-(A15) of the Appendix, we have
Figure 8. Calculation of torsional angle. The top half of

the figure shows the atoms and vector conventions used in
the calculation of the torsional angle. The bottom half of
the figure shows the projection of the bonds in a plane when
viewed down the center bond.
tween them. We can now write

r, = r1 x r2, r, = Ir,l (25) Using eqs. (37)-(39) with eqs. (28)-(31), the forces
on all four atoms can be resolved.
rb = r3 x r2, rb = Irbl (26)
and
Integration Module
The second step in molecular dynamics is the
numerical integration of Newton's second law for
The force on each atom, using eqs. (A9) and (A10) each atom i,
from the Appendix, is given by
-
Fi d 2ri
-ViV = -VlVV;rl= -V1V (28) =a2 =-
mi dt2
- 0 j V = -VlVVjrl- V2VVjr2 = V1V + V2V where mi is the mass of the ith particle, ai its ac-
(29) celeration, ri its position, and t its time. As pointed
-vkv = -V2VVkr2 - V3VVkr3 out above, the integration module depends on the
type of boundary condition in use, as this module
= -V2V+ V3V (30) also applies any position changes necessary to keep
-VlV = - V ~ V V =
I ~-V3V
~ (31) the atoms in the unit cell (in the case of periodic
boundaries) or applies any restoring forces nec-
Since the potentials used are solely functions of the
essary (in the case of soft walls). The integration
torsional angle
algorithm we use in each of these modules is the
v = V(c0s 4). (32) same, a version of the Verlet algorithm as dis-
each of the terms of eqs. (28)-(31) can be evaluated cussed by bee mar^^^ with further modifications by
as Andersen.36 In our implementation, the vector
+
difference in positions di ( t 1) at time step t 1 +
d V (cos 4) is calculated from the previous difference di ( t )and
v,v =
d cos 4
0, cos 4, 4 = 1,2,3 (33)
the force
However, following the format of eqs. (A4)-(A7)
of the Appendix, we can expand the gradients of d,(t + 1) = di(t) + h2-Fi (41)
mi
in which t indicated the time step t , and h is the Table 11. Comparison of simulation times on various
size of the time step. Next, the new positions are computers.
calculated from the new difference in positions Molecular dynamics
In C, In FORTRAN,
r;(t + 1)= ri(t) + di(t + 1) (42) w/o FPA w/FPA Monte
Computer on VAX on VAX carlo
The improvement made by Andersen is that now
only the difference in position is stored rather than AP-12OB 1 1 1
VAX 111780 80 35
the velocity. Therefore, there is less round-off error VAX 111750 150
as the numbers being added in each stage of the PDP 11 300
calculation are closer in magnitude. A crude for- Prime 400 650
ward difference velocity can easily be obtained by IBM 3701168 1.5
CDC 7600 0.43
dividing the difference in position by the time step,
or, if desired, more accurate velocities can be cal-
culated.58
Beeman58 shows that higher-order integration an individual Newton run. Normally, trajectories
techniques tend not to be as stable as the simple are not saved in that it is easier and faster to do the
Verlet algorithm when larger time steps are used. run over rather than saving and storing the details
He also shows that the Verlet algorithm conserves of the run.
energy as well as other integrators tested in the
large time-step limit. The advantage of using this
simple technique with the array processor is that IV. RESULTS AND ANALYSIS
only a minimal amount of memory is set aside for As pointed out in Section 11,the main advantage
the storage of positions and the past time history of an array processor is high computational speed
of the system, as compared with higher-order in- at a relatively low price. In the way of benchmarks,
tegration methods. we have obtained the following results presented
in Table 11. Column 1shows the various computers
on which the benchmarks are taken. Columns 2,
Data Collection Modules
3, and 4 compare the speed of the AP-12OB for
The data collection routine used for a particular three different molecular-mechanics packages.
application is often very specialized and thus to The one presented in column 2 is a direct C
analyze different properties, different A P modules translation of our A P molecular dynamics package
must be written. However, MDOUT, the mecha- run on a VAX 11/780 and a VAX 11/750 without
nism for buffering the data out to the host com- floating-point accelerators (FPAS). Column 3
puter, is common to all routines once the property compares our AP version against an optimized
that is to be collected is calculated. In many cases, Fortran version due to Hagler59and run on a VAX
this property can be calculated in a vector function 11/780 with a floating-point accelerator. Column
program. 4 compares some speeds obtained for Monte Carlo
Various data collection modules exist for use calculations from Chester et a1.60As can be seen
with Newton. An example is the module that cal- from the table, our molecular-mechanics package
culates and collects dipole moments and polariz- is approximately 35 times faster than a VAX
abilities for the system as a function of time in 11/780 with a floating-point accelerator and an
order to compute by linear response theory the optimized Fortran compiler. Thus a simulation
infrared42>43and Raman44?46 spectra. These mod- that can be run in a week and a half on the AP-
ules operate in a very similar manner to the force 120Bwould take a year on a VAX even if the VAX
evaluation modules in that for the most part they were totally dedicated to that calculation.
use a list to index the atoms involved in the data Although the AP-12OB has proven to be a very
calculation and the relevant parameters (such as fast and economical machine for molecular me-
static and derivative dipole moments for mole- chanics and allows us to simulate systems which
cules). MDOUT saves the data in two internal otherwise would not be feasible, it is far from ideal.
buffers in main data memory and then DMAs the One concern is the word length of the floating-
data out under interrupt control using a double- point numbers. For example, for many quantum
buffering technique described in the next section. mechanical calculations, 32-bit floating-point
In general, the actual data is all that is saved from representations are inadequate, whereas 64-bit
precision suffices. The question arises as to the Another obvious misfeature is the absence of
adequacy of the 38 bits of the AP and if particular integer multiplication which makes the addressing
sections of algorithms can be painstakingly coded of multidimensional arrays difficult. Additional
in the forced double precision possible on the AP memory-address registers, so that more than one
to give adequate results. While most molecular item could be fetched from different banks of main
mechanics is certainly adequately handled with 38 data memory in the same instruction, would re-
bits, one might do the integration module for place the necessity of using the writable table
molecular dynamics in double precision and leave memory for selective data storage, thus allowing
the more computationally intensive force evalua- all data to be accessible through the DMA channel,
tion in single precision. freeing table memory for seldom changed con-
Another feature missing in the AP-120B is the stants.
ability to DMA data out of main data memory to There is currently available from Floating Point
the host processor preserving the full 38 bits of Systems a 64-bit array processor, known as the
precision. This impacts the ability to do three FPS-164, which solves the problem of numerical
things. The first is that we would like to be able to accuracy, at least for most problems of chemical
rapidly store intermediate results on disk so that interest. It is, however, no faster (in fact somewhat
the runs could be stopped and started at a later slower) than the AP-12OB and quite expensive. It
time without loss of precision in the data. A rapid also differs in that program memory and data
method of writing such intermediate files would memory occupy the same space, and thus the
also facilitate periodic file dumps for restart ca- FORTRAN complier approach, although no faster
pability if the host system were to crash. The sec- in speed, is more feasible due to the abundance of
ond impacted area is the ability to rapidly load and memory now available for its bulky code. This of-
retrieve integers packed into floating-point words. fers some advantages, for example, for quantum
The amount of time spent in loading the AP applications as large, previously developed pro-
through the virtual front panel with long packed gram packages can be run in the AP using its
lists, for example, which would be necessary if FORTRAN compiler with only the inner loops being
neighbor lists were used, could be extraordinarily optimized as hand-coded AP routines. Studies of
burdensome. The third is that this limitation in- usual quantum packages indicate that only
terferes with making the AP a rapidly sharable 500-1000 lines of code take up most of the execu-
machine rather than an exclusive-use device as it tion time.6' In addition, Floating Point Systems
is now. To make it truly sharable, d l of the ma- now offers a newer and more highly optimized
chine's memories and internal registers would have FORTRAN compiler for the FPS-164 that suppos-
to be rapidly DMAed out to disk (swapped), and edly produces code that runs within 1.5 times the
this is not possible with the present architec- speed of hand-coded routines. These benchmarks
ture. have only been obtained on a small class of short
In future versions of array processors, we would routines (such as vector add and vector move) and
also like to see a separate integer memory that it is not clear that they will hold up when a larger
could be accessed faster. Lacking this, an in- package of code is compiled.
struction should be added that would allow the Another improvement that would facilitate the
acess of all the bits of an integer packed into a main use of array processors for computational chemists,
data word. Since molecular dynamics is not usually is a good higher-level language compiler. We would
memory bound and most code is parallel for each suggest the C language.50 For many reasons, C is
of the three coordinate axes, more than one float- more appropriate than FORTRAN for compiling
ing-point adder and multiplier might be useful. In into efficient AP-120B assembly language. For
this way, calculations for each of the three axes example, C allows the use of pointers to arrays as
could be carried out in parallel rather than serially. an alternative to subscripts. Incrementing and
The serial approach, however, is often convenient, decrementing a pointer and stepping along mem-
especially given the fact that the multiplier is a ory can be much more efficiently handled in AP-
three-stage operation, but since the adder is only 120B assembly language than adding a subscript
two stages it often disrupts any attempts to build offset (which may need to be decremented if it
the physical symmetry of the problem into the does not start a t 0, such as is the case with FOR-
code. If such improvements were to be incorpo- TRAN) to the array base address. In addition, C
rated, speed enhancements by utilizing fast main allows the declaration of variables as registers, thus
data memory could more easily be realized. allowing the programmer to warn the compiler
that a particular piece of data needs to be kept in that we can (and have) moved both the UNIX
data pad registers rather than written and reread operating system and Newton from one type of
from memory. We are optimistic that a good C processor to another. Its debugging, editing, and
compiler can be written that will produce AP-12OB friendliness to the user are supeiior, enhancing
assembly language code good enough to make programmer productivity with ease of making and
obsolete the desire to program in assembly lan- debugging changes to Newton.
guage except for extremely critical loops which are The development of Newton would have been
executed too many times to tolerate any inef- very difficult without the UNIX operating system
ficiencies. The only such loop of this type in our environment. The operating system kernel and
molecular dynamics code is the intermolecular- device drivers for UNIX are written in C, and thus
force evaluation, which is less than 10%of the total are easily changed. We have hand-tailored the
AP code. AP-12OB driver to meet our needs. It allows the
The actual generation of such a compiler is a host computer and the AP-12OB to operate asyn-
difficult task. Since the AP is not of the von Neu- chronously, coordinating efforts via interrupts. As
mann archite~ture,~s there is little expertise in this the AP-120B fills up its data buffers, it sends an
area of software design. Wilson62 has suggested interrupt to the host which causes the host to
a Monte Carlo method of code generation where- empty the buffer from AP-12OB memory to disk,
by, given certain rules and constraints, the AP itself while the AP-12OB continues the calculation,
would try to optimize its generated assembly code filling up a second buffer. This means that there
using a Monte Carlo technique, varying the code is no lost computation time by the AP-12OB
while preserving its logical outcome. In this case, waiting for the host to empty the buffer and restart
code, not particles, would be randomly moved with the calculation. Furthermore, this procedure al-
the overall length of the code being minimized. His lows the host program to be inactivated (and even
attempt at implementing such an optimizer also swapped) without having to loop just to check on
starts with C language source code. the state of the AP. This change has increased our
All of the support code for Newton run on the data throughput by over a factor of 10 and allows
host computer is written in C. This has allowed us the use of the AP-120B on a timesharing system
to maintain a single set of source code files which with minimal impact to other users.
are shared by many programmers and used to UNIX, a third generation operating system, is
simulate systems dramatically different in nature. easy to learn. Our research group consists entirely
C is much more structured than FORTRAN and of chemists, most of whom have little previous
self-documenting in many cases, as it has superior training in computer science and most of whom
readability. It discourages the use of “go to” have had little difficulty in picking up the neces-
statements which have been described as a mar- sary skills to use the operating system on a so-
velous way to write impossible-to-understand phisticated level. Since most have little or no ex-
programs. Although it is a higher-level language, perience in traditional computer languages such
it allows the programmer the freedom and degrees as FORTRAN, it is interesting to note that they can
of manipulation of data found in most typical as- begin writing complicated C programs in much less
sembly languages. It is portable since there is no time that it would have taken to gain the equiva-
built-in I/O, and system dependencies are only in lent abilities in Fortran.
word lengths. Although C is very closely tied to the
UNIX operating system, there are C compilers V. CONCLUSION
running on VAXs under VMS, on IBM 360s, and
on various other machines. There is a portable C As computational chemists search for more
compiler that can be bought up on most machines computer power, others will surely turn to array
with just a few months of work. The C compiler processors as we have, as they provide, at the mo-
itself is written in C. ment, by far the most computational power per
The advantages of the UNIX operating sys- hardware dollar, particularly since the cost is low
tem63>64 should not be overlooked. The reason enough that they can be dedicated full time to a
UNIX and C are so related is that all of the UNIX particular task or class of tasks. While running on
utilities and over 90% of the actual UNIX kernel a supercomputer such as Cray-1 will result in more
are written in C. Thus, the UNIX operating system computation per hour of processor use, it is un-
itself is portable. It is becoming a standard oper- likely to result in as much computation per year.
ating system for a wide variety of computers, so The reason is that the equivalent to 24 h/day of
330
dedicated AP-12OB time is, for example,3*24725*60 Mackay for his substantial work in obtaining many of the
benchmarks presented here.
2-8 h/day of Cray-1 time, a usage rate which few,
if any, research groups are able to afford over the
long run. Even if a group's budget were large APPENDIX: CHAIN RULE FOR
enough to annually purchase this much super- GRADIENTS
computer time, for the same cost several array In the calculation and decomposition of forces, it is often
processors could be purchased each year. convenient to switch from Cartesian to internal coordinate
While array processor use is very appealing and systems. This presents difficulties as the gradients must
the reward can be high, we believe our effort in also undergo this transformation. In this appendix, we
present formula for converting gradients in one frame of
bringing up a general purpose program package for reference to another.
molecular mechanics has also uncovered many of The gradient withrespect to the Cartesian coordinates
the pitfalls. That we can run in 10 days problems + +
of particle i, ri = xii yij zik, is the vector operator
given by
which would require a year of dedicated VAX
11/780 time allows us to handle problems in solu- vi = i*- +a j-+
- a k- a (All
axi dya azi
tion reaction and biomolecular dynamics which The force on the ith particle Fi can be expressed using the
would not otherwise be feasible. However, the operator in eq. ( A l ) as
price we have paid is substantial. While molecular Fi = -ViV 642)
mechanics is straightforward in nature, it has
where V is the potential energy. Commonly, however, the
taken over six man-years to develop efficient AP potential V is more easily expressed as a function of some
code to carry out the task. internal coordinate r,, where the internal coordinate is a
An important feature of our code is its modu- function of the Cartesian coordinates, ra = ra(ri). We
would therefore like to convert the gradient in eq. (A2) into
larity. Since reprogramming is expensive, we have a gradient with respect to the internal coordinate r,. Using
attempted to isolate the individual aspects of the the chain rule, the following terms result.
calculation into individual AP modules. The gen-
erality of the program package allows us to simu-
late a wide variety of systems using essentially the
same code. Past work includes the calculation from
molecular dynamics and linear response theory of
infrared,42,43Raman4446 e l e ~ t r o n i c ~ spectra
~ ~ 4 8 in
the gas phase and in liquid solution. In addition, Equation (A3) can be written in a more compact form as
we have computed the dynamics and rotational Vi V(r,) = V, V(r, )Vi ra (-44)
and vibrational spectra of alkanes (such as meth- where V,V(r,) is a vector with components
ane, ethane, cyclohexane, and their solutions),
water (in both the gas and liquid phase as well as
various N-mers of water molecules), and ions and and Vi r, is a tensor65 with components
microcrystals dissolved in water. We have com-
puted the transient Raman and electronic ab-
ax,
dYi
sorption spectra during the course of a chemical
reaction by computing the dynamics for the pho-
todissociation of iodine in a solution of liquid
Other applications involve the com-
putation from molecular dynamics of thermody-
namic quantities and their quantum correction Thus by eqs. (A5) and (A6), the expression in eq. (A4) is
actually the vector matrix product66
through spectral analysis of atomic-velocity time
histories.49Newton also incorporates a general set
of protein potentials for biomolecules and is cur-
ax,
"I
rently being applied to the molecular mechanics
of polypeptides and membranes in collaboration
with A. Hagler.
b.2;
We thank the National Science Foundation, Chemistry, bz,
the Office of Naval Research, Chemistry, and the National
institutes of Health, Division of Research Resources, for bxi
the support which has made this work possible, and Don which when expanded gives eq. (A3).
331
We will now apply this technique to the examples pre- 10. D. W. Wood, in Water, A Comprehensive Treatise,
sented in the text. In the case of the three-body module, Vol. 6, Recent Advances, F. Franks, Ed., Plenum, New
where the internal coordinate vectors r, and r b as in eqs. York, 1979, p. 279.
(6) and (7) of the text, respectively, are given by the dif- 11. D. Ceperely and J. Tully, Proceedings of the Workshop
ference in position of the ith particle with respect to the j t h on Stochastic Molecular Dynamics, National Re-
particle, viz., source for computation in Chemistry, Berkeley,
1979.
r, = ri - rj (-48) 12. M. Levitt and A. Warshel, Nature, 253,694 (1975).
it can be easily seen by evaluating the partial derivatives 13. J. A. McCammon, B. R. Gelin, and M. Karplus, N a -
in eq. (A61 using eq. (A81 that the relevant tensors are given ture, 267,585 (1977).
by 14. A. Warshel, in Semi-Empirical Methods of Electronic
Structure Theory, Part A , G. Segal, Ed., Plenum, New
Vir, = I (A9) York, 1977, p. 133.
V j r a = -I (A101
15. A. T. Hagler, F. Naider, P. S. Stern, and R. Sharon, J .
A m . Chem. SOC.,101,6842 (1979).
where I is the unit tensor. 16. B. J. Alder and T. E. Wainwright, J . Chem. Phys., 31,
However, in the four-body force decomposition the in- 459 (1969).
ternal coordinates ra and rb are vector cross products of 17. A. Rahman, Phys. Rev. A , 136,405 (1964).
the bond vectors rl, r2, and r3, viz., 18. L. Verlet, Phys. Rev., 159,98 (1967).
19. P. Schofield, Comput. Phys. Commun., 5,17 (1973).
ra = rl x r2 = [YI, - ~ 1 ~~ 21 x, -2 ~ 1 ~ ~ 2 ,1 - ~~ 1 x22 1 20. J. A. McCammon and M. Karplus, Annu. Rev. Phys.
(All) Chem., 31,29 (1980).
21. N. S. Ostlund, in Report of a Minicomputer Work-
For this case, we must evaluate the partial derivatives in
shop, National Resource for Computation in Chem-
eq. (A6) using eq. ( A l l ) except with respect to the Carte-
istry, Berkeley, 1978.
sian vector rl instead of r2. This gives for the gradient ex-
22. N. S. Ostlund, Attached Scientific Processors for
pressed in eq. (34) of the text the following result.
Chemical Computations: A Report to t h e Chemistry
b cos 4 d cos 9 b cos 4 Community, National Resource for Computation in
--- Chemistry, Lawrence Berkeley Laboratory (LBL-
bxa by, dza
10409), Berkeley, 1980, also available from the Na-
tional Technical Information Service.
23. W. R. Wittmayer, Comput. Des., 93 (March 1978).
24. R. S. Bucy and K. D. Senne, Comp. Math. Applica-
tions, 6,317 (1980).
25. W. J. Karplus and D. Cohen, Computer, 11(September
When eq. (A12) is expanded it becomes obvious that it is 1981).
equivalent to 26. D. Bergmark, in Proceedings of t h e 1978 A R R A Y
V1 cos 9 = 1 2 X V , cos 4 (A131 Conference, Floating Point Systems, Inc., Portland,
OR, 1978.
Similarly, the following results can be obtained for the 27. Array Processor Fortran, Floating Point Systems, Inc.,
other gradients. Portland, OR, 1978.
28. Toast Fortran Development System, System Software
v2 cos 4 = v a cos 4 x rl -k v b cos 4 x r3 (A14) Factors, Reading-Berkshire, England, 1981.
v3 cos 4 = 1 2 x V b cos 4 (A15) 29. D. White, private communication.
30. A. E. Charlesworth, Computer, 18 (September
1981).
References 31. H. C. Andersen, private communication.
32. C. Pottle, M. S. Pottle, R. W. Tuttle, R. J. Kinch, and
1. K. R. Wilson, in Computer Networking and Chemis- H. A. Scheraga, J . Comput. Chem., 1,46 (1980).
t r y , P. Lykos, Ed., American Chemical Society, 33. D. C. Rapaport and H. A. Scheraga, Chem. Phys. Lett.,
Washington, DC, 1975, p. 17. 78,491 (1981).
2. K. R. Wilson, in Minicomputers and Large Scale 34. R. H. Kincaid and H. A. Scheraga, J. Phys. Chem., 86,
Computations, P. Lykos, Ed., American Chemical 833 (1982).
Society, Washington, DC, 1977, p. 147. 35. R. H. Kincaid and H. A. Scheraga, J . Phys. Chem., 85,
3. D. Fincham and B. J. Ralston, Comput. Phys. Com- 838 (1982).
mun., 23, 127 (1981). 36. W. C. Swope, H. C. Andersen, P. H. Berens, and K. R.
4. D. Fincham, Comput. Phys. Commun., 21, 247 Wilson, J . Chem. Phys., 76,637 (1982).
(1980). 37. B. J. Berne, private communication.
5. B. J. Alder, Annu. Rev. Phys. Chem., 24,325 (1973). 38. R. Dammkoehler, private communication.
6. W. W. Wood and J. J. Erpenbeck, Annu. Rev. Phys. 39. P. Alexander, Ind. Res. Dev., 111 (May 1980).
Chem., 27,319 (1976). 40. W. Furey, Jr., B. C. Wang, and M. Sax, J. Appl. Crys-
7. J. P. Valleau and S. G. Whittington, in Statistical tallogr., 15, 160 (1982).
Mechanics Part A: Equilibrium Techniques, B. J. 41. J. M. Dawson, R. W. Huff, and C. Wu, AFIPS Natl.
Berne, Ed., Plenum, New York, 1977. Comput. Conf. Proc., 47,395 (1978).
8. I. R. McDonald, in Microscopic Structure and Dy- 42. P. H. Berens and K. R. Wilson, in Picosecond Phe-
namics of Liquids, J. Dupuy and A. J. Dianoux, Eds., nomena II, R. Hochstrasser, w. Kaiser, and C. V.
Plenum, New York, 1977. Shank, Eds., Springer-Verlag, Berlin, 1980, p. 246.
9. K. Binder, Monte Carlo Methods i n Statistical 43. P. H. Berens and K. R. Wilson, J . Chem. Phys., 74,
Physics, Springer-Verlag, Berlin, 1979. 4872 (1981).
332
44. P. H. Berens, S. R. White, and K. R. Wilson, J. Chem. submitted.

Phys., 75,515 (1981). 54. B. Quentrec and C. Brot, J . Cornput. Phys., 13,430
45. P. Bado, P. H. Berens, and K. R. Wilson, in Picosecond (1973).
Lasers and Applications, L. S. Goldberg, Ed., Proc. 55. R. 0. Watts, Chem. Phys., 26,367 (1977).
SOC.Photo-Optic. Eng., Bellingham, WA, 1982, p. 56. E. B. Wilson, Jr., J. C. Decius, and P. C. Cross, Mo-
230. lecular Vibrations, McGraw-Hill, New York, 1955.
46. P. H. Berens, J. P. Bergsma, and K. R. Wilson, J . 57. S. Califano, Vibrational States, Wiley, London,
Chem. Phys., to be submitted. 1976.
47. P. Bado, P. H. Berens, J. P. Bergsma, S. B. Wilson, K. 58. D. Beeman, J. Comput. Phys., 20,130 (1976).
R. Wilson, and E. J. Heller, in Picosecond Phenomena 59. A. T. Hagler, private communication.
ZZZ, ed. K. Eisenthal, R. Hochstrasser, W. Kaiser, and 60. G. Chester, R. Gann, R. Gallagher, and A. Grimison,
A. Laubereau, Eds., Springer-Verlag, Berlin, 1982, p. in Computer Modeling of Matter, P. Lykos, Ed.,
260. American Chemical Society, Washington, DC, 1980,
48. P. H. Berens, J. P. Bergsma, and K. R. Wilson, to be p. 111.
submitted. 61. A. Komornicki, private communication.
49. P. H. Berens, D. H. J. Mackay, G. M. White, and K. R. 62. D. Jacobs, J. Prins, and K. Wilson, in Proceedings of
Wilson, J. Chem. Phys., submitted. the 1982 ARRAY Conference, Floating Point Systems,
50. B. W. Kernighan and D. M. Ritchie, The C Pro- Inc., Portland, OR, 1982.
gramming Language, Prentice-Hall, Englewood Cliffs, 63. D. M. Ritchie and K. Thompson, Bell Syst. Tech. J . ,
NJ, 1978. 57,1905 (1978).
51. D. J. Adams, in The Problem of Long-Range Forces 64. R. Thomas and J. Yates, A User Guide to the UNZX@
in the Computer Simulation of Condensed Media, D. System, Osborne, Berkeley, 1982.
Ceperely, Ed., National Resource for Computation in 65. P. I. Richards, Manual of Mathematical Physics,
Chemistry, Berkeley, 1980, p. 13. Pergamon, London, 1959, p. 299.
52. H. M. Cundy and A. P. Rollet, Mathematical Models, 66. R. E. Williamson, R. H. Crowell, and H. F. Trotter,
Oxford U. P., London, 1961. Calculus of Vector Functions, Prentice-Hall, Engle-
53. T. Andrea, W. C. Swope, and H. C. Andersen, to be wood Cliffs, NJ, 1968, p. 167.

Molecular Mechanics With An Array Processor: Peter

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Molecular Mechanics With An Array Processor: Peter

Caricato da

Copyright:

Formati disponibili

313

Molecular Mechanics with an Array Processor

Peter H. Berens and Kent R. Wilson

I. INTRODUCTION a n array of different processors operating in par-

Journal of Computational Chemistry, Vol. 4, No. 3,313-332 (1983)

Table I. Comparison of processor cost/performancefor demodulationproblem.

INWT i INPUT 2 INPUTQ

FLOATING FLOATING TABLE Y M W Y MAIN DATA

RESULT RESULT CYCLE

ARRAY PROCESSOR (AP-1206)

table lookup bits) can be accessed by the AP-12OB

Figure 4. AP-120B floating-point representation. A main 111. NEWTON

VAX 11/750 PDP 11/34

due to bond angle bending which requires the

i Ak tion routine is to be used. Currently we have two

iw' 6r, = Iri - rjl -rg = Ira! - r :

cos 4 in terms of the projections r, and rb to

Figure 8. Calculation of torsional angle. The top half of

tween them. We can now write

44. P. H. Berens, S. R. White, and K. R. Wilson, J. Chem. submitted.

Potrebbero piacerti anche