Trident: A Scalable Architecture For Scalar, Vector, and Matrix Operations

Trident: A Scalable Architecture for Scalar, Vector, and Matrix Operations
Mostafa I. Soliman and Stanislav G. Sedukhin

Graduate School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu City Fukushima, 965-8580 Japan
d8031102@u-aizu.ac.jp; sedukhin@u-aizu.ac.jp
Abstract
Within a few years it will be possible to integrate a billion transistors on a single chip. At this integration level, we propose using a high level ISA to express parallelism to hardware instead of using a huge transistor budget to dynamically extract it. Since the fundamental data structures for a wide variety of applications are scalar, vector, and matrix, our proposed Trident processor extends the classical vector ISA with matrix operations. The Trident processor consists of a set of parallel vector pipelines (PVPs) combined with a fast in order scalar core. The PVPs can access both vector and matrix register files to perform vector, matrix, and matrix-vector operations. One key point of our design is the exploitation of up to three levels of data parallelism. Another key point is the ring register files for storing vector and matrix data. The ring structure of the register files reduces the number and size of the address decoders, the number of ports, the area overhead caused by the address bus, and the number of registers attached to bit lines, as well as providing local communication between PVPs. The scalability of the Trident processor does not require more fetch, decode, or issue bandwidth, but requires replication of PVPs and increasing the register file size. Scientific, engineering, multimedia, and many other applications, which are based on a mixture of scalar, vector, and matrix operations, can be speeded up on the Trident processor. Keywords: Parallel processing, data parallelism, vector/matrix processing, ring register file, scalable hardware.
Traditionally, additional transistors have been used to improve processor performance by exploiting one or more forms of parallelism to perform more work per clock cycle. Instruction-level parallelism (ILP), threadlevel parallelism (TLP), and data parallelism (DP) are the three major forms of parallelism. These forms of parallelism are not mutually exclusive and can be combined in one computer. We briefly discuss a superscalar as an ILP processor and a single chip multiprocessor as an example of TLP before introducing our processor design, which exploits a significant amount of DP. Superscalar processors are capable of executing more than one instruction in parallel by exploiting ILP (Jouppi and Wall 1989, Smith and Sohi 1995). A significant portion of the die for most commercial superscalar processors is used by logic which searches for independent operations to execute in parallel. In contrast, the die area used to actually execute operations is relatively small (Lee and DeVries 1997). Superscalar processor performance is improved not only by trying to fetch and decode more instructions per cycle, but also by using wider outof-order issue. However, the cost of issuing multiple instructions per cycle grows at least quadratically with issue width and the required circuitry may soon limit the clock frequencies of superscalar processors (Palacharla, Jouppi, and Smith 1997). Moreover, research on improving superscalar performance suggests that superscalar processors wider than 4-issue may not be the most effective technique for exploiting ILP and using chip resources (Lee and DeVries 1997). Implementing more than one processor on the same chip offers performance advantages over wide-issue superscalar processors (Hammond, Hubbert, Siu, Prabhu, Chen, and Olukotun 2000, Krishnan and Torrellas 1999). These single chip multiprocessors offer high performance on single applications by exploiting loop-level parallelism and provide high throughput and low interactive response time on multiprogramming workloads (Nayfeh, Hammond, and Olukotun 1996). Although, the tightly integrated single chip multiprocessor has low interprocessor communication delays (for a relatively small number of processors), programs must still lay data out carefully in memory to avoid conflicts between processors, minimize data communication between processors, and express synchronization at any point in a program where processors may actively share data. To
Introduction
Rapid improvements in semiconductor technology fuel processor performance growth: each increase in integration density allows for higher clock rates and offers new opportunities for microarchitectural innovation (Vajapeyam and Valero 2001, Hammond, Nayfeh, and Olukotun 1997). Within a few years it will be possible to integrate a billion transistors on a single chip (Brinkman 1997, Burger and Goodman 1997, Patt, Patel, Evers, Friendly, and Stark, 1997). At this integration level, it is necessary to find new processor architectures to use this huge transistor budget efficiently and meet the requirements of future applications.
Copyright 2002, Australian Computer Society, Inc. This paper appeared at the Seventh Asia-Pacific Computer Systems Architecture Conference (ACSAC'2002), Melbourne, Australia. Conferences in Research and Practice in Information Technology, Vol. 6. Feipei Lai and John Morris, Eds. Reproduction for academic, not-for-profit purposes permitted provided this text is included.
Code movaps xmm0, [Vec1] mulps mulps addps xmm0, [Vec2] xmm1, [Vec2+16] xmm0, xmm1 movaps xmm1, [Vec1+16]
Latency 4 5 4 5 4 1 2 4 1 2 4
Throughput 1/2 cycles 1/2 cycles 1/2 cycles 1/2 cycles 1/2 cycles 1/1 cycles 1/2 cycles 1/2 cycles 1/1 cycles 1/2 cycles 1/2 cycles
Instr. Set Scalar
Example Addition Addition
Scalar Code z=x+y; for(i=0;i<n;i++) z[i]=x[i]+y[i]; s=0; for(i=0;i<n;i++) s+=x[i]*y[i]; for(i=0;i<n;i++) for(j=0;j<n;j++) z[i][j]=x[i][j]+ y[i][j];
Scalar operations 1 O(n) O(n)
Vector
movaps xmm1, xmm0 shufps xmm1, xmm1, 4Eh addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm1, xmm1, 11h addps xmm0, xmm1
Dot product
Addition
O(n2)
Table 1. Dot product on the Pentium III SSE simplify parallel programming, a single chip multiprocessor can support thread-level speculation and memory renaming. However, thread-level speculation requires additional complex hardware to divide a program into independent threads and track all inter-thread dependences. Not only has rapid technological advance had a direct impact on the processor architecture, but also application domains. For example, in response to the increasing importance of multimedia applications such as audio, video, 2-D image processing, 3-D graphics, speech, and handwriting recognition, major processor vendors have announced extensions to their general purpose processors in an effort to improve their multimedia performance (Diefendorff and Dubey 1997, Hughes, Kaul, Adve, Jain, Park, and Srinivasan 2001). Intel extended the IA-32 with MMX (MMX Technology 1996), SSE (Raman, Pentkovski, and Keshava 2000), and recently SSE2 (Hinton, Sager, Upton, Boggs, Carmean, Kyker, and Roussel 2001), Sun enhanced SPARC with VIS (Kohn, Maturana, Tremblay, Prabhu, and Zyner 1995), HewlettPackard added MAX to its PA-RISC architecture (Lee 1996), Silicon Graphics extended the MIPS architecture with MDMX, Compaq added MVI to Alpha, and Motorola extended the PowerPC with AltiVec (Diefendorff, Dubey, Hochsprung, and Scales 2000). Although multimedia extensions are a good step toward incorporating vector architecture into a processor, they have some disadvantages: they have limited vector instruction sets with fixed vector length and stride, one instruction may keep one datapath busy for few cycles, wide datapaths can be used after either changing the instruction set architecture (ISA) or the issue width, multiple instructions are needed to load and align a vector data, etc. (Stoodley and Lee 1999). To illustrate these disadvantages, Table 1 shows the Pentium III SSE instructions to perform a dot product of two eight element vectors. Although this code performs better than the scalar equivalent because loops are eliminated and a single instruction executes on multiple data, the fewer vector instructions load vectors and keep datapaths busy for many cycles.
Matrix
for(i=0;i<n;i++){ Matrix-vector s=0; multiplication for(j=0;j<n;j++) s+=x[i][j]*y[j]; z[i]=s;} for(i=0;i<n;i++) Matrix-matrix for(j=0;j<n;j++){ multiplication s=0; for(k=0;k<n;k++) s+=x[i][k]*y[k][j]; z[i][j]=s;}
O(n2)
O(n3)
Table 2: Trident processor instruction sets
We propose an approach which differs from existing ones to harness the available transistor budget efficiently and meet the requirements of future applications. We use a high level ISA to express parallelism to hardware instead of extracting parallelism dynamically by hardware or statically with compliers. Since the fundamental data structures for a wide variety of multimedia, scientific, and engineering applications are scalar, vector, and matrix, our Trident processor extends the classical vector ISA with matrix operations. This results in simpler hardware which does not need to detect parallelism and three levels of instructions: scalar, vector, and matrix, for programming and expressing parallelism in Trident. Trident ISA is high level because it extends a classical vector ISA which is compact, expressive, and scalable (Patterson and Hennessy 1996, Asanovic 1998). The Trident processor instructions are able to describe up to three levels (dimensions) of DP, as shown in Table 2. By using vector instructions, one-dimensional (1-D) DP can be exploited because each instruction allows O(n) scalar operations on O(n) scalar data. This means that a single vector instruction is equivalent to an entire 1-D loop, with each iteration computing one element of a vector, updating the indices and branching back to the beginning. However, by using matrix instructions, either 2-D DP (O(n2) scalar operations), or 3-D DP (O(n3) scalar operations) can be exploited on O(n2) scalar data. Since two or three nested loop code may be converted into only one matrix instruction, the matrix instructions give developers further ability to reduce the number of instructions require to execute particular tasks. As a direct
consequence of executing far fewer instructions, the instruction fetch bandwidth, the pressure on the fetch unit, the execution of branches, and the negative impact of branches are greatly reduced. The Trident processor consists mainly of datapath circuitry and register files because all valuable information about vector and matrix parallelism can be
Figure 1. Monolithic, partitioned, and distributed register files expressed by a single instruction. A single instruction can package up to O(n3) homogenous scalar operations. This saves hardware in the decode and issue stages (the opcode is decoded only once) as well as this eliminates dependency checking by hardware between these operations. The Trident processor can be scaled easily due to the natural scalability of the vector and matrix processing. Replicating parallel vector pipelines (PVPs) and increasing the size of register files can scale the Trident processor to process longer vectors or larger matrices without increasing the complexity of the fetch and decode units. The remainder of this paper is organized as follows. In Section 2, we describe the ring register files and illustrate their advantages. The architecture of the Trident processor is described in Section 3. Finally, Section 4 presents our conclusions and future direction of work.
Figure 2. Internal connections of a ring vector register be used to store vector and matrix elements. The monolithic RF represents the most straightforward configuration for implementing any RF. However, a monolithic RF occupies the largest area because the register cell must have multiple read and write-ports. When a monolithic RF is used to store vector or matrix data, all the vector or matrix elements are put together in the same RF. Any register, which contains a vector or matrix element, can be randomly accessed, even multiple times, in the same clock cycle. To reduce the physical size of the monolithic RF, registers can be partitioned into banks where each register has only one read port and one write port. When the partitioned RF is used to store vector or matrix data, each bank will contain vector or matrix elements and each register within a bank can be randomly accessed. Finally, in the distributed RF, each vector pipeline (VP) or set of VPs has direct access to one register set (small monolithic RF) only. However, accessing another set requires register copy operations, which requires extra ports. Lee (1992) has discussed the tradeoffs among monolithic, partitioned, and distributed RF. As we can see, these RFs can access the vector and matrix elements randomly, which require calculating the address of each element instead of calculating the address of the vector or matrix (whole set of elements) once. For example, to add two vectors, Ci = Ai + Bi (0 i <n), the address of the elements Ci, Ai, and Bi would have to be calculated in each computational step instead of only calculating the addresses of the vectors C, A, and B. To make matters worse, when operating p VPs in parallel, the vector or matrix RFs send two data elements to each VP and receive one (2p read and p write operations). Moreover, read and write operations are needed for loading and storing vector or matrix data. When accessing each register randomly, 3p+2 addresses would
Ring Register Files
In vector and matrix processing the same operations are executed on 1-D and 2-D arrays of elements, such as Ci = Ai vop Bi (0 i < n) and Ci,j = Ai,j mop Bi,j (0 i, j <n), respectively, where n is the length of the vectors or the size of the matrices. All the operations are carried out with the issue of a single vector or matrix instruction. That is, when a vector instruction is issued, the operation (vop) starts to be applied to the first element of the vector(s). In the next computational step, the same operation is applied to the second element of the vector(s). Thus, the same operation is carried out in sequence on the vector data. However, when a single matrix instruction is issued, the operation (mop) starts to be applied to the first row or column (n elements) in parallel, and so on. By the nature of vector and matrix processing, sequential access is more appropriate for accessing these elements than random access. Of course, the traditional register files (RF), such as monolithic, partitioned, and distributed (see Figure 1) can
have to be calculated and decoded each cycle. Besides, the the address wires will occupy a large area. The NECs Vector Pipelined Processor uses auto index incrementors for reading and writing vector registers (Okamoto, Hagihara, Ohkubo, Yamada, and Enomoto 1991). So, once the start address of a vector is supplied, the auto index incrementors calculate the required address in every cycle by incrementing the address pointer by one. Although this efficiently reduces the area overhead caused by the address bus, it did not reduce the bit line capacitance because the same number of registers was connected to each bit line. To resolve these problems, we propose ring RFs to store vector and matrix data. These ring RFs can reduce the area overhead caused by the address bus, as well as the bit line capacitance. Furthermore, the ring RFs are smaller than the traditional ones (see Figure 1) because the required number of ports is smaller.
cyclically shift its stored data only in 1-D. So, by using a set of PVPs, rows or columns of matrices can be processed each cycle. While a 1-D cyclical shift of MRs is sufficient for many element-wise matrix operations, such as addition and subtraction, there are others, such as multiplication, inversion, decomposition, etc., which require full connection with PVPs. This means that all registers in a ring MR should be connected with PVPs to implement a 2-D cyclical shift of a ring MR.
Figure 4. Internal connections of a matrix register There exist two possible schemes to provide full connection with PVPs (all-to-all broadcast): one of them is based on multiple one-to-all broadcasts and the other based on cyclical data shifts. Both of them need the same number of logical steps, O(p), to perform all-to-all broadcast on p PVPs. Figure 5 shows the implementation of matrix-vector multiplication, which requires all-to-all
Figure 3. Connection of the vector RF with PVP buses Figure 2 shows the internal connection of a ring vector register (VR), which consists of n registers (b bits). The cyclic_shift signal should be high during reading/writing the corresponding ring VR. When the read_enable switch is closed, vector elements can be read sequentially (one element every cycle) through Reg0, as shown in Figure 2b. However, if the read-enable switch is open, vector elements can be written sequentially through Reg(n-1), as shown in Figure 2c. Also, it is possible to read and write a ring VR concurrently, as shown in Figure 2c. As we can see, the ring VR uses local connections between its registers instead of using read and write global buses. Multiple ring VRs, which are considered collectively as a vector RF, give the appearance of a RF with multiple read and write ports. A ring vector RF, which has N vector registers, can allow N accesses to occur in the same cycle. The ring vector RF has a set of demultiplexors that supply Read_data from the ring VRs to the input buses of the PVPs, and a set of multiplexors that connect the output buses of PVPs to the Write_data of the ring VRs, as shown in Figure 3. Although only one register per ring VR is available during each clock cycle, this suffices due to the sequential nature of vector processing. Extending the idea of the ring VR, a matrix register (MR) looks like a set of ring VRs which are cyclically shifted in parallel, as shown in Figure 4. The ring MR can
(a) Matrix-vector multiplication
(b) Implementing (a) by p one-to-all broadcasts
(c) Implementing (a) by p rotation of ring VRs Figure 5. Matrix-vector multiplication
broadcast (see Figure 5a). The first scheme (see Figure 5b) is not scalable because it requires global connection and synchronization with all PVPs. On the other hand, the communication scheme, which is based on cyclical data shifts (see Figure 5c), can be easily scaled up because it only requires local communications between elements. Because full connection with PVPs is important for many matrix operations, we designed a set of ring vector registers with multiple ports for implementing the second scheme (cyclic shift). These multiple port vector registers (MPVRs) can be loaded in parallel from a MR, memory, or the output buses of the PVPs. To simplify our, architectural design, each MPVR connects with only one MR rather than with all MRs (as an example, Figure 6 shows an four element MPVR). These MPVRs can be used efficiently in matrix-vector multiplication, matrixmatrix multiplication, and many other matrix operations, as we will show later.
computations. The address unit is able to slip ahead of the scalar and vector/matrix processing units and load data that will be needed. The data fetched by the address unit is stored in the queues and stays there until the processing units retrieve it. Scalar memory accesses go first through the first level cache that holds only scalar data. However, vector and matrix accesses go to the second level cache. The scalar unit includes traditional scalar functional units and a scalar RF. The primary job of the scalar unit is to perform fundamental scalar instructions: addition, subtraction, multiplication, division, as well as others, and to service the other execution units. Vector processing uses PVPs to execute vector operations. Each VP is capable of executing fundamental vector operations on vector data stored in ring VRs, such as addition, subtraction, multiplication, division, as well as others, as shown in Figure 8a (we show a four element VR as an example). It is also possible to chain some of the scalar and/or vector registers to perform more complex vector operations, such as chaining VRs, multiplier, and adder to calculate a dot product (see Figure 8b). Although the Trident processor issues only one vector instruction per cycle, succeeding vector instructions can be executed in parallel on the PVPs.
Figure 6. 2-D cyclical shift of a matrix register
The Architecture of the Trident processor
Hiding execution details of high level instructions, such as vector and matrix, is the idea of the Trident processor. As shown in Table 2, a single vector instruction can be decomposed into several scalar instructions. Moreover, a large number of scalar instructions are contained in a single matrix instruction. Not only is the number of instructions reduced, but also the number of internal operations because most of the address computations, loop counter increments, and branch computations are imbedded in the powerful instructions. Therefore the Trident processor can efficiently execute programs that require a mixture of scalar, vector, and matrix computations because its ISA supports these data structures. Figure 7 shows an overall block diagram of the Trident processor. The fetch and decode stages are common to scalar, vector, and matrix instructions. As in a typical scalar processor, instructions are fetched from the instruction cache and stored in the fetch buffer awaiting decoding. The dispatch unit splits the incoming decoded instruction stream into three different streams. Each of these streams goes to a different processing unit: address unit, scalar unit, and vector/matrix unit. All these units communicate via queues (Hsu 1994, Roger and Mateo 1999). The address unit performs scalar, vector, and matrix memory accesses as well as all address
Figure 7. Trident processor block diagram By using a set of PVPs, the element-wise (independent) matrix computations, such as addition, subtraction, multiplication, division, as well as others, can be performed on matrices stored in ring MRs. Figure 9 shows a 44 MR to clarify our idea. The PVPs are able to communicate with scalar, vector, and matrix RFs to perform scalar-vector, scalar-matrix, vector-vector, matrix-vector and matrix-matrix operations. Not only the element-wise matrix operations but also the dependent ones, such as matrix-vector and matrix-matrix multiplications, can be done on the PVPs,.
Sequential multiplication of an nn matrix by a vector takes O(n2) computational steps. Each step involves multiplication and accumulation. A systolic implementation of the matrix-vector multiplication is feasible on p PVPs in O(n2/p) computational steps, 1 p n. We propose implementing not only matrixvector multiplication, but also vector-matrix multiplication on p = n PVPs, as shown in Figure 10. Both of them can be used for alterative processing of matrix rows and columns. This is because many matrix problems, such as 2-D DFT, singular value decomposition, eigenvalue problem, etc, require working on a matrix and its transpose simultaneously: this can be efficiently realized without explicit transposing of the matrix.
As an extension of matrix-vector multiplication, matrixmatrix multiplication can be systolically implemented on p PVPs in O(n3/p) computational steps, 1 p n2. We limit p = n PVPs only because the time complexity of parallel nn matrix-matrix multiplication will not be affected drastically by loading the source matrices into MRs. Figure 11 shows how n PVPs, a set of MRs, and a MPVR can be used for nn matrix-matrix multiplication. Also, many other matrix operations, such as addition, subtraction, decomposition, etc., can be efficiently implemented by using PVPs and ring RFs (Sedukhin 1990).
Conclusion and Future Work
Step Accumulator 1 s=0; s=s+a0b0 2 s=s+a1b1 3 s=s+a2b2 4 c=s+a3b3
In this paper we have proposed an architecture for the Trident processor, which has three instruction sets: scalar, vector, and matrix. A set of PVPs and ring RFs have been added to a scalar die to efficiently deal with vector and matrix operations, found in a broad range of scientific, engineering, and multimedia applications. The Trident processor ISA is compact, expressive, and scalable. This significantly reduces the instruction fetch bandwidth, the pressure on the fetch unit, the negative impact of branches, the dispatch of instructions, the computations for address calculations, and loop control. Hiding the executions details for vector and matrix instructions by exploiting up to 3-D DP is one key point of our Trident processor. Another key point is using ring RFs to store vector and matrix data. The ring nature of register files reduces the number of required read/write ports, the area overhead caused by the address bus, and the number of registers attached to bit lines, as well as providing local communications between PVPs.
(a) Vector pipeline
(b) r0 = Dot product (VR0, VR1)
Figure 8. Vector pipeline Note that sequential loading the source nn matrix from memory into a MR, which requires O(n2) steps, will drastically affect the total time of matrix-vector computation because the computational time of the parallel implementation on p = n PVPs is only O(n). If the same matrix in a MR is to be used many times, then the overhead of loading a source matrix gets less and less significant as the number of multiplications increases. So, loading a source matrix once and using it many times is the efficient way to use matrix-vector multiplication. Iterative methods for solving a system of linear algebraic equations, such as Jacoby and Gauss-Seidel iterations and 3-D graphics transformations represent good examples where the same matrix is multiplied by multiple vectors.
Figure 9. Element-wise matrix operations on PVPs and ring matrix registers We expect that the implementation of the Trident processor will not be complex because the regularity and locality of vector and matrix RFs as well as elimination of the complicated interconnection network (like crossbar)
(a) c = Ab
1.1.1 Step VP 0 VP 1 VP 2 VP3 1.1.2 1 s0=0; s0=s0+a00b0 s1=0; s1=s1+a11b1 s2=0; s2=s2+a22b2 s3=0; s3=s3+a33b3 2 s0=s0+a01b1 s1=s1+a12b2 s2=s2+a23b3 s3=s3+a30b0 3 s0=s0+a02b2 s1=s1+a13b3 s2=s2+a20b0 s3=s3+a31b1 4 c0=s0+a03b3 c1=s1+a10b0 c2=s2+a21b1 c3=s3+a32b2 1.1.7 1.1.6
(b) c = bA
Step VP 0 VP 1 VP 2 VP 3 1 c0=0; c1=c0+a00b0 c1=0; c2=c1+a11b1 c2=0; c3=c2+a22b2 c3=0; c0=c3+a33b3 2 c2=c1+a01b0 c3=c2+a12b1 c0=c3+a23b2 c1=c0+a30b3 3 c3=c2+a02b0 c0=c3+a13b1 c1=c0+a20b2 c2=c1+a31b3 4 c0=c3+a03b0 c1=c0+a10b1 c2=c!+a21b2 c3=c2+a32b3
1.1.3
1.1.8
1.1.4
1.1.9
1.1.5
1.1.10
Figure 10. Matrix-vector and vector-matrix multiplication between PVPs are very suitable for VLSI implementation. Furthermore, the control logic for issuing new vector and matrix instructions is relatively simple because parallelism does not need to be detected by hardware. This can allow more datapath capabilities in the available die area. The architecture of the Trident processor is scalable: its scalability does not require more fetch, decode, or issue bandwidth, but, rather, requires replicating VPs and increasing the size of the RFs. We expect that there are many applications which will benefit from the Trident processor, such as neural networks, multimedia, DSP and BLAS, to name just a few. In the near future, we will simulate the Trident processor and evaluate its performance on some multimedia and numerical applications. Also, we will compare the performance of our Trident processor with wide-issue superscalar processors for real applications which are mostly based on a mixture of scalar, vector, and matrix operations.
DIEFENDORFF, K. and DUBEY, P. (1997): How multimedia workloads will change processor design. IEEE Computer, 30(9):43-45. DIEFENDORFF, K., DUBEY, P., HOCHSPRUNG, R. and SCALES, H. (2000): Altivec extension to PowerPC accelerates media processing. IEEE Micro, 20(2):85-95. HAMMOND, L., HUBBERT, B., SIU, M., PRABHU, M., CHEN, M. and OLUKOTUN, K. (2000): The Stanford Hydra CMP. IEEE Micro, 20(2):71-84. HAMMOND, L., NAYFEH, B. and OLUKOTUN, K. (1997): A Single-chip multiprocessor. IEEE Computer, 30(9):70-85. HINTON, G., SAGER, D., UPTON, M., BOGGS, D., CARMEAN, D., KYKER, A. and ROUSSEL, P. (2001): The Microarchitecture of the Pentium 4 processor. Intel Technology Journal, 1st quarter. HSU, P. (1994): Designing the TFP microprocessor. IEEE Micro, 14(2):23-33. HUGHES, C., KAUL, P., ADVE, S., JAIN, R., PARK, C. and SRINIVASAN, J. (2001): Variability in the execution of multimedia applications and implications for architecture. Proc. 28th International Symposium on Computer Architectur. JOUPPI, N. and WALL, D. (1989): Available instruction level parallelism for superscalar and superpipelined machines. Proc. 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, Massachusetts, 272282. KOHN, L., MATURANA, G., TREMBLAY, M., PRABHU, A. and ZYNER, G. (1995): Visual instruction set (VIS) in UltraSPARC. Proc. COMPCON95: Technologies for the Information Superhighway. KRISHNAN, V. and TORRELLAS, J. (1999): A chipmultiprocessor architecture with speculative multithreading. IEEE Trans. on Computers, 48(9):866880. LEE, C. (1992): Code optimizers and register organizations for vector architectures. Ph.D. thesis. University of California, Berkeley. LEE, C. and DEVRIES, D. (1997): Initial results on the performance and cost of vector microprocessors. Proc. 30th Annual International Symposium on Microarchitecture (Micro 97). LEE, R. (1996): Subword parallelism with MAX-2. IEEE Micro, 16(4):51-59. MMX Technology (1996): Intel Architecture MMX Technology Programmers Reference Manual. Intel Corporation.
Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
VP 0 s0=0; s0=s0+a00b00 s0=s0+a01b10 s0=s0+a02b20 c00=s0+a03b30 s0=0; s0=s0+a00b01 s0=s0+a01b11 s0=s0+a02b21 c01=s0+a03b31 s0=0; s0=s0+a00b02 s0=s0+a01b12 s0=s0+a02b22 c02=s0+a03b32 s0=0; s0=s0+a00b03 s0=s0+a01b13 s0=s0+a02b23 c03=s0+a03b33
VP 2 s2=0; s2=s2+a22b20 s2=s2+a23b30 s2=s2+a20b00 c20=s2+a21b10 s2=0; s2=s2+a22b21 s2=s2+23b31 s2=s2+a20b01 c00=s2+a21b11 s2=0; s2=s2+a22b22 s2=s2+a23b32 s2=s2+a20b02 c00=s2+a21b12 s2=0; s2=s2+a22b23 s2=s2+a23b33 s2=s2+a20b03 c00=s2+a21b13
Figure 11. Matrix-matrix multiplication using parallel vector pipelines and ring registers
Acknowledgement
We wish to thank Prof. John Morris for referring and reviewing this paper.
References
ASANOVIC, K. (1998): Vector microprocessors. Ph.D. thesis. University of California, Berkeley. BRINKMAN, W. (1997): The transistor: 50 glorious years and where we are going. Proc. IEEE International Solid State Circuits Conference, 22-27. BURGER, D. and GOODMAN, J. (1997): Billiontransistor architecture. IEEE Computer, 30(9):46-48. CORBAL, J., VALERO, M. and ESPASA, R. (1999): Exploiting a New level of DLP in multimedia applications. Proc. 32nd Annual International Symposium on Microarchitecture (MICRO-32).
NAYFEH, B., HAMMOND, L. and OLUKOTUN, K. (1996): Evaluation of design alternatives for a multiprocessor microprocessor. Proc. 23rd International Symposium on Computer Architectur. OKAMOTO, F., HAGIHARA, Y., OHKUBO, C. YAMADA, H. and ENOMOTO, T. (1991): A 200MFLOPS 100-MHz 64-b BiCMOS vector-pipelined processor (VPP) VLSI. IEEE Journal of Solid State Circuits, 26:1885-1892. PALACHARLA, S. JOUPPI, N. and SMITH, J. (1997): Complexity-effective superscalar processors, Proc. 24th Annual International Symposium on Computer Architecture. PATT, Y., PATEL, S., EVERS, M., FRIENDLY, D. and STARK, J. (1997): One Billion transistors, one uniprocessor, one chip. IEEE Computer, 30(9):51-57. PATTERSON, D. and HENNESSY, J. (1996): Computer Architecture: A Quantitative Approach. 2nd Edition, Morgan Kaufmann Publishers.
RAMAN, S., PENTKOVSKI, V. and KESHAVA, J. (2000): Implementing streaming SIMD extensions on the Pentium III processor. IEEE Micro, 20(4): 47-57. ROGER, E and MATEO, V (1999): A Simulation Study of Decoupled Vector Architectures. Journal of Supercomputing, October. SEDUKHIN, S. (1990): Organization of systolic computations on a ring of computers. Proc. 5th International Workshop on Parallel ProcessingPARCELLA'90, 273-278. SMITH, J. and SOHI, G. (1995): The Microarchitecture of superscalar processors. Proc. of the IEEE, 83:16091624. STOODLEY, M. and LEE, C. (1999): Vector Microprocessors for Desktop Computing. Proc. 26th Annual International Symposium on Computer Architecture. VAJAPEYAM, S. and VALERO, M. (2001): Early 21st century processors. IEEE Computer, 34(4):47-50.

Trident: A Scalable Architecture For Scalar, Vector, and Matrix Operations

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Trident: A Scalable Architecture For Scalar, Vector, and Matrix Operations

Caricato da

Copyright:

Formati disponibili

Trident: A Scalable Architecture for Scalar, Vector, and Matrix Operations

Mostafa I. Soliman and Stanislav G. Sedukhin

Instr. Set Scalar

Example Addition Addition

Scalar operations 1 O(n) O(n)

Table 2: Trident processor instruction sets

Ring Register Files

(a) Matrix-vector multiplication

(b) Implementing (a) by p one-to-all broadcasts

(c) Implementing (a) by p rotation of ring VRs Figure 5. Matrix-vector multiplication

Figure 6. 2-D cyclical shift of a matrix register

The Architecture of the Trident processor

Conclusion and Future Work

Step Accumulator 1 s=0; s=s+a0b0 2 s=s+a1b1 3 s=s+a2b2 4 c=s+a3b3

(a) Vector pipeline

(b) r0 = Dot product (VR0, VR1)

Potrebbero piacerti anche