RISC-Based Architecture For Computer Hardware Introduction

RISC-Based Architecture for Computer Hardware Introduction
Antonio H. Zavala, Jorge Avante R., Quetzalcatl Duarte R., J. David Valencia P.
Unidad Profesional Interdisciplinaria en Ingeniera y Tecnologas Avanzadas Instituto Politcnico Nacional Distrito Federal, Mxico anhernandezz@ipn.mx; javanter0700@ipn.mx; eduarter0400@ipn.mx; jvalenciap0700@ipn.mx techniques to build the other components, like VHDL or Verilog language and a proprietary synthesizer depending on each hardware vendor. There exist different approaches in order to obtain high comprehension when explaining computer architecture, one approach uses simulation of processors that allow interaction with each module that composes a computer as on [1]-[4]; another option is to use FPGA devices and VHDL language to construct a simple computer [5]-[6], but there is the disadvantage that by programming there is a lack of understanding in dataflow. There is even an approach that uses MSI digital components such as TTL in order to construct a computer [7], it is a good exercise to realize interconnections between components, but sometimes using wires to build those connections is excessively timeconsuming. On the other hand, there is an option to construct a simple computer by using a GUI that allow to design at gate level by placing components in a spreadsheet, it is easy to use and easy to understand dataflow between components even is more comprehensive if we can create all of the computer components from logical gates. In the present work, the design of an 8-bit data width Reduced Instruction Set Computer (RISC) processor is presented; it was developed with simplicity and implementation efficiency in mind. It has a complete instruction set, program and data memories, general purpose registers and a simple Arithmetical Logical Unit (ALU) for basic operations. It operates following a multi-cycle execution nature and is implemented on a Xilinx Spartan-3E FPGA. In the following section the main characteristics of presented processor are presented. The third section introduces the instruction set and its instruction format used to interact through a programming language with the processor. Fourth section presents the different blocks that constitute the processor and its integration. On section five, timing and implementation results are presented followed by conclusions at sixth section. II. CPU ARQUITECTURE
Abstract Nowadays, computers are indispensable tools for most of everyday activities ranging from consumer electronics to industrial process automation. Complexity of new applications leads computer engineers to use embedded systems in order to develop high performance technological solutions that can achieve high speed processing while exploiting hardware resources efficiently. In order to develop embedded systems is necessary to understand the basic operation of a computer system, mainly composed of memory, a peripheral controller and a microprocessor. This work presents the design and implementation of an 8 bit RISC soft core processor intended for computer architecture introduction considered to be an effective solution for computer comprehension. Final circuit can be used as a soft core for FPGA embedded designs mainly for control and automation applications. Keywords: Digital Logic, Microprocessor, Computer Architecture, Programmable Logic
I.
INTRODUCTION
The traditional approach to develop a digital system was to use a set of interconnected digital integrated circuits like counters, buffers, logic gates and memory. That task required lots of analysis, testing and the need to adapt the design to the hardwares inherent limitations (speed, response time, power consumption, etc.) which resulted in capped headroom for development. Also, every design change implied a whole analysis but sometimes the prototyping hardware wouldnt allow any expansion without a considerable and most times expensive- upgrade. A the present time, technological advance has brought new options like programmable logic as Complex Programmable Logic Devices (CPLD) or Field Programmable Gate Arrays (FPGA) with more sophisticated simulation and design verification environments, which enable engineers to reach new levels of complexity and robustness, while greatly reducing the time between development and implementation. Also, those advances let engineers focus on the application needs, rather than to fit the system onto existing hardware. This way, one can develop a system that can be optimized for manufacturing on a single chip, with the capacity to add or remove modules according to the requirements in the future. Modern processor design sometimes reduce the implementation effort by acquiring some of these elements as Intellectual Property (IP) or through implementation
___________________________________ 978-1-61284-840-2/11/$26.00 2011 IEEE
The processor is based on the Harvard architecture, in which the size of the instructions is not related to the size of the data, and therefore it can be optimized in a way that any instruction occupies a single position of program memory, thus obtaining greater speed and a minor program length. Also, the access time to the instructions can be superposed with the one of the data, obtaining a greater speed in each operation.
Figure 1. The Harvard computer architecture composed of: program and data memories and the CPU interconnected by buses.
The processor presented includes a RISC instruction set and uses a Single Instruction Single Data (SISD) execution order. Its main characteristics are: Eight 8-bit general purpose registers 256 allocations of 16-bit wide ROM program memory 256 allocations of 8-bit wide RAM data memory ALU with basic arithmetic and logical operations III. INSTRUCTION SET For a complete CPU design, it was necessary to create a specific RISC instruction set and its own assembly code with its proper instruction format. The instruction set had to accomplish two medullar objectives: simplicity and robustness, i.e. using the simplest instructions that made the processor capable of executing complex operations and routines in the fewest steps. To achieve these objectives, the instructions needed were classified according to its purpose type in three groups: Operations (Arithmetic and Logical) Program Control (Jumps) Data Manipulation (Load and Storage) The arithmetic operations selected were addition and subtraction, because it is possible to perform any other operation with just these two. It was decided to allow the execution of these operations between two registers and between one register and an immediate data. The logical operations selected were AND, OR, NOT, shift right and shift left. The first two operations are executed between two registers, while the other operations are applied to only one register called source register. The branch instructions include an immediate jump, a jump to a value contained in a register, and conditional branches, one for each status flag. The execution of conditional branches depends on the status of the proper flag in the status register. The load and storage instructions required must address different data sources on an immediate, direct to data memory and indirect to data memory, addressing modes. Table I describes the complete instruction set, showing the mnemonic, description, syntax and micro-operation done for every instruction. The development of the instruction format was restricted by many factors: the CPU architecture, the number of instructions, and the operands involved; being this last, the most important factor. Each instruction has its own OPCODE (Operation Code), which is analog to the assembly mnemonic in machine language. The length in bits of the OPCODE depends on the
bits required to represent the total number of instructions. As the complete set contains 20 instructions, 5 bits are enough to represent them. As said, operands are the most important factor, the reason is that this factor makes necessary to reclassify the instructions like follows [8]: Type J (Jumps) Type I (Immediate data) Type R (Registered operations) Type J instructions only use the OPCODE and an 8 bit constant value (k); Type I Instructions use the OPCODE, a destiny (Rd) or source (Rs) register and a constant value (k); while Type R instructions require a single destination register or both: destination and source register.
TABLE I. Mne monic ADD SUB ADDI SUBI AND OR NOT SHL SHR JMP JMR BRC BRZ BRH LDI LDD LDX STD STX LDP Description Addition Subtraction Immediate addition Immediate subtraction Logic AND Logic OR Logic NOT Shift register left Shift register right Immediate jump Jump to register Branch if carry Branch if zero Branch if half-carry Load immediate Direct load from memory Indirect load from memory Direct storage to memory Indirect storage to memory Load from program counter INSTRUCTION SET Syntax ADD Rd, Rs SUB Rd, Rs ADDI Rd, k SUBI Rd, k AND Rd, Rs OR Rd, Rs NOT Rd SHL Rd SHR Rd JMR Rd BRC k BRZ k BRH k LDI Rd, k LDD Rd,[A] LDX Rd,[Rs] STD [A],Rd STX [Rd],Rs LDP Rd
Rd: Rs: k: PC: A:
Microoperation Rd Rd + Rs Rd Rd - Rs Rd Rd + k Rd Rd k
Rd Rd Rs Rd Rd or Rs Rd NOT (Rd) Rd(n+1) Rd(n), Rd(0) 0 Rd(n) Rd(n+1), Rd(7) 0 PC PC+k PCRd if (C = 1) then PC PC + k if (C = 1) then PC PC + k if (C = 1) then PC PC + k Rd k Rd Rd [A] [Rd] [A] [Rs] Rs Rs
RdPC
Destination register Source register Constant Program counter Address
Figure 2. Instruction format per instruction type.
To specify a register (source or destination) 3 bits are used since the CPU has 8 general purpose registers. To specify a constant value, 8 bits are used. Instruction format length was standardized to 16 bits. Fig. 2 shows the instruction format for each type. IV. HARDWARE IMPLEMENTATION
A. Program Counter (PC) The program counter produces the address to read instructions from the program memory. It has to be capable of loading a random address if the program requires so (i.e. loops or branches), and should be able to wait while the other functional parts complete their tasks (i.e. while the ALU gets the sum of 2 registers). An 8-bit parallel-load counter was used [9]-[11]. B. ROM Program Memory The program memory as its name describes- stores instructions to be executed. It has to be non-volatile and fast. It was decided to use internal ROM as program memory, because it was the fastest option and eliminated the need for external storage. This memory was built with 256x1 bit ROM blocks, with a total of 16 blocks to store 256 16-bit instructions. C. Instruction Registers and Decoder Instruction registers store the instruction read from the program memory, and keep it as an output for the decoder, which separates the operation code and operands to execute the command. Two 8-bit registers (D-type flip-flops) were used, one for the high byte and one for the low byte (16-bit instruction). The instruction decoder deals with the raw data stored in the instruction registers, and separates it in parts: Operation code (OPCODE), Rd, Rs and A/K, so that these values can be sent to the corresponding component, like the ALU, General Purpose Register block, etc. This is achieved simply by using a set of buffers inside a block to sort the signals to separate buses. D. General Purpose Registers (GPR) General Purpose Registers (GPRs) store and save operands and results during program execution. ALU and memories must be able to write/read those registers, so a set of eight 8-bit registers were used, along with multiplexers and a decoder to control which register is read or written. Two registers can be read at a time but only one register can be written at a time see Fig. 3. E. Arithmetic-Logic Unit (ALU) The Arithmetic-Logic Unit has 8 operations; each one of them was created and converted into a symbol, then, a multiplexor was placed in order to obtain a 3-bit selector. Also, it has 3 flags: carry (C), half carry (H) and zero (Z), which indicate if there is a carry, a half carry or a zero after any ALU operation see Fig. 4.
Figure 3. General purpose registers block.
Figure 4. Complete ALU block.
F. RAM Data Memory The RAM memory is a data storage block, there the stack is handled and other data are kept as variables. It is built with 32 memory blocks of 8 bits each one. The address input is divided in 2 parts. The first part has 3 bits select the memory block to read or write using a decoder. The second has 5 bits that select the memory location between 0 and 31 of the previously specified portion by the first part. In Fig. 8 appears the schematic diagram of the RAM memory. G. Control Unit The control unit operates as a state machine. In general, it follows the state diagram shown in Fig. 5. States 0, 1 and 3 remain equal, whereas the second state can change depending on the instruction read by the decoder. In this way: state 0 represents a reset stage, state 1 represents the fetch/writeback stage and finally during state 3 the program counter is incremented. For most of the instructions the control unit pass from state 1 to state 2 and then to state 3, returning to state 1,
except for the jump or branch instructions (if the corresponding flag is set) which do not need to increment the program counter as they load a value to it returning to state 1 after state 2. Table II shows the states depending of each instruction. The Control Unit was designed simplifying the logic expressions resulting of Table II. Finally, it was implemented using logic components such as logic gates and a counter which controls the change between states. The circuit obtained is presented in Fig. 6.
Figure 5. Control Unit States Diagram.
TABLE II.
TRUTH TABLE FOR CONTROL UNIT STATES MACHINE Figure 6. Control Unit implemented with logic components.
GPR_SOURCE1
GPR_SOURCE0
RAM SOURCE
ALU_SOURCE
RAM_WRITE
GPR_WRITE
PC SOURCE IR_LOAD
PC_CLOCK
INSTRUCTIONS
IDR_LOAD
PC_LOAD
V.
RESULTS
STATE
0 1 2
0 0 0 0 1 0
0 0 0
0 1 0
0 1 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
3 2 2 2 2 2
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 1 1 0 1 1
0 0 0 0 1 1
0 0 1 0 0 0
0 0 0 0 1 0
0 0 0 1 0 0
0 0 0 0 0 0
0 0
2 2 2 2
0 0 0 1
0 0 0 0
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
0 0 1 0
1 0 1 0
0 0 0 0
0 1 0 0
1 1 0 0
Reset Fetch JMP, BRC , BRZ , BRH Inc. PC LDI LDD STD ADDI, SUBI ADD, SUB, AND, OR, SHL, SHR, NOT LDX STX LDP JMR
A. Development Tools For the complete CPU implementation, a Xilinx Spartan3 XC3S1000-5FG320 FPGA was used. For design purpose, XilinxISE Webpack tools were used under schematic mode to allow direct gate level designs. Simulation is performed on the included XilinxISE Simulator, designed to work with files generated by XilinxISE. In order to simplify the conversion of CPU test programs, a specific translator was designed in MATLAB language, that reads a Microsoft Excel archive (*.xls) which contains the program in assembly language, and translates it to machine code, giving the proper format to load the instructions to the program memory. B. Test Application The program used to evaluate the performance of the CPU must make use of all type of instructions to manipulate the stack, execute subroutines and access RAM. With a view to meet the requirements above, a program that calculates Fibonacci series elements was chosen. The translated test application was loaded to the program memory, and a clock signal was inserted as g_clk. Key steps in the execution of the program are shown in Fig. 7 and Fig. 8. These steps are the jump to the subroutine and the storage of the proper value to its corresponding RAM address. For both figures, the first row corresponds to g_clk, the fifth row to the value of PC, the sixth row to the
OPCODE of the current instruction and the thirteenth and fourteenth rows to RAM data and RAM address. Fig. 7 shows the jump to the subroutine which is executed when the PC changes from h11 to h14, while in Fig. 8 is shown that after the return from the subroutine, the fifth element of Fibonacci series, i.e. the number 3, is written to address 0x04. C. Resource Consumption After file synthesis with the Xilinx XST tool and the simulation analysis shown above, some numerical results were obtained. These results reflect the hardware resources consumption and timing performance of the processor. Table III shows the number of cycles and time taken by the processor to execute an instruction, considering a period of 200 ns for the global clock signal (GCLK). Hardware resource consumption for the complete processor implemented in a Xilinx Spartan3 XC3S1000-5FG320 FPGA is shown in Table IV, where IOs have the highest utilization percentage; but on the other hand, the number of slice flip flops is minimal due to the combinational nature of the processor being capable of executing an instruction in few clock cycles.
TABLE IV. Logic Utilization Number of Slices
HARDWARE RESOURCES CONSUMED Used 205 98 410 107 1 Available 7680 15360 15360 221 8 Utilization 2% 0% 2% 48% 12%
Number of Slice Flip Flops Number of 4 input LUTs Number of bonded IOBs Number of GCLKs
VI.
CONCLUSIONS
From this work one can observe the flexibility of the development platform, which gives a great margin for modification and improvement of a current design. The simplicity of the design based on functional blockslets the students to understand how each part of a modern computer works, its relationships and to materialize theoretical concepts into functional devices. Obtained circuit can be used as a soft core processor for FPGA designs that would like to add a Central Processing Unit (CPU) to handle peripherals or other devices. VII. FUTURE WORK In order to obtain a more sophisticated architecture is necessary to add some advanced techniques like pipelining, interrupt handler and input/output controllers obtaining a competitive general purpose 8 bit RISC processor ACKNOWLEDGEMENTS Authors would like to thank SIP-IPN project 20100847, for their financial support to accomplish this project. REFERENCES
P. Verplaetse, J. Campenhout, ESCAPE: Environment for the Simulation of Computer Architecture for the Purpose of Education, IEEE TCCA Newsletter, February, pp. 57-59, 1999. [2] M. Jaumain, M. Osee, A. Richard, A. Vander Biest, P. Mathys, Educational simulation of the RiSC processor, ICEE International Conference on Engineering Education, 2007. [3] G. Wainer, S. Daicz, L. De Simoni, D. Wassermann, Using the Alfa1 simulated processor for educational purposes, Journal on Educational Resources in Computing, vol. 1, Issue 4, pp.111151, 2001 [4] J. Djordjevic, A. Milenkovic, N. Grbanovic, An Integrated Environment for Teaching Computer Architecture, IEEE Micro vol. 20, Issue 3, pp. 6674, 2000. [5] D. Mandalidis, P. Kenterlis, J. Ellinas, A computer architecture educational system based on a 32-bit RISC processor, International Review on computers and Software, pp. 114-119, 2008. [6] M. Becvar, A. Pluhacek, J. Danecek, DOP: a CPU core for teaching basics of computer architecture, Workshop on Computer architecture education, Article No. 4, 2003. [7] H. ElAarag, A complete design of a RISC processor for pedagogical purposes, Journal of Computing Sciences in Colleges, vol. 25, Issue 2, pp. 205-213, 2009. [8] Pattersson, Hennessy, Computer Organization and Design, The hardware/software interface, 2nd Edition, Morgan Kaufmann, 1998. [9] Tocci, Widmer, Moss, Sistemas Digitales, Principios y aplicaciones, 10 Edicin, Pearson, 2007. [10] W. Stallings, Organizacin y Arquitectura de Computadoras, 7 Edicin, Pearson, 2006. [11] M. Mano, Arquitectura de computadoras, 3 Edicin, Pearson, 1994. [1]
Figure 7. Simulation of test application I (Subroutine jump)
Figure 8. Simulation of test application II (Storage in RAM)
TABLE III. Instruction Branch instructions Rest of the instructions
TIMING RESULTS Number of cycles to be executed 2 3 Time to be executed 400 ns 600 ns

RISC-Based Architecture For Computer Hardware Introduction

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

RISC-Based Architecture For Computer Hardware Introduction

Caricato da

Copyright:

Formati disponibili

RISC-Based Architecture for Computer Hardware Introduction

Figure 2. Instruction format per instruction type.

Figure 4. Complete ALU block.

Figure 5. Control Unit States Diagram.

TABLE IV. Logic Utilization Number of Slices

Figure 7. Simulation of test application I (Subroutine jump)

Figure 8. Simulation of test application II (Storage in RAM)

TABLE III. Instruction Branch instructions Rest of the instructions

TIMING RESULTS Number of cycles to be executed 2 3 Time to be executed 400 ns 600 ns

Potrebbero piacerti anche