Sei sulla pagina 1di 22

Microprocessors

A microprocessor is a computer processor on a microchip. It's sometimes called a


logic chip. It is the "engine" that goes into motion when you turn your computer on. A
microprocessor is designed to perform arithmetic and logic operations that make use
of small number-holding areas called registers. Typical microprocessor operations
include adding, subtracting, comparing two numbers, and fetching numbers from one
area to another. These operations are the result of a set of instructions that are part of
the microprocessor design. When the computer is turned on, the microprocessor is
designed to get the first instruction from the basic input/output system (BIOS) that
comes with the computer as part of its memory. After that, either the BIOS, or the
operating system that BIOS loads into computer memory, or an application progam is
"driving" the microprocessor, giving it instructions to perform.

Intel 80486DX2 microprocessor in a ceramic PGA package


Main article: Microprocessor

The introduction of the microprocessor in the 1970s significantly affected the design
and implementation of CPUs. Since the introduction of the first microprocessor (the
Intel 4004) in 1970 and the first widely used microprocessor (the Intel 8080) in 1974,
this class of CPUs has almost completely overtaken all other central processing unit
implementation methods. Mainframe and minicomputer manufacturers of the time
launched proprietary IC development programs to upgrade their older computer
architectures, and eventually produced instruction set compatible microprocessors that
were backward-compatible with their older hardware and software. Combined with
the advent and eventual vast success of the now ubiquitous personal computer, the
term "CPU" is now applied almost exclusively to microprocessors.

Previous generations of CPUs were implemented as discrete components and


numerous small integrated circuits (ICs) on one or more circuit boards.
Microprocessors, on the other hand, are CPUs manufactured on a very small number
of ICs; usually just one. The overall smaller CPU size as a result of being
implemented on a single die means faster switching time because of physical factors
like decreased gate parasitic capacitance. This has allowed synchronous
microprocessors to have clock rates ranging from tens of megahertz to several
gigahertz. Additionally, as the ability to construct exceedingly small transistors on an
IC has increased, the complexity and number of transistors in a single CPU has
increased dramatically. This widely observed trend is described by Moore's law,
which has proven to be a fairly accurate predictor of the growth of CPU (and other
IC) complexity to date.

While the complexity, size, construction, and general form of CPUs have changed
drastically over the past sixty years, it is notable that the basic design and function has
not changed much at all. Almost all common CPUs today can be very accurately
described as von Neumann stored-program machines.

As the aforementioned Moore's law continues to hold true, concerns have arisen about
the limits of integrated circuit transistor technology. Extreme miniaturization of
electronic gates is causing the effects of phenomena like electromigration and
subthreshold leakage to become much more significant. These newer concerns are
among the many factors causing researchers to investigate new methods of computing
such as the quantum computer, as well as to expand the usage of parallelism and other
methods that extend the usefulness of the classical von Neumann model.

80C186 Family Execution Unit


The Execution Unit executes all instructions, provides data and addresses to the Bus
Interface Unit and manipulates the general registers and the Processor Status Word.
The 16-bit ALU within the Execution Unit maintains the μp status and control flags
and manipulates the general registers and instruction operands. All registers and data
paths in the Execution Unit are 16 bits wide for fast internal transfers.
The Execution Unit does not connect directly to the system bus. It obtains instructions
from a
queue maintained by the Bus Interface Unit. When an instruction requires access to
memory or a peripheral device, the Execution Unit requests the Bus Interface Unit to
read and write data. Addresses manipulated by the Execution Unit are 16 bits wide.
The Bus Interface Unit, however, performs an address calculation that allows the
Execution Unit to access the full megabyte of memory space.
To execute an instruction, the Execution Unit must first fetch the object code byte
from the instruction queue and then execute the instruction. If the queue is empty
when the Execution Unit is ready to fetch an instruction byte, the Execution Unit
waits for the Bus Interface Unit to fetch the instruction byte.
80C186 Family Bus Interface Unit
The 80C186 family Bus Interface Units are functionally identical. They are
implemented differently to match the structure and performance characteristics of
their respective system buses. The Bus Interface Unit executes all external bus cycles.
This unit consists of the segment registers, the Instruction Pointer, the instruction code
queue and several miscellaneous registers. The Bus Interface Unit transfers data to
and from the Execution Unit on the ALU data bus.
Registers
Registers for a variety of purposes such as holding the address of instructions and
data, storing the result of an operation, signaling the result of a logic operation, or
indicating the status of the program or the μp itself. Some registers may be accessible
to programmers, while others are reserved for us by the μp itself. Registers store
binary values such as 1 or 0 as electrical voltages of say 5 volts or 0 volts. They
consist of several integrated transistors which are configured as a flip-flop circuits
each of which can be switched into a 1 or 0 state. They remain in that state until
changed under control of the μp or until the power is removed from the processor.
Each register has a specific name and is addressable, some, however, are dedicated to
specific tasks while the majority are general purpose. The width of a register depends
on the type of μp, e.g., an 16, 32 or 64 bit microprocessor. In order to provide
backward compatibility, registers may be sub-divided. For example, the Pentium
processor is a 32 bit CPU, and its registers are 32 bits wide. Some of these are sub-
divided and named as 8 and 16 bit registers in order to run 8 and 16 bit applications
designed for earlier x86 microprocessors.
Instruction Register
When the Bus Interface Unit receives an instruction it transfers it to the Instruction
Register for temporary storage. In Pentium processors the Bus Interface Unit transfers
instructions to the L1 I-Cache, there is no instruction register as such.
Stack Pointer
A stack is a small area of reserved memory used to store the data in the μp’s registers
when: (1) system calls are made by a process to operating system routines; (2) when
hardware interrupts generated by input/output (I/O) transactions on peripheral
devices; (3) when a process initiates an I/O transfer; (3) when a process rescheduling
event occurs on foot of a hardware timer interrupt. This transfer of register contents is
called a context switch. The stack pointer is the register which holds the address of the
most recent stack entry. Hence, when a system call is made by a process (to say print
a document) and its context is stored on the stack, the called system routine uses the
stack pointer to reload the register contents when it is finished printing. Thus the
process can continue where it left off.
80C186 Family General Registers
The 80C186 family CPU has eight 16-bit general registers (see Figure 2). The general registers are
subdivided into two sets of four registers. These sets are the data registers (also called the H & L group
for high and low) and the pointer and index registers (also called the P & I group).
Figure 2. 80C186 Family General Registers
The data registers can be addressed by their upper or lower halves. Each data register
can be used interchangeably as a 16-bit register or two 8-bit registers. The pointer
registers are always accessed as 16-bit values. The μp can use data registers without
constraint in most arithmetic and logic operations. Arithmetic and logic operations can
also use the pointer and index registers. Some instructions use certain registers
implicitly (see Table 1), allowing compact encoding.
Table 1. Implicit Use of General Registers
Register Operations
AX Word Multiply, Word Divide, Word I/O
AL Byte Multiply, Byte Divide, Byte I/O, Translate, Decimal Arithmetic
AH Byte Multiply, Byte Divide
BX Translate
CX String Operations, Loops
CL Variable Shift and Rotate
DX Word Multiply, Word Divide, Indirect I/O
SP Stack Operations
SI String Operations
DI String Operations

The contents of the general-purpose registers are undefined following a processor


reset.
Instruction Decoder
The Instruction Decoder is an arrangement of logic elements which act on the bits that
constitute the instruction. Simple instructions with corresponding logic hard-wired
into the execution unit are simply passed to the Execution Unit, complex instructions
are decoded so that related microcode modules can be transferred from the μp’s
microcode ROM to the execution unit. The Instruction Decoder will also store
referenced operands in appropriate registers so data at the memory locations
referenced can be fetched.
Accumulator
The accumulator may contain data to be used in a mathematical or logical operation,
or it may contain the result of an operation. General purpose registers are used to
support the accumulator by holding data to be loaded to/from the accumulator.
Computer Status (Flag) Register
The result of an ALU operation may have consequences of subsequent operations; for
example, changing the path of execution. Individual bits in this register are set or reset
in accordance with the result of mathematical or logical operations. Also called a flag,
each bit in the register has a pre-assigned meaning and the contents are monitored by
the control unit to help control μp related actions. Bits in FL reflect ALU status and
μp (interrupt) status e.g. Negative (N), Zero (Z), Carry (C), Parity Flag (PF),
Overflow (V), Interrupt (IF), Trap (TF).
80C186 Family Flags
The 80C186 family has six status flags (see Figure 3) that the Execution Unit posts as
the result of arithmetic or logical operations. Program branch instructions allow a
program to alter its execution depending on conditions flagged by a prior operation.
Different instructions affect the status flags differently, generally reflecting the
following states:
1• If the Auxiliary Flag (AF) is set, there has been a carry out from the low nibble
into the high nibble or a borrow from the high nibble into the low nibble of an 8-
bit uantity (low-order byte of a 16-bit quantity). This flag is used by decimal
arithmetic instructions.
2• If the Carry Flag (CF) is set, there has been a carry out of or a borrow into the
high-order bit of the instruction result (8- or 16-bit). This flag is used by
instructions that add or subtract multibyte numbers. Rotate instructions can also
isolate a bit in memory or a register by placing it in the Carry Flag.
3• If the Overflow Flag (OF) is set, an arithmetic overflow has occurred. A
significant digit has been lost because the size of the result exceeded the capacity
of its destination location. An Interrupt On Overflow instruction is available that
will generate an interrupt in this situation.
4• If the Sign Flag (SF) is set, the high-order bit of the result is a 1. Since negative
binary numbers are represented in standard two’s complement notation, SF
indicates the sign of the result (0 = positive, 1 = negative).
5• If the Parity Flag (PF) is set, the result has even parity, an even number of 1
bits. This flag can be used to check for data transmission errors.
6• If the Zero Flag (ZF) is set, the result of the operation is zero. Additional
control flags (see Figure 3) can be set or cleared by programs to alter processor
operations:
7• Setting the Direction Flag (DF) causes string operations to auto-decrement.
Strings are processed from high address to low address (or “right to left”).
Clearing DF causes string operations to auto-increment. Strings are processed
from low address to high address (or “left to right”).
1• Setting the Interrupt Enable Flag (IF) allows the μp to recognize maskable
external or internal interrupt requests. Clearing IF disables these interrupts. The
Interrupt Enable Flag has no effect on software interrupts or non-maskable
interrupts.
2• Setting the Trap Flag (TF) bit puts the processor into single-step mode for
debugging. In this mode, the CPU automatically generates an interrupt after each
instruction. This allows a program to be inspected instruction by instruction
during execution.

The status and control flags are contained in a 16-bit Processor Status Word (see
Figure 3). Reset initializes the Processor Status Word to 0F000H.

Figure 3. 80C186 Family Processor Status Word


80C186 Family Segment Registers
The 80C186 family memory space is 1 Mbyte in size and divided into logical
segments of up to 64 Kbytes each. The CPU has direct access to four segments at a
time. The segment registers contain the base addresses (starting locations) of these
memory segments (see Figure 4). The CS register points to the current code segment,
which contains instructions to be fetched. The SS register points to the current stack
segment, which is used for all stack operations. The DS register points to the current
data segment, which generally contains program variables. The ES register points to
the current extra segment, which is typically used for data storage. The CS register
initializes to 0FFFFH, and the SS, DS and ES registers initialize to 0000H. Programs
can access and manipulate the segment registers with several instructions.

Figure 4. 80C186 Family Segment Register


Program or Instruction Counter
The Program Counter (PC) is the register that stores the address in primary memory
(RAM or ROM) of the next instruction to be executed. In 32 bit systems, this is a 32
bit linear or virtual memory address that references a byte (the first of 4 required to
store the 32 bit instruction) in the process’s virtual memory address space. This value
is translated to determine the real memory address in which the instruction is stored.
When the referenced instruction is fetched, the address in the PC is incremented to the
address of the next instruction to be executed. If the current address is 00B0 hex, then
the next address will be 00B4 hex. Remember each byte in RAM is individually
addressable, however each complete instruction is 32 bits or 4 bytes, and the address
of the next instruction in the process will be 4 bytes on.
80C186 Family Instruction Pointer
The Bus Interface Unit updates the 16-bit Instruction Pointer (IP) register so it
contains the offset of the next instruction to be fetched. Programs do not have direct
access to the Instruction Pointer, but it can change, be saved or be restored as a result
of program execution. For example, if the Instruction Pointer is saved on the stack, it
is first automatically adjusted to point to the next instruction to be executed. Reset
initializes the Instruction Pointer to 0000H. The CS and IP values comprise a starting
execution address of 0FFFF0H.
Arithmetic and Logic Unit
The Arithmetic and Logic Unit (ALU) performs all arithmetic and logic operations in
a microprocessor e.g. addition, subtraction, logical AND, OR, EX-OR, etc.. A typical
ALU is connected to accumulator and general purpose registers and other μp
components that help transfer the result of its operations to RAM via the Bus Interface
Unit and the system bus. The results may also be written into internal or external
caches.
Control Unit
The control unit coordinates and manages μp activities, in particular the execution of
instructions by the arithmetic and logic unit (ALU). In Pentium processors its role is
complex, as microcode from decoded instructions are pipelined for execution by two
ALUs.
The System Clock
The Intel 8088 had a clock speed of 4.77 Mhz; that is, its internal logic gates were
opened and closed under the control of a square wave pulsed signal that had a
frequency of 4.77 million cycles per second. Alternatively put, the logic gates opened
and closed 4.77 million times per second. Thus, instructions and data were pumped
through the integrated transistor logic circuits at a rate of 4.77 million bits per second.
Later designs ran at higher speeds e.g. the i286 8-20 Mhz, the i386 16-33 Mhz, i486
25-50 Mhz. Where does this clock signal come from? Each motherboard is fitted with
a quartz oscillator in a metal package that generates a square wave clock pulse of a
certain frequency. In i8088 systems the crystal oscillator ran at 14.318 Mhz and this
was fed to the i8284 to generate the system clock frequency of 4.77 Mhz in earlier
system, to 10Mhz is later designs. Later, the i286 PCs had a 12 Mhz crystal which
provided i82284 IC multiplier/divider with the primary clock signal. This then
divided/multiplied the basic 12 Mhz to generate the system clock signal of 8-20 Mhz.
With the advent of the i486DX, the system clock signal, which ran at 25 or 33 Mhz,
was effectively multiplied by factors of 2 and 3 to deliver an internal μp clock speed
of 50, 66, 75, 100 Mhz. This approach is used in Pentium IV architectures, where the
primary crystal source delivers a relatively slow 50 Mhz clock signal that is then
multiplied to the system clock speed of 100-133 Mhz. The internal multiplier in the
Pentium then multiplies this by a factor 20+ to obtain speeds of 2Ghz and above.
Instruction Cycle
An instruction cycle consists of the activities required to fetch and execute an
instruction. The length of time take to fetch and execute is measured in clock cycles.
The μp finishes the execution of an instruction it transfers the content of the program
or instruction register into the Bus Interface Unit (1 clock cycle) . This is then gated
onto the system address bus and the read signal is asserted on the control bus (1 clock
cycle). This is a signal to the RAM controller that the value of this address is to be
read from memory and loaded onto the data bus (4+ clock cycles). The instruction is
read in from the data bus and decoded (2 + clock cycles. The fetch and decode
activities constitute the first machine cycle of the instruction cycle. The second
machine cycle begins when the instruction’s operand is read from RAM and ends
when the instruction is executed and the result written back to memory. This will take
at least another 8+ clock cycles, depending on the complexity of the instruction. Thus
an instruction cycle will take at least 16 clock cycles, a considerable length of time.
However, Intel made advances by super pipelining instructions, that is by interleaving
fetch, decode, operand read, execute, and retire (i.e. write the result of the instruction
to RAM) activities into two separate pipelines serving two ALUs. Hence, instructions
are not executed sequentially, but concurrently and in parallel more about pipelining
later.
Microprocessor tasks
Microprocessors must perform the following activities:
11. Provide temporary storage for addresses and data
22. Perform arithmetic and logic operations
33. Control and schedule all operations.
Operation of the CPU is sequential :
Repeat
1• If (Reset)
2• Fetch (read) the first/next instruction from memory into the IR
3• Read operands, if required from memory or IO
4• Execute the instruction within the microprocessor
5• Write results, if required, to memory or IO
• If (Interrupt) { Execute interrupt sequence}

Code density

In early computers, program memory was expensive and limited, and minimizing the
size of a program in memory was important. Thus the code density -- the combined
size of the instructions needed for a particular task -- was an important characteristic
of an instruction set. Instruction sets with high code density employ powerful
instructions that can implicity perform several functions at once. Typical complex
instruction-set computers (CISC) have instructions that combine one or two basic
operations (such as "add", "multiply", or "call subroutine") with implicit instructions
for accessing memory, incrementing registers upon use, or dereferencing locations
stored in memory or registers. Some software-implemented instruction sets have even
more complex and powerful instructions.

Reduced instruction-set computers (RISC), first widely implemented during a period


of rapidly-growing memory subsystems, traded off simpler and faster instruction-set
implementations for lower code density (that is, more program memory space to
implement a given task). RISC instructions typically implemented only a single
implicit operation, such as an "add" of two registers or the "load" of a memory
location into a register.

Minimal instruction set computers (MISC) are a form of stack machine, where there
are few separate instructions (16-64), so that multiple instructions can be fit into a
single machine word. These type of cores often take little silicon to implement, so
they can be easily realized in an FPGA or in a multi-core form. Code density is
similar to RISC; the increased instruction density is offset by requiring more of the
primitive instructions to do a task.

Instruction sets may be categorized by the number of operands in their most complex
instructions. (In the examples that follow, a, b, and c refer to memory addresses, and
reg1 and so on refer to machine registers.)

• 0-operand ("zero address machines") -- these are also called stack machines, and
all operations take place using the top one or two positions on the stack. Adding
two numbers here can be done with four instructions: push a, push b, add, pop c;
• 1-operand -- this model was common in early computers, and each instruction
performs its operation using a single operand and places its result in a single
accumulator register: load a, add b, store c;
• 2-operand -- most RISC machines fall into this category, though many CISC
machines also fall here as well. For a RISC machine (requiring explicit memory
loads), the instructions would be: load a,reg1, load b,reg2, add reg1,reg2, store
reg2,b;
• 3-operand -- some CISC machines, and a few RISC machines fall into this
category. The above example here might be performed in a single instruction in a
machine with memory operands: add a,b,c, or more typically (most machines
permit a maximum of two memory operations even in three-operand instructions):
move a,reg1, add reg1,b,c. In three-operand RISC machines, all three operands
are typically registers, so explicit load/store instructions are needed. An instruction
set with 32 registers requires 15 bits to encode three register operands, so this
scheme is typically limited to instructions sets with 32-bit instructions or longer;
• more operands -- some CISC machines permit a variety of addressing modes that
allow more than 3 register-based operands for memory accesses.

There has been research into executable compression as a mechanism for improving
code density. The mathematics of Kolmogorov complexity describes the challenges
and limits of this.

Complex instruction set computer


From Wikipedia, the free encyclopedia

Jump to: navigation, search

A complex instruction set computer (CISC) is a microprocessor instruction set


architecture (ISA) in which each instruction can execute several low-level operations,
such as a load from memory, an arithmetic operation, and a memory store, all in a
single instruction. The term was retroactively coined in contrast to reduced instruction
set computer (RISC).

Before the first RISC processors were designed, many computer architects tried to
bridge the "semantic gap" - to design instruction sets to support high-level
programming languages by providing "high-level" instructions such as procedure call
and return, loop instructions such as "decrement and branch if non-zero" and complex
addressing modes to allow data structure and array accesses to be combined into
single instructions. The compact nature of such a CISC ISA results in smaller program
sizes and fewer calls to main memory, which at the time (the 1960s) resulted in a
tremendous savings on the cost of a computer.

While many designs achieved the aim of higher throughput at lower cost and also
allowed high-level language constructs to be expressed by fewer instructions, it was
observed that this was not always the case. For instance, badly designed, or low-end
versions of complex architectures (which used microcode to implement many
hardware functions) could lead to situations where it was possible to improve
performance by not using a complex instruction (such as a procedure call instruction),
but instead using a sequence of simpler instructions.

One reason for this was that such high level instruction sets, often also highly encoded
(for a compact executable code), may be quite complicated to decode and execute
efficiently within a limited transistor budget. These architectures therefore require a
great deal of work on the part of the processor designer (or a slower microcode
solution). At the time where transistors were a limited resource, this also left less
room on the processor to optimize performance in other ways, which gave room for
the ideas that led to the original RISC designs in the mid 1970 (IBM 801 - IBMs
Watson Research Center).

Examples of CISC processors are the System/360, VAX, PDP-11, Motorola 68000
family, and Intel x86 architecture based processors.

The terms RISC and CISC have become less meaningful with the continued evolution
of both CISC and RISC designs and implementations. The first highly pipelined
"CISC" implementations, such as 486s from Intel, AMD, Cyrix, and IBM, certainly
supported every instruction that their predecessors did, but achieved high efficiency
only on a fairly simple x86 subset (resembling a RISC instruction set, but without the
load-store limitations of RISC). Modern x86 processors also decode and split more
complex instructions into a series of smaller internal "micro-operations" which can
thereby be executed in a pipelined (parallel) fashion, thus achieving high performance
on a much larger subset of instructions.

Reduced instruction set computer


From Wikipedia, the free encyclopedia

(Redirected from RISC)


Jump to: navigation, search
This article or section does not cite any references or sources.
Please help improve this article by adding citations to reliable sources. (help, get involved!)
Any material not supported by sources may be challenged and removed at any time. This article has
been tagged since February 2007.
This article is about computer architecture. For the acronym RISC in biology, see
RNA-induced silencing complex.
This article is about computer architecture. For the operating system RISC OS, see
RISC OS.

The reduced instruction set computer, or RISC, is a CPU design philosophy that
favors a reduced instruction set as well as a simpler set of instructions. The most
common RISC microprocessors are Alpha, ARC, ARM, AVR, MIPS, PA-RISC, PIC,
Power Architecture, and SPARC.

The idea was originally inspired by the discovery that many of the features that were
included in traditional CPU designs to facilitate coding were being ignored by the
programs that were running on them. Also these more complex features took several
processor cycles to be performed. Additionally, the performance gap between the
processor and main memory was increasing. This led to a number of techniques to
streamline processing within the CPU, while at the same time attempting to reduce the
total number of memory accesses.

Pre-RISC design philosophy


For more details on this topic, see CPU design.

In the early days of the computer industry, compiler technology did not exist at all.
Programming was done in either machine code or assembly language. To make
programming easier, computer architects created more and more complex
instructions, which were direct representations of high level functions of high level
programming languages. The attitude at the time was that hardware design was easier
than compiler design, so the complexity went into the hardware.

Another force that encouraged complexity was the lack of large memory. Since
memory was small, it was advantageous for the density of information held in
computer programs to be very high. When every byte of memory was precious, for
example one's entire system only had a few kilobytes of storage, it moved the industry
to such features as highly encoded instructions, instructions which could be variable
sized, instructions which did multiple operations and instructions which did both data
movement and data calculation. At that time, such instruction packing issues were of
higher priority than the ease of decoding such instructions.

Another reason to keep the density of information high was that memory was not only
small, but also quite slow, usually implemented using ferrite core memory technology.
By having dense information packing, one could decrease the frequency with which
one had to access this slow resource.

CPUs had few registers for two reasons:

• bits in internal CPU registers are always more


expensive than bits in external memory. The
available level of silicon integration of the day
meant large register sets would have been
burdensome to the chip area or board areas
available.
• Having a large number of registers would have
required a large number of instruction bits
(using precious RAM) to be used as register
specifiers.

For the above reasons, CPU designers tried to make instructions that would do as
much work as possible. This led to one instruction that would do all of the work in a
single instruction: load up the two numbers to be added, add them, and then store the
result back directly to memory. Another version would read the two numbers from
memory, but store the result in a register. Another version would read one from
memory and the other from a register and store to memory again. And so on. This
processor design philosophy eventually became known as Complex Instruction Set
Computer (CISC) once the RISC philosophy came onto the scene.
The general goal at the time was to provide every possible addressing mode for every
instruction, a principle known as "orthogonality." This led to some complexity on the
CPU, but in theory each possible command could be tuned individually, making the
design faster than if the programmer used simpler commands.

The ultimate expression of this sort of design can be seen at two ends of the power
spectrum, the 6502 at one end, and the VAX at the other. The $25 single-chip 1 MHz
6502 had only a single general-purpose register, but its simplistic single-cycle
memory interface allowed byte-wide operations to perform almost on par with
significantly higher clocked designs, such as a 4 MHz Zilog Z80 using equally slow
memory chips (i.e. approx. 300ns). The VAX was a minicomputer whose initial
implementation required 3 racks of equipment for a single cpu, and was notable for
the amazing variety of memory access styles it supported, and the fact that every one
of them was available for every instruction.

[edit] RISC design philosophy


In the late 1970s researchers at IBM (and similar projects elsewhere) demonstrated
that the majority of these "orthogonal" addressing modes were ignored by most
programs. This was a side effect of the increasing use of compilers to generate the
programs, as opposed to writing them in assembly language. The compilers in use at
the time only had a limited ability to take advantage of the features provided by CISC
CPUs; this was largely a result of the difficulty of writing a compiler. The market was
clearly moving to even wider use of compilers, diluting the usefulness of these
orthogonal modes even more.

Another discovery was that since these operations were rarely used, in fact they
tended to be slower than a number of smaller operations doing the same thing. This
seeming paradox was a side effect of the time spent designing the CPUs: designers
simply did not have time to tune every possible instruction, and instead only tuned the
ones used most often. One famous example of this was the VAX's INDEX instruction,
which ran slower than a loop implementing the same code.

At about the same time CPUs started to run even faster than the memory they talked
to. Even in the late 1970s it was apparent that this disparity was going to continue to
grow for at least the next decade, by which time the CPU would be tens to hundreds
of times faster than the memory. It became apparent that more registers (and later
caches) would be needed to support these higher operating frequencies. These
additional registers and cache memories would require sizeable chip or board areas
that could be made available if the complexity of the CPU was reduced.

Yet another part of RISC design came from practical measurements on real-world
programs. Andrew Tanenbaum summed up many of these, demonstrating that most
processors were vastly overdesigned. For instance, he showed that 98% of all the
constants in a program would fit in 13 bits, yet almost every CPU design dedicated
some multiple of 8 bits to storing them, typically 8, 16 or 32, one entire word. Taking
this fact into account suggests that a machine should allow for constants to be stored
in unused bits of the instruction itself, decreasing the number of memory accesses.
Instead of loading up numbers from memory or registers, they would be "right there"
when the CPU needed them, and therefore much faster. However this required the
operation itself to be very small, otherwise there would not be enough room left over
in a 32-bit instruction to hold reasonably sized constants.

Since real-world programs spent most of their time executing very simple operations,
some researchers decided to focus on making those common operations as simple and
as fast as possible. Since the clock rate of the CPU is limited by the time it takes to
execute the slowest instruction, speeding up that instruction -- perhaps by reducing the
number of addressing modes it supports -- also speeds up the execution of every other
instruction. The goal of RISC was to make instructions so simple, each one could be
executed in a single clock cycle [1]. The focus on "reduced instructions" led to the
resulting machine being called a "reduced instruction set computer" (RISC).

The main difference between RISC and CISC is that RISC architecture instructions
either (a) perform operations on the registers or (b) load and store the data to and from
them. Many CISC instructions, on the other hand, combine these steps. To clarify this
difference, many researchers use the term load-store to refer to RISC.

Over time the older design technique became known as Complex Instruction Set
Computer, or CISC, although this was largely to give it a different name for
comparison purposes.

Code was implemented as a series of these simple instructions, instead of a single


complex instruction that had the same result. This had the side effect of leaving more
room in the instruction to carry data with it, meaning that there was less need to use
registers or memory. At the same time the memory interface was considerably
simpler, allowing it to be tuned.

However RISC also had its drawbacks. Since a series of instructions is needed to
complete even simple tasks, the total number of instructions read from memory is
larger, and therefore takes longer. At the time it was not clear whether or not there
would be a net gain in performance due to this limitation, and there was an almost
continual battle in the press and design world about the RISC concepts.

bus
In computer architecture, a bus is a subsystem that transfers data or power between
computer components inside a computer or between computers and typically is
controlled by device driver software. Unlike a point-to-point connection, a bus can
logically connect several peripherals over the same set of wires. Each bus defines its
set of connectors to physically plug devices, cards or cables together.

Early computer buses were literally parallel electrical buses with multiple
connections, but the term is now used for any physical arrangement that provides the
same logical functionality as a parallel electrical bus. Modern computer buses can use
both parallel and bit-serial connections, and can be wired in either a multidrop
(electrical parallel) or daisy chain topology, or connected by switched hubs, as in the
case of USB.

] First generation

Early computer buses were bundles of wire that attached memory and peripherals.
They were named after electrical buses, or busbars. Almost always, there was one bus
for memory, and another for peripherals, and these were accessed by separate
instructions, with completely different timings and protocols.

One of the first complications was the use of interrupts. Early computers performed
I/O by waiting in a loop for the peripheral to become ready. This was a waste of time
for programs that had other tasks to do. Also, if the program attempted to perform
those other tasks, it might take too long for the program to check again, resulting in
lost data. Engineers thus arranged for the peripherals to interrupt the CPU. The
interrupts had to be prioritized, because the CPU can only execute code for one
peripheral at a time, and some devices are more time-critical than others.

] Second generation

"Second generation" bus systems like NuBus addressed some of these problems.
They typically separated the computer into two "worlds", the CPU and memory on
one side, and the various devices on the other, with a bus controller in between. This
allowed the CPU to increase in speed without affecting the bus. This also moved
much of the burden for moving the data out of the CPU and into the cards and
controller, so devices on the bus could talk to each other with no CPU intervention.
This led to much better "real world" performance, but also required the cards to be
much more complex. These buses also often addressed speed issues by being "bigger"
in terms of the size of the data path, moving from 8-bit parallel buses in the first
generation, to 16 or 32-bit in the second, as well as adding software setup (now
standardised as Plug-n-play) to supplant or replace the jumpers.

However these newer systems shared one quality with their earlier cousins, in that
everyone on the bus had to talk at the same speed. While the CPU was now isolated
and could increase speed without fear, CPUs and memory continued to increase in
speed much faster than the buses they talked to. The result was that the bus speeds
were now very much slower than what a modern system needed, and the machines
were left starved for data. A particularly common example of this problem was that
video cards quickly outran even the newer bus systems like PCI, and computers
began to include AGP just to drive the video card. By 2004 AGP was outgrown again
by high-end video cards and is being replaced with the new PCI Express bus.

Third generation

"Third generation" buses are now in the process of coming to market, including
HyperTransport and InfiniBand. They typically include features that allow them to
run at the very high speeds needed to support memory and video cards, while also
supporting lower speeds when talking to slower devices such as disk drives. They also
tend to be very flexible in terms of their physical connections, allowing them to be
used both as internal buses, as well as connecting different machines together. This
can lead to complex problems when trying to service different requests, so much of
the work on these systems concerns software design, as opposed to the hardware
itself. In general, these third generation buses tend to look more like a network than
the original concept of a bus, with a higher protocol overhead needed than early
systems, while also allowing multiple devices to use the bus at once.

CPU operation
The fundamental operation of most CPUs, regardless of the physical form they take,
is to execute a sequence of stored instructions called a program. Discussed here are
devices that conform to the common von Neumann architecture. The program is
represented by a series of numbers that are kept in some kind of computer memory.
There are four steps that nearly all von Neumann CPUs use in their operation: fetch,
decode, execute, and writeback.

Diagram showing how one MIPS32 instruction is decoded. (MIPS Technologies


2005)

The first step, fetch, involves retrieving an instruction (which is represented by a


number or sequence of numbers) from program memory. The location in program
memory is determined by a program counter (PC), which stores a number that
identifies the current position in the program. In other words, the program counter
keeps track of the CPU's place in the current program. After an instruction is fetched,
the PC is incremented by the length of the instruction word in terms of memory
units.[3] Often the instruction to be fetched must be retrieved from relatively slow
memory, causing the CPU to stall while waiting for the instruction to be returned.
This issue is largely addressed in modern processors by caches and pipeline
architectures (see below).

The instruction that the CPU fetches from memory is used to determine what the CPU
is to do. In the decode step, the instruction is broken up into parts that have
significance to other portions of the CPU. The way in which the numerical instruction
value is interpreted is defined by the CPU's instruction set architecture (ISA).[4] Often,
one group of numbers in the instruction, called the opcode, indicates which operation
to perform. The remaining parts of the number usually provide information required
for that instruction, such as operands for an addition operation. Such operands may be
given as a constant value (called an immediate value), or as a place to locate a value: a
register or a memory address, as determined by some addressing mode. In older
designs the portions of the CPU responsible for instruction decoding were
unchangeable hardware devices. However, in more abstract and complicated CPUs
and ISAs, a microprogram is often used to assist in translating instructions into
various configuration signals for the CPU. This microprogram is sometimes
rewritable so that it can be modified to change the way the CPU decodes instructions
even after it has been manufactured.

Block diagram of a simple CPU

After the fetch and decode steps, the execute step is performed. During this step,
various portions of the CPU are connected so they can perform the desired operation.
If, for instance, an addition operation was requested, an arithmetic logic unit (ALU)
will be connected to a set of inputs and a set of outputs. The inputs provide the
numbers to be added, and the outputs will contain the final sum. The ALU contains
the circuitry to perform simple arithmetic and logical operations on the inputs (like
addition and bitwise operations). If the addition operation produces a result too large
for the CPU to handle, an arithmetic overflow flag in a flags register may also be set
(see the discussion of integer range below).

The final step, writeback, simply "writes back" the results of the execute step to some
form of memory. Very often the results are written to some internal CPU register for
quick access by subsequent instructions. In other cases results may be written to
slower, but cheaper and larger, main memory. Some types of instructions manipulate
the program counter rather than directly produce result data. These are generally
called "jumps" and facilitate behavior like loops, conditional program execution
(through the use of a conditional jump), and functions in programs.[5] Many
instructions will also change the state of digits in a "flags" register. These flags can be
used to influence how a program behaves, since they often indicate the outcome of
various operations. For example, one type of "compare" instruction considers two
values and sets a number in the flags register according to which one is greater. This
flag could then be used by a later jump instruction to determine program flow.

After the execution of the instruction and writeback of the resulting data, the entire
process repeats, with the next instruction cycle normally fetching the next-in-
sequence instruction because of the incremented value in the program counter. If the
completed instruction was a jump, the program counter will be modified to contain
the address of the instruction that was jumped to, and program execution continues
normally. In more complex CPUs than the one described here, multiple instructions
can be fetched, decoded, and executed simultaneously. This section describes what is
generally referred to as the "Classic RISC pipeline," which in fact is quite common
among the simple CPUs used in many electronic devices (often called
microcontrollers).[6]

Design and implementation


Main article: CPU design
Prerequisites

Computer architecture

Digital circuits

Integer range

The way a CPU represents numbers is a design choice that affects the most basic
ways in which the device functions. Some early digital computers used an electrical
model of the common decimal (base ten) numeral system to represent numbers
internally. A few other computers have used more exotic numeral systems like ternary
(base three). Nearly all modern CPUs represent numbers in binary form, with each
digit being represented by some two-valued physical quantity such as a "high" or
"low" voltage.[7]

MOS 6502 microprocessor in a dual in-line package, an extremely popular 8-bit


design.

Related to number representation is the size and precision of numbers that a CPU can
represent. In the case of a binary CPU, a bit refers to one significant place in the
numbers a CPU deals with. The number of bits (or numeral places) a CPU uses to
represent numbers is often called "word size", "bit width", "data path width", or
"integer precision" when dealing with strictly integer numbers (as opposed to floating
point). This number differs between architectures, and often within different parts of
the very same CPU. For example, an 8-bit CPU deals with a range of numbers that
can be represented by eight binary digits (each digit having two possible values), that
is, 28 or 256 discrete numbers. In effect, integer size sets a hardware limit on the range
of integers the software run by the CPU can utilize.[8]

Integer range can also affect the number of locations in memory the CPU can address
(locate). For example, if a binary CPU uses 32 bits to represent a memory address,
and each memory address represents one octet (8 bits), the maximum quantity of
memory that CPU can address is 232 octets, or 4 GiB. This is a very simple view of
CPU address space, and many designs use more complex addressing methods like
paging in order to locate more memory than their integer range would allow with a
flat address space.

Higher levels of integer range require more structures to deal with the additional
digits, and therefore more complexity, size, power usage, and general expense. It is
not at all uncommon, therefore, to see 4- or 8-bit microcontrollers used in modern
applications, even though CPUs with much higher range (such as 16, 32, 64, even
128-bit) are available. The simpler microcontrollers are usually cheaper, use less
power, and therefore dissipate less heat, all of which can be major design
considerations for electronic devices. However, in higher-end applications, the
benefits afforded by the extra range (most often the additional address space) are
more significant and often affect design choices. To gain some of the advantages
afforded by both lower and higher bit lengths, many CPUs are designed with different
bit widths for different portions of the device. For example, the IBM System/370 used
a CPU that was primarily 32 bit, but it used 128-bit precision inside its floating point
units to facilitate greater accuracy and range in floating point numbers (Amdahl et al.
1964). Many later CPU designs use similar mixed bit width, especially when the
processor is meant for general-purpose usage where a reasonable balance of integer
and floating point capability is required.

Clock rate

Logic analyzer showing the timing and state of a synchronous digital system.
Main article: Clock rate

Most CPUs, and indeed most sequential logic devices, are synchronous in nature.[9]
That is, they are designed and operate on assumptions about a synchronization signal.
This signal, known as a clock signal, usually takes the form of a periodic square
wave. By calculating the maximum time that electrical signals can move in various
branches of a CPU's many circuits, the designers can select an appropriate period for
the clock signal.

This period must be longer than the amount of time it takes for a signal to move, or
propagate, in the worst-case scenario. In setting the clock period to a value well above
the worst-case propagation delay, it is possible to design the entire CPU and the way it
moves data around the "edges" of the rising and falling clock signal. This has the
advantage of simplifying the CPU significantly, both from a design perspective and a
component-count perspective. However, it also carries the disadvantage that the entire
CPU must wait on its slowest elements, even though some portions of it are much
faster. This limitation has largely been compensated for by various methods of
increasing CPU parallelism (see below).

Architectural improvements alone do not solve all of the drawbacks of globally


synchronous CPUs, however. For example, a clock signal is subject to the delays of
any other electrical signal. Higher clock rates in increasingly complex CPUs make it
more difficult to keep the clock signal in phase (synchronized) throughout the entire
unit. This has led many modern CPUs to require multiple identical clock signals to be
provided in order to avoid delaying a single signal significantly enough to cause the
CPU to malfunction. Another major issue as clock rates increase dramatically is the
amount of heat that is dissipated by the CPU. The constantly changing clock causes
many components to switch regardless of whether they are being used at that time. In
general, a component that is switching uses more energy than an element in a static
state. Therefore, as clock rate increases, so does heat dissipation, causing the CPU to
require more effective cooling solutions.

One method of dealing with the switching of unneeded components is called clock
gating, which involves turning off the clock signal to unneeded components
(effectively disabling them). However, this is often regarded as difficult to implement
and therefore does not see common usage outside of very low-power designs.[10]
Another method of addressing some of the problems with a global clock signal is the
removal of the clock signal altogether. While removing the global clock signal makes
the design process considerably more complex in many ways, asynchronous (or
clockless) designs carry marked advantages in power consumption and heat
dissipation in comparison with similar synchronous designs. While somewhat
uncommon, entire CPUs have been built without utilizing a global clock signal. Two
notable examples of this are the ARM compliant AMULET and the MIPS R3000
compatible MiniMIPS. Rather than totally removing the clock signal, some CPU
designs allow certain portions of the device to be asynchronous, such as using
asynchronous ALUs in conjunction with superscalar pipelining to achieve some
arithmetic performance gains. While it is not altogether clear whether totally
asynchronous designs can perform at a comparable or better level than their
synchronous counterparts, it is evident that they do at least excel in simpler math
operations. This, combined with their excellent power consumption and heat
dissipation properties, makes them very suitable for embedded computers (Garside et
al. 1999).

Floating point unit


From Wikipedia, the free encyclopedia

Jump to: navigation, search

A floating point unit (FPU) is a part of a computer system specially designed to


carry out operations on floating point numbers. Typical operations are addition,
subtraction, multiplication, division, and square root. Some systems (particularly
older, microcode-based architectures) can also perform various "transcendental"
functions such as exponential or trigonometric calculations, though in most modern
processors these are done with software library routines.

In most modern general purpose computer architectures, one or more FPUs are
integrated with the CPU; however many embedded processors, especially older
designs, do not have hardware support for floating point operations.

In the past, some systems have implemented floating point via a coprocessor rather as
an integrated unit; in the microcomputer era, this was generally a single microchip,
while in older systems it could be an entire circuit board or a cabinet.

Not all computer architectures have a hardware FPU. In the absence of an FPU, many
FPU functions can be emulated, which saves the added hardware cost of an FPU but
is significantly slower. Emulation can be implemented on any of several levels - in the
CPU as microcode, as an operating system function, or in user space code.

In most modern computer architectures, there is some division of floating point


operations from integer operations. This division varies significantly by architecture;
some, like the Intel x86 have dedicated floating point registers, while some take it as
far as independent clocking schemes.

Floating point operations are often pipelined. In earlier superscalar architectures


without general out-of-order execution, floating point operations were sometimes
pipelined separately from integer operations. Today, many CPUs/architectures have
more than one FPU, such as the PowerPC 970, and processors based on the Netburst
and AMD64 architectures (such as the Pentium 4 and Athlon 64, respectively.)

In some cases, FPUs may be specialized, and divided between simpler floating point
operations (mainly addition and multiplication) and more complicated operations, like
division. In some cases, only the simple operations may be implemented in hardware,
while the more complex operations could be emulated.

In some current architectures, the FPU functionality is combined with units to


perform SIMD computation; an example of this is the replacement of the x87
instructions set with SSE instruction set in the x86-64 architecture used in newer Intel
and AMD processors.

Wait state
From Wikipedia, the free encyclopedia

Jump to: navigation, search

A wait state is a delay experienced by a computer processor when accessing external


memory or another device that is slow to respond.

As of 2005, computer microprocessors run at very high speeds, while memory


technology does not seem to be able to catch up: typical PC processors like the Intel
Pentium 4 and the AMD Athlon run with a clock of several GHz, while the main
memory clock ranges from 233 to 533 MHz. Some second-level CPU caches run
slower than the processor core.

When the processor needs to access external memory, it starts placing the address of
the requested information on the address bus. It then must wait for the answer, that
may come back tens if not hundreds of cycles later. Each of the cycles spent waiting is
called a wait state.

Wait states are a pure waste for a processor's performance. Modern designs try to
eliminate or hide them using a variety of techniques: CPU caches, instruction
pipelines, instruction prefetch, branch prediction, simultaneous multithreading and
others. No single technique is 100% successful, but together they can significantly
reduce the problem.

Potrebbero piacerti anche