Hyper-Threading Technology Architecture and Microarchitecture - Summary

Intel introduces a new concept Hyper Threading (HT) Technology, which brings simultaneously multiple threads execution with
it. HT technology divides the processor into two logical parts which means that threads will execute parallel in the processor. These logically distributed processors shared the underlying resources. The architecture state is duplicated for each processor. So the operating system will treat these both logical processors as physical processors to make sure the parallel execution. Earlier scientists were trying to improve the performance of traditional microprocessor, in different ways like, pipelining and out of order executions. But while increasing the transistors count on a processor the heat dissipation was increasing with greater ratio than the performance and there was a physical limit as well to install transistors on a processor. To improve the performance their main focus was to 1. 2. 3. increase the clock frequency Instruction Level Parallelism and increase caches
To increase the clock frequency, pipelining was used, because if the number of instructions to execute will increase, the performance will also increase. But there was a cost of cache misses and branch misprediction. Instruction Level Parallelism (ILP) is a technique to execute multiple instructions at a time. The idea is to have a multiple execution units which can execute multiple instructions at a time. And the problem in ILP was to find out the multiple instructions to execute because it is hard to know the instruction dependencies Processor has to access the RAM for data read and write, to access the RAM is much slow than the processor execution speed, which increases the latencies. Caches were introduced to solve this process. But cache can only be fast if it is smaller in size. And if data is not available in the cache, processor has to access the RAM, so cache misses again increases the latencies. Another problem, while increasing the number of transistors to a processor was, the die size was increasing rapidly. If we see the statistics of Intels first four processor generations, we can conclude that to double up the number of transistors in a processor we cannot get the increase in performance with this ratio, as the number of clock cycles were wasting in branch misprediction. So, architect engineers started to think about a solution to get the great performance and overcome the latencies while controlling the transistors count, and heat dissipation. And HT technology is one of them On the other hand if we have a look on latest software trends we will know that todays software applicatio ns consist of multiple threads, and can execute them parallel. So multiprocessors were introduced for this Thread Level Parallelism (TLP). Programmers know how to write the applications which exploit the multiprocessing environment. Other techniques were also introduced to exploit TLP. One of them was Chip Level Multiprocessing. Where two processors were placed on a single die, each processor had its own set of instructions, architecture and physical resources.
To execute multiple processes at a single time, time sharing technique was also used. This technique switch the process after a given time interval, but there was also great wastage of CPU cycles. Because it was causing the cache misses and branch misprediction So at the end Simultaneous Threading (ST) is the only solution to increase the processor performance vs transistor counts and heat dissipation. ST executes the multiple threads without switching. HT technology uses the simultaneous threading. HT Technology Architecture Hyper Threading technology divides a single physical processor into two logical processors, each logical processor has a copy of its architecture state and both share common resources. So for the applications there are two physical processes where multiple threads can execute simultaneously. HT technology increases the performance of CPU without increases the cost. It takes less than 5% of size on chip but it increases the performance on higher rates than the power consumption. Each processor have its own architecture state, registers, branch predictors, interrupt controllers but share the all other resources like cache and RAM. Implementation and Goals HT technology was first implemented in INTEL XEON microprocessor. There were several goals of this technology First to increasing the performance without increasing the die size are and cost. HT technology just takes 5% additional space on chip. Second goal was, if a logical processor is idle or in waiting state, the other should continue its work. Third goal was, if there is only a single software thread running at a time, the partitioned resources should combine again to achieve maximum performance. Pipelining The major functions of pipelining in HT Technology are Instructions start from Execution Tree Cache (TC). TC is L1 cache and the front end of pipeline transfers the instructions to the other pipes. A process is further divided in to micro operations (uops). Each uop is stored in TC. Logical process access the TC to load instruction. If both processors try to access the TC at a same time, the access only grants one by one using alternate clock cycles. But if a processor is idle then other logical processor can use the complete bandwidth of TC. If TC misses then TC sends a request to Instruction Transfer Look aside Buffer (ILTB) to load next instruction from L2 cache. Each logical processor has a separate ILTB and instruction set. ILTB serves the requests on first come first serve basis. ILTB fetches the next instruction and put into streaming buffer (SB) until it decodes. Complex instructions are handled by Microcode ROM (MR). The most complex instructions to decode are IA-32 instructions, because the instruction length is not fixed. We only need to decode when there is a TC miss. The decode logic (Microcode Rom) reads the instruction from streaming buffer and decodes it. If both logical
processors wan to decode at a same time, the decode logic alternative gives the turn to each. If only one process is decoding a time than it can use complete bandwidth When an instruction is fetched from TC or MR, it goes to UoP Queue. The queue is divided in to two equal parts so that both logical processers may continue their work Here starts the execution phase. Execution occurs in out of order execution engine. It involves the following functions 1. 2. 3. 4. 5. Allocation Register Renaming Instruction Scheduling Execution Retirement
Allocator takes a uop from uop queue and allocates the needed buffer for the execution of this uop. Almost half of the buffer entries are allocated to each logical processor. If both logical processors are working simultaneously, the resources will be assigned fairly to them, if one of them completes its execution then it stalls and allocator tries to allocate these free resources to the other logical processor. Register renaming logic renames the register and place the new name in Register Alias Table (RAT). There are two RATs, one for each logical processor. Renaming operation is done parallel to the Allocation operation so that the instruction can know the new names of allocated registers. After renaming and allocation, the uop is places in two types of queues. One for memory operations called memory instruction queue and other for all operations called general purpose queue. Scheduler is the main part of the out of order execution engine. There are five schedulers used to schedule a uop to execute. Schedulers determine that the uop is ready to execute on the basis of availability of inputs and execution resources. When a uop is ready to execute, it is sent to the available logical process Uops access the renamed registers to get input and to write the results back on it after execution. And when execution completes uops are sent to out of order buffer, which decouples the uop to the retirement stage Retirement logic which uop is ready to retire, when a uop is retired and the results are to write on the L cache, Selection lines alternatively allow each logical processor to write data on cache. Single tasking and multitasking: There are two modes of the processor Single Task (ST) and Multi Task (MT). ST further has two modes ST0 and ST1. And there is an HALT instruction in IA-32 architecture, only OS or ring-0 processes can use this instruction. It changes the processor mode from MT to ST. If there is a single task to execute, only one logical processor ST0 or ST will be used and HALT will execute to the other logical processor which will go in lower power mode. Optimizations required in OS There are two optimizations required in OS to use HT technology. First to use the HALT instruction, if OS does not use this instruction the idle processor will continue to keep using the shared resources, which can cause the idle loop and can slow down the processing
Second optimization is to schedule the software threads on two logical processors same as on physical processors. This will allow the uops to use both of the logical processors HT Technology increases the performance upto 21% for normal single or dual processors and 16% to 28% for webservers.

Hyper-Threading Technology Architecture and Microarchitecture - Summary

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hyper-Threading Technology Architecture and Microarchitecture - Summary

Caricato da

Copyright:

Formati disponibili

Intel introduces a new concept Hyper Threading (HT) Technology, which brings simultaneously multiple threads execution with

Potrebbero piacerti anche