Report For Neha Lem Architecture

Report: Intel Nehalem architecture - BeHardware
>> Processors Written by Franck Delattre Published on September 17, 2008 URL: http://www.behardware.com/art/lire/733/ Page 1 Introduction
After the big splash made with Core 2 architecture easily dominating two generations of AMD processors and handing leading position back to Intel, the manufacturer is now getting ready to put the Core i7 and Core i7 Extreme Edition, the first representatives of Nehalem architecture, on the market.
In vino veritas Nehalem is henceforth a name synonymous with Core i7 architecture and its different apparitions. Nehalem is a river in Oregon, and a project manager who was also a connoisseur of the regions wine, chose the name. Wine making is decidedly a rich source of inspiration, as Asus recently showed, giving the label Pinot Noir to one of its motherboards... As is now our habit, we are going to try to understand Intels goals in the design of this new architecture and see what the new range of processors is going to bring in terms of how we use our PCs. For once, we wont have to wait too long as the first Core i7s are expected to be available at the end of the year. In 2006, Intel left its mark on the world of PCs when it released a processor that had high performance, low energy consumption and was relatively well adapted for all platforms. From the moment it was put on the market, the Core 2 eclipsed AMDs Athlon 64, which had in turn fairly easily imposed itself on the Pentium 4 and its Netburst architecture. AMD more or less missed its chance for a successful challenger with a weak Phenom and moreover Intel made matters worse with the 45nm Core 2 models that raised
the bar another notch. This miracle processor, which firmly reinstated Intel as the leader was not by any means a totally new creation. In fact, it is a direct descendant of Mobile processors, initiated under the Pentium M name (Banias et Dothan) and up to the Core Duo (Yonah), the first native dual-core on the market. Keep in mind that these models are the fruit of Intels research centres in Haifa, Israel. Once Mobile, always Mobile Core 2 clearly has mobile origins, and although it gives excellent results on all types of platforms (desktop PCs, servers and of course portable technology), its design does keep the main strengths and weaknesses of this mobile heritage: obviously design specifications for a mobile processor are different from those of a desktop or server processor. Indeed, from its introduction, the Core 2 suffered from a certain technological out-datedness: an aging processor bus (in comparison with the K8 HyperTransport bus), an external memory controller (integrated directly to the K8 since 2004!) as well as its dual-core architecture, which, although going further than previous implementations, only preceded the first AMD native quad-cores by 12 months. Intel had to push several innovations back, so as to get the Core 2 released as quickly as possible with the least amount of modification. Of course, performances werent disappointing but in terms of innovation, the Core 2 didnt represent th e same sort of technological leap as did the Pentium 4 when it was introduced in 2000 (even if this didnt guarantee its success). A fundamentally mobile architecture, Core 2 was able to adapt to the demands of other platforms. However, note that this wasnt without consequences, in particular in terms of server use as a quad core assembly had difficulty with the old bus processor technology or the single memory bus shared by several processors. Also, 64-bit performances fell slightly (some processor capabilities were not activated in this mode). As good as it was in the absolute, the Core 2 never dominated the server platform, in contrast to the Opteron Barcelona, which was designed especially for the job. This goes to show that no processor can be best at everything! So, this gives an idea of what is at stake with Nehalem, the goal being to cover that which is lacking in the Core 2. Yet how is it possible to come up with an architecture that meets the demands of all platforms? The key is modularity. Nehalem is above all a flexible and adaptable architecture. More Lego than processor, there will be so many versions of the architecture that, even according to Intel engineers, the nomenclature (the commercial naming that is) will be a real headache. This means that many of the different improvements of Nehalem that we will talk about below will not equip all models.
Page 2 Quad and SMT Whats new? Nehalem is a monolithic quad core architecture meaning that it doesnt result from the fusion of two dual cores. Nehalem also introduces the uncore notion to Intels products, this term designating any part of the processor that isnt directly part of the instruction processing engine. Unlike the Core 2, which was based on a single clock distribution (the entire processor functions on a single clock cycle value), Nehalem uses a complex clock distribution. Each core can thus run at its own frequency and this goes as well for the entire uncore part of the processor.
Some new and some old The Nehalems cores benefit from SMT technology (Simultaneous Multi-Threading) which appeared with the Pentium 4 equipped with Hyperthreading (the non-commercial name of SMT on Netburst) and that we also find on the first generations of Atom processors. SMT is technique which aims to facilitate handling several threads by the same execution core. In the absence of SMT, a core successively processes the pieces of the different threads that it is in charge of at any given moment. The constant transition from one thread to another gives an illusion that they are being executed simultaneously but in actuality a lot of time is devoted to these transitions. Each time the core must save the context in which the execution of a thread was carried out (state of the registers and stack) and load the context of a new thread. The concept of SMT is to offer the core the possibility to have not one but two contexts at the same time which will thus enable processing two threads but this time in a real simultaneous manner. The cores resources (c aches and execution units) are shared between the two threads in a static manner (for example, the buffer is separated into two identical parts) or dynamic (threads access the resource in a competitive way depending on their specific need).
Besides the time spent in transitions in context, SMTs performance gain comes from the best use of a cores execution units. In fact, the flux of instructions from each of the two threads are independent which notably is of benefit to the out-of-order execution engine (OOO) one of whose constraints in functioning is due to the interdependency of instructions. The possibility to more efficiently fill the execution pipeline is thus greater and in the end the efficiency of the execution core increases. In a non-OOO execution engine that cannot re-order instructions (as is the case with the Atom), speed is directly related to the dependence of successive instructions on each other. In this way, SMT enables practically double performance. And so as not to waste a thing, the addition of SMT to a core is economical compared to the added benefits in terms of performance as soon as more than one thread is handled in the processor. As the operating system only handles threads, it interprets the presence of the two contexts as two distinct logic processors in the same way as two cores. The Nehalems four cores thus appear in the Windows task manager in the form of 8 logic processors. All of this sounds quite advantageous but SMT technology isnt exempt from defects. In the first place, the concept of SMT resides in increasing the efficiency of an execution core and the possible gain is thus all the more significant if the starting efficiency of the architecture in question is low. Netburst has problems with its very long pipeline (20 and then 30 stages), and is therefore difficult to fill in an optimal manner. For this reason the architecture benefits from SMT and the gain can attain 40% in certain applications with the Pentium 4 Prescott. The effect is even more advantageous on the Atom whose in-order execution engine is strongly penalized by dependency. Nehalem inherits its execution engine from the Core 2 and its only legitimate to wonder about the gain added by SMT on execution cores that are already reputed to be efficient. According to Intel, SMT allows the Nehalems 4-wide engine (in other words, its capable of simultaneously processing up to four instructions) to fully use its width. This may be considered a somewhat horizontal optimization compared to the vertical one obtained from the added length of the Netburst pipeline. The other defect of SMT resides in the competing access of threads to cache, in particular to L1. The Nehalems L1 are fortunately sufficiently large to easily accommodate two threads and at any rate are better equipped than those of the Pentium 4 in this domain. So in the end, will SMT be advantageous for the Nehalem? Yes, because we will see in our study that many of the improvements added to the Nehalems core were made with consideration for the optimal
functioning of SMT on the new architecture and this in order to maximize performance gains. Of course, the gain will necessarily be variable depending on the application. Otherwise, SMT works miracles in server and database management environments or at least this is what resulted from its use on the Xeon with its Netburst architecture. What remains is that the interest of SMT will of course be less on desktop PCs, notably in the framework of office or gaming use (and even more so on the mobile platform). Moreover, it was even planned that for a certain time that non-server versions of the Nehalem would not be equipped with SMT but Intel have come back on this decision and the Bloomfield (a high end version for PC desktops) will have it. Whatever the case, SMT remains optional and it will therefore always be possible to decide if its presence is desired.
Page 3 Memory controller, TLB An in-depth review of memory hierarchy The Nehalems memory pipeline was the object of numerous evolutions compared to the Core 2. First off, two innovations immediately stand out: the memory controller is integrated in the processor and there is the presence of a third cache level. IMC = Integrated Memory Controller The Nehalems integrated memory controller is capable of supporting up to three DDR3-1333 memory channels thus offering a maximum bandwidth of 32 GB/s (compared to the 21 GB/s that the discreet memory controllers of the Core 2 could provide). This should enable feeding the eight threads that the Nehalem can process thanks to SMT. The integration of the memory controller also means a drastic reduction of access latency to memory. While the interest of the integrated memory controller is real for desktop PCs, its especially server platforms that will benefit the most, notably in configurations with several sockets for which the available memory bandwidth then increases proportionally to the number of processors present in the system. The advantage for a solution with a shared memory bus is enormous and Intel thus hopes to recuperate some of the PC server market on which the Opteron (K8 and K10) shines thanks to this capability of adapting the bandwidth. Otherwise, its probable that versions of the processor reserved for ultra -mobile use will do without the integrated controller in favor of a reduced thermal envelope. The integration of the memory controller to a processor also means less manoeuverability in the support of DRAM technologies. Will the Nehalems modularity go as far as support for dif ferent types of memory controllers? This is planned but we know already that server versions of the processor will no longer support FB-DIMM. Maybe Intel didn't find it wise to invest in this standard in favor a more recent memory server technology? A lot of questions and we will follow these developments closely. A reviewed TLB hierarchy TLB (Translation Lookaside Buffers) are buffers that store the translations of the virtual addresses used by programs and their physical address equivalents. We heard a lot said about this recently due to the famous bug found in the Phenom. The Core 2s TLB structure gives high performances, due to the presence, in addition to a classic TLB of 288 entries, of a very small and fast micro-TLB which is solely devoted to reading. Intel had to review its copy, notably due to SMT, and the micro-TLB had to be abandoned on the Nehalem in favor of a classic TLB which is more capable of holding the addresses of two threads. On the other hand, Nehalem keeps two TLB levels: two first level buffers for code and data (192 entries in total) and a unified buffer offering no less than 512 entries. These two TLB levels are capable of efficiently handling a much larger quantity of data than the Core 2. And once again, its the server market that is first and foremost the target of this
new characteristic, in particular the management of large data bases. In addition, the Nehalems TLB have the innovation of the presence of a virtual processor ID that enables defining an entry which is specific to the processing in a virtual machine. While on the Core 2 the transition to a virtual host (for example, created by VMWare) means flushing the TLB, this trick enables the Nehalem to accelerate the transitions between local and virtual machines. Here the interest is mostly intended for professional use.
Page 4 Cache hierarchy The new cache hierarchy Maintaining the coherence of data manipulated by each of the cores in monolithic architecture is accomplished via a shared cache. The Core 2 thus integrates a large cache L2 shared between two cores. The implementation of quadruple cores on the Core 2 relies on the processor bus to maintain this coherence which isnt optimal for performances.
Source : Chip Architect
Its therefore almost natural to find a large cache shared between the four cores on the Nehalem. However, things are much more complicated with four cores instead of two. Indeed, a cache can only respond to requests from the four cores that solicit it in an intensive manner and without any significant latency unless the technical characteristics of the cache are improved but this implies complexity beyond that of a consumer processor. The economical solution thus consists of reducing the number of requests that come to the shared cache. To do this Intel inserted a small cache of 256 KB between the L1 of each core and the shared cache. These four caches of 256 KB do not take up too much space on the chip and their smallness in size is a guarantee of speed. On the other hand, such a size does not translate into record success rates but this is not the goal. If each L2 offers a success rate of only 50% (which is pessimistic), every other request will not reach the shared cache and things happen as if the requests only make it from two of the four cores. And with only two cores, we already know that this works
fine. L1 The first level L1 caches of each core of the Nehalem have the same size characteristics as the Core 2: 32 KB for data and 32 KB for instructions. Doing away with the micro-TLB, which we mentioned above, unfortunately translates into a slight increase in access time to L1. The data L1s latency thus goes to 4 cycles (versus the 3 with Core 2). For L1 devoted to instructions, Intel chose to favor latency to the detriment of associativity. Indeed, managing cache ways takes time and this is all the more so the greater the number of ways. By reducing the associativity of L1 instruction cache from 8 to 4 ways, it can keep a latency of 3 cycles like on the Core 2 and this despite the absence of a micro-TLB. Why this choice? Because an instructions cache is more sensitive to latency than a data cache. Access latency on the latter can be (or at least partially) compensated for by the work of the OOO engine which reorganizes instructions in order to mask latency (the Nehalems has been considerably improved), while each access to the instructions cache is directly affected by the effects of higher latency, in particular access carried out by branching prediction mechanisms. The instructions L1 thus has less to lose when reducing its associativity instead of increasing its latency. In the end its a compromise. Finally, the Nehalems L1 is capable of handling more parallel cache misses than the Core 2. This is due to the gain in bandwidth offered by the integrated memory controller: a cache miss signifies a memory access and the average time between two memory requests diminishes. The increase proves to be particularly interesting for SMT as two threads generate more cache misses than a single one. Inclusive cache The Nehalems cache hierarchy necessarily reminds us of the Phenoms. However, the resemblance stops at the number of levels because cache does not function in the same way for the two architectures. This begins with the fact that the Nehalems shared L3 cache has an inclusive relation with all of the other cache levels, meaning that it contains a copy of the contents of L1 and L2. This characteristic distinguishes it from AMDs choice on the Phenom whose L3 has a pseudo-exclusive relationship with other cache levels (data cannot be found in the two cache levels at the same time, although when we say pseudo this means that there are a few exceptions). An inclusive cache relationship generally translates into higher performances but to the detriment of the total size of useful cache (due to the redundancy of certain data in two successive levels). In multi-core architecture, this inclusive relationship amplifies the defect: of the 8 MB of L3, more than 1 MB is occupied by copies of L1 and L2 caches. However, it also has the advantage of affecting the private L1 and L2 caches less. Why? Because in the case of an L3 cache miss, we are sure that this data is not in the private caches of each of the cores (otherwise it would be in L3 due to the inclusive relationship), which enables avoiding verification and immediately creating a reading request in memory. Things become more complicated in the case of an L3 hit because verification is then required to see if the data is already present in one of the private caches, which means verifying all the caches of each core. This necessary step in the coherence of cache is called cache snooping and can be a significant source of latency. To overcome this problem, the Nehalem has for each line of L3 cache a flag that indicates in private cache in which core(s) the data is found. While the gain in time is appreciable, the storage of these flags adds a little weight to the structure of L3. The first latency tests showed an average of 40 processor cycles for the L3 cache of current Nehalem models (4 cycles for L1 and roughly 10 cycles for L2). Such a value can be partly explained by the fact that L3 cache functions at a different frequency (as well as voltage) from that of the rest of the processor and this like the uncore part of the Nehalem. Thus, on the 2.93 GHz model, L3 runs at 2.66 GHz. This slightly distorts latency measurements expressed in processor cycles therefore to 2.93 GHz. The separate frequencies and voltages add more flexibility to the processors design and notably avoid having to align the processors overall frequency with other slower elements. In addition, this enables better control of overall thermal dissipation of the socket, which as we will see later, gives the Nehalem another special characteristic.
In terms of flexibility, the size of the Nehalems L3 is easily adaptable depending on the capabilities of each processor version and also with each evolution in manufacturing. The transition to 32 nm engraving will probably be accompanied by an L3 cache of 12 MB as was the case for the Core 2.
Page 5 QPI Bus, the core A new processor bus One of the defects of the Core 2 resides in the use of a processor bus of older design. While mobile and desktop platforms have no problem with it, this isnt the case for servers where the old FSB is a bottleneck in the interconnection between sockets. In this area, the Opteron and its HyperTransport bus have been without serious competition up until now. Nehalem abandons the FSB for a more modern interconnection bus called the QPI (Quick Path Interconnect). This new point to point dual directional bus shares numerous characteristics with the HyperTransport bus and the principle is fundamentally similar. Just like its rival, QPI offers large flexibility in its implementation and systems will be able to integrate as many QPI links as required by bandwidth. The QPI bus is announced with transfers of 4.8 to 6.4 GT/s (Giga-transfers per second). With a bus width that can attain 20 bits. This gives us a maximum speed of 6.4 x 20 / 8 = 16 GB/s, or 32 GB/s for a dual directional link. The first implementations of QPI on Nehalem provide a lane of 25.6 GB/s, or the double of that which is offered by a classic FSB at 1600 MHz.
QPI lanes, in blue on the diagram, play the role of interconnection between the processors and also between each processor and an IOH (input/output hub that for example serves as an interface with the PCI-Express bus). In this example, each processor is capable of handling four QPI lanes. On a monoprocessor machine, a single QPI lane between the processor and IOH (in this case an X58) is of course necessary. Improvements to the core Compared to the Core 2, many improvements to the Nehalem were motivated by the support of SMT and in general by the new memory hierarchy (the three cache levels and the increase in available memory bandwidth). The same goes for the processing cores and this along the entire pipeline of which the stages were more or less slightly improved compared to what was found on the Core 2. Branching prediction
Starting with branching prediction, it is one of the mechanisms that, as we saw in our look at Core 2, has one of the most significant influences on the performances of the processing pipeline. Branching predictions goal is to avoid cuts in the flux of code as these slow traffic ins the pipeline and thus lower speeds. Nehalem inherits the mechanisms already found on the Core 2: a loop detector and the management of direct and indirect branches. In addition, the new architecture integrates a second BTB (Branch Target Buffer) address buffer whose role is to stock a history of destination addresses that were efficiently taken; while the first BTB is devoted to local addresses, the second is meant for addresses further away that we can find in heavier applications (yes, like the management of data bases). In addition to this, Intel has added a new mechanism that relies on the storing of return addresses (and not on destination addresses like the BTB) called the RSB (Return Stack Buffer). Note that each thread has its own RSB in order to avoid any conflict in the management of this buffer when SMT is activated. Fusion The instruction decoding step was also reviewed. You may recall that this phase consists of transforming x86 instructions into elementary micro-operations that are comprehensible to the rest of the processing pipeline. Nehalem keeps the four decoders already found on the Core 2 but improves certain mechanisms brought by its predecessor. Macro-fusion was one of the innovations of Core 2 architecture which consists of detecting pairs of predefined x86 instructions such as compare + jump (CMP + JCC) and transforming them into a single micro-operation. The technique enables to both increase decoding capacity and reduce the number of micro-operations that are generated - and this all the more so with numerous appearances of these instruction pairs. Nehalem adds new instructions pairs capable of macro-fusing and especially enables macro-fusion in 64 bit mode (which is unfortunately not the case with the Core 2 that does thus benefit from its potential when it runs with a 64 bit operating system).
Page 6 The core (continued) Loop detector Core 2 also introduced a control mechanism that optimized decode loops called the Loop Stream Detector, or LSD. The principle is based on the detection of loops in the code flux. The code in question is then placed in a dedicated buffer and put on a special path that avoids certain redundant phases of processing. Indeed it is useless, for example, to resort to branching prediction on every step of the loop. Nehalem uses the same concept except that loop detection is carried out after decoding in microoperations and this with the goal of saving the decoding phase of the loop at each step. Therefore the buffer doesnt store x86 instructions but rather micro-operations that have already been decoded. Its interesting to note the similarity in concept with the trace cache of the Pentium 4, which functions as code cache containing instructions that have already been decoded. This is just another example that shows how technology is recycled.
OOO engine The Nehalems out-of-order (OOO) execution engine underwent several modifications mostly destined to support SMT. Thus, the size of the buffer for re-ordering instructions (ROB : Re-order Buffer) was increased to 128 entries (96 for the Core 2s) and is shared in two equal parts by the two threads. The micro-instructions of the two threads are then sent to a buffer called the Reservation Station which is responsible for dividing them up over the calculation units. This buffer now contains 36 entries (32 on the Core 2) and uses a dynamic sharing policy between the two threads. Thus, if one of the two threads is waiting for a memory operand, the other thread can benefit from more entries in the RS.
The following step consists of the actual execution of instructions by the calculation units. This is not directly affected by SMT whose influence on this level is only in terms of the instruction speed of processing. The Nehalems units are therefore identical to those of the Core 2s in every way.
SSE4.2 Nehalem introduces a new instruction set or more precisely a complement to the SSE4.1 instructions of the Core 2 45 nm. Intel has made some effort to communicate on this new instruction set and now presents them in a more concrete form for users. SSE4.2 are therefore broken down into STTNI (String and Text New Instructions), whose purpose is to accelerate the processing of character chains and into ATA (Application Targeted Instructions) that groups together instructions specialized in the calculation of control sums (for example, used in compression algorithms) and others that involve data searches. Note that the Intel C++ version 10 compiler already supports these new instructions.
Page 7 TDP and Turbo Mode A higher TDP. Is Nehalem architecture economical in energy use? Not really, or rather not all models. The first Core i7 Bloomfields that will soon be available have an announced TDP of 130W for clock frequencies between 2.66 and 3.20 GHz. This is almost 35% more than the Core 2 Quad Yorkfield (45 nm) and its TDP of 95W for up to 3 GHz . However, we should keep in mind that the processor now integrates a memory controller whose dissipation is now included in these 130W which isnt the case for a machine based on the Core 2 where the memory controller is integrated to the Northbridge. This only meant that these watts were consumed elsewhere. On the other hand, the Core i7 concentrates this dissipation on a smaller surface and close to the cores which are the hottest part of the system. Core i7s caches are also a significant source of thermal dissipation. If we only c onsider dissipation related to leaks, just the presence of L2 (added, you may recall, for the sole purpose of relieving L3 cache but does not increase the total size of cache) increases the cache sub-systems dissipation by almost 13% (4 x 256 KB divided by 8 MB). The variation in the number of clock cycle and voltage areas luckily enables better control of the processors overall thermal dissipation or at least this is more efficient than the Core 2 (which moreover does not really give off excessive heat even in its quad core configuration) and the big flexibility in design will be the key for low power Core i7 models (or whatever their names are).
Turbo mode and overlocking The Turbo mode of the Core i7 is not an architectural characteristic but a functionality that Intel has already implemented on certain versions of the Core 2 Mobile under the name IDA (Intel Dynamic Acceleration). This mechanism consists of accelerating in a dynamic and temporary way the clock speed of one or several of the cores when others are not called for. The concept is based on the fact that many applications consist of one or two threads and therefore do not use all of the multi-thread processing potential of a multi-core processor. When the case arises, the Turbo mode comes into play (under control of the operating system) and increases the multiplying coefficient of the one or several cores in question. Thus, Nehalem is an evolutionary step in the control of internal voltage and clock frequency. Up until now this control was handled by the operating system and the processeur offered the possibility of external control of the multiplying coefficient (FID: Frequency Identifier) and voltage (VID: Voltage Identifier) which
forms the foundation for EIST (Enhanced Intel SpeedStep Technology). With the Core i7, these parameters are no longer external and it alone can change them in order to have control over its thermal dissipation. Indeed, the processor is capable of estimating the power it consumes at each moment (voltage x intensity of the current consumed) and of course to control it with the help of frequency and voltage parameters. Therefore, a Core i7 model will no longer be characterized by its maximum FID and VID but by its maximum TDP. Software (the BIOS and operating system) no longer controls the FID/VID combo as it did with previous generations but rather Power-States (or P-State). These power levels are defined based on the overall TDP of the processor at the maximum frequency outside of Turbo mode. As an example, the Core i7 has an announced TDP of 130 Watts at 2.93 GHz, or 22 x 133. For each intermediate multiplying coefficient, the estimated TDP equals: TDP[coeff] = (coeff / max_coeff) x TDP_core + TDP_uncore The uncore part of the processor (IMC and L3 cach e) is not subject to P-states and thus 20 of the 130 Watts is constant. We therefore obtain for example at 14 x 133 = 1.86 GHz: TDP[14] = (14/22) x 110 + 20 = 48 Watts. The multiplying coefficient varies between 12x and 22x which gives us ten P-states between 37 and 130 Watts. The Turbo mode thus operates within the framework of this internal control of the processors overall TDP. The absence of activity of one or several cores results in a lowering of the overall TDP and thereby offers the Turbo mode the opportunity to accelerate cores that are called for. With this new protection mechanism via control of the TDP, there is the obvious question of overclocking as the processors maximum TDP is quickly surpassed. A priori, Intel will finally not set any limitations here and it will be possible to go beyond this value, but of course the Turbo mode will not be in effect. Only variations of Core i7 Extreme Edition models will enable modifying the TDP ceiling. However, note that these modifiable parameters on the Core i7 XE will only concern the Turbo mode. Should the need arise, it will thus be possible to modify the maximum multiplying coefficient as well as the TDP but this will not mean that the processor will function the entire time under these parameters.
3 3
Page 8 The different versions Different versions of Nehalem architecture Nehalems flexibility enables the existence of several versions of this processor which best fit the demands of each platform. Its possible to create a profile for each of these variations: The desktop platform
The desktop PC market is big enough to accommodate several Nehalem models. At first we will see high end versions (the Bloomfield) which will soon be launched. There are already three models planned: - the Core i7 920: 2.66 GHz, four cores, SMT, an L3 cache of 8 MB, integrated DDR3-1066 memory controller, socket LGA 1366 for an estimated price of around 260 euros. - Core i7 940: characteristics identical to the 920 but set at 2.93 GHz for a price of roughly 500 Euros. - Core i7 Extreme Edition 965: version XE, set at 3.2 GHz for a price of around 1000 Euros. Evidently the supplementary MHz have an added price! Note that these models will finally have DDR31066 support and not 1333. Intel seems to have reserved this memory frequency for server models in order to make the difference a little bigger. Most motherboard manufacturers will release LGA1366 models at the same time these first Core i7 versions will be available. Based on Intels X58 Tylersburg chipset and c ombined with an Intel ICH10, these mobos will have six memory slots and at least two PCI-Express 16x slots.
We will have to wait until the third quarter of 2009 for a more affordable model, the Lynnfield. Destined for the new Socket LGA1160, it will have DDR3 memory support only on two channels and will be equipped with a PCI-Express controller which will enable simplifying the chipset design (the Ibex Peak will only be composed of a single chip). Next will come the Havendale in early 2010, a dual core that will directly integrate the graphic component which up until now was found on the chipset. We should keep in mind that there may be lower gains in dual core versions over current Core2 Duos compared to those in the transition from quad core Core i7s over Core2 Quads. Indeed, while the Nehalems cache hierarchy offers optimal performances in a quad core configuration, it is less adapted to the dual core and certainly less to the Core 2 which is supposed to function exclusively with two cores. Of course, the other improvements of the Nehalem should largely compensate for this. The server platform Server versions of Nehalem, known under the codename Gainestown (DP) and Beckton (MP) will be the best equipped. Planned for release at the same time as the Bloomfield, the Gainestown will share the same characteristics but have a supplementary QPI bus destined for an intra-CPU connection. The Beckton will not arrivere before 2009 and will be equipped with three or four QPI links.
The mobile platform The PC desktop Nehalem Lynnfield and Havendale will have their mobile equivalents planned to be released at the same time in the 3rd quarter of 2009 and early 2010. The Clarksfield are quad cores and the Auburndale will be a dual core with an integrated GPU. Intel also mentions a more aggressive Turbo mode on mobile processors with frequency increases of more than 50%.
Page 9 Conclusion Conclusion In designing the Nehalem, Intel largely based it on the Core 2 while adding modifications in many areas. From simple touch ups to in-depth changes, nothing really went unchanged. Taking a closer look at the Nehalem reveals Intels desire to release its new architecture from constraints related to its mobile lineage and align it more to server use. Thus, Intel has strived to renew its bond with server machines, a domain in which AMD is still very present, all the while having a certain modularity enabling it to offer the architecture in all sectors. So will you and I as individual users benefit from the transition to the new architecture? Certainly because applications, even current ones, tend to process a growing volume of data and what is the better suited for processing such large volumes than a processor designed for server use?
With Nehalem, Intel redefines the concept of a multi-platform processor and this time without any compromise. While AMD has put up a fight up until now in the professional world, the Texan manufacturer may have some problems with Intels latest creation. After having lost some groun d in the home PC sector, AMD may now lose some more market share to its giant rival.

Report For Neha Lem Architecture

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Report For Neha Lem Architecture

Caricato da

Copyright:

Formati disponibili

Report: Intel Nehalem architecture - BeHardware

Source : Chip Architect

Potrebbero piacerti anche