TLS Dynamic

Exploiting Parallelism in Multicore Processors through Dynamic Optimizations
Abhimanyu Khosla Mtech Cse
What is TLS ??
Potentially Dependent Threads or SPECULATIVE THREADS
Why Dynamic ??
Static thread management analyse extensive profile information (Probability based data and control dependence profiling). Can extract Coarse-Grained parallelism ( several thousand instructions). Difficult to statically estimate performance impact even when extensive information is available . Why ??
Problems with Static Thread Management
Profiling Information cannot accurately predict the costs of speculation, synchronization and other overheads. Performance impact of speculative threads depends on the underlying hardware configuration. Speculative threads behaviors are inputdependent.
Speculative threads experience phase behavior.
Experimental Infrastructure For Dynamic Optimizations.

Speculative Thread Execution Model. Architectural Support.
Compilation Infrastructure.
Simulation Infrastructure.
Speculative Thread Execution Model
Allows Compiler to parallelize a sequential program without first proving the independence among the extracted threads.
Underlying hardware keeps track of each memory access.
TLS empowers compiler to parallelize programs that were previously nonparallelizable.
Compiler Is forced in giving up Parallelizing this Code(memory Address unknown At compile time).
Architectural Support for Speculation (CMP)
Each core has a private 1st level protocol and a shared l2 cache.
STAMPede approach is used to support TLS.

Extends Cache Coherence Protocol with two new states - (SpS) Speculatively Shared.
- (SpE) - Speculatively Exclusive.

- Transition to and From these States.
If a cache line is Speculatively loaded it enters a SpS or SpE state.
All Speculative threads are assigned a unique ID.

Thread ID of the Sender PiggyBacks on all invalidation messages. If an invalidation message arrives from a logically earlier thread for a cache line (Sps or SpE), then the thread is squashed and re-executed.
Compilation Infrastructure
Built on Open64 Compiler.

Extended to extract Speculative threads from loops. To dynamically optimize where Speculative Threads should be spawned, compiler is forced to create a different executable in which every loop is parallized.
Simulation Infrastructure
Based on trace-driven, out of order superscalar processor simulator. TG- Trace Generation portion based on PIN instrumentation tool.
AS- Architectural Simulation based on Simple scalar.

TG instruments all instructions to extract - instruction address, register used,opcode etc
AS reads the trace file and translates the code generated by compiler into Alpha like code.
Pipeline is based on Simple Scalar.
Wattch model power consumption.

Orion model inter-connection power consumption. Cacti model Cache.
Performance Estimation
Cycles for TLS are broken down into 6 segments - Busy : Cycles spent graduating NonTLS instructions. - Exe Stall : Cycles stalled due to lack of ILP - iFetch : Cycles stalled due to fetch penalty - dCache : Cycles stalled due to data cache misses. - Squash : Cycles stalled due to speculation failures. - Others : Cycles spent on various TLS overheads.
Deriving Seq execution time from TLS cycle breakdown.
PSEQs = (TLS Squash -Others)
Runtime Support
Performance profile with Hardware performance monitors. Decision making for TLS
Performance Profile With Hardware Montiors
Hardware Performance montiors are programmed to attribute execution cycles into following categories. Examining the head of the stall gives us some clue to the cause of a stall. - Busy cycles spent graduating instructions. - ExeStall, cycles stalled due to instruction execution delays. - iFetch, cycles stalled due to instruction fetch penalty. - dCache, cycles stalled due to data cache misses. - Useful Instrution, number of non-TLS instructions committed. - ThreadCount, number of threads committed. - Total, cycles elapsed since the beginning of TLS invocation.
- dCacheServe,each data cache miss, we also count the number of cycles needed to serve the miss.
Counters are maintained per core. A counter is aggregated if its value is aggregated from all the cores.
Counters...
Total is incremented on every clock cycle. At a given cycle, if the ROB is empty, the iFetch counter is incremented. If the instruction at the head of ROB is able to graduate, the Busy counter is incremented. If the instruction stalled at the head of the ROB is a memory operation, the dCacheServe counter is incremented. If the instruction stalled is a TLS management instruction, such as thread creation/commit instructions or synchronization instruc-tions, no counter is incremented. Otherwise, the ExeStall counter is incremented. . When a non-TLS-management instruction commits,
Aggregating The Counters
when a thread is spawned to a core, counters on that core are reset.

When a thread commits (only the nonspeculative thread is allowed to commit), all the aggregated counters are forwarded to the next nonspeculative thread and the ThreadCount is incremented when a speculative thread becomes nonspeculative, it aggregates the forwarded counters with its own counters.
Counting Cycles for Data Cache Misses
Estimating the cache performance of sequential execution from parallel execution is a complex task. Consider the following scenarios. - a data item used by a thread is actually brought into the cache by another speculative thread which fails. - a data item in the L1 cache is invalidated by a message from a speculative thread. - a data item is needed by two threads running on two different cores, causing two cache misses.
Decision Making
To decide which loops to parallelize speculatively, a performance table is maintained for the candidate loops.
Each entry in the table contains two entries
- saturation counter, which is incremented if the TLS execution outperforms the predicted sequential execution and vice a versa.
- a performance profile summary, which contains the cumulative difference in execution time between the TLS execution and the estimated sequential execution.
After a candidate loop is executed in TLS mode, the main thread updates the table by adding the difference between the TLS execution and the predicted sequential execution time to the performance summary.
Performance Evaluation
Dynamic Performance tuning method is required to

- Identify the loop that can take maximum advantage of TLS. - and also select the right level of loop to be parallelised.
Dynamic Performance Tuning Policies
Simple
Quantitative
Quantitative + Static
Quantitative + StaticHint

TLS Dynamic

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

TLS Dynamic

Caricato da

Copyright:

Formati disponibili

Exploiting Parallelism in Multicore Processors through Dynamic Optimizations

Abhimanyu Khosla Mtech Cse

Potentially Dependent Threads or SPECULATIVE THREADS

Problems with Static Thread Management

Speculative threads experience phase behavior.

Experimental Infrastructure For Dynamic Optimizations.

Speculative Thread Execution Model. Architectural Support.

Speculative Thread Execution Model

TLS empowers compiler to parallelize programs that were previously nonparallelizable.

Architectural Support for Speculation (CMP)

STAMPede approach is used to support TLS.

- (SpE) - Speculatively Exclusive.

If a cache line is Speculatively loaded it enters a SpS or SpE state.

All Speculative threads are assigned a unique ID.

Built on Open64 Compiler.

AS- Architectural Simulation based on Simple scalar.

Wattch model power consumption.

Deriving Seq execution time from TLS cycle breakdown.

PSEQs = (TLS Squash -Others)

Performance Profile With Hardware Montiors

Aggregating The Counters

when a thread is spawned to a core, counters on that core are reset.

Counting Cycles for Data Cache Misses

Dynamic Performance tuning method is required to

Dynamic Performance Tuning Policies

Potrebbero piacerti anche