Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
What is TLS ??
Why Dynamic ??
Static thread management analyse extensive profile information (Probability based data and control dependence profiling). Can extract Coarse-Grained parallelism ( several thousand instructions). Difficult to statically estimate performance impact even when extensive information is available . Why ??
Profiling Information cannot accurately predict the costs of speculation, synchronization and other overheads. Performance impact of speculative threads depends on the underlying hardware configuration. Speculative threads behaviors are inputdependent.
Compilation Infrastructure.
Simulation Infrastructure.
Allows Compiler to parallelize a sequential program without first proving the independence among the extracted threads.
Underlying hardware keeps track of each memory access.
Compiler Is forced in giving up Parallelizing this Code(memory Address unknown At compile time).
Each core has a private 1st level protocol and a shared l2 cache.
Compilation Infrastructure
Simulation Infrastructure
Based on trace-driven, out of order superscalar processor simulator. TG- Trace Generation portion based on PIN instrumentation tool.
AS reads the trace file and translates the code generated by compiler into Alpha like code.
Pipeline is based on Simple Scalar.
Performance Estimation
Cycles for TLS are broken down into 6 segments - Busy : Cycles spent graduating NonTLS instructions. - Exe Stall : Cycles stalled due to lack of ILP - iFetch : Cycles stalled due to fetch penalty - dCache : Cycles stalled due to data cache misses. - Squash : Cycles stalled due to speculation failures. - Others : Cycles spent on various TLS overheads.
Runtime Support
Performance profile with Hardware performance monitors. Decision making for TLS
Hardware Performance montiors are programmed to attribute execution cycles into following categories. Examining the head of the stall gives us some clue to the cause of a stall. - Busy cycles spent graduating instructions. - ExeStall, cycles stalled due to instruction execution delays. - iFetch, cycles stalled due to instruction fetch penalty. - dCache, cycles stalled due to data cache misses. - Useful Instrution, number of non-TLS instructions committed. - ThreadCount, number of threads committed. - Total, cycles elapsed since the beginning of TLS invocation.
- dCacheServe,each data cache miss, we also count the number of cycles needed to serve the miss.
Counters are maintained per core. A counter is aggregated if its value is aggregated from all the cores.
Counters...
Total is incremented on every clock cycle. At a given cycle, if the ROB is empty, the iFetch counter is incremented. If the instruction at the head of ROB is able to graduate, the Busy counter is incremented. If the instruction stalled at the head of the ROB is a memory operation, the dCacheServe counter is incremented. If the instruction stalled is a TLS management instruction, such as thread creation/commit instructions or synchronization instruc-tions, no counter is incremented. Otherwise, the ExeStall counter is incremented. . When a non-TLS-management instruction commits,
Estimating the cache performance of sequential execution from parallel execution is a complex task. Consider the following scenarios. - a data item used by a thread is actually brought into the cache by another speculative thread which fails. - a data item in the L1 cache is invalidated by a message from a speculative thread. - a data item is needed by two threads running on two different cores, causing two cache misses.
Decision Making
To decide which loops to parallelize speculatively, a performance table is maintained for the candidate loops.
Each entry in the table contains two entries
- saturation counter, which is incremented if the TLS execution outperforms the predicted sequential execution and vice a versa.
- a performance profile summary, which contains the cumulative difference in execution time between the TLS execution and the estimated sequential execution.
After a candidate loop is executed in TLS mode, the main thread updates the table by adding the difference between the TLS execution and the predicted sequential execution time to the performance summary.
Performance Evaluation
Simple
Quantitative
Quantitative + Static
Quantitative + StaticHint