A place to share knowledge and interact with other computer science intellectuals. Spread the Word.

Improving Region Selection in Dynamic Optimization Systems

Reviewed By: 
Jason Mars

This paper proposes 2 new trace/region selection algorithms. Many dynamic optimizers attempt to detect hot execution paths (i.e. traces) through the control flow graph during runtime. These methods must be lightweight (as to not incur too much overhead). At the time of this publication the de facto method used was an algorithm called NET (next executing tail). This paper proposes 2 new trace selection techniques. First LEI (last executed iteration) which only selects traces that are cyclic (the branch at the end of the trace branches to the beginning).

Results: 
LEI increases the spanned cycle ratio (% of traces that is a complete cycle) by 6%, reduces code expansion by 8%, and reduces region transitions by 20%. LEI's 90% set cover is 18% lower than NET's. Trace combination reduces both the region transitions and 90% set cover by 15% in the case of NET and 26% in the case of LEI.

ISCA 2008 Didn't Make Internal Deadline

I just got an email today indicating that the ISCA (2008) program committee didn't hit their deadline for the rebuttal period. This email was sent out:

from	ISCA08 Submissions 
to	Jason Mars ,
cc	Matt Frank ,
date	Jan 15, 2008 10:39 AM
subject	[ISCA 08] Rebuttal period

Dear Jason Mars,

The rebuttal period for ISCA will begin on January 22 (rather than January
16) and close on January 25.

-Wen-mei Hwu

Hardware Atomicity for Reliable Software Speculation

Reviewed By: 
Jason Mars

We can take the dynamic hot paths of a program, optimize it assumeing we won't exit it early, and patch that new region into or executing code. However, in the rare case that we need to take a cold path, exiting the hot path early requires the undoing of these optimizations. This compensation code adds complexity and overhead. In this work the authors show a scheme for removing this complexity by using hardware structures to enable code region atomicity. They demonstrate their techinque with Java dynamic optimization.

Results: 
They use complete hardware simulation to show an improvement in execution time by over 10% (and 20% with aggressive inlining) using their atomicity approach. They also show data on the effects of tweaking the atomicity microarchitecture and the coverage and abortion rate (~1% avg) of their atomic regions.

Dynamic Trace Selection Using Performance Monitoring Hardware Sampling

Reviewed By: 
Jason Mars

In this work, the authors use Itanium's performance monitoring capabilities to detect and form dynamic traces of hot code. The Branch Trace Buffer provided by the Itanium architecture is used to detect dynamic paths and their hotness. Cache performance information about these traces can then be collected. Phases are determined using 2 tables, the local and global. The local table holds recent traces, the global holds predictions on what traces are hot. As long as these tables are 60% similar we are in a hot phase. They detect a phase change is fewer than 60% of sampled traces are optimized.

Results: 
Hot trace coverage extracted is Very good across the Spec2000 benchmarks. Interval Size of 100 samples is much better than 1000 samples. Overhead incurred by scheme is between 2% (more coarse sample size) and 4% (finer sample size).

An Event-Driven Multithreaded Dynamic Optimization Framework

Reviewed By: 
Jason Mars

This work proposes Trident, a hardware based framework to support dynamic optimization at the native binary level. This framework uses an intricate hardware based hotpath profiler that uses taken branch histories to extract hat path information. It also uses specialized trace management performance counters and hardware to manage the trace code cache. In addition they also propose some hot value profiler hardware support to allow the implementation of low overhead value specialization optimization techniques.

Results: 
Very nice numbers. They use SMTSIM with spec2000. From simply forming the hot traces and performing value specialization they show an average speedup of over 20%.

Dynamic Compilation: The Benefits of Early Investing

Reviewed By: 
Jason Mars

This work states that round robin is the wrong way to schedule the compilation thread of Java VMs. It then demonstrates that setting a static thread processor utilization for the compilation thread performs better. Their experiments show that setting the utilization of the compilation thread to 100% gives the best results. Basically this says whenever we see hotspots, recompile and optimize immediately and until completion.

Results: 
They show an average 18% speedup on a large number of benchmarks, with biggest improvement over 60%. Degrades performance for only 4 out of 46 benchmark/input pairs. All short running programs of course.

CellVM: A Homogeneous Virtual Machine Runtime System for a Heterogenous Single-Chip Multiprocessor

Reviewed By: 
Jason Mars

This work presents a java virtual machine (JVM) that is designed to present a homogeneous layer of abstraction on top of the cell heterogeneous processor. Multithread java programs run accross both the PPE and SPEs. The PPE is used for bytecode instructions that involve syscalls. The SPE is used for all other bytecode instructions. Thread state is kept in main memory and the local store of each SPE is used as a software controlled cache.

Results: 
Scalable performance (at least to 8 threads); increases to 6x speedup when going from 1 to 8 SPE cores.

Welcome to Bit Sect

First and foremost Bit Sect is a place to come and share knowledge. Everyone has to read papers and keep informed on the work that is out there, why not write a 'bit' about the paper. Everyone has opinions about the latest conference, news updates about the state of our scene, the recent work that's getting published, etc so why not write a piece about it. Bit Sect is a place for you to read and write bits and pieces about breaking edge research. Secondly there is the Bit Sect think tank.

Syndicate content