1 Department of Applied Mathematics and Computer Science, Technical University of Denmark2 Embedded Systems Engineering, Department of Applied Mathematics and Computer Science, Technical University of Denmark3 Department of Informatics and Mathematical Modeling, Technical University of Denmark4 Chalmers University of Technology5 IBM Research Laboratory
The performance of many parallel applications relies not on instruction-level parallelism but on loop-level parallelism. Unfortunately, automatic parallelization of loops is a fragile process; many different obstacles affect or prevent it in practice. To address this predicament we developed an interactive compilation feedback system that guides programmers in iteratively modifying their application source code. This helps leverage the compiler’s ability to generate loop-parallel code. We employ our system to modify two sequential benchmarks dealing with image processing and edge detection, resulting in scalable parallelized code that runs up to 8.3 times faster on an eightcore Intel Xeon 5570 system and up to 12.5 times faster on a quad-core IBM POWER6 system. Benchmark performance varies significantly between the systems. This suggests that semi-automatic parallelization should be combined with target-specific optimizations. Furthermore, comparing the first benchmark to manually-parallelized, handoptimized pthreads and OpenMP versions, we find that code generated using our approach typically outperforms the pthreads code (within 93-339%). It also performs competitively against the OpenMP code (within 75-111%). The second benchmark outperforms manually-parallelized and optimized OpenMP code (within 109-242%).
Proceedings of the International Conference on Parallel Processing, 2012, p. 410-419
Main Research Area:
International Conference on Parallel Processing. Proceedings
41st International Conference on Parallel Processing (ICPP 2012)International Conference on Parallel Processing