Memory Hierarchy for Multicore and Manycore Processors Mohamed Zahran and Bushra AhsanDesign Issues Physical Memory Cache Hierarchy Organization Cache Hierarchy Sharing Cache Hierarchy Optimization Cache Coherence Support for Memory Consistency Models Cache Hierarchy in Light of New Technologies Concluding Remarks FSB: A Flexible Set Balancing Strategy for Last Level CachesMohammad Hammoud, Sangyeun Cho, and Rami MelhemIntroduction Motivation and BackgroundFlexible Set Balancing (FSB)Quantitative EvaluationRelated Work Conclusions and Future Work The SPARC Processor Architecture Simone Secchi, Antonino Tumeo, and Oreste VillaIntroduction The SPARC Instruction Set Architecture Memory AccessSynchronization The NIAGARA Processor Architecture Core Micro-Architecture Core Interconnection Memory SubsystemNiagara Evolutions The Cilk and Cilk++ Programming Languages Hans VandierendonckAbstract Introduction The Cilk LanguageImplementationAnalyzing Parallelism in Cilk Programs HyperobjectsConclusion Multithreading in the PLASMA Library Jakub Kurzak, Piotr Luszczek, Asim YarKhan, Mathieu Faverge, Julien Langou, Henricus Bouwmeester, and Jack DongarraIntroduction Multithreading in PLASMA Dynamic Scheduling with QUARK Parallel Composition Task Aggregation Nested Parallelism Efficient Aho-Corasick String Matching on Emerging Multicore Architectures Antonino Tumeo, Oreste Villa, Simone Secchi, and Daniel Chavarria-MirandaIntroduction Related Work Preliminaries Algorithm Design Experimental Results Conclusions Sorting on a Graphics Processing Unit (GPU) Shibdas Bandyopadhyay and Sartaj SahniGraphics Processing Units Sorting Numbers on GPUsSorting Records on GPUs Scheduling DAG Structured Computations Yinglong Xia and Viktor K. PrasannaIntroduction Background Related Work Lock-Free Collaborative SchedulingHierarchical Scheduling with Dynamic Thread GroupingConclusion Evaluating Multicore Processors and Accelerators for Dense Numerical Computations Seunghwa Kang, Nitin Arora, Aashay Shringarpure, Richard W. Vuduc, and David A. BaderIntroductionInterarchitectural Design Trade-OffsDescriptions and Qualitative Analysis of Computational Statistics Kernels Baseline Architecture-Specific Implementations for the Computational Statistics KernelsExperimental Results for the Computational Statistics KernelsDescriptions and Qualitative Analysis of Direct N-Body Kernels Direct N-Body ImplementationsExperimental Results and Discussion for the Direct N-Body ImplementationsConclusions Sorting on the Cell Broadband Engine Shibdas Bandyopadhyay, Dolly Sharma, Reda A. Ammar, Sanguthevar Rajasekaran, and Sartaj SahniThe Cell Broadband Engine High-level Strategies for Sorting SPU Vector and Memory Operations Sorting NumbersSorting Records GPU Matrix Multiplication Junjie Li, Sanjay Ranka, and Sartaj SahniIntroduction GPU Architecture Programming Model Occupancy Single Core Matrix Multiply Multicore Matrix Multiply GPU Matrix MultiplyA Comparison Backprojection Algorithms for Multicore and GPU Architectures William Chapman, Sanjay Ranka, Sartaj Sahni, Mark Schmalz, Linda Moore, Uttam Majumder, and Bracy EltonSummary of Backprojection Partitioning Backprojection for Implementation on a GPU Single Core Backprojection GPU Backprojection Conclusion Acknowledgments Index