◇ 数値線形代数に関する講演 ◇
日時: 7月24日(火)10:00~12:00
場所: 東京大学 理学部 7 号館 202 会議室
http://www.u-tokyo.ac.jp/ campusmap/cam01_06_06_j.html
講演者: Edgar Solomonik (Dept. of EECS, University of California, Berkeley, U. S. A.)
題目: 2.5D algorithms for dense linear algebra
概要:
Parallel algorithms can use redundant memory to avoid communication and achieve better parallel scalability. We introduce a new parallelization scheme for dense linear algebra computations that exploits any extra available memory adaptively. The scheme is motivated by the classical 2D and 3D matrix multiplication algorithms. The 2D algorithm assigns each processor a 2D matrix block, while the 3D algorithm reduces communication by assigning each processor a cubic block of computational work. 2.5D algorithms generalize 2D and 3D algorithms, using the largest blocks which can fit in memory in order to minimize communication. We start with 2.5D matrix multiplication then introduce 2.5D algorithms for LU, Cholesky, triangular solve, and Cholesky-QR. Communication cost analysis demonstrates that these 2.5D algorithms have optimal bandwidth and latency costs. The 2.5D matrix multiplication algorithm also achieves perfect strong scaling complexity. The theoretical improvement in communication cost is matched by practical topology-aware mapping advantages of the virtual decomposition. A performance study on a Blue Gene/P supercomputer demonstrates significant improvements in strong scalability and efficiency at 65,536 cores.
日時: 7月24日(火)10:00~12:00
場所: 東京大学 理学部 7 号館 202 会議室
http://www.u-tokyo.ac.jp/
講演者: Edgar Solomonik (Dept. of EECS, University of California, Berkeley, U. S. A.)
題目: 2.5D algorithms for dense linear algebra
概要:
Parallel algorithms can use redundant memory to avoid communication and achieve better parallel scalability. We introduce a new parallelization scheme for dense linear algebra computations that exploits any extra available memory adaptively. The scheme is motivated by the classical 2D and 3D matrix multiplication algorithms. The 2D algorithm assigns each processor a 2D matrix block, while the 3D algorithm reduces communication by assigning each processor a cubic block of computational work. 2.5D algorithms generalize 2D and 3D algorithms, using the largest blocks which can fit in memory in order to minimize communication. We start with 2.5D matrix multiplication then introduce 2.5D algorithms for LU, Cholesky, triangular solve, and Cholesky-QR. Communication cost analysis demonstrates that these 2.5D algorithms have optimal bandwidth and latency costs. The 2.5D matrix multiplication algorithm also achieves perfect strong scaling complexity. The theoretical improvement in communication cost is matched by practical topology-aware mapping advantages of the virtual decomposition. A performance study on a Blue Gene/P supercomputer demonstrates significant improvements in strong scalability and efficiency at 65,536 cores.
Categories: