Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Compting Center North (HPC2N).
Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Compting Center North (HPC2N). (UMIT)ORCID iD: 0000-0002-4675-7434
Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Compting Center North (HPC2N). (eSSENCE, UMIT, Parallel and Scientific Computing)
2009 (English)In: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 36, no 2, p. 11:1-11:25Article in journal (Refereed) Published
Place, publisher, year, edition, pages
2009. Vol. 36, no 2, p. 11:1-11:25
Identifiers
URN: urn:nbn:se:umu:diva-24692DOI: 10.1145/1499096.1499100Scopus ID: 2-s2.0-65849486487OAI: oai:DiVA.org:umu-24692DiVA, id: diva2:227185
Available from: 2009-07-09 Created: 2009-07-09 Last updated: 2023-03-23Bibliographically approved
In thesis
1. Scheduling of parallel matrix computations and data layout conversion for HPC and Multi-Core Architectures
Open this publication in new window or tab >>Scheduling of parallel matrix computations and data layout conversion for HPC and Multi-Core Architectures
2011 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Dense linear algebra represents fundamental building blocks in many computational science and engineering applications. The dense linear algebra algorithms must be numerically stable, robust, and reliable in order to be usable as black-box solvers by expert as well as non-expert users. The algorithms also need to scale and run efficiently on massively parallel computers with multi-core nodes. Developing high-performance algorithms for dense matrix computations is a challenging task, especially since the widespread adoption of multi-core architectures. Cache reuse is an even more critical issue on multi-core processors than on uni-core processors due to their larger computational power and more complex memory hierarchies. Blocked matrix storage formats, in which blocks of the matrix are stored contiguously, and blocked algorithms, in which the algorithms exhibit large amounts of cache reuse, remain key techniques in the effort to approach the theoretical peak performance.

In Paper I, we present a packed and distributed Cholesky factorization algorithm based on a new blocked and packed matrix storage format. High performance node computations are obtained as a result of the blocked storage format, and the use of look-ahead leads to improved parallel efficiency. In Paper II and Paper III, we study the problem of in-place matrix transposition in general and in-place matrix storage format conversion in particular. We present and evaluate new high-performance parallel algorithms for in-place conversion between the standard column-major and row-major formats and the four standard blocked matrix storage formats. Another critical issue, besides cache reuse, is that of efficient scheduling of computational tasks. Many weakly scalable parallel algorithms are efficient only when the problem size per processor is relatively large. A current research trend focuses on developing parallel algorithms which are more strongly scalable and hence more efficient also for smaller problems.

In Paper IV, we present a framework for dynamic node-scheduling of two-sided matrix computations and demonstrate that by using priority-based scheduling one can obtain an efficient scheduling of a QR sweep. In Paper V and Paper VI, we present a blocked implementation of two-stage Hessenberg reduction targeting multi-core architectures. The main contributions of Paper V are in the blocking and scheduling of the second stage. Specifically, we show that the concept of look-ahead can be applied also to this two-sided factorization, and we propose an adaptive load-balancing technique that allow us to schedule the operations effectively.

Abstract [sv]

Matrisberäkningar är fundamentala byggblock imånga beräkningstunga teknisk-vetenskapliga applikationer. Algoritmerna måste vara numeriskt stabila och robusta för att användaren ska kunna förlita sig på de beräknade resultaten. Algoritmerna måste dessutom skala och kunna köras effektivt på massivt parallella datorer med noder bestående av flerkärniga processorer.

Det är utmanande att uveckla högpresterande algoritmer för täta matrisberäkningar, särskilt sedan introduktionen av flerkärniga processorer. Det är ännu viktigare att återanvända data i cache-minnena i en flerkärnig processor på grund av dess höga beräkningsprestanda. Två centrala tekniker i strävan efter algoritmer med optimal prestanda är blockade algoritmer och blockade matrislagringsformat. En blockad algoritm har ett minnesåtkomstmönster som passar minneshierarkin väl. Ett blockat matrislagringsformat placerar matrisens element i minnet så att elementen i specifika matrisblock lagras konsekutivt.

I Artikel I presenteras en algoritm för Cholesky-faktorisering av en matris kompakt lagrad i ett distribuerat minne. Det nya lagringsformatet är blockat och möjliggör därigenom hög prestanda. Artikel II och Artikel III beskriver hur en konventionellt lagrad matris kan konverteras till och från ett blockat lagringsformat med hjälp av en ytterst liten mängd extra lagringsutrymme. Lösningen bygger på en ny parallell algoritm för matristransponering av rektangulära matriser.

Vid skapandet av en skalbar parallell algoritm måste man även beakta hur de olika beräkningsuppgifterna schemaläggs på ett effektivt sätt. Många så kallade svagt skalbara algoritmer är effektiva endast för relativt stora problem. En nuvarande forskningstrend är att utveckla så kallade starkt skalbara algoritmer, vilka är mer effektiva även för mindre problem.

Artikel IV introducerar ett dynamiskt schemaläggningssystem för två-sidiga matrisberäkningar. Beräkningsuppgifterna fördelas statiskt på noderna och schemaläggs sedan dynamiskt inom varje nod. Artikeln visar även hur prioritetsbaserad schemaläggning tar en tidigare ineffektiv algoritm för ett så kallat QR-svep och gör den effektiv.

Artikel V och Artikel VI presenterar nya parallella blockade algoritmer, designade för flerkärniga processorer, för en två-stegs Hessenberg-reduktion. De centrala bidragen i Artikel V utgörs av en blockad algoritm för reduktionens andra steg samt en adaptiv lastbalanseringsmetod.

Place, publisher, year, edition, pages
Umeå: Institutionen för datavetenskap, Umeå universitet, 2011
Series
Report / UMINF, ISSN 0348-0542 ; 11.04
National Category
Information Systems
Research subject
computer and systems sciences
Identifiers
urn:nbn:se:umu:diva-41224 (URN)978-91-7459-214-6 (ISBN)
Public defence
2011-05-19, KBC-huset, KB3B1, Umeå Universitet, Umeå, 13:00 (English)
Opponent
Supervisors
Available from: 2011-04-28 Created: 2011-03-21 Last updated: 2021-03-18Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Gustavson, FredKarlsson, LarsKågström, Bo

Search in DiVA

By author/editor
Gustavson, FredKarlsson, LarsKågström, Bo
By organisation
Department of Computing ScienceHigh Performance Compting Center North (HPC2N)
In the same journal
ACM Transactions on Mathematical Software

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 397 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf