umu.sePublications
Change search
Refine search result
1 - 7 of 7
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Elmroth, Erik
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Gustavson, F. G.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Applying recursion to serial and parallel QR factorization leads to better performance2000In: IBM Journal of Research and Development, ISSN 0018-8646, E-ISSN 2151-8556, Vol. 44, no 4, p. 605-624Article in journal (Refereed)
    Abstract [en]

    We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations. However, there are significant additional costs for creating and performing the updates, which prohibit the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000(R) SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-MHz IBM PowerPC(R) 604e SMP node. The hybrid recursive algorithm reaches more than 90% of the theoretical peak performance of the POWER2 node, Compared to standard block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a four-way 332-MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For two, three, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.

  • 2.
    Elmroth, Erik
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Gustavson, Fred G
    New serial and parallel recursive QR factorization algorithms for SMP systems1998In:  Applied parallel computing: large scale scientific and industrial problems: 4th international workshop, PARA '98, Umeå, Sweden, June 14-17, 1998 : proceedings / [ed] Bo Kågström, Jack Dongarra, Erik Elmroth, Jerzy Wasniewski, Heidelberg/Berlin, Germany: Springer , 1998, Vol. 1541, p. 120-128Conference paper (Other academic)
    Abstract [en]

    We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the efficient use of the recursion for large n. This obstacle is overcome by using a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m=n increases from 100 to 1000. A successful parallel implementation on a PowerPC 604 based IBM SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors and m=n=2000 it shows speedups of 1.96, 2.99, and 3.92 compared to our uniprocessor algorithm.

  • 3.
    Gustavson, Fred G.
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. IBM T.J. Watson Research Center, New York, USA.
    Herrero, Jose R.
    Morancho, Enric
    A Square Block Format for Symmetric Band Matrices2014In: Parallel Processing and Applied Mathematics: 10th International Conference, PPAM 2013 Warsaw, Poland, September 8–11, 2013, Revised Selected Papers, Part I / [ed] Wyrzykowski, R Dongarra, J Karczewski, K Wasniewski, J, Springer Berlin/Heidelberg, 2014, p. 683-689Conference paper (Refereed)
    Abstract [en]

    This contribution describes a Square Block, SB, format for storing a banded symmetric matrix. This is possible by rearranging "in place" LAPACK Band Layout to become a SB layout: store submatrices as a set of square blocks. The new format reduces storage space, provides higher locality of memory accesses, results in regular access patterns, and exposes parallelism.

  • 4.
    Gustavson, Fred G.
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Wasniewski, Jerzy
    Dongarra, Jack J.
    Herrero, Jose R.
    Langou, Julien
    Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms2013In: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 39, no 2, p. 9-Article in journal (Refereed)
    Abstract [en]

    Four routines called DPOTF3i, i = a, b, c, d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is "identical" to Square Block Packed Format (SBPF). "LAPACK" implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n approximate to nb as well as results for large n comparing DBPTRF versus DPOTRF.

  • 5.
    Gustavson, Fred G.
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Wasniewski, Jerzy
    Dongarra, Jack J.
    Langou, Julien
    Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution, and Inversion2010In: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 37, no 2, p. 1-21, article id 18Article in journal (Refereed)
    Abstract [en]

    We describe a new data format for storing triangular, symmetric, and Hermitian matrices called Rectangular Full Packed Format (RFPF). The standard two-dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full-format representation. Also, RFPF requires exactly the same minimal storage as packed the format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full-format routine and two calls to Level 3 BLAS routines. This means no new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution, and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using the standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines.

  • 6.
    Gustavson, Fred
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Compting Center North (HPC2N).
    Karlsson, Lars
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Compting Center North (HPC2N).
    Kågström, Bo
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Compting Center North (HPC2N).
    Distributed SBP Cholesky factorization algorithms with near-optimal scheduling2009In: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 36, no 2, p. 11:1-11:25Article in journal (Refereed)
  • 7.
    Gustavson, Fred
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Karlsson, Lars
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Kågström, Bo
    Umeå University, Faculty of Science and Technology, High Performance Computing Center North (HPC2N).
    Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion2012In: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 38, no 3, p. 17:1-17:32Article in journal (Refereed)
    Abstract [en]

    Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition are presented and evaluated. A new algorithm for in-place transposition which efficiently determines the structure of the transposition permutation a priori is one of the key ingredients. It enables effective load balancing in a parallel environment.

1 - 7 of 7
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf