Umeå universitets logga

umu.sePublikationer
Ändra sökning
Avgränsa sökresultatet
1 - 12 av 12
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Elmroth, Erik
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Gustavson, F. G.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Applying recursion to serial and parallel QR factorization leads to better performance2000Ingår i: IBM Journal of Research and Development, ISSN 0018-8646, E-ISSN 2151-8556, Vol. 44, nr 4, s. 605-624Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations. However, there are significant additional costs for creating and performing the updates, which prohibit the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000(R) SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-MHz IBM PowerPC(R) 604e SMP node. The hybrid recursive algorithm reaches more than 90% of the theoretical peak performance of the POWER2 node, Compared to standard block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a four-way 332-MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For two, three, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.

  • 2.
    Elmroth, Erik
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Gustavson, Fred G
    New serial and parallel recursive QR factorization algorithms for SMP systems1998Ingår i:  Applied parallel computing: large scale scientific and industrial problems: 4th international workshop, PARA '98, Umeå, Sweden, June 14-17, 1998 : proceedings / [ed] Bo Kågström, Jack Dongarra, Erik Elmroth, Jerzy Wasniewski, Heidelberg/Berlin, Germany: Springer , 1998, Vol. 1541, s. 120-128Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the efficient use of the recursion for large n. This obstacle is overcome by using a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m=n increases from 100 to 1000. A successful parallel implementation on a PowerPC 604 based IBM SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors and m=n=2000 it shows speedups of 1.96, 2.99, and 3.92 compared to our uniprocessor algorithm.

  • 3.
    Gustavson, Fred G.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. IBM T.J. Watson Research Center, Ossining, USA.
    Algebra and geometry combined explains how the mind does math2014Ingår i: Parallel processing and applied mathematics: 10th International Conference, PPAM 2013, Warsaw, Poland, September 8-11, 2013, Revised selected papers, Part I / [ed] Roman Wyrzykowski; Jack Dongarra; Konrad Karczewski; Jerzy Waśniewski, Springer Berlin/Heidelberg, 2014, s. 1-11Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    This paper updates my talk on Cache Blocking for Dense Linear Algorithms since 1985 given at PPAM 11; see [11]. We again apply Dimension Theory to matrices in the Fortran and C programming languages. New Data Structures (NDS) for matrices are given. We use the GCD algorithm to transpose a n by m matrix A in CMO order, standard layout, in-place. Algebra and Geometry are used to make this idea concrete and practical; it is the reason for title of our paper: make a picture of any matrix by the GCD algorithm to convert it into direct sum of square submatrices. The picture is Geometry and the GCD algorithm is Algebra. Also, the in-place transposition of the GKK and TT algorithms will be compared. Finally, the importance of using negative integers will be used to give new results about subtraction and finding primitive roots which also make a priori in-place transpose more efficient.

  • 4.
    Gustavson, Fred G.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. IBM T.J. Watson Research Center, Poland.
    Cache blocking2012Ingår i: Applied parallel and scientific computing: 10th international conference, PARA 2010, Reykjavík, Iceland, June 6-9, 2010, revised selected papers, part I / [ed] Kristján Jónasson, Springer Berlin/Heidelberg, 2012, s. 22-32Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    Over the past five years almost all computer manufacturers have dramatically changed their computer architectures to Multicore (MC) processors. We briefly describe Cache Blocking as it relates to computer architectures since about 1985 by covering the where, when, how and why of Cache Blocking as it relates to dense linear algebra. It will be seen that the arrangement in memory of the submatrices A ij of A that are being processed is very important.

  • 5.
    Gustavson, Fred G.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. IBM T.J. Watson Research Center, United States.
    Cache blocking for linear algebra algorithms2012Ingår i: Parallel processing and applied mathematics: 9th international conference, PPAM 2011, Torun, Poland, September 11-14, 2011. Revised selected papers, part I / [ed] Roman Wyrzykowski; Jack Dongarra; Konrad Karczewski; Jerzy Waśniewski, Springer Berlin/Heidelberg, 2012, s. 122-132Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    We briefly describe Cache Blocking for Dense Linear Algebra Algorithms on computer architectures since about 1985. Before that one had uniform memory architectures. The Cray I machine was the last holdout. We cover the where, when, what, how and why of Cache Blocking. Almost all computer manufacturers have recently (about seven years ago) dramatically changed their computer architectures to produce Multicore (MC) processors. It will be seen that the arrangement in memory of the submatrices A ij of A is a critical factor for obtaining high performance. From a practical point of view, this work is very important as it will allow existing codes using LAPACK and ScaLAPACK to remain usable by new versions of LAPACK and ScaLAPACK.

  • 6.
    Gustavson, Fred G.
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. IBM T.J. Watson Research Center, New York, USA.
    Herrero, Jose R.
    Morancho, Enric
    A Square Block Format for Symmetric Band Matrices2014Ingår i: Parallel Processing and Applied Mathematics: 10th International Conference, PPAM 2013 Warsaw, Poland, September 8–11, 2013, Revised Selected Papers, Part I / [ed] Wyrzykowski, R Dongarra, J Karczewski, K Wasniewski, J, Springer Berlin/Heidelberg, 2014, s. 683-689Konferensbidrag (Refereegranskat)
    Abstract [en]

    This contribution describes a Square Block, SB, format for storing a banded symmetric matrix. This is possible by rearranging "in place" LAPACK Band Layout to become a SB layout: store submatrices as a set of square blocks. The new format reduces storage space, provides higher locality of memory accesses, results in regular access patterns, and exposes parallelism.

  • 7.
    Gustavson, Fred G.
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. IBM T. J. Watson Research Center, Yorktown Heights, USA.
    Walker, David W.
    School of Computer Science and Informatics, Cardiff University, Cardiff, UK.
    Algorithms for in-place matrix transposition2014Ingår i: Parallel processing and applied mathematics: 10th International Conference, PPAM 2013, Warsaw, Poland, September 8-11, 2013, Revised selected papers, Part II / [ed] Roman Wyrzykowski; Jack Dongarra; Konrad Karczewski; Jerzy Waśniewski, Springer Berlin/Heidelberg, 2014, s. 105-117Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    This paper presents an implementation of an in-place swap-based algorithm for transposing rectangular matrices, and a proof of correctness is also sketched. The implementation is based on an algorithm described by Tretyakov and Tyrtyshnikov [4], but we have introduced a number of variations. In particular, we show how the original algorithm can be modified to require constant additional memory. We also identify opportunities for exploiting parallelism. 

  • 8.
    Gustavson, Fred G.
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Wasniewski, Jerzy
    Dongarra, Jack J.
    Herrero, Jose R.
    Langou, Julien
    Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms2013Ingår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 39, nr 2, s. 9-Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Four routines called DPOTF3i, i = a, b, c, d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is "identical" to Square Block Packed Format (SBPF). "LAPACK" implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n approximate to nb as well as results for large n comparing DBPTRF versus DPOTRF.

  • 9.
    Gustavson, Fred G.
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Wasniewski, Jerzy
    Dongarra, Jack J.
    Langou, Julien
    Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution, and Inversion2010Ingår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 37, nr 2, s. 1-21, artikel-id 18Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We describe a new data format for storing triangular, symmetric, and Hermitian matrices called Rectangular Full Packed Format (RFPF). The standard two-dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full-format representation. Also, RFPF requires exactly the same minimal storage as packed the format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full-format routine and two calls to Level 3 BLAS routines. This means no new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution, and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using the standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines.

  • 10.
    Gustavson, Fred G.
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. IBM Research, United States.
    Waśniewski, Jerzy
    Technical University of Denmark, Denmark.
    Herrero, José R.
    Universitat Politècnica de Catalunya, BarcelonaTech, Spain.
    New level-3 BLAS kernels for cholesky factorization2012Ingår i: Parallel processing and applied mathematics: 9th international conference, PPAM 2011, Torun, Poland, September 11-14, 2011. Revised selected papers, part i / [ed] Roman Wyrzykowski; Jack Dongarra; Konrad Karczewski; Jerzy Waśniewski, Springer Berlin/Heidelberg, 2012, s. 60-69Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    Some Linear Algebra Libraries use Level-2 routines during the factorization part of any Level-3 block factorization algorithm. We discuss four Level-3 routines called DPOTF3, a new type of BLAS, for the factorization part of a block Cholesky factorization algorithm for use by LAPACK routine DPOTRF or for BPF (Blocked Packed Format) Cholesky factorization. The four routines DPOTF3 are Fortran routines. Our main result is that performance of routines DPOTF3 is still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts to decrease. This means that the performance of DGEMM, DSYRK, and DTRSM will increase due to their use of larger block sizes and also to making less passes over the matrix elements. We present corroborating performance results for DPOTF3 versus DPOTF2 on a variety of common platforms. The four DPOTF3 routines are based on simple register blocking; different platforms have different numbers of registers and so our four routines have different register blockings. Blocked Packed Format (BPF) is discussed. LAPACK routines for -POTRF and -PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK -POTRF source codes. Upper BPF is shown to be identical to square block packed format. Performance results for DBPTRF and DPOTRF for large n show that routines DPOTF3 does increase performance for large n. 

  • 11.
    Gustavson, Fred
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Distributed SBP Cholesky factorization algorithms with near-optimal scheduling2009Ingår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 36, nr 2, s. 11:1-11:25Artikel i tidskrift (Refereegranskat)
  • 12.
    Gustavson, Fred
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion2012Ingår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 38, nr 3, s. 17:1-17:32Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition are presented and evaluated. A new algorithm for in-place transposition which efficiently determines the structure of the transposition permutation a priori is one of the key ingredients. It enables effective load balancing in a parallel environment.

1 - 12 av 12
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf