umu.sePublikationer
Ändra sökning
Avgränsa sökresultatet
1 - 28 av 28
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Adlerborn, Björn
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Distributed one-stage Hessenberg-triangular reduction with wavefront scheduling2018Ingår i: SIAM Journal on Scientific Computing, ISSN 1064-8275, E-ISSN 1095-7197, Vol. 40, nr 2, s. C157-C180Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    A novel parallel formulation of Hessenberg-triangular reduction of a regular matrix pair on distributed memory computers is presented. The formulation is based on a sequential cacheblocked algorithm by K degrees agstrom et al. [BIT, 48 (2008), pp. 563 584]. A static scheduling algorithm is proposed that addresses the problem of underutilized processes caused by two-sided updates of matrix pairs based on sequences of rotations. Experiments using up to 961 processes demonstrate that the new formulation is an improvement of the state of the art and also identify factors that limit its scalability.

  • 2.
    Adlerborn, Björn
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Distributed one-stage Hessenberg-triangular reduction with wavefront scheduling2016Rapport (Övrigt vetenskapligt)
    Abstract [en]

    A novel parallel formulation of Hessenberg-triangular reduction of a regular matrix pair on distributed memory computers is presented. The formulation is based on a sequential cache-blocked algorithm by Kågstrom, Kressner, E.S. Quintana-Ortí, and G. Quintana-Ortí (2008). A static scheduling algorithm is proposed that addresses the problem of underutilized processes caused by two-sided updates of matrix pairs based on sequences of rotations. Experiments using up to 961 processes demonstrate that the new algorithm is an improvement of the state of the art but also identifies factors that currently limit its scalability.

  • 3. Bosner, Nela
    et al.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Parallel and Heterogeneous $m$-Hessenberg-Triangular-Triangular Reduction2017Ingår i: SIAM Journal on Scientific Computing, ISSN 1064-8275, E-ISSN 1095-7197, Vol. 39, nr 1, s. C29-C47Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The m-Hessenberg-triangular-triangular (mHTT) reduction is a simultaneous orthogonal reduction of three matrices to condensed form. It has applications, for example, in solving shifted linear systems arising in various control theory problems. A new heterogeneous CPU/GPU implementation of the mHTT reduction is presented and evaluated against an existing CPU implementation. The algorithm offloads the compute-intensive matrix-matrix multiplications to the GPU and keeps the inner loop, which is memory intensive and has a complicated control flow, on the CPU. Experiments demonstrate that the heterogeneous implementation can be superior to the existing CPU implementation on a system with 2 x 8 CPU cores and one GPU. Future development should focus on improving the scalability of the CPU computations.

  • 4.
    Bujanović, Zvonimir
    et al.
    Department of Mathematics, Faculty of Science, University of Zagreb, Zagreb, Croatia.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kressner, Daniel
    Institute of Mathematics, EPFL, Lausanne, Switzerland.
    A Householder-Based Algorithm for Hessenberg-Triangular Reduction2018Ingår i: SIAM Journal on Matrix Analysis and Applications, ISSN 0895-4798, E-ISSN 1095-7162, Vol. 39, nr 3, s. 1270-1294Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The QZ algorithm for computing eigenvalues and eigenvectors of a matrix pencil $A - \lambda B$ requires that the matrices first be reduced to Hessenberg-triangular (HT) form. The current method of choice for HT reduction relies entirely on Givens rotations regrouped and accumulated into small dense matrices which are subsequently applied using matrix multiplication routines. A nonvanishing fraction of the total flop-count must nevertheless still be performed as sequences of overlapping Givens rotations alternately applied from the left and from the right. The many data dependencies associated with this computational pattern leads to inefficient use of the processor and poor scalability. In this paper, we therefore introduce a fundamentally different approach that relies entirely on (large) Householder reflectors partially accumulated into block reflectors, by using (compact) WY representations. Even though the new algorithm requires more floating point operations than the state-of-the-art algorithm, extensive experiments on both real and synthetic data indicate that it is still competitive, even in a sequential setting. The new algorithm is conjectured to have better parallel scalability, an idea which is partially supported by early small-scale experiments using multithreaded BLAS. The design and evaluation of a parallel formulation is future work.

  • 5.
    Eljammaly, Mahmoud
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    A library for storing and manipulating dense tensors2016Rapport (Övrigt vetenskapligt)
    Abstract [en]

    Aiming to build a layered infrastructure for high-performance dense tensor applications, we present a library, called dten, for storing and manipulating dense tensors. The library focuses on storing dense tensors in canonical storage formats and converting between storage formats in parallel. In addition, it supports tensor matricization in different ways. The library is general-purpose and provides a high degree of flexibility.

  • 6.
    Eljammaly, Mahmoud
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    An auto-tuning framework for a NUMA-aware Hessenberg reduction algorithm2017Rapport (Övrigt vetenskapligt)
    Abstract [en]

    The performance of a recently developed Hessenberg reduction algorithm greatly depends on the values chosen for its tunable parameters. The search space is huge combined with other complications makes the problem hard to solve effectively with generic methods and tools. We describe a modular auto-tuning framework in which the underlying optimization algorithm is easy to substitute. The framework exposes sub-problems of standard auto-tuning type for which existing generic methods can be reused. The outputs of concurrently executing sub-tuners are assembled by the framework into a solution to the original problem.

  • 7.
    Eljammaly, Mahmoud
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    An auto-tuning framework for a NUMA-aware Hessenberg reduction algorithm2018Ingår i: ICPE '18 Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ACM Digital Library, 2018, , s. 4s. 5-8Konferensbidrag (Refereegranskat)
  • 8.
    Eljammaly, Mahmoud
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Evaluation of the Tunability of a New NUMA-Aware Hessenberg Reduction Algorithm2016Rapport (Övrigt vetenskapligt)
  • 9.
    Eljammaly, Mahmoud
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    On the Tunability of a New Hessenberg Reduction Algorithm Using Parallel Cache Assignment2018Ingår i: Parallel Processing and Applied Mathematics. PPAM 2017: Part 1 / [ed] Wyrzykowski R., Dongarra J., Deelman E., Karczewski K., Springer, 2018, s. 579-589Konferensbidrag (Refereegranskat)
    Abstract [en]

    The reduction of a general dense square matrix to Hessenberg form is a well known first step in many standard eigenvalue solvers. Although parallel algorithms exist, the Hessenberg reduction is one of the bottlenecks in AED, a main part in state-of-the-art software for the distributed multishift QR algorithm. We propose a new NUMA-aware algorithm that fits the context of the QR algorithm and evaluate the sensitivity of its algorithmic parameters. The proposed algorithm is faster than LAPACK for all problem sizes and faster than ScaLAPACK for the relatively small problem sizes typical for AED.

  • 10.
    Gustavson, Fred
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Distributed SBP Cholesky factorization algorithms with near-optimal scheduling2009Ingår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 36, nr 2, s. 11:1-11:25Artikel i tidskrift (Refereegranskat)
  • 11.
    Gustavson, Fred
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion2012Ingår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 38, nr 3, s. 17:1-17:32Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition are presented and evaluated. A new algorithm for in-place transposition which efficiently determines the structure of the transposition permutation a priori is one of the key ingredients. It enables effective load balancing in a parallel environment.

  • 12.
    Gustavson, Fred
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Three Algorithms for Cholesky Factorization on Distributed Memory Using Packed Storage2007Ingår i: Applied Parallel Computing - State of the Art in Scientific Computing: 8th International Workshop, PARA 2006, Springer , 2007, s. 550-559Konferensbidrag (Refereegranskat)
  • 13.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Blocked in-place transposition with application to storage format conversion2009Rapport (Övrigt vetenskapligt)
    Abstract [en]

    We develop a prototype library for in-place (dense) matrix storage format conversion between the canonical row and column-major formats and the four canonical block data layouts. Many of the fastest linear algebra routines operate on matrices in a block data layout. In-place storage format conversion enables support for input/output of large matrices in the canonical row and column-major formats. The library uses algorithms associated with in-place transposition as building blocks. We investigate previous work on the subject of (in-place) transposition and the most promising algorithms are implemented and evaluated. Our results indicate that the Three-Stage Algorithm which only requires a small constant amount of additional memory performs well and is easy to tune. Murray Dow’s V5 algorithm, which is a two-stage semi-in-place algorithm that requires a small amount of additional memory is sometimes a better choice. The write-allocate strategy of most cache-based computer architectures appears to be the cause of an observed performance problem for large matrices.

  • 14.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Scheduling of parallel matrix computations and data layout conversion for HPC and Multi-Core Architectures2011Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
    Abstract [sv]

    Matrisberäkningar är fundamentala byggblock imånga beräkningstunga teknisk-vetenskapliga applikationer. Algoritmerna måste vara numeriskt stabila och robusta för att användaren ska kunna förlita sig på de beräknade resultaten. Algoritmerna måste dessutom skala och kunna köras effektivt på massivt parallella datorer med noder bestående av flerkärniga processorer.

    Det är utmanande att uveckla högpresterande algoritmer för täta matrisberäkningar, särskilt sedan introduktionen av flerkärniga processorer. Det är ännu viktigare att återanvända data i cache-minnena i en flerkärnig processor på grund av dess höga beräkningsprestanda. Två centrala tekniker i strävan efter algoritmer med optimal prestanda är blockade algoritmer och blockade matrislagringsformat. En blockad algoritm har ett minnesåtkomstmönster som passar minneshierarkin väl. Ett blockat matrislagringsformat placerar matrisens element i minnet så att elementen i specifika matrisblock lagras konsekutivt.

    I Artikel I presenteras en algoritm för Cholesky-faktorisering av en matris kompakt lagrad i ett distribuerat minne. Det nya lagringsformatet är blockat och möjliggör därigenom hög prestanda. Artikel II och Artikel III beskriver hur en konventionellt lagrad matris kan konverteras till och från ett blockat lagringsformat med hjälp av en ytterst liten mängd extra lagringsutrymme. Lösningen bygger på en ny parallell algoritm för matristransponering av rektangulära matriser.

    Vid skapandet av en skalbar parallell algoritm måste man även beakta hur de olika beräkningsuppgifterna schemaläggs på ett effektivt sätt. Många så kallade svagt skalbara algoritmer är effektiva endast för relativt stora problem. En nuvarande forskningstrend är att utveckla så kallade starkt skalbara algoritmer, vilka är mer effektiva även för mindre problem.

    Artikel IV introducerar ett dynamiskt schemaläggningssystem för två-sidiga matrisberäkningar. Beräkningsuppgifterna fördelas statiskt på noderna och schemaläggs sedan dynamiskt inom varje nod. Artikeln visar även hur prioritetsbaserad schemaläggning tar en tidigare ineffektiv algoritm för ett så kallat QR-svep och gör den effektiv.

    Artikel V och Artikel VI presenterar nya parallella blockade algoritmer, designade för flerkärniga processorer, för en två-stegs Hessenberg-reduktion. De centrala bidragen i Artikel V utgörs av en blockad algoritm för reduktionens andra steg samt en adaptiv lastbalanseringsmetod.

  • 15.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kagstrom, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Efficient Reduction from Block Hessenberg Form to Hessenberg Form Using Shared Memory2012Ingår i: Applied parallel and scientific computing: Part II, 2012, s. 258-268Konferensbidrag (Refereegranskat)
    Abstract [en]

    A new cache-efficient algorithm for reduction from block Hessenberg form to Hessenberg form is presented and evaluated. The algorithm targets parallel computers with shared memory. One level of look-ahead in combination with a dynamic load-balancing scheme significantly reduces the idle time and allows the use of coarse-grained tasks. The coarse tasks lead to high-performance computations on each processor/core. Speedups close to 13 over the sequential unblocked algorithm have been observed on a dual quad-core machine using one thread per core.

  • 16.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kjelgaard Mikkelsen, Carl Christian
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Negative stride in the column-major format makes sense and has useful applications2017Rapport (Övrigt vetenskapligt)
    Abstract [en]

    Two lower triangular or two upper triangular matrices of the same size can be stored with minimal memory footprint. If both positive and negative strides are used, then both matrices can be accessed as if they were stored in regular column-major format.

  • 17.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kjelgaard Mikkelsen, Carl Christian
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Improving Perfect Parallelism2014Ingår i: Parallel Processing and Applied Mathematics: 10th International Conference, PPAM 2013, Warsaw, Poland, September 8-11, 2013, Revised Selected Papers, Part I / [ed] Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, Jerzy Waśniewski, Springer Berlin/Heidelberg, 2014, Vol. 8384, s. 76-85Konferensbidrag (Refereegranskat)
    Abstract [en]

    We reconsider the familiar problem of executing a perfectly parallel workload consisting of N independent tasks on a parallel computer with P << N processors. We show that there are memory-bound problems for which the runtime can be reduced by the forced parallelization of individual tasks across a small number of cores. Specific examples include solving differential equations, performing sparse matrix-vector multiplications, and sorting integer keys.

  • 18.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kressner, Daniel
    Lang, Bruno
    Optimally Packed Chains of Bulges in Multishift QR Algorithms2014Ingår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 40, nr 2, s. 12-Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The QR algorithm is the method of choice for computing all eigenvalues of a dense nonsymmetric matrix A. After an initial reduction to Hessenberg form, a QR iteration can be viewed as chasing a small bulge from the top left to the bottom right corner along the subdiagonal of A. To increase data locality and create potential for parallelism, modern variants of the QR algorithm perform several iterations simultaneously, which amounts to chasing a chain of several bulges instead of a single bulge. To make effective use of level 3 BLAS, it is important to pack these bulges as tightly as possible within the chain. In this work, we show that the tightness of the packing in existing approaches is not optimal and can be increased. This directly translates into a reduced chain length by 33% compared to the state-of-the-art LAPACK implementation of the QR algorithm. To demonstrate the impact of our idea, we have modified the LAPACK implementation to make use of the optimal packing. Numerical experiments reveal a uniform reduction of the execution time, without affecting stability or robustness.

  • 19.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kressner, Daniel
    Uschmajew, Andre
    Parallel algorithms for tensor completion in the CP format2016Ingår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 57, s. 222-234Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Low-rank tensor completion addresses the task of filling in missing entries in multidimensional data. It has proven its versatility in numerous applications, including context aware recommender systems and multivariate function learning. To handle large-scale datasets and applications that feature high dimensions, the development of distributed algorithms is central. In this work, we propose novel, highly scalable algorithms based on a combination of the canonical polyadic (CP) tensor format with block coordinate descent methods. Although similar algorithms have been proposed for the matrix case, the case of higher dimensions gives rise to a number of new challenges and requires a different paradigm for data distribution. The convergence of our algorithms is analyzed and numerical experiments illustrate their performance on distributed-memory architectures for tensors from a range of different applications.

  • 20.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    A framework for dynamic node-scheduling of two-sided blocked matrix computations2009Ingår i: Applied Parallel Computing - State of the Art in Parallel and Scientific Computing: 9th International Workshop, PARA 2008, 2009Konferensbidrag (Refereegranskat)
  • 21.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Efficient reduction from block Hessenberg form to Hessenberg form using shared memory2010Ingår i: PARA 2010: State of the Art in Scientific and Parallel Computing, Reykjavik, June 6-9, 2010, 2010Konferensbidrag (Refereegranskat)
  • 22.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures2011Ingår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 37, nr 12, s. 771-782Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We consider parallel reduction of a real matrix to Hessenberg form using orthogonal transformations. Standard Hessenberg reduction algorithms reduce the columns of the matrix from left to right in either a blocked or unblocked fashion. However, the standard blocked variant performs 20% of the computations in terms of matrix vector multiplications. We show that a two-stage approach consisting of an intermediate reduction to block Hessenberg form speeds up the reduction by avoiding matrix vector multiplications. We describe and evaluate a new high-performance implementation of the two-stage approach that attains significant speedups over the one-stage approach. The key components are a dynamically scheduled implementation of Stage 1 and a blocked, adaptively load-balanced implementation of Stage 2. (C) 2011 Elsevier B.V. All rights reserved.

  • 23.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Kågström, Bo
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Wadbro, Eddie
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Högpresterande beräkningscentrum norr (HPC2N).
    Fine-Grained Bulge-Chasing Kernels for Strongly Scalable Parallel QR Algorithms2014Ingår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, nr 7, s. 271-288Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The bulge-chasing kernel in the small-bulge multi-shift QR algorithm for the non-symmetric dense eigenvalue problem becomes a sequential bottleneck when the QR algorithm is run in parallel on a multicore platform with shared memory. The duration of each kernel invocation is short, but the critical path of the QR algorithm contains a long sequence of calls to the bulge-chasing kernel. We study the problem of parallelizing the bulge-chasing kernel itself across a handful of processor cores in order to reduce the execution time of the critical path. We propose and evaluate a sequence of four algorithms with varying degrees of complexity and verify that a pipelined algorithm with a slowly shifting block column distribution of the Hessenberg matrix is superior. The load-balancing problem is non-trivial and computational experiments show that the load-balancing scheme has a large impact on the overall performance. We propose two heuristics for the load-balancing problem and also an effective optimization method based on local search. Numerical experiments show that speed-ups are obtained for problems as small as 40-by-40 on two different multicore architectures.

  • 24.
    Karlsson, Lars
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Tisseur, Francoise
    Algorithms for Hessenberg-Triangular Reduction of Fiedler Linearization of Matrix Polynomials2015Ingår i: SIAM Journal on Scientific Computing, ISSN 1064-8275, E-ISSN 1095-7197, Vol. 37, nr 3, s. C384-C414Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Small- to medium-sized polynomial eigenvalue problems can be solved by linearizing the matrix polynomial and solving the resulting generalized eigenvalue problem using the QZ algorithm. The QZ algorithm, in turn, requires an initial reduction of a matrix pair to Hessenberg-triangular (HT) form. In this paper, we discuss the design and evaluation of high-performance parallel algorithms and software for HT reduction of a specific linearization of matrix polynomials of arbitrary degree. The proposed algorithm exploits the sparsity structure of the linearization to reduce the number of operations and improve the cache reuse compared to existing algorithms for unstructured inputs. Experiments on both a workstation and a high-performance computing system demonstrate that our structure-exploiting parallel implementation can outperform both the general LAPACK routine DGGHRD and the prototype implementation DGGHR3 of a general blocked algorithm.

  • 25.
    Kjelgaard Mikkelsen, Carl Christian
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Blocked Algorithms for Robust Solution of Triangular Linear Systems2018Ingår i: Parallel Processing and Applied Mathematics: 12th International Conference, PPAM 2017, Lublin, Poland, September 10-13, 2017, Revised Selected Papers, Part I / [ed] Roman Wyrzykowski, Jack Dongarra, Ewa Deelman, Konrad Karczewski, Springer, 2018, Vol. 1, s. 68-78Konferensbidrag (Refereegranskat)
    Abstract [en]

    We consider the problem of computing a scaling α such that the solution x of the scaled linear system Tx = αb can be computed without exceeding an overflow threshold Ω. Here T is a non-singular upper triangular matrix and b is a single vector, and Ω is less than the largest representable number. This problem is central to the computation of eigenvectors from Schur forms. We show how to protect individual arithmetic operations against overflow and we present a robust scalar algorithm for the complete problem. Our algorithm is very similar to xLATRS in LAPACK. We explain why it is impractical to parallelize these algorithms. We then derive a robust blocked algorithm which can be executed in parallel using a task-based run-time system such as StarPU. The parallel overhead is increased marginally compared with regular blocked backward substitution.

  • 26.
    Kjelgaard Mikkelsen, Carl Christian
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Schwarz, Angelika Beatrix
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Parallel robust solution of triangular linear systems2018Ingår i: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, artikel-id e5064Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Triangular linear systems are central to the solution of general linear systems and the computation of eigenvectors. In the absence of floating‐point exceptions, substitution runs to completion and solves a system which is a small perturbation of the original system. If the matrix is well‐conditioned, then the normwise relative error is small. However, there are well‐conditioned systems for which substitution fails due to overflow. The robust solvers xLATRS from LAPACK extend the set of linear systems which can be solved by dynamically scaling the solution and the right‐hand side to avoid overflow. These solvers are sequential and apply to systems with a single right‐hand side. This paper presents algorithms which are blocked and parallel. A new task‐based parallel robust solver (Kiya) is presented and compared against both DLATRS and the non‐robust solvers DTRSV and DTRSM. When there are many right‐hand sides, Kiya performs significantly better than the robust solver DLATRS and is not significantly slower than the non‐robust solver DTRSM.

  • 27.
    Kågström, Bo
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Kressner, Daniel
    Computing codimensions and generic canonical forms for generalized matrix products2011Ingår i: The Electronic Journal of Linear Algebra, ISSN 1537-9582, E-ISSN 1081-3810, Vol. 22, s. 277-309Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    A generalized matrix product can be formally written as Lambda(sp)(p) Lambda(sp-1)(p-1) ... Lambda(s2)(2) Lambda(s1)(1) where s(i) is an element of {- 1,+ 1} and ( A(1), ..., A(p)) is a tuple of ( possibly rectangular) matrices of suitable dimensions. The periodic eigenvalue problem related to such a product represents a nontrivial extension of generalized eigenvalue and singular value problems. While the classification of generalized matrix products under eigenvalue-preserving similarity transformations and the corresponding canonical forms have been known since the 1970's, finding generic canonical forms has remained an open problem. In this paper, we aim at such generic forms by computing the codimension of the orbit generated by all similarity transformations of a given generalized matrix product. This can be reduced to computing the so called cointeractions between two different blocks in the canonical form. A number of techniques are applied to keep the number of possibilities for different types of cointeractions limited. Nevertheless, the matter remains highly technical; we therefore also provide a computer program for finding the codimension of a canonical form, based on the formulas developed in this paper. A few examples illustrate how our results can be used to determine the generic canonical form of least codimension. Moreover, we describe an algorithm and provide software for extracting the generically regular part of a generalized matrix product.

  • 28.
    Schwarz, Angelika Beatrix
    et al.
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Karlsson, Lars
    Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
    Scalable eigenvector computation for the non-symmetric eigenvalue problem2019Ingår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 85, s. 131-140Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present two task-centric algorithms for computing selected eigenvectors of a non-symmetric matrix reduced to real Schur form. Our approach eliminates the sequential phases present in the current LAPACK/ScaLAPACK implementation. We demonstrate the scalability of our implementation on multicore, manycore and distributed memory systems.

1 - 28 av 28
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf