Umeå University's logo

umu.sePublications
Change search
Refine search result
1 - 8 of 8
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Daldorff, Lars K. S.
    et al.
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Bengt, Eliasson
    Umeå University, Faculty of Science and Technology, Department of Physics.
    Parallelization of a Vlasov–Maxwell solver in four-dimensional phase space2009In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 35, no 2, p. 109-115Article in journal (Refereed)
    Abstract [en]

    We present a parallelized algorithm for solving the time-dependent Vlasov–Maxwell system of equations in the four-dimensional phase space (two spatial and velocity dimensions). One Vlasov equation is solved for each particle species, from which charge and current densities are calculated for the Maxwell equations. The parallelization is divided into two different layers. For the first layer, each plasma species is given its own processor group. On the second layer, the distribution function is domain decomposed on its dedicated resources. By separating the communication and calculation steps, we have met the design criteria of good speedup and simplicity in the implementation.

  • 2.
    Jäger, Gerold
    et al.
    Computer Science Institute, University of Halle-Wittenberg, D-06120 Halle (Saale), Germany.
    Wagner, Clemens
    denkwerk, Vogelsanger Straße 66, D-50823 Köln, Germany.
    Efficient parallelizations of Hermite and Smith normal form algorithms2009In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 35, no 6, p. 345-357Article in journal (Refereed)
    Abstract [en]

    Hermite and Smith normal form are important forms of matrices used in linear algebra. These terms have many applications in group theory and number theory. As the entries of the matrix and of its corresponding transformation matrices can explode during the computation, it is a very difficult problem to compute the Hermite and Smith normal form of large dense matrices. The main problems of the computation are the large execution times and the memory requirements which might exceed the memory of one processor. To avoid these problems, we develop parallelizations of Hermite and Smith normal form algorithms. These are the first parallelizations of algorithms for computing the normal forms with corresponding transformation matrices, both over the rings Z and F[x]. We show that our parallel versions have good efficiency, i.e., by doubling the processes, the execution time is nearly halved. Furthermore, they succeed in computing normal forms of dense large example matrices over the rings Q[x], F3[x], and F5[x].

  • 3.
    Karlsson, Lars
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Kressner, Daniel
    Uschmajew, Andre
    Parallel algorithms for tensor completion in the CP format2016In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 57, p. 222-234Article in journal (Refereed)
    Abstract [en]

    Low-rank tensor completion addresses the task of filling in missing entries in multidimensional data. It has proven its versatility in numerous applications, including context aware recommender systems and multivariate function learning. To handle large-scale datasets and applications that feature high dimensions, the development of distributed algorithms is central. In this work, we propose novel, highly scalable algorithms based on a combination of the canonical polyadic (CP) tensor format with block coordinate descent methods. Although similar algorithms have been proposed for the matrix case, the case of higher dimensions gives rise to a number of new challenges and requires a different paradigm for data distribution. The convergence of our algorithms is analyzed and numerical experiments illustrate their performance on distributed-memory architectures for tensors from a range of different applications.

  • 4.
    Karlsson, Lars
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Computing Center North (HPC2N).
    Kågström, Bo
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Computing Center North (HPC2N).
    Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures2011In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 37, no 12, p. 771-782Article in journal (Refereed)
    Abstract [en]

    We consider parallel reduction of a real matrix to Hessenberg form using orthogonal transformations. Standard Hessenberg reduction algorithms reduce the columns of the matrix from left to right in either a blocked or unblocked fashion. However, the standard blocked variant performs 20% of the computations in terms of matrix vector multiplications. We show that a two-stage approach consisting of an intermediate reduction to block Hessenberg form speeds up the reduction by avoiding matrix vector multiplications. We describe and evaluate a new high-performance implementation of the two-stage approach that attains significant speedups over the one-stage approach. The key components are a dynamically scheduled implementation of Stage 1 and a blocked, adaptively load-balanced implementation of Stage 2. (C) 2011 Elsevier B.V. All rights reserved.

  • 5.
    Karlsson, Lars
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Computing Center North (HPC2N).
    Kågström, Bo
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Computing Center North (HPC2N).
    Wadbro, Eddie
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Science and Technology, High Performance Computing Center North (HPC2N).
    Fine-Grained Bulge-Chasing Kernels for Strongly Scalable Parallel QR Algorithms2014In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, no 7, p. 271-288Article in journal (Refereed)
    Abstract [en]

    The bulge-chasing kernel in the small-bulge multi-shift QR algorithm for the non-symmetric dense eigenvalue problem becomes a sequential bottleneck when the QR algorithm is run in parallel on a multicore platform with shared memory. The duration of each kernel invocation is short, but the critical path of the QR algorithm contains a long sequence of calls to the bulge-chasing kernel. We study the problem of parallelizing the bulge-chasing kernel itself across a handful of processor cores in order to reduce the execution time of the critical path. We propose and evaluate a sequence of four algorithms with varying degrees of complexity and verify that a pipelined algorithm with a slowly shifting block column distribution of the Hessenberg matrix is superior. The load-balancing problem is non-trivial and computational experiments show that the load-balancing scheme has a large impact on the overall performance. We propose two heuristics for the load-balancing problem and also an effective optimization method based on local search. Numerical experiments show that speed-ups are obtained for problems as small as 40-by-40 on two different multicore architectures.

    Download full text (pdf)
    PARCO-D-12-00193.pdf
  • 6.
    Khelghatdoust, Mansour
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. The University of Sydney.
    Gramoli, Vincent
    The University of Sydney.
    A scalable and low latency probe-based scheduler for data analytics frameworks2021In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 103, article id 102752Article in journal (Refereed)
    Abstract [en]

    Today's data analytics frameworks divide jobs into many parallel tasks such that each task operates on a small partition of data in order to execute jobs with low latency. Such frameworks often rely on probe-based distributed schedulers to tackle the challenge of reducing the associated overhead. Unfortunately, the existing solutions do not perform efficiently under workload fluctuations and heterogeneous job durations. This is due to a problem called Head-of-Line blocking, i.e., short tasks are enqueued at workers behind longer tasks. To overcome this problem, we propose Peacock (Khelghatdoust and Gramoli, 0000) [25] a new fully distributed probe-based scheduling method. Unlike the existing methods, Peacock introduces a novel probe rotation technique. Workers form a ring overlay network and rotate probes using elastic queues of workers. It is augmented by a novel starvation-free probe reordering algorithm executed by workers. We evaluate Peacock against two existing state-of-the-art probe based solutions through a trace driven simulation of up to 20,000 workers and a distributed experiment of 100 workers in Apache Spark under Google, Cloudera, and Yahoo! traces. The performance results indicate that Peacock outperforms the state-of-the-art in all cluster sizes and loads. Our distributed experiments confirm our simulation results.

  • 7.
    Schwarz, Angelika Beatrix
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Karlsson, Lars
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Scalable eigenvector computation for the non-symmetric eigenvalue problem2019In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 85, p. 131-140Article in journal (Refereed)
    Abstract [en]

    We present two task-centric algorithms for computing selected eigenvectors of a non-symmetric matrix reduced to real Schur form. Our approach eliminates the sequential phases present in the current LAPACK/ScaLAPACK implementation. We demonstrate the scalability of our implementation on multicore, manycore and distributed memory systems.

  • 8.
    Schwarz, Angelika Beatrix
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Kjelgaard Mikkelsen, Carl Christian
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Karlsson, Lars
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Robust parallel eigenvector computation for the non-symmetric eigenvalue problem2020In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 100, article id 102707Article in journal (Refereed)
    Abstract [en]

    A standard approach for computing eigenvectors of a non-symmetric matrix reduced to real Schur form relies on a variant of backward substitution. Backward substitution is prone to overflow. To avoid overflow, the LAPACK eigenvector routine DTREVC3 associates every eigenvector with a scaling factor and dynamically rescales an entire eigenvector during the backward substitution such that overflow cannot occur. When many eigenvectors are computed, DTREVC3 applies backward substitution successively for every eigenvector. This corresponds to level-2 BLAS operations and constitutes a bottleneck. This paper redesigns the backward substitution such that the entire computation is cast as tile operations (level-3 BLAS). By replacing LAPACK’s scaling factor with tile-local scaling factors, our solver decouples the tiles and sustains parallel scalability even when a lot of numerical scaling is necessary.

1 - 8 of 8
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf