A novel parallel formulation of Hessenberg-triangular reduction of a regular matrix pair on distributed memory computers is presented. The formulation is based on a sequential cacheblocked algorithm by K degrees agstrom et al. [BIT, 48 (2008), pp. 563 584]. A static scheduling algorithm is proposed that addresses the problem of underutilized processes caused by two-sided updates of matrix pairs based on sequences of rotations. Experiments using up to 961 processes demonstrate that the new formulation is an improvement of the state of the art and also identify factors that limit its scalability.
In this paper we discuss the problem of computing eigenvectors for matrices in Schur form using parallel computing. We develop a new parallel algorithm and report on the performance of our MPI based implementation. We have also implemented a new parallel algorithm for scaling during the backsubstitution phase. We have increased the arithmetic intensity by interleaving the compution of several eigenvectors and by merging the backward substitution and the back-transformation of the eigenvector computation.
A novel parallel formulation of Hessenberg-triangular reduction of a regular matrix pair on distributed memory computers is presented. The formulation is based on a sequential cache-blocked algorithm by Kågstrom, Kressner, E.S. Quintana-Ortí, and G. Quintana-Ortí (2008). A static scheduling algorithm is proposed that addresses the problem of underutilized processes caused by two-sided updates of matrix pairs based on sequences of rotations. Experiments using up to 961 processes demonstrate that the new algorithm is an improvement of the state of the art but also identifies factors that currently limit its scalability.
Appearing frequently in applications, generalized eigenvalue problems represent one of the core problems in numerical linear algebra. The QZ algorithm of Moler and Stewart is the most widely used algorithm for addressing such problems. Despite its importance, little attention has been paid to the parallelization of the QZ algorithm. The purpose of this work is to fill this gap. We propose a parallelization of the QZ algorithm that incorporates all modern ingredients of dense eigensolvers, such as multishift and aggressive early deflation techniques. To deal with (possibly many) infinite eigenvalues, a new parallel deflation strategy is developed. Numerical experiments for several random and application examples demonstrate the effectiveness of our algorithm on two different distributed memory HPC systems.
The QZ algorithm reduces a regular matrix pair to generalized Schur form, which can be used to address the generalized eigenvalue problem. This paper summarizes recent work on improving the performance of the QZ algorithm on serial machines and work in progress on a novel parallel implementation. In both cases, the QZ iterations are based on chasing chains of tiny bulges. This allows to formulate the majority of the computation in terms of matrix-matrix multiplications, resulting in natural parallelism and better performance on modern computing systems with memory hierarchies. In addition, advanced deflation strategies are used, specifically the so called aggressive early deflation, leading to a considerable convergence acceleration and consequently to a reduction of floating point operations and computing time.
Given a general matrix pair (A,B) with real entries, we provide software routines for computing a generalized Schur decomposition (S, T). The real and complex conjugate pairs of eigenvalues appear as 1×1 and 2×2 blocks, respectively, along the diagonals of (S, T) and can be reordered in any order. Typically, this functionality is used to compute orthogonal bases for a pair of deflating subspaces corresponding to a selected set of eigenvalues. The routines are written in Fortran 90 and targets distributed memory machines.
We present parallel algorithms for triangular periodic Sylvester-type matrix equations, conceptually being the third step of a periodic Bartels-Stewart-like solution method for general periodic Sylvester-type matrix equations based on variants of the periodic Schur decomposition. The presented algorithms are designed and implemented in the framework of the recently developed HPG library SCASY and are based on explicit blocking, 2-dimensional block cyclic data distribution and a wavefront-like traversal of the right hand side matrices. High performance is obtained by rich usage of level 3 BLAS operations. It is also demonstrated how several important key concepts of SCASY regarding communications and the treatment of quasi-triangular coefficient matrices are generalized to the periodic case. Some experimental results from a distributed memory Linux cluster demonstrate are also presented.
A block (column) wrap-mapping approach for design of parallel block matrix factorization algorithms that are (trans)portable over and between shared memory multiprocessors (SMM) and distributed memory multicomputers (DMM) is presented. By reorganizing the matrix on the SMM architecture, the same ring-oriented algorithms can be used on both SMM and DMM systems with all machine dependencies comprised to a small set of communication routines. The algorithms are described on high level with focus on portability and scalability aspects. Implementation aspects of the LU , Cholesky, and QR factorizations and machine specific communication routines for some SMM and DMM systems are discussed. Timing results show that our portable algorithms have similar performance as machine specific implementations. 1 Introduction With the introduction of advanced parallel computer architectures a demand for efficient and portable algorithms has emerged. Several attempts to design algorithms and implementat.
A two-stage blocked algorithm for reduction of a regular matrix pair (A, B) to upper Hessenberg-triangular form is presented. In stage 1 (A, B) is reduced to block upper Hessenberg-triangular form using mainly level 3 (matrix-matrix) operations that permit data reuse in the higher levels of a memory hierarchy. In the second stage all but one of the r subdiagonals of the block Hessenberg A-part are set to zero using Givens rotations. The algorithm proceeds in a sequence of supersweeps, each reducing m columns. The updates with respect to row and column rotations are organized to reference consecutive columns of A and B. To further improve the data locality, all rotations produced in a supersweep are stored to enable a left-looking reference pattern, i.e., all updates are delayed until they are required for the continuation of the supersweep. Moreover, we present a blocked variant of the single diagonal double-shift QZ method for computing the generalized Schur form of(A, B) in upper Hessenberg-triangular form. The blocking for improved data locality is done similarly, now by restructuring the reference pattern of the updates associated with the bulge chasing in the QZ iteration. Timing results show that our new blocked variants outperform the current LAPACK routines, including drivers for the generalized eigenvalue problem, by a factor 2-5 for sufficiently large problems.
We construct the Hasse diagrams G2 and G3 for the closure ordering on the sets of congruence classes of 2 × 2 and 3 × 3 complex matrices. In other words, we construct two directed graphs whose vertices are 2 × 2 or, respectively, 3 × 3 canonical matrices under congruence, and there is a directed path from A to B if and only if A can be transformed by an arbitrarily small perturbation to a matrix that is congruent to B. A bundle of matrices under congruence is defined as a set of square matrices A for which the pencils A + λAT belong to the same bundle under strict equivalence. In support of this definition, we show that all matrices in a congruence bundle of 2 × 2 or 3 × 3 matrices have the same properties with respect to perturbations. We construct the Hasse diagrams G2 B and G3 B for the closure ordering on the sets of congruence bundles of 2 × 2 and, respectively, 3 × 3 matrices. We find the isometry groups of 2 × 2 and 3 × 3 congruence canonical matrices.
We investigate the changes under small perturbations of the canonical structure information for a system pencil (A B C D) − s (E 0 0 0), det(E) ≠ 0, associated with a (generalized) linear time-invariant state-space system. The equivalence class of the pencil is taken with respect to feedback-injection equivalence transformation. The results allow to track possible changes under small perturbations of important linear system characteristics.
We investigate the changes of the canonical structure information under small perturbations for a system pencil associated with a (generalized) linear time-invariant state-space system. The equivalence class of the pencil is taken with respect to feedback-injection equivalence transformations. The results allow us to track possible changes of important linear system characteristics under small perturbations.
Matlab functions to work with the canonical structures for congru-ence and *congruence of matrices, and for congruence of symmetricand skew-symmetric matrix pencils are presented. A user can providethe canonical structure objects or create (random) matrix examplesetups with a desired canonical information, and compute the codi-mensions of the corresponding orbits: if the structural information(the canonical form) of a matrix or a matrix pencil is known it isused for the codimension computations, otherwise they are computednumerically. Some auxiliary functions are provided too. All thesefunctions extend the Matrix Canonical Structure Toolbox.
We study how small perturbations of general matrix polynomials may change their elementary divisors and minimal indices by constructing the closure hierarchy (stratification) graphs of matrix polynomials' orbits and bundles. To solve this problem, we construct the stratification graphs for the first companion Fiedler linearization of matrix polynomials. Recall that the first companion Fiedler linearization as well as all the Fiedler linearizations is matrix pencils with particular block structures. Moreover, we show that the stratification graphs do not depend on the choice of Fiedler linearization which means that all the spaces of the matrix polynomial Fiedler linearizations have the same geometry (topology). This geometry coincides with the geometry of the space of matrix polynomials. The novel results are illustrated by examples using the software tool StratiGraph extended with associated new functionality.
We study how small perturbations of matrix polynomials may change their elementary divisors and minimal indices by constructing the closure hierarchy graphs (stratifications) of orbits and bundles of matrix polynomial Fiedler linearizations. We show that the stratifica-tion graphs do not depend on the choice of Fiedler linearization which means that all the spaces of the matrix polynomial Fiedler lineariza-tions have the same geometry (topology). The results are illustrated by examples using the software tool StratiGraph.
We prove Roth-type theorems for systems of matrix equations including an arbitrary mix of Sylvester and $\star$-Sylvester equations, in which the transpose or conjugate transpose of the unknown matrices also appear. In full generality, we derive consistency conditions by proving that such a system has a solution if and only if the associated set of $2 \times 2$ block matrix representations of the equations are block diagonalizable by (linked) equivalence transformations. Various applications leading to several particular cases have already been investigated in the literature, some recently and some long ago. Solvability of these cases follow immediately from our general consistency theory. We also show how to apply our main result to systems of Stein-type matrix equations.
We study how small perturbations of a skew-symmetric matrix pencil may change its canonical form under congruence. This problem is also known as the stratification problem of skew-symmetric matrix pencil orbits and bundles. In other words, we investigate when the closure of the congruence orbit (or bundle) of a skew-symmetric matrix pencil contains the congruence orbit (or bundle) of another skew-symmetric matrix pencil. The developed theory relies on our main theorem stating that a skew-symmetric matrix pencil A - lambda B can be approximated by pencils strictly equivalent to a skew-symmetric matrix pencil C - lambda D if and only if A - lambda B can be approximated by pencils congruent to C - lambda D.
We study how small perturbations of a skew-symmetric matrix pencil may change its canonical form under congruence. This problem is also known as the stratification problem of skew-symmetric matrix pencil orbits and bundles. In other words, we investigate when the closure of the congruence orbit (or bundle) of a skew-symmetric matrix pencil contains the congruence orbit (or bundle) of another skew-symmetric matrix pencil. This theory relies on our main theorem stating that a skew-symmetric matrix pencil A-λB can be approximated by pencils strictly equivalent to a skew-symmetric matrix pencil C-λD if and only if A-λB can be approximated by pencils congruent to C-λD.
The homogeneous system of matrix equations (X(T)A + AX, (XB)-B-T + BX) = (0, 0), where (A, B) is a pair of skew-symmetric matrices of the same size is considered: we establish the general solution and calculate the codimension of the orbit of (A, B) under congruence. These results will be useful in the development of the stratification theory for orbits of skew-symmetric matrix pencils.
The set of all solutions to the homogeneous system of matrix equations (X-T A + AX, X-T B + BX) = (0, 0), where (A, B) is a pair of symmetric matrices of the same size, is characterized. In addition, the codimension of the orbit of (A, B) under congruence is calculated. This paper is a natural continuation of the article [A. Dmytryshyn, B. Kagstrom, and V. V. Sergeichuk. Skew-symmetric matrix pencils: Codimension counts and the solution of a pair of matrix equations. Linear Algebra Appl., 438:3375-3396, 2013.], where the corresponding problems for skew-symmetric matrix pencils are solved. The new results will be useful in the development of the stratification theory for orbits of symmetric matrix pencils.
We derive versal deformations of the Kronecker canonical form by deriving the tangent space and orthogonal bases for the normal space to the orbits of strictly equivalent matrix pencils. These deformations reveal the local perturbation theory of matrix pencils related to the Kronecker canonical form. We also obtain a new singular value bound for the distance to the orbits of less generic pencils. The concepts, results, and their derivations are mainly expressed in the language of numerical linear algebra. We conclude with experiments and applications.
Computing the Jordan form of a matrix or the Kronecker structure of a pencil is a well-known ill-posed problem. We propose that knowledge of the closure relations, i.e., the stratification, of the orbits and bundles of the various forms may be applied in the staircase algorithm. Here we discuss and complete the mathematical theory of these relationships and show how they may be applied to the staircase algorithm. This paper is a continuation of our Part I paper on versal deformations, but it may also be read independently.
The High Performance Computing Center North (HPC2N) Super Cluster is a truly self-made high-performance Linux cluster with 240 AMD processors in 120 dual nodes, interconnected with a high-bandwidth, low-latency SCI network. This contribution describes the hardware selected for the system, the work needed to build it, important software issues and an extensive performance analysis. The performance is evaluated using a number of state-of-the-art benchmarks and software, including STREAM, Pallas MPI, the Atlas DGEMM, High-Performance Linpack and NAS Parallel benchmarks. Using these benchmarks we first determine the raw memory bandwidth and network characteristics; the practical peak performance of a single CPU, a single dual-node and the complete 240-processor system; and investigate the parallel performance for non-optimized dusty-deck Fortran applications. In summary, this $500 000 system is extremely cost-effective and shows the performance one would expect of a large-scale supercomputing system with distributed memory architecture. According to the TOP500 list of June 2002, this cluster was the 94th fastest computer in the world. It is now fully operational and stable as the main computing facility at HPC2N. The system’s utilization figures exceed 90%, i.e. all 240 processors are on average utilized over 90% of the time, 24 hours a day, seven days a week.
The performance of a recently developed Hessenberg reduction algorithm greatly depends on the values chosen for its tunable parameters. The search space is huge combined with other complications makes the problem hard to solve effectively with generic methods and tools. We describe a modular auto-tuning framework in which the underlying optimization algorithm is easy to substitute. The framework exposes sub-problems of standard auto-tuning type for which existing generic methods can be reused. The outputs of concurrently executing sub-tuners are assembled by the framework into a solution to the original problem.
The reduction of a general dense and square matrix to Hessenberg form is a well known first step in many standard eigenvalue solvers. Although parallel algorithms exist, the Hessenberg reduction is still one of the bottlenecks in state-of-the-art software for the distributed QR algorithm. We propose a new NUMA-aware algorithm that fits the context of the QR algorithm and evaluate the tunability of its algorithmic parameters. The proposed algorithm can be faster than LAPACK and ScaLAPACK for small problem sizes. In addition, evaluating the algorithmic parameters shows that there is potential for auto-tuning some of the parameters.
The reduction of a general dense square matrix to Hessenberg form is a well known first step in many standard eigenvalue solvers. Although parallel algorithms exist, the Hessenberg reduction is one of the bottlenecks in AED, a main part in state-of-the-art software for the distributed multishift QR algorithm. We propose a new NUMA-aware algorithm that fits the context of the QR algorithm and evaluate the sensitivity of its algorithmic parameters. The proposed algorithm is faster than LAPACK for all problem sizes and faster than ScaLAPACK for the relatively small problem sizes typical for AED.
Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today's memory-tiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on today's high performance computing systems.
The canonical structures of controllability and observability pairs (A,B) and (A,C) associated with a state-space system are studied under small perturbations. We show how previous work for general matrix pencils can be applied to the stratification of orbits and bundles of matrix pairs. A stratification provides qualitative information about the closure relation between canonical structures.We also present how the new results are used in StratiGraph, which is a software tool for computing and visualizing closure hierarchies.
A prototype web computing environment for computations related to the design and analysis of control systems using the SLICOT software library is presented. The web interface can be accessed from a standard world wide web browser with no need for additional software installations on the local machine. The environment provides user-friendly access to SLICOT routines where run-time options are specified by mouse clicks on appropriate buttons. Input data can be entered directly into the web interface by the user or uploaded from a local computer in a standard text format or in Matlab binary format. Output data is presented in the web browser window and possible to download in a number of different formats, including Matlab binary. The environment is ideal for testing the SLICOT software before performing a software installation or for performing a limited number of computations. It is also highly recommended for education as it is easy to use, and basically self-explanatory, with the users' guide integrated in the user interface.
Computing the fine-canonical-structure elements of matrices and matrix pencils are ill-posed problems. Therefore, besides knowing the canonical structure of a matrix or a matrix pencil, it is equally important to know what are the nearby canonical structures that explain the behavior under small perturbations. Qualitative strata information is provided by our StratiGraph tool. Here, we present lower and upper bounds for the distance between Jordan and Kronecker structures in a closure hierarchy of an orbit or bundle stratification. This quantitative information is of importance in applications, e.g., distance to more degenerate systems (uncontrollability). Our upper bounds are based on staircase regularizing perturbations. The lower bounds are of EckartYoung type and are derived from a matrix representation of the tangent space of the orbit of a matrix or a matrix pencil. Computational results illustrate the use of the bounds.
StratiGraph, a Java-based tool for computation and presentation of closure hierarchies of Jordan and Kronecker structures is presented. The tool is based on recent theoretical results on stratifications of orbits and bundles of matrices and matrix pencils. A stratification reveals the complete hierarchy of nearby structures. information critical for explaining the qualitative behaviour of linear systems under perturbations. StratiGraph facilitates the application of these theories and visualizes the resulting hierarchy as a graph. Nodes in the graph represent orbits or bundles of matrices or matrix pencils. Edges represent covering relations in the closure hierarchy. Given a Jordan or Kronecker structure, a user can obtain the complete information of nearby structures simply by mouse clicks on nodes of interest. This contribution gives an overview of the StratiGraph tool, presents its main functionalities and other features, and illustrates its use by sample applications.
Copyright (C) 2001 John Wiley & Sons, Ltd.
Cover relations for orbits and bundles of controllability and observability pairs associated with linear time-invariant systems are derived. The cover relations are combinatorial rules acting on integer sequences, each representing a subset of the Jordan and singular Kronecker structures of the corresponding system pencil. By representing these integer sequences as coin piles, the derived stratification rules are expressed as minimal coin moves between and within these piles, which satisfy and preserve certain monotonicity properties. The stratification theory is illustrated with two examples from systems and control applications, a mechanical system consisting of a thin uniform platform supported at both ends by springs, and a linearized Boeing 747 model. For both examples, nearby uncontrollable systems are identified as subsets of the complete closure hierarchy for the associated system pencils.
The set (or family) of 2-by-3 matrix pencils A-lambda B comprises 18 structurally different Kronecker structures (canonical forms). The algebraic and geometric characteristics of the generic and the 17 nongeneric cases are examined in full detail. The complete closure hierarchy of the orbits of all different Kronecker structures is derived and presented in a closure graph that shows how the structures relate to each other in the la-dimensional space spanned by the set of 2-by-3 pencils. Necessary conditions on perturbations for transiting from the orbit of one Kronecker structure to another in the closure hierarchy are presented in a labeled closure graph. The node and are labels shows geometric characteristics of an orbit's Kronecker structure and the change of geometric characteristics when transiting to an adjacent node, respectively. Computable normwise bounds for the smallest perturbations (delta A, delta B) of a generic 2-by-3 pencil A lambda B such that (A+delta A)-lambda(B+delta B) has a specific nongeneric Kronecker structure are presented. First, explicit expressions for the perturbations that transfer A-lambda B to a specified nongeneric form are derived. In this context tractable and intractable perturbations are defined. Second, a modified GUPTRI that computes a specified Kronecker structure of a generic pencil is used. Perturbations devised to impose a certain nongeneric structure are computed in a way that guarantees one will find a Kronecker canonical form (KCF) on the closure of the orbit of the intended KCF. Both approaches are illustrated by computational experiments. Moreover, a study of the behaviour of the nongeneric structures under random perturbations in finite precision arithmetic (using the GUPTRI software) show for which sizes of perturbations the structures are invariant and also that structure transitions occur in accordance with the closure hierarchy. Finally, some of the results are extended to the general m-by-(m+1) case.
Recently, recursive blocked algorithms for solving triangular one-sided and two-sided Sylvester-type equations were introduced by Jonsson and Kagstrom. This elegant yet simple technique enables an automatic variable blocking that has the potential of matching the memory hierarchies of today's HPC systems. The main parts of the computations are performed as level 3 general matrix multiply and add (GEMM) operations. We extend and apply the recursive blocking technique to solving periodic Sylvester-type matrix equations. Successive recursive splittings are performed on 3-dimensional arrays, where the third dimension represents the periodicity of a matrix equation.
Library software implementing a parallel small-bulge multishift QR algorithm with Aggressive Early Deflation (AED) targeting distributed memory high-performance computing systems is presented. Starting from recent developments of the parallel multishift QR algorithm [Granat et al., SIAM J. Sci. Comput. 32(4), 2010], we describe a number of algorithmic and implementation improvements. These include communication avoiding algorithms via data redistribution and a refined strategy for balancing between multishift QR sweeps and AED. Guidelines concerning several important tunable algorithmic parameters are provided. As a result of these improvements, a computational bottleneck within AED has been removed in the parallel multishift QR algorithm. A performance model is established to explain the scalability behavior of the new parallel multishift QR algorithm. Numerous computational experiments confirm that our new implementation significantly outperforms previous parallel implementations of the QR algorithm.
We continue our presentation of parallel ScaLAPACK-style algorithms for solving Sylvester-type matrix equations. In Part II, we present SCASY (SCAlable SYlvester solvers), a state-of-the-art HPC software library for solving 44 sign and transpose variants of eight common standard and generalized Sylvester-type matrix equations.
A direct method for eigenvalue reordering in a product of a K-periodic matrix sequence in periodic or extended periodic real Schur form is presented and analyzed. Each reordering of two adjacent sequences of diagonal blocks is performed tentatively to guarantee backward stability and involves solving a K-periodic Sylvester equation (PSE) and constructing a K-periodic sequence of orthogonal transformation matrices. An error analysis of the direct reordering method is presented, and results from computational experiments confirm the stability and accuracy of the method for well-conditioned as well as ill-conditioned problems. These include matrix sequences with fixed and time-varying dimensions, and sequences of small and large periodicity.
We discuss parallel algorithms for solving eight common standard and generalized triangular Sylvester-type matrix equation. Our parallel algorithms are based on explicit blocking, 2D block-cyclic data distribution of the matrices and wavefront-like traversal of the right hand side matrices while solving small-sized matrix equations at different nodes and updating the rest of the right hand side using level 3 operations. We apply the triangular solvers in condition estimation, developing parallel sep(-1)-estimators. Some experimental results are presented.
Parallel ScaLAPACK-style algorithms for solving eight common standard and generalized Sylvester-type matrix equations and various sign and transposed variants are presented. All algorithms are blocked variants based on the Bartels--Stewart method and involve four major steps: reduction to triangular form, updating the right-hand side with respect to the reduction, computing the solution to the reduced triangular problem, and transforming the solution back to the original coordinate system. Novel parallel algorithms for solving reduced triangular matrix equations based on wavefront-like traversal of the right-hand side matrices are presented together with a generic scalability analysis. These algorithms are used in condition estimation and new robust parallel sep − 1-estimators are developed. Experimental results from three parallel platforms, including results from a mixed OpenMP/MPI platform, are presented and analyzed using several performance and accuracy metrics. The analysis includes results regarding general and triangular parallel solvers as well as parallel condition estimators.
We present the functionality, the contents, and the proper usage of the latest release of the SCASY software library
A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early deflation. The multiwindow approach ensures that most computations when chasing chains of bulges are performed in level 3 BLAS operations, while the aim of aggressive early deflation is to speed up the convergence of the QR algorithm. Mixed MPI-OpenMP coding techniques are utilized for porting the codes to distributed memory platforms with multithreaded nodes, such as multicore processors. Numerous numerical experiments confirm the superior performance of our parallel QR algorithm in comparison with the existing ScaLAPACK code, leading to an implementation that is one to two orders of magnitude faster for sufficiently large problems, including a number of examples from applications.
Numerical algorithms for solving the continuous-time algebraic Riccati matrix equation on a distributed memory parallel computer are considered. In particular, it is shown that the Schur method, based on computing the stable invariant subspace of a Hamiltonian matrix, can be parallelized in an efficient and scalable way. Our implementation employs the state-of-the-art library ScaLAPACK as well as recently developed parallel methods for reordering the eigenvalues in a real Schur form. Some experimental results are presented, confirming the scalability of our implementation and comparing it with an existing implementation of the matrix sign iteration from the PLiCOC library.