# Applications/Hpl

## Contents

## Application Details

- Description: HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It is freely available implementation of the High Performance Computing Linpack Benchmark.
- Version: 11.3.1
- Modules: hpl/intel-2016/11.3.1
- Licence: Open source (See http://www.netlib.org/benchmark/hpl/copyright.html)

## Usage Examples

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.

The HPL package provides a testing and timing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. The best performance achievable by this software on your system depends on a large variety of factors. Nonetheless, with some restrictive assumptions on the interconnection network, the algorithm described here and its attached implementation are scalable in the sense that their parallel efficiency is maintained constant with respect to the per processor memory usage.

The HPL software package requires the availibility on your system of an implementation of the Message Passing Interface MPI (1.1 compliant). An implementation of either the Basic Linear Algebra Subprograms BLAS or the Vector Signal Image Processing Library VSIPL is also needed. Machine-specific as well as generic implementations of MPI, the BLAS and VSIPL are available for a large variety of systems.

### Non interactive job

This runs on the scheduler SLURM

#### Compute Node

In this example we test the compute nodes, note we only specify 1 task per node here as HPL will use all of the cores available.

#!/bin/bash #SBATCH -J compute-single-node #SBATCH -N 1 #SBATCH --ntasks-per-node 1 #SBATCH -m cyclic:cyclic #SBATCH -D /home/user/benchmarks/HPL/single_node/compute #SBATCH -o %N.%j.%a.out #SBATCH -e %N.%j.%a.err #SBATCH -p compute #SBATCH --exclusive echo $SLURM_JOB_NODELIST module purge module add hpl/intel-2016/11.3.1 export I_MPI_DEBUG=5 export I_MPI_FABRICS=shm:tmi export I_MPI_FALLBACK=no xhpl_intel64

And passing it to SLURM

[username@login01 ~]$ sbatch HPLdemo.job Submitted batch job 889552

#### GPU Node

In this example we test the GPU nodes, note we only specify 1 task per node here as HPL will use all of the cores available.

#!/bin/bash #SBATCH -J gpu-single-node #SBATCH -N 1 #SBATCH --ntasks-per-node 1 #SBATCH -m cyclic:cyclic #SBATCH -D /home/cvsupport/benchmarks/HPL/single_node/gpu #SBATCH -o %N.%j.%a.out #SBATCH -e %N.%j.%a.err #SBATCH -p gpu #SBATCH --exclusive echo $SLURM_JOB_NODELIST module purge module add hpl/intel-2016/11.3.1 module add intel/mkl/64/11.3.2 export PSM2_SDMA=0 export I_MPI_DEBUG=5 export I_MPI_FABRICS=shm:tmi export I_MPI_FALLBACK=no xhpl_intel64_dynamic

And passing it to SLURM, this time the GPU queue:

[username@login01 ~]$ sbatch HPLgpu-demo.job Submitted batch job 882552

## Further Information