Skip to content

Parallelism

This page gives hints on how to set parameters for a parallel calculation with the ABINIT package.

Introduction

Running ABINIT in parallel (MPI 10 processors) can be as simple as:

mpirun -n 10 abinit run.abi > log 2> err

or (MPI 10 processors + OpenMP 4 threads):

export OMP_NUM_THREADS=4
mpirun -n 10 abinit run.abi > log 2> err   

The command mpirun might possibly be replaced by mpiexec depending on your system. In the latter, the standard output of the application is redirected to log while err collects the standard error. Note that the control of output in the parallel case needs to be adapted, as for massively parallel runs, one cannot afford to have some of the output files that are usually created, see the abinit help file for more explanation and how to change the default behavior (_LOG/_NOLOG files).

  • For ground-state calculations, the code has been parallelized (MPI-based parallelism) on the k-points, the spins, the spinor components, the bands, and the FFT grid and plane wave coefficients. For the k-point and spin parallelisations (using MPI), the communication load is generally very small. and the parallel efficiency very good provided the number of MPI procs divide the number of k-points in the IBZ. However, the number of nodes that can be used with this kind of k-point/spin distribution might be small, and depends strongly on the physics of the problem. A combined FFT / band parallelisation (LOBPCG with paral_kgb 1) is available [Bottin2008], and has shown very large speed up (>1000) on powerful computers with a large number of processors and high-speed interconnect. The combination of FFT / band / k point and spin parallelism is also available, and quite efficient for such computers. Available for norm-conserving as well as PAW cases. Automatic determination of the best combination of parallelism levels is available. Use of MPI-IO is mandatory for the largest speed ups to be observed.

  • Chebyshev filtering (Chebfi) is a method to solve the linear eigenvalue problem, and can be used as a SCF solver in Abinit. It is implemented in Abinit [Levitt2015]. The design goal is for Chebfi to replace LOBPCG as the solver of choice for large-scale computations in Abinit. By performing less orthogonalizations and diagonalizations than LOBPCG, scaling to higher processor counts is possible. A manual to use Chebfi is available here

  • For ground-state calculations, with a set of images (e.g. nudged elastic band method, the string method, the path-integral molecular dynamics, the genetic algorithm), MPI-based parallelism is used. The workload for the different images has been distributed. This parallelization level can be combined with the parallelism described above, leading to speed-up beyond 5000.

  • For ground-state calculations, GPUs can be used. There are two available GPU programming models: openMP offload (openMP v5+) compatible with Nvidia and AMD accelerators, Kokkos+cuda compatible with Nvidia accelerators. See gpu_option keyword.
    Obvioulsy, to benefit from GPU acceleration, ABINIT has to be compiled in a specific way, using a GPU compatible compiler (nvhpc, aocc, gcc), activating the relevant compilation options and linking to specific libraries (cuda toolkit, ROCm, …).
    This implementation is still EXPERIMENTAL (january 2024).

  • For ground-state calculations, the wavelet part of ABINIT (BigDFT) is also very well parallelized: MPI band parallelism, combined with GPUs.

  • For response calculations, the code has been MPI-parallelized on k-points, spins, bands, as well as on perturbations. For the k-points, spins and bands parallelisation, the communication load is rather small also, and, unlike for the GS calculations, the number of nodes that can be used in parallel will be large, nearly independently of the physics of the problem. Parallelism on perturbations is very similar to the parallelism on images in the ground state case (so, very efficient), although the load balancing problem for perturbations with different number of k points is not adressed at present. Use of MPI-IO is mandatory for the largest speed ups to be observed.

  • GW calculations are MPI-parallelized over k-points. They are also parallelized over transitions (valence to conduction band pairs), but the two parallelisation cannot be used currently at present. The transition parallelism has been show to allow speed ups as large as 300.

  • Ground state, response function, and GW parallel calculations can be done also by using OpenMP parallelism, even combined with MPI parallelism.

basic:

  • autoparal AUTOmatisation of the PARALlelism
  • paral_atom activate PARALelization over (paw) ATOMic sites
  • paral_kgb activate PARALelization over K-point, G-vectors and Bands
  • paral_rf Activate PARALlelization over Response Function perturbations

useful:

  • bandpp BAND Per Processor
  • chkparal CHecK whether the PARALelism is adequate
  • gpu_option GPU: OPTION to choose the implementation
  • gwpara GW PARAllelization level
  • max_ncpus MAXimum Number of CPUS
  • np_spkpt Number of Processors at the SPin and K-Point Level
  • npband Number of Processors at the BAND level
  • npfft Number of Processors at the FFT level
  • nphf Number of Processors for (Hartree)-Fock exact exchange
  • npimage Number of Processors at the IMAGE level
  • nppert Number of Processors at the PERTurbation level
  • npspinor Number of Processors at the SPINOR level

expert:

Selected Input Files

gpu_kokkos:

gpu_omp:

paral:

Tutorials