Gromacs-2021.3 (比较常用GPU队列)

4. Gromacs-2021.3 (比较常用GPU队列)

说明：

使用Singularity容器解决方案，调用/fs00/software/singularity-images/ngc_gromacs_2021.3.sif完成Gromacs的能量最小化（em）、平衡模拟（nvt、npt）以及成品模拟（md）在公共共享100%GPU队列722080tiib、72rtxib、723090ib的表现。

队列情况：

队列	节点数	每节点CPU	每节点内存(GB)	平均每核内存(GB)	CPU主频(GHz)	每节点GPU数量	每GPU显存(GB)	浮点计算理论峰值(TFLOPS)
83a100ib	1	64	512	8	2.6	8	40	双精度:82.92 单精度:------
723090ib	2	48	512	10.7	2.8	8	24	双精度:4.30 单精度:569.28
722080tiib	4	16	128	8.0	3.0	4	11	双精度:3.07 单精度:215.17
72rtxib	3	16	128	8.0	3.0	4	24	双精度:2.30 单精度:195.74

前人关于Gromacs-2021.3(全部相互作用用GPU计算)的测试报告中，尝试用GPU来模拟102808个原子体系（464 residues, 9nt DNA, 31709 SOL, 94 NA, 94 CL）50 ns内所有相互作用的运算，结果表明83a100ib（250 ns/day以上）＞723090ib（220 ns/day以上）＞722080tiib（170 ns/day以上）＞72rtxib（180 ns/day以上），但83a100ib和723090ib队列常年存在80以上的NJOBS，因此作为成品模拟的前期准备，笔者通常不使用这两个队列。

文件位置：

/fs00/software/singularity-images/ngc\_gromacs\_2021.3.sif

提交代码：

能量最小化（em.lsf）

#BSUB -q 72rtxib
#BSUB -gpu "num=1"
module load singularity/latest
export OMP_NUM_THREADS=`echo $LSB_HOSTS | awk '{print NF}'`
SINGULARITY="singularity run --nv /fs00/software/singularity-images/ngc_gromacs_2021.3.sif"
${SINGULARITY} gmx grompp -f minim.mdp -c 1aki_solv_ions.gro -p topol.top -o em.tpr
${SINGULARITY} gmx mdrun -nb gpu -ntmpi 2 -deffnm em

平衡模拟（nvt）

#BSUB -q 72rtxib
#BSUB -gpu "num=1"
module load singularity/latest
export OMP_NUM_THREADS=`echo $LSB_HOSTS | awk '{print NF}'`
SINGULARITY="singularity run --nv /fs00/software/singularity-images/ngc_gromacs_2021.3.sif"
${SINGULARITY} gmx grompp -f nvt.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr
${SINGULARITY} gmx mdrun -nb gpu -ntmpi 2 -deffnm nvt

平衡模拟（npt）

#BSUB -q 72rtxib
#BSUB -gpu "num=1"
module load singularity/latest
export OMP_NUM_THREADS=`echo $LSB_HOSTS | awk '{print NF}'`
SINGULARITY="singularity run --nv /fs00/software/singularity-images/ngc_gromacs_2021.3.sif"
${SINGULARITY} gmx grompp -f npt.mdp -c nvt.gro -r nvt.gro -t nvt.cpt -p topol.top -o npt.tpr
${SINGULARITY} gmx mdrun -nb gpu -ntmpi 2 -deffnm npt

成品模拟（md）

#BSUB -q 723090ib
#BSUB -gpu "num=1"
module load singularity/latest
export OMP_NUM_THREADS=`echo $LSB_HOSTS | awk '{print NF}'`
SINGULARITY="singularity run --nv /fs00/software/singularity-images/ngc_gromacs_2021.3.sif"
${SINGULARITY} gmx grompp -f md.mdp -c npt.gro -t npt.cpt -p topol.top -o md_0_1.tpr
${SINGULARITY} gmx mdrun -nb gpu -bonded gpu -update gpu -pme gpu -pmefft gpu -deffnm md_0_1

成品模拟（md）

也可以参照以下命令进行修改，以作业脚本形式进行提交：

#BSUB -q 723090ib
#BSUB -gpu "num=1"
module load singularity/latest
export OMP_NUM_THREADS=`echo $LSB_HOSTS | awk '{print NF}'`
SINGULARITY="singularity run --nv /fs00/software/singularity-images/ngc_gromacs_2021.3.sif"
${SINGULARITY} echo 4 | gmx pdb2gmx -f protein.pdb -o protein_processed.gro -water tip3p -ignh -merge all
${SINGULARITY} gmx editconf -f protein_processed.gro -o pro_newbox.gro -c -d 1.0 -bt cubic
${SINGULARITY} ${SINGULARITY}gmx solvate -cp pro_newbox.gro -cs spc216.gro -o pro_solv.gro -p topol.top
${SINGULARITY} ##gmx 软件信息：grompp -f ../MDP/ions.mdp -c pro_solv.gro -p topol.top -o ions.tpr
${SINGULARITY} echo 13| gmx genion -s ions.tpr -o pro_solv_ions.gro -p topol.top -pname NA -nname CL -neutral
${SINGULARITY} gmx grompp -f ../MDP/minim.mdp -c pro_solv_ions.gro -p topol.top -o em.tpr
${SINGULARITY} gmx mdrun -v -deffnm em
${SINGULARITY} echo 10 0 | gmx energy -f em.edr -o potential.xvg
${SINGULARITY} gmx grompp -f ../MDP/nvt.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr
${SINGULARITY} gmx mdrun -deffnm nvt
${SINGULARITY} echo 16 0 |gmx energy -f nvt.edr -o temperature.xvg
${SINGULARITY} gmx grompp -f ../MDP/npt.mdp -c nvt.gro -r nvt.gro -t nvt.cpt -p topol.top -o npt.tpr
${SINGULARITY} gmx mdrun -deffnm npt
${SINGULARITY} echo 18 0| gmx energy -f npt.edr -o pressure.xvg
${SINGULARITY} gmx grompp -f ../MDP/md.mdp -c npt.gro -t npt.cpt -p topol.top -o md.tpr
${SINGULARITY} gmx mdrun -v -deffnm md
${SINGULARITY} echo 4 4| gmx rms -f md.xtc -s md.tpr -o rmsd.xvg

软件信息：

GROMACS version:    2021.3-dev-20210818-11266ae-dirty-unknown
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.9-sse2-avx-avx2-avx2_128-avx512
CUDA driver:        11.20
CUDA runtime:       11.40



###### 测试算例：

测试算例：

ATOM 218234 (401 Protein residues, 68414 SOL, 9 Ion residues)

nsteps = 100000000 ; 200 ns

eScience中心GPU测试：能量最小化（em）、平衡模拟（nvt、npt）使用1个GPU进行模拟，成品模拟（md）使用1个GPU进行模拟。

~~<center>~~

~~|**任务1**|em|nvt|npt|md|~~

|: 任务1 em nvt npt md ---: ~~|:---:|:---:|:---:|:---:|~~72rtxib |722080tiib ~~---~~722080tiib |723090ib ~~72rtxib~~ | ~~722080tiib | 722080tiib | 723090ib | |~~ CPU time | 1168.45 | 13960.33 | 42378.71 | | Run time |79 791648 |5586 ~~1648 | 5586 |~~ 117.428 ns/day~~ ~~
0.204 hour/ns | | Turnaround time |197 ~~197~~1732 |5661 ~~1732~~ | ~~5661~~ | || ~~|**任务2**|em|nvt|npt|md|~~ | 任务2 em nvt npt md --- |72rtxib 722080tiib 72rtxib |722080tiib ~~722080tiib~~ | ~~72rtxib | 722080tiib | |~~ CPU time | 1399.30 | 15732.66 | 40568.04 | | Run time |93 931905 |5236 ~~1905 | 5236 |~~ 106.862 ns/day~~ ~~
0.225 hour/ns | | Turnaround time |181 ~~181~~1991 |5479 ~~1991~~ | ~~5479~~ | || ~~|**任务3**|em|nvt|npt|md|~~ | 任务3 em nvt npt md --- |72rtxib 72rtxib |72rtxib 72rtxib | ~~72rtxib~~ ~~| 72rtxib | |~~ CPU time | 1368.11 | 5422.49 | 5613.74 | | Run time |92 92355 |366 ~~355 | 366 |~~ 103.213 ns/day~~ ~~
0.233 hour/ns | | Turnaround time |180 ~~180~~451 |451 ~~451~~ | ~~451~~ | || ~~|**任务4**|em|nvt|npt|md|~~ | 任务4 em nvt npt md --- |72rtxib 72rtxib |72rtxib ~~72rtxib~~722080tiib | ~~72rtxib~~ ~~| 722080tiib | |~~ CPU time | 1321.15 | 5441.60 | 5618.87 | | Run time |89 89356 |369 ~~356 | 369 |~~ 111.807 ns/day~~ ~~
0.215 hour/ns | | Turnaround time |266 ~~266~~440 |435 ~~440~~ | ~~435~~ | || ~~|**任务5**|em|nvt|npt|md|~~ | 任务5 em nvt npt md --- |72rtxib 72rtxib |72rtxib 72rtxib | ~~72rtxib~~ ~~| 72rtxib | |~~ CPU time | 1044.17 | 5422.94 | 5768.44 | | Run time |72 72354 |380 ~~354 | 380 |~~ 110.534 ns/day~~ ~~
0.217 hour/ns | | Turnaround time |162 ~~162~~440 |431 ~~440~~ | ~~431~~ | || ~~|**任务6**|em|nvt|npt|md|~~ | 任务6 em nvt npt md --- |723090ib 723090ib |723090ib 723090ib | ~~723090ib~~ ~~| 723090ib | |~~ CPU time | 1569.17 | 7133.74 | 6677.25 | | Run time |81 81326 |325 ~~326 | 325 |~~ 114.362 ns/day~~ ~~
0.210 hour/ns | | Turnaround time |75 75320 |300 ~~320~~ | ~~300~~ | || ~~|**任务7**|em|nvt|npt|md|~~ | 任务7 em nvt npt md --- |723090ib 723090ib |723090ib ~~723090ib~~722080tiib | ~~723090ib~~ ~~| 722080tiib | |~~ CPU time | 1970.56 | 5665.71 | 6841.73 | | Run time |91 91253 |327 ~~253 | 327 |~~ 111.409 ns/day~~ ~~
0.215 hour/ns | | Turnaround time |123 ~~123~~251 |328 ~~251~~ | ~~328~~ | || ~~|**任务8**|em|nvt|npt|md|~~ | 任务8 em nvt npt md --- |72rtxib 72rtxib |72rtxib 72rtxib | ~~72rtxib~~ ~~| 72rtxib | |~~ CPU time | 1234.24 | 5540.59 | 5528.91 | | Run time |108 ~~108~~363 |370 ~~363 | 370 |~~ 114.570 ns/day~~ ~~
0.209 hour/ns | | Turnaround time |85 85364 |363 ~~364~~ | ~~363~~ | || ~~|**任务9**|em|nvt|npt|md|~~ | 任务9 em nvt npt md --- |723090ib 723090ib |723090ib 723090ib | ~~723090ib~~ ~~| 723090ib | |~~ CPU time | 2016.10 | 7633.83 | 7983.58 | | Run time |93 93342 |361 ~~342 | 361 |~~ 115.695 ns/day~~ ~~
0.207 hour/ns | | Turnaround time |130 ~~130~~377 |356 ~~377~~ | ~~356~~ | || ~~|**任务10**|em|nvt|npt|md|~~ | 任务10 em nvt npt md --- |723090ib 723090ib |723090ib ~~723090ib~~72rtxib | ~~723090ib~~ ~~| 72rtxib | |~~ CPU time | 1483.84 | 7025.65 | 7034.90 | | Run time |68 68317 |333 ~~317 | 333 |~~ 102.324 ns/day~~ ~~
0.235 hour/ns | | Turnaround time |70 70319 |316 ~~319~~ | ~~316~~ | ~~</center>~~

~~#####~~

结论：

~~结论：~~

能量最小化（em）在任务较少的722080tiib和72rtxib队列中，Run time分别为88.83 ± 12.45和83.25 ± 11.44s;

平衡模拟（nvt）任务在722080tiib、72rtxib和723090ib队列中，Run time分别为1776.50 ± 181.73、357.00 ± 4.08和309.50 ± 39.06 s；

平衡模拟（npt）任务在722080tiib、72rtxib和723090ib队列中，Run time分别为5411.00 ± 247.49、371.25 ± 6.08和336.50 ± 16.68 s；

原子数218234的 200 ns成品模拟（md）任务在722080tiib、72rtxib、和723090ib队列中，~~**性能表现差别不大**~~性能表现差别不大，分别为110.03 ± 2.75、115.83 ± 1.54和107.66 ± 5.90 ns/day。

综上，建议在能量最小化（em）、平衡模拟（nvt、npt）等阶段**使用排队任务较少的72rtxib队列**等阶段使用排队任务较少的72rtxib队列 ，建议在成品模拟（md）阶段**按照任务数量**阶段按照任务数量（从笔者使用情况来看，排队任务数量72rtxib＜722080tiib＜723090ib＜83a100ib）、~~**GPU收费情况**~~GPU收费情况（校内及协同创新中心用户：72rtxib队列1.8 元/卡/小时=0.45元/核/小时、722080tiib队列1.2 元/卡/小时=0.3元/核/小时、723090ib队列1.8 元/卡/小时=0.3元/核/小时、83a100ib队列4.8 元/卡/小时=0.3元/核/小时）~~**适当考虑队列**~~适当考虑队列。

在以上提交代码中，未涉及到Gromacs的并行效率问题（**直接“num=4”并不能在集群同时使用4块GPU**并不能在集群同时使用4块GPU），感兴趣的同学可以查看http://bbs.keinsci.com/thread-13861-1-1.html以及https://developer.nvidia.com/blog/creating-faster-molecular-dynamics-simulations-with-gromacs-2020/的相关解释。但根据前辈的经验，**ATOM 500000以上才值得使用两张GPU加速卡**500000以上才值得使用两张GPU加速卡，原因在于Gromacs的并行效率不明显。感兴趣的同学也可以使用Amber的GPU并行加速，但对显卡的要求为3090或者tesla A100。这里提供了GPU=4的gromacs命令：

gmx mdrun -deffnm $file.pdb.md -ntmpi 4 -ntomp 7 -npme 1 -nb gpu -pme gpu -bonded gpu -pmefft gpu -v