ZJUSCT HPC101 超算短学期 Lab 1

环境:搭载 Apple M1 Max 的 MacBook Pro,使用 Parallels Desktop 虚拟机运行 Debian 11.7.0 ARM64。

实验指导

仅供参考。相关路径请自行修改。

(没有采用配好一台后克隆的做法,四台一起配。所以下面做成了类似部署脚本的东西。)


编译 OpenMPI

1
2
3
4
5
6
cd
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
tar xvjf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5
./configure
make -j

编译 BLAS

1
2
3
4
5
cd
wget https://www.netlib.org/blas/blas-3.11.0.tgz
tar xvzf blas-3.11.0.tgz
cd BLAS-3.11.0
make -j

编译 CBLAS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
cd
wget https://www.netlib.org/blas/blast-forum/cblas.tgz
tar xvzf cblas.tgz
cd CBLAS

patch Makefile.in << EOF
25c25
< BLLIB = /Users/julie/Documents/Boulot/lapack-dev/lapack/trunk/blas_LINUX.a
---
> BLLIB = /home/45gfg9/BLAS-3.11.0/blas_LINUX.a
EOF

patch testing/c_sblat1.f << EOF
214c214
< CALL STEST1(SNRM2TEST(N,SX,INCX),STEMP,STEMP,SFAC)
---
> CALL STEST1(SNRM2TEST(N,SX,INCX),STEMP(1),STEMP,SFAC)
218c218
< CALL STEST1(SASUMTEST(N,SX,INCX),STEMP,STEMP,SFAC)
---
> CALL STEST1(SASUMTEST(N,SX,INCX),STEMP(1),STEMP,SFAC)
EOF

patch testing/c_dblat1.f << EOF
214c214
< CALL STEST1(DNRM2TEST(N,SX,INCX),STEMP,STEMP,SFAC)
---
> CALL STEST1(DNRM2TEST(N,SX,INCX),STEMP(1),STEMP,SFAC)
218c218
< CALL STEST1(DASUMTEST(N,SX,INCX),STEMP,STEMP,SFAC)
---
> CALL STEST1(DASUMTEST(N,SX,INCX),STEMP(1),STEMP,SFAC)
EOF

make -j

编译 HPL

这里不手动指定 libgfortran.so 会链接失败…?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
cd
wget https://netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar xvzf hpl-2.3.tar.gz
cd hpl-2.3

patch setup/Make.Linux_ATHLON_CBLAS -o Make.Linux_AArch64 << 'EOF'
64c64
< ARCH = Linux_ATHLON_CBLAS
---
> ARCH = Linux_AArch64
70c70
< TOPdir = $(HOME)/hpl
---
> TOPdir = $(HOME)/hpl-2.3
84,86c84,86
< MPdir = /usr/local/mpi
< MPinc = -I$(MPdir)/include
< MPlib = $(MPdir)/lib/libmpich.a
---
> MPdir = /usr/local
> MPinc = -I$(MPdir)/include/openmpi
> MPlib = $(MPdir)/lib/libmpi.so
95c95
< LAdir = $(HOME)/netlib/ARCHIVES/Linux_ATHLON
---
> LAdir = $(HOME)
97c97
< LAlib = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
---
> LAlib = $(LAdir)/CBLAS/lib/cblas_LINUX.a $(LAdir)/BLAS-3.11.0/blas_LINUX.a
145c145
< HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
---
> HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) /usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.so
169c169
< CC = /usr/bin/gcc
---
> CC = /usr/local/bin/mpicc
173c173
< LINKER = /usr/bin/gcc
---
> LINKER = $(CC)
EOF

patch Make.top << EOF
47c47
< arch = UNKNOWN
---
> arch = Linux_AArch64
EOF

make -j arch=Linux_AArch64

生成 SSH 密钥,复制到其他节点,创建 MPI Hostfile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ssh-keygen -t ed25519 -f key -P ""
ssh-copy-id Debian-Cluster-A.local
ssh-copy-id Debian-Cluster-B.local
ssh-copy-id Debian-Cluster-C.local
ssh-copy-id Debian-Cluster-D.local

cat > ~/cluster_hostfile << EOF
Debian-Cluster-A.local slots=2
Debian-Cluster-B.local slots=2
Debian-Cluster-C.local slots=2
Debian-Cluster-D.local slots=2
EOF

mpirun --hostfile ~/cluster_hostfile uname -a

运行 xhpl

1
2
cd ~/hpl-2.3/bin/Linux_AArch64
mpirun --hostfile ~/cluster_hostfile ./xhpl

常见问题

编译 CBLAS 报错:

1
2
3
4
5
c_sblat1.f:218:48:

218 | CALL STEST1(SASUMTEST(N,SX,INCX),STEMP,STEMP,SFAC)
| 1
Error: Rank mismatch in argument ‘strue1’ at (1) (scalar and rank-1)

L214 和 L218 相同的错误。1 所指的 STEMP 改为 STEMP(1)。不懂 Fortran,这里是参数的秩不匹配?c_dblat1.f 也是。见前文的 patch。

编译 HPL 错误:

1
2
3
4
5
6
7
8
/usr/bin/ld: /home/45gfg9/CBLAS/lib/cblas_LINUX.a(cblas_dtrsv.o): in function `cblas_dtrsv':
cblas_dtrsv.c:(.text+0x130): undefined reference to `dtrsv_'
/usr/bin/ld: /home/45gfg9/CBLAS/lib/cblas_LINUX.a(cblas_dgemm.o): in function `cblas_dgemm':
cblas_dgemm.c:(.text+0xd0): undefined reference to `dgemm_'
/usr/bin/ld: cblas_dgemm.c:(.text+0x1ac): undefined reference to `dgemm_'
/usr/bin/ld: /home/45gfg9/CBLAS/lib/cblas_LINUX.a(cblas_dtrsm.o): in function `cblas_dtrsm':
cblas_dtrsm.c:(.text+0x1b0): undefined reference to `dtrsm_'
collect2: error: ld returned 1 exit status

CBLAS 找不到 BLAS。在 HPL Makefile 中指定 LAlib 时,确保有 blas_LINUX.a,而且在 cblas_LINUX.a 后面。见前文 patch。

编译 HPL 错误:

1
2
3
4
5
6
7
8
/usr/bin/ld: /home/45gfg9/BLAS-3.11.0/blas_LINUX.a(xerbla.o): in function `xerbla_':
xerbla.f:(.text+0x54): undefined reference to `_gfortran_st_write'
/usr/bin/ld: xerbla.f:(.text+0x60): undefined reference to `_gfortran_string_len_trim'
/usr/bin/ld: xerbla.f:(.text+0x7c): undefined reference to `_gfortran_transfer_character_write'
/usr/bin/ld: xerbla.f:(.text+0x8c): undefined reference to `_gfortran_transfer_integer_write'
/usr/bin/ld: xerbla.f:(.text+0x94): undefined reference to `_gfortran_st_write_done'
/usr/bin/ld: xerbla.f:(.text+0xa4): undefined reference to `_gfortran_stop_string'
collect2: error: ld returned 1 exit status

找不到 libgfortran。(为什么呢?)

确保已经安装 gfortran(当然的吧,编译 CBLAS 要用)。定位系统中的 libgfortran:

1
find /usr -name 'libgfortran.so*'

选择一个 .so 将其加入 Makefile 对应的位置。因为如果出现多个的话,它们大概率是同一个文件。
如果选了 .a,会:

1
2
3
/usr/bin/ld: /usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a(fpu.o): undefined reference to symbol 'feraiseexcept@@GLIBC_2.17'
/usr/bin/ld: /lib/aarch64-linux-gnu/libm.so.6: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

glibc,是不懂的领域呢。总之,还是见前文的 patch(…)。


ZJUSCT HPC101 超算短学期 Lab 1
https://heap.45gfg9.net/t/ZJU/2023-HPC101/43cc89f4f5ad/
作者
45gfg9
发布于
2023-05-29
许可协议