[Ubuntu] DCGM 설치하고 실행해보기

peanut0613 2022. 7. 4. 17:01

2022. 7. 4. 17:01

Nvidia-smi 말고 GPU 하드웨어 피처들을 수집할수 있는 방법. 전자에는 없는 피처들을 포함하고 있다.

aws에서 사용한 AMI : Deep Learning AMI GPU CUDA 11.4.1 (Ubuntu 18.04) 20211204

ubuntu 18.04
cuda 11.4
python3.7

< 설치 >

https://developer.nvidia.com/dcgm 여기 그대로 따라서 설치하면됨

NVIDIA DCGM

Manage and Monitor GPUs in Cluster Environments NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system al

developer.nvidia.com

Ubuntu LTS
Set up the CUDA network repository meta-data, GPG key. The example shown below is for Ubuntu 20.04 on x86_64:
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

Install DCGM
$ sudo apt-get update \
&& sudo apt-get install -y datacenter-gpu-manager

Set up the DCGM service
$ sudo systemctl --now enable nvidia-dcgm

그런데 본인은 ubuntu 18.04 에서 실행하기 때문에 링크만 조금 바꿔줌

https://developer.download.nvidia.com/compute/cuda/repos/ 들어가서 ubuntu18.04에 해당하는 파일을 찾아서 링크 바꿔줘서 진행함.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb

sudo dpkg -i cuda-keyring_1.0-1_all.deb

** 에러발생 (아래 sudo losf~ 이거 실행할떄 아무것도 출력안될때까지 반복)

dpkg: error: dpkg frontend is locked by another process

> sudo lsof /var/lib/dpkg/lock-frontend
> sudo kill -9 <PID>

sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"

**에러발생

N: Updating from such a repository can't be done securely, and is therefore disabled by default.

N: See apt-secure(8) manpage for repository creation and user configuration details.

https → http 바꿔줘보기
바로앞 명령어 실행 안해주면 발생함

sudo apt-get update && sudo apt-get install -y datacenter-gpu-manager

sudo systemctl --now enable nvidia-dcgm

< 실행코드 >

실행 코드들은 https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html 참조
그룹을 만들고 진행해주어야 에러가 안남

dcgmi group -l

dcgmi discovery -l

dcgmi group -g 1 -i

dcgmi profile -l -i 0

dcgmi dmon -e 1001,1004,1005

dcgmi dmon -e 1001,1004,1005 > dcgmi-log.csv  # 1초마다 숫자에 해당하는 피처들을 뽑아서 csv파일로 저장

'<하드웨어> > GPU' 카테고리의 다른 글

DCGMI 실행 명령어 정리 (0)	2022.07.06
Nsight로 GPU hardware feature profiling (0)	2022.07.06
Ubuntu18.04+cuda11.4+python3.7+tensorflow2.7.0+cuDNN8.2.4 설치 (0)	2022.04.10
DeviceQuery 결과 csv파일로 저장 (0)	2022.03.21
GPU register,global,shared,local,constant,texture 메모리 정의구분 및 계층구조 + gpu구조 (0)	2022.03.15

DARAM BLOG