telegraf/plugins/inputs/amd_rocm_smi
Paweł Żak 6816aefcd5
chore: fix linter findings for unparam and revive.unused-parameter (#12150)
2022-11-08 12:04:12 -07:00
..
testdata feat: Add rocm_smi input to monitor AMD GPUs (#9602) 2021-09-02 08:57:17 -06:00
README.md docs: add global configuration header (#12107) 2022-10-26 14:58:36 -05:00
amd_rocm_smi.go chore: fix linter findings for unparam and revive.unused-parameter (#12150) 2022-11-08 12:04:12 -07:00
amd_rocm_smi_test.go refactor: move from io/ioutil to io and os package (#9811) 2021-09-28 15:16:32 -06:00
sample.conf chore(inputs_a-l): migrate sample configs into separate files (#11132) 2022-05-18 11:31:52 -05:00

README.md

AMD ROCm System Management Interface (SMI) Input Plugin

This plugin uses a query on the rocm-smi binary to pull GPU stats including memory and GPU usage, temperatures and other.

Global configuration options

In addition to the plugin-specific configuration settings, plugins support additional global and plugin configuration settings. These settings are used to modify metrics, tags, and field or create aliases and configure ordering, etc. See the CONFIGURATION.md for more details.

Configuration

# Query statistics from AMD Graphics cards using rocm-smi binary
[[inputs.amd_rocm_smi]]
  ## Optional: path to rocm-smi binary, defaults to $PATH via exec.LookPath
  # bin_path = "/opt/rocm/bin/rocm-smi"

  ## Optional: timeout for GPU polling
  # timeout = "5s"

Metrics

  • measurement: amd_rocm_smi
    • tags

      • name (entry name assigned by rocm-smi executable)
      • gpu_id (id of the GPU according to rocm-smi)
      • gpu_unique_id (unique id of the GPU)
    • fields

      • driver_version (integer)
      • fan_speed(integer)
      • memory_total(integer B)
      • memory_used(integer B)
      • memory_free(integer B)
      • temperature_sensor_edge (float, Celsius)
      • temperature_sensor_junction (float, Celsius)
      • temperature_sensor_memory (float, Celsius)
      • utilization_gpu (integer, percentage)
      • utilization_memory (integer, percentage)
      • clocks_current_sm (integer, Mhz)
      • clocks_current_memory (integer, Mhz)
      • power_draw (float, Watt)

Troubleshooting

Check the full output by running rocm-smi binary manually.

Linux:

rocm-smi rocm-smi -o -l -m -M  -g -c -t -u -i -f -p -P -s -S -v --showreplaycount --showpids --showdriverversion --showmemvendor --showfwinfo --showproductname --showserial --showuniqueid --showbus --showpendingpages --showpagesinfo --showretiredpages --showunreservablepages --showmemuse --showvoltage --showtopo --showtopoweight --showtopohops --showtopotype --showtoponuma --showmeminfo all --json

Please include the output of this command if opening a GitHub issue, together with ROCm version.

Example Output

amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=28,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572551000000000
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=30,temperature_sensor_memory=91,utilization_gpu=0i 1630572701000000000
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572749000000000

Limitations and notices

Please notice that this plugin has been developed and tested on a limited number of versions and small set of GPUs. Currently the latest ROCm version tested is 4.3.0. Notice that depending on the device and driver versions the amount of information provided by rocm-smi can vary so that some fields would start/stop appearing in the metrics upon updates. The rocm-smi JSON output is not perfectly homogeneous and is possibly changing in the future, hence parsing and unmarshaling can start failing upon updating ROCm.

Inspired by the current state of the art of the nvidia-smi plugin.