2021-09-30 05:40:23 +08:00
|
|
|
# AMD ROCm System Management Interface (SMI) Input Plugin
|
2021-09-02 22:57:17 +08:00
|
|
|
|
2022-06-08 05:10:18 +08:00
|
|
|
This plugin uses a query on the [`rocm-smi`][1] binary to pull GPU stats
|
|
|
|
|
including memory and GPU usage, temperatures and other.
|
|
|
|
|
|
|
|
|
|
[1]: https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools
|
2021-09-02 22:57:17 +08:00
|
|
|
|
2022-10-27 03:58:36 +08:00
|
|
|
## Global configuration options <!-- @/docs/includes/plugin_config.md -->
|
|
|
|
|
|
|
|
|
|
In addition to the plugin-specific configuration settings, plugins support
|
|
|
|
|
additional global and plugin configuration settings. These settings are used to
|
|
|
|
|
modify metrics, tags, and field or create aliases and configure ordering, etc.
|
|
|
|
|
See the [CONFIGURATION.md][CONFIGURATION.md] for more details.
|
|
|
|
|
|
2023-01-12 23:55:21 +08:00
|
|
|
[CONFIGURATION.md]: ../../../docs/CONFIGURATION.md#plugins
|
2022-10-27 03:58:36 +08:00
|
|
|
|
2024-04-25 02:34:08 +08:00
|
|
|
## Startup error behavior options
|
|
|
|
|
|
|
|
|
|
In addition to the plugin-specific and global configuration settings the plugin
|
|
|
|
|
supports options for specifying the behavior when experiencing startup errors
|
|
|
|
|
using the `startup_error_behavior` setting. Available values are:
|
|
|
|
|
|
|
|
|
|
- `error`: Telegraf with stop and exit in case of startup errors. This is the
|
|
|
|
|
default behavior.
|
|
|
|
|
- `ignore`: Telegraf will ignore startup errors for this plugin and disables it
|
|
|
|
|
but continues processing for all other plugins.
|
|
|
|
|
- `retry`: NOT AVAILABLE
|
|
|
|
|
|
2021-11-25 02:55:55 +08:00
|
|
|
## Configuration
|
2021-09-02 22:57:17 +08:00
|
|
|
|
2022-05-24 21:49:47 +08:00
|
|
|
```toml @sample.conf
|
2022-04-08 06:01:21 +08:00
|
|
|
# Query statistics from AMD Graphics cards using rocm-smi binary
|
2021-09-02 22:57:17 +08:00
|
|
|
[[inputs.amd_rocm_smi]]
|
|
|
|
|
## Optional: path to rocm-smi binary, defaults to $PATH via exec.LookPath
|
|
|
|
|
# bin_path = "/opt/rocm/bin/rocm-smi"
|
|
|
|
|
|
|
|
|
|
## Optional: timeout for GPU polling
|
|
|
|
|
# timeout = "5s"
|
|
|
|
|
```
|
|
|
|
|
|
2021-11-25 02:55:55 +08:00
|
|
|
## Metrics
|
|
|
|
|
|
2021-09-02 22:57:17 +08:00
|
|
|
- measurement: `amd_rocm_smi`
|
|
|
|
|
- tags
|
|
|
|
|
- `name` (entry name assigned by rocm-smi executable)
|
|
|
|
|
- `gpu_id` (id of the GPU according to rocm-smi)
|
|
|
|
|
- `gpu_unique_id` (unique id of the GPU)
|
|
|
|
|
|
|
|
|
|
- fields
|
|
|
|
|
- `driver_version` (integer)
|
|
|
|
|
- `fan_speed`(integer)
|
|
|
|
|
- `memory_total`(integer B)
|
|
|
|
|
- `memory_used`(integer B)
|
|
|
|
|
- `memory_free`(integer B)
|
|
|
|
|
- `temperature_sensor_edge` (float, Celsius)
|
|
|
|
|
- `temperature_sensor_junction` (float, Celsius)
|
|
|
|
|
- `temperature_sensor_memory` (float, Celsius)
|
|
|
|
|
- `utilization_gpu` (integer, percentage)
|
|
|
|
|
- `utilization_memory` (integer, percentage)
|
|
|
|
|
- `clocks_current_sm` (integer, Mhz)
|
|
|
|
|
- `clocks_current_memory` (integer, Mhz)
|
|
|
|
|
- `power_draw` (float, Watt)
|
|
|
|
|
|
2021-11-25 02:55:55 +08:00
|
|
|
## Troubleshooting
|
|
|
|
|
|
2021-09-02 22:57:17 +08:00
|
|
|
Check the full output by running `rocm-smi` binary manually.
|
|
|
|
|
|
|
|
|
|
Linux:
|
2021-11-25 02:55:55 +08:00
|
|
|
|
2021-09-02 22:57:17 +08:00
|
|
|
```sh
|
|
|
|
|
rocm-smi rocm-smi -o -l -m -M -g -c -t -u -i -f -p -P -s -S -v --showreplaycount --showpids --showdriverversion --showmemvendor --showfwinfo --showproductname --showserial --showuniqueid --showbus --showpendingpages --showpagesinfo --showretiredpages --showunreservablepages --showmemuse --showvoltage --showtopo --showtopoweight --showtopohops --showtopotype --showtoponuma --showmeminfo all --json
|
|
|
|
|
```
|
2021-11-25 02:55:55 +08:00
|
|
|
|
2022-06-08 05:10:18 +08:00
|
|
|
Please include the output of this command if opening a GitHub issue, together
|
|
|
|
|
with ROCm version.
|
2021-11-25 02:55:55 +08:00
|
|
|
|
2022-06-08 05:10:18 +08:00
|
|
|
## Example Output
|
2021-11-25 02:55:55 +08:00
|
|
|
|
2023-04-04 19:43:49 +08:00
|
|
|
```text
|
2021-09-02 22:57:17 +08:00
|
|
|
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=28,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572551000000000
|
|
|
|
|
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=30,temperature_sensor_memory=91,utilization_gpu=0i 1630572701000000000
|
|
|
|
|
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572749000000000
|
|
|
|
|
```
|
2021-11-25 02:55:55 +08:00
|
|
|
|
2022-06-08 05:10:18 +08:00
|
|
|
## Limitations and notices
|
2021-11-25 02:55:55 +08:00
|
|
|
|
2022-06-08 05:10:18 +08:00
|
|
|
Please notice that this plugin has been developed and tested on a limited number
|
|
|
|
|
of versions and small set of GPUs. Currently the latest ROCm version tested is
|
|
|
|
|
4.3.0. Notice that depending on the device and driver versions the amount of
|
|
|
|
|
information provided by `rocm-smi` can vary so that some fields would start/stop
|
|
|
|
|
appearing in the metrics upon updates. The `rocm-smi` JSON output is not
|
|
|
|
|
perfectly homogeneous and is possibly changing in the future, hence parsing and
|
2024-05-31 16:26:37 +08:00
|
|
|
unmarshalling can start failing upon updating ROCm.
|
2021-09-02 22:57:17 +08:00
|
|
|
|
|
|
|
|
Inspired by the current state of the art of the `nvidia-smi` plugin.
|