Skip to content

Commit 080464f

Browse files
committed
Add llama.cpp observability documentation
1 parent b75cd9f commit 080464f

File tree

1 file changed

+116
-0
lines changed

1 file changed

+116
-0
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# llama.cpp Observability Solution
2+
3+
This document explains the observability solution implemented for llama.cpp when used with Docker Model Runner.
4+
5+
## Overview
6+
7+
The llama.cpp observability solution provides comprehensive monitoring of the llama.cpp inference engine, which is used by Docker Model Runner to execute and serve LLM models. This solution allows you to:
8+
9+
- Monitor key performance metrics of llama.cpp
10+
- Track resource utilization
11+
- Identify performance bottlenecks
12+
- Optimize model inference
13+
14+
## Architecture
15+
16+
The observability solution consists of the following components:
17+
18+
1. **llama.cpp Metrics Exporter**: A standalone Go service that collects metrics from the llama.cpp API and exposes them in Prometheus format
19+
2. **Prometheus**: Collects and stores the metrics
20+
3. **Grafana Dashboard**: Visualizes the metrics in a comprehensive dashboard
21+
22+
```
23+
??????????????? ??????????????? ???????????????
24+
? llama.cpp ? --> ? Metrics ? --> ? Prometheus ?
25+
? (Docker ? ? Exporter ? ? ?
26+
? Model Runner? ? ? ? ?
27+
??????????????? ??????????????? ???????????????
28+
| |
29+
| v
30+
| ???????????????
31+
?-----------------------------> ? Grafana ?
32+
? Dashboard ?
33+
???????????????
34+
```
35+
36+
## Metrics Collected
37+
38+
The following metrics are collected from llama.cpp:
39+
40+
### Performance Metrics
41+
- `llamacpp_tokens_per_second`: Token generation speed
42+
- `llamacpp_batch_latency_seconds`: Batch processing latency
43+
- `llamacpp_first_token_latency_seconds`: Time to first token
44+
45+
### Resource Utilization
46+
- `llamacpp_memory_usage_bytes`: Memory usage
47+
- `llamacpp_total_memory_bytes`: Total available memory
48+
- `llamacpp_cpu_utilization_percent`: CPU utilization
49+
- `llamacpp_gpu_utilization_percent`: GPU utilization (if available)
50+
- `llamacpp_temperature_celsius`: GPU temperature (if available)
51+
52+
### Model Metrics
53+
- `llamacpp_model_size_bytes`: Model size in bytes
54+
- `llamacpp_model_parameters`: Number of model parameters
55+
- `llamacpp_context_size_tokens`: Current context size
56+
- `llamacpp_max_context_size_tokens`: Maximum context size
57+
58+
### KV Cache Metrics
59+
- `llamacpp_kv_cache_usage_bytes`: KV cache memory usage
60+
- `llamacpp_kv_cache_limit_bytes`: KV cache memory limit
61+
62+
### System Metrics
63+
- `llamacpp_thread_count`: Number of threads used
64+
- `llamacpp_status`: Current status (idle, loading, running)
65+
- `llamacpp_batch_size`: Current batch size
66+
- `llamacpp_optimal_batch_size`: Optimal batch size
67+
68+
## Setup and Configuration
69+
70+
The llama.cpp exporter and observability solution are automatically configured in the Docker Compose setup. Here are the key configuration points:
71+
72+
### Environment Variables
73+
74+
The following environment variables can be used to configure the exporter:
75+
76+
- `LLAMACPP_BASE_URL`: URL for the llama.cpp API (default: `http://model-runner.docker.internal/engines/llama.cpp/v1`)
77+
- `LLAMACPP_MODEL`: Model name for labeling metrics (default: from `LLM_MODEL_NAME` or `ai/llama3.2:1B-Q8_0`)
78+
- `LLAMACPP_EXPORTER_ADDR`: Address to expose metrics on (default: `:9100`)
79+
- `LLAMACPP_SCRAPE_INTERVAL`: Interval between metrics scrapes (default: `5s`)
80+
- `LLAMACPP_CLIENT_TIMEOUT`: HTTP client timeout (default: `3s`)
81+
82+
### Docker Compose
83+
84+
The `compose.yaml` file includes the llama.cpp exporter service, which is configured to start automatically with the rest of the stack.
85+
86+
### Prometheus
87+
88+
Prometheus is configured to scrape metrics from the llama.cpp exporter on the `/metrics` endpoint.
89+
90+
### Grafana
91+
92+
A preconfigured Grafana dashboard is provided to visualize llama.cpp metrics. It can be accessed at http://localhost:3001 (default credentials: admin/admin).
93+
94+
## Troubleshooting
95+
96+
If metrics are not appearing:
97+
98+
1. Check if the exporter is running: `docker compose ps | grep llamacpp-exporter`
99+
2. Check the exporter logs: `docker compose logs llamacpp-exporter`
100+
3. Verify connectivity to llama.cpp: `curl http://model-runner.docker.internal/engines/llama.cpp/v1/stats`
101+
4. Verify the exporter is exposing metrics: `curl http://localhost:9100/metrics`
102+
5. Check Prometheus targets page: http://localhost:9091/targets
103+
104+
## Extending the Solution
105+
106+
To add more metrics or customize the existing ones:
107+
108+
1. Modify the `pkg/llamacpp/metrics.go` file
109+
2. Update the exporter to collect the new metrics
110+
3. Update the Grafana dashboard to visualize the new metrics
111+
112+
## Limitations
113+
114+
- The metrics collection depends on the llama.cpp API being accessible
115+
- Some metrics may not be available depending on the version of llama.cpp
116+
- GPU metrics are only available when running with GPU support

0 commit comments

Comments
 (0)