|
| 1 | +# llama.cpp Observability Solution |
| 2 | + |
| 3 | +This document explains the observability solution implemented for llama.cpp when used with Docker Model Runner. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The llama.cpp observability solution provides comprehensive monitoring of the llama.cpp inference engine, which is used by Docker Model Runner to execute and serve LLM models. This solution allows you to: |
| 8 | + |
| 9 | +- Monitor key performance metrics of llama.cpp |
| 10 | +- Track resource utilization |
| 11 | +- Identify performance bottlenecks |
| 12 | +- Optimize model inference |
| 13 | + |
| 14 | +## Architecture |
| 15 | + |
| 16 | +The observability solution consists of the following components: |
| 17 | + |
| 18 | +1. **llama.cpp Metrics Exporter**: A standalone Go service that collects metrics from the llama.cpp API and exposes them in Prometheus format |
| 19 | +2. **Prometheus**: Collects and stores the metrics |
| 20 | +3. **Grafana Dashboard**: Visualizes the metrics in a comprehensive dashboard |
| 21 | + |
| 22 | +``` |
| 23 | +??????????????? ??????????????? ??????????????? |
| 24 | +? llama.cpp ? --> ? Metrics ? --> ? Prometheus ? |
| 25 | +? (Docker ? ? Exporter ? ? ? |
| 26 | +? Model Runner? ? ? ? ? |
| 27 | +??????????????? ??????????????? ??????????????? |
| 28 | + | | |
| 29 | + | v |
| 30 | + | ??????????????? |
| 31 | + ?-----------------------------> ? Grafana ? |
| 32 | + ? Dashboard ? |
| 33 | + ??????????????? |
| 34 | +``` |
| 35 | + |
| 36 | +## Metrics Collected |
| 37 | + |
| 38 | +The following metrics are collected from llama.cpp: |
| 39 | + |
| 40 | +### Performance Metrics |
| 41 | +- `llamacpp_tokens_per_second`: Token generation speed |
| 42 | +- `llamacpp_batch_latency_seconds`: Batch processing latency |
| 43 | +- `llamacpp_first_token_latency_seconds`: Time to first token |
| 44 | + |
| 45 | +### Resource Utilization |
| 46 | +- `llamacpp_memory_usage_bytes`: Memory usage |
| 47 | +- `llamacpp_total_memory_bytes`: Total available memory |
| 48 | +- `llamacpp_cpu_utilization_percent`: CPU utilization |
| 49 | +- `llamacpp_gpu_utilization_percent`: GPU utilization (if available) |
| 50 | +- `llamacpp_temperature_celsius`: GPU temperature (if available) |
| 51 | + |
| 52 | +### Model Metrics |
| 53 | +- `llamacpp_model_size_bytes`: Model size in bytes |
| 54 | +- `llamacpp_model_parameters`: Number of model parameters |
| 55 | +- `llamacpp_context_size_tokens`: Current context size |
| 56 | +- `llamacpp_max_context_size_tokens`: Maximum context size |
| 57 | + |
| 58 | +### KV Cache Metrics |
| 59 | +- `llamacpp_kv_cache_usage_bytes`: KV cache memory usage |
| 60 | +- `llamacpp_kv_cache_limit_bytes`: KV cache memory limit |
| 61 | + |
| 62 | +### System Metrics |
| 63 | +- `llamacpp_thread_count`: Number of threads used |
| 64 | +- `llamacpp_status`: Current status (idle, loading, running) |
| 65 | +- `llamacpp_batch_size`: Current batch size |
| 66 | +- `llamacpp_optimal_batch_size`: Optimal batch size |
| 67 | + |
| 68 | +## Setup and Configuration |
| 69 | + |
| 70 | +The llama.cpp exporter and observability solution are automatically configured in the Docker Compose setup. Here are the key configuration points: |
| 71 | + |
| 72 | +### Environment Variables |
| 73 | + |
| 74 | +The following environment variables can be used to configure the exporter: |
| 75 | + |
| 76 | +- `LLAMACPP_BASE_URL`: URL for the llama.cpp API (default: `http://model-runner.docker.internal/engines/llama.cpp/v1`) |
| 77 | +- `LLAMACPP_MODEL`: Model name for labeling metrics (default: from `LLM_MODEL_NAME` or `ai/llama3.2:1B-Q8_0`) |
| 78 | +- `LLAMACPP_EXPORTER_ADDR`: Address to expose metrics on (default: `:9100`) |
| 79 | +- `LLAMACPP_SCRAPE_INTERVAL`: Interval between metrics scrapes (default: `5s`) |
| 80 | +- `LLAMACPP_CLIENT_TIMEOUT`: HTTP client timeout (default: `3s`) |
| 81 | + |
| 82 | +### Docker Compose |
| 83 | + |
| 84 | +The `compose.yaml` file includes the llama.cpp exporter service, which is configured to start automatically with the rest of the stack. |
| 85 | + |
| 86 | +### Prometheus |
| 87 | + |
| 88 | +Prometheus is configured to scrape metrics from the llama.cpp exporter on the `/metrics` endpoint. |
| 89 | + |
| 90 | +### Grafana |
| 91 | + |
| 92 | +A preconfigured Grafana dashboard is provided to visualize llama.cpp metrics. It can be accessed at http://localhost:3001 (default credentials: admin/admin). |
| 93 | + |
| 94 | +## Troubleshooting |
| 95 | + |
| 96 | +If metrics are not appearing: |
| 97 | + |
| 98 | +1. Check if the exporter is running: `docker compose ps | grep llamacpp-exporter` |
| 99 | +2. Check the exporter logs: `docker compose logs llamacpp-exporter` |
| 100 | +3. Verify connectivity to llama.cpp: `curl http://model-runner.docker.internal/engines/llama.cpp/v1/stats` |
| 101 | +4. Verify the exporter is exposing metrics: `curl http://localhost:9100/metrics` |
| 102 | +5. Check Prometheus targets page: http://localhost:9091/targets |
| 103 | + |
| 104 | +## Extending the Solution |
| 105 | + |
| 106 | +To add more metrics or customize the existing ones: |
| 107 | + |
| 108 | +1. Modify the `pkg/llamacpp/metrics.go` file |
| 109 | +2. Update the exporter to collect the new metrics |
| 110 | +3. Update the Grafana dashboard to visualize the new metrics |
| 111 | + |
| 112 | +## Limitations |
| 113 | + |
| 114 | +- The metrics collection depends on the llama.cpp API being accessible |
| 115 | +- Some metrics may not be available depending on the version of llama.cpp |
| 116 | +- GPU metrics are only available when running with GPU support |
0 commit comments