|
| 1 | +--- |
| 2 | +date: 2026-03-30 |
| 3 | +id: natural-language-log-exploration |
| 4 | +title: Natural Language Log Exploration |
| 5 | +description: Search, filter, and analyze logs in SigNoz by asking questions in plain English through your AI assistant. |
| 6 | +doc_type: howto |
| 7 | +--- |
| 8 | + |
| 9 | +import GetHelp from '@/components/shared/get-help.md' |
| 10 | + |
| 11 | +A product manager posts in #incidents: |
| 12 | + |
| 13 | +> Multiple sellers are complaining that products they updated hours ago still show old prices in search results. The catalog page shows the right data, but search is stale. |
| 14 | +
|
| 15 | +You have SigNoz collecting logs and the MCP server connected to your AI assistant. You know search is powered by an indexing pipeline, but you don't know the internals. |
| 16 | + |
| 17 | +## Prerequisites |
| 18 | + |
| 19 | +- Connect your AI assistant to SigNoz using the [MCP Server guide](https://signoz.io/docs/ai/signoz-mcp-server). |
| 20 | +- Make sure your services are sending logs to SigNoz. See [Send Logs to SigNoz](https://signoz.io/docs/userguide/logs) if you haven't set this up. |
| 21 | + |
| 22 | +## Step 1: Search for the Symptom |
| 23 | + |
| 24 | +``` |
| 25 | +Show me recent error or warning logs from any service related to search indexing or index lag in the last 6 hours. |
| 26 | +``` |
| 27 | + |
| 28 | +Results come back from `search-indexer`: |
| 29 | + |
| 30 | +``` |
| 31 | +Found 34 logs matching across 2 services: |
| 32 | +
|
| 33 | +1. 14:52:11 WARN search-indexer - "Index lag exceeds threshold: 4h12m behind head (threshold: 15m)" |
| 34 | +2. 14:47:03 WARN search-indexer - "Index lag exceeds threshold: 4h07m behind head (threshold: 15m)" |
| 35 | +3. 14:42:01 WARN search-indexer - "Index lag exceeds threshold: 4h02m behind head (threshold: 15m)" |
| 36 | +4. 14:22:18 WARN search-indexer - "Consumer group rebalance completed, partition assignment unchanged" |
| 37 | +5. 13:15:44 WARN search-indexer - "Batch processing rate: 12 events/sec (normal: ~340 events/sec)" |
| 38 | +... |
| 39 | +``` |
| 40 | + |
| 41 | +The search indexer is 4+ hours behind. Processing speed has dropped from 340 events/sec to 12. That explains the stale results. But there are no errors, just slowness. Why is it crawling? |
| 42 | + |
| 43 | +## Step 2: Understand Why Throughput Dropped |
| 44 | + |
| 45 | +``` |
| 46 | +Show me logs from search-indexer in the last 6 hours that mention "skip", "drop", "malformed", "parse", or "invalid". |
| 47 | +``` |
| 48 | + |
| 49 | +The volume is striking: |
| 50 | + |
| 51 | +``` |
| 52 | +Found 9,847 logs matching: |
| 53 | +
|
| 54 | +1. 14:51:58 WARN "Skipping malformed event: missing required field 'sku_id' (event_source: catalog-pipeline)" |
| 55 | +2. 14:51:57 WARN "Skipping malformed event: field 'price' is not numeric: 'USD29.99' (event_source: catalog-pipeline)" |
| 56 | +3. 14:51:55 WARN "Parse retry exhausted for event, moving to dead letter queue (event_source: catalog-pipeline)" |
| 57 | +... |
| 58 | +``` |
| 59 | + |
| 60 | +Nearly 10,000 malformed events in 6 hours. The indexer is spending all its time retrying bad data and dead-lettering it. Valid events are stuck behind the flood. Every bad event comes from `catalog-pipeline`. |
| 61 | + |
| 62 | +``` |
| 63 | +How many "Skipping malformed event" warnings has search-indexer logged per hour over the last 24 hours? |
| 64 | +``` |
| 65 | + |
| 66 | +The hourly breakdown shows a clear inflection point: |
| 67 | + |
| 68 | +``` |
| 69 | +Malformed event warnings per hour (search-indexer): |
| 70 | +
|
| 71 | + 00:00 - 10:00 UTC: 0-3/hour (baseline noise) |
| 72 | + 10:00 - 10:59: 2 |
| 73 | + 11:00 - 11:59: 1,847 <-- spike |
| 74 | + 12:00 - 12:59: 1,923 |
| 75 | + 13:00 - 13:59: 1,812 |
| 76 | + 14:00 - 14:59: 1,690 (ongoing) |
| 77 | +``` |
| 78 | + |
| 79 | +The malformed events started at 11:00 UTC. Something changed in `catalog-pipeline` around that time. |
| 80 | + |
| 81 | +## Step 3: Trace the Root Cause Upstream |
| 82 | + |
| 83 | +``` |
| 84 | +Show me INFO and WARN logs from catalog-pipeline between 10:45 and 11:15 UTC today. I am looking for deployments, config changes, or schema changes. |
| 85 | +``` |
| 86 | + |
| 87 | +The deployment logs tell the story: |
| 88 | + |
| 89 | +``` |
| 90 | +Found 28 logs: |
| 91 | +
|
| 92 | +1. 10:52:03 INFO "Deployment started: catalog-pipeline v2.14.0 -> v2.15.0 (deployer: ci-bot)" |
| 93 | +2. 10:52:18 INFO "Migration applied: product_event_schema_v3" |
| 94 | +3. 10:52:19 INFO "Event format updated: sku_id field moved from root to nested product.identifiers.sku_id" |
| 95 | +4. 10:52:19 INFO "Event format updated: price field changed from cents (int) to formatted string (e.g. 'USD29.99')" |
| 96 | +5. 10:52:31 INFO "Deployment complete: catalog-pipeline v2.15.0 healthy" |
| 97 | +6. 10:53:01 INFO "Backfill started: reprocessing 14,291 products with new schema" |
| 98 | +7. 11:01:12 INFO "Backfill complete: 14,291 events published" |
| 99 | +``` |
| 100 | + |
| 101 | +`catalog-pipeline` v2.15.0 changed the event schema in two breaking ways: it moved `sku_id` into a nested path and changed `price` from integer cents to a formatted string. The search indexer still expects the old schema. Every event from the new version fails validation. On top of that, the backfill re-published 14,291 products in the new format, flooding the indexer with unparseable data. |
| 102 | + |
| 103 | +## Step 4: Scope the Impact |
| 104 | + |
| 105 | +``` |
| 106 | +How many events has search-indexer moved to the dead letter queue in the last 6 hours? What is the current index lag? |
| 107 | +``` |
| 108 | + |
| 109 | +The damage: |
| 110 | + |
| 111 | +``` |
| 112 | +Dead letter queue (last 6 hours): |
| 113 | + - Events moved to DLQ: 9,214 |
| 114 | + - Estimated unique products affected: ~6,800 |
| 115 | +
|
| 116 | +Current index lag: 4h17m behind head |
| 117 | +Indexer throughput: 12 events/sec (normal: 340 events/sec) |
| 118 | +``` |
| 119 | + |
| 120 | +~6,800 products have stale search data. The lag keeps growing because new events from v2.15.0 continue arriving in the broken format. The fix: either roll back `catalog-pipeline` to v2.14.0, or deploy a hotfix to `search-indexer` to handle both schema versions. The 9,214 dead-lettered events will need to be replayed after the fix. |
| 121 | + |
| 122 | +## Tips for Your Own Investigations |
| 123 | + |
| 124 | +1. **Start with what you know.** The Slack message, the error alert, the customer complaint. Search for that first. |
| 125 | +2. **Follow the thread.** When results mention another service, a timeout, or an error code, ask about that next. |
| 126 | +3. **Scope before you dig.** Once you know what is failing, check how many errors, when they started, and whether they are increasing. |
| 127 | +4. **Find the boundary.** Zoom into the moment errors started. The logs right before the first error often reveal the trigger. |
| 128 | + |
| 129 | +<Admonition type="tip"> |
| 130 | +If a field like `service.name` is not available, ask the assistant to discover fields: _"What resource attributes are available for logs?"_ Field availability depends on how your services are instrumented. |
| 131 | +</Admonition> |
| 132 | + |
| 133 | +<details> |
| 134 | +<ToggleHeading> |
| 135 | +## Under the Hood |
| 136 | +</ToggleHeading> |
| 137 | + |
| 138 | +During this investigation, the MCP server called these tools: |
| 139 | + |
| 140 | +| Step | MCP Tool | What It Did | |
| 141 | +|------|----------|-------------| |
| 142 | +| 1 | `signoz_search_logs` | Searched across all services for warning/error logs matching search indexing keywords | |
| 143 | +| 2 | `signoz_search_logs` | Found malformed event warnings in the indexer, revealing upstream data quality issue | |
| 144 | +| 2 | `signoz_aggregate_logs` | Computed malformed event counts per hour to pinpoint when the problem started | |
| 145 | +| 3 | `signoz_search_logs` | Found deployment and schema migration logs in catalog-pipeline around the start time | |
| 146 | +| 4 | `signoz_aggregate_logs` | Counted dead-lettered events to measure blast radius | |
| 147 | + |
| 148 | +</details> |
| 149 | + |
| 150 | +## Next Steps |
| 151 | + |
| 152 | +- [Latency Spike Explainer](https://signoz.io/docs/ai/use-cases/latency-spike-explainer) - Ask "why is this slow?" and trace the bottleneck. |
| 153 | +- [Reconstruct a Bug from a Trace ID](https://signoz.io/docs/ai/use-cases/reconstruct-bug-from-trace-id) - Debug a support ticket with a trace ID. |
| 154 | + |
| 155 | +<GetHelp /> |
0 commit comments