Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Improves the document extraction workflow by avoiding redundant OCR when a document already has extracted text, preserving OCR spatial (TSV) data for re-extraction, and enhancing OS “open file” error messages for headless environments.
Changes:
- Skip OCR when
ExtractedTextis already present and pass through cached OCR TSV (ExtractData) to the LLM prompt. - Extend
startExtractionOverlayto accept/preserveextractDataand update all call sites accordingly. - Improve
xdg-open/ opener failure messaging (headless display detection) and add focused tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/app/model.go | Passes ExtractData into extraction overlay and skips OCR early when ExtractedText already exists. |
| internal/app/extraction.go | Adds extractData parameter; skips OCR when cached text exists; preserves TSV data in LLM sources. |
| internal/app/extraction_test.go | Adds coverage for “skip OCR with existing text” and TSV preservation cases. |
| internal/app/docopen.go | Improves actionable error messages for missing opener and headless display failures; adds hasDisplay(). |
| internal/app/docopen_test.go | Adds tests for new opener error wrapping behavior. |
Comments suppressed due to low confidence (1)
internal/app/extraction.go:297
- For cached text on PDFs, the initial
TextSourceis always labeled aspdftotextwith a “digital text” description. If the prior run was OCR-based (scanned PDF),extractData(TSV) will likely be non-empty and the tool/description should reflect OCR (e.g.tesseract) so the LLM prompt/UI aren’t misleading. Consider branching onlen(extractData) > 0here to pick the correct tool/desc (and possibly step detail).
case mime == extract.MIMEApplicationPDF:
tool = "pdftotext"
desc = "Digital text extracted directly from the PDF."
When re-extracting a document that already has text from a previous run, skip the OCR/text extraction steps and feed the existing text directly to the LLM for structured data extraction. This avoids redundant OCR work and makes re-extraction fast when only the LLM pass is needed. - startExtractionOverlay skips OCR when extractedText is non-empty - Images with cached OCR text now show the text step (normally hidden) - Previous ExtractData (TSV) is preserved in sources for spatial layout - afterDocumentSave also skips OCR in its early bailout check closes #711 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
52b7569 to
1e52267
Compare
1e52267 to
3943739
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the document extraction workflow by reusing previously extracted text to avoid redundant OCR work, adds an “extract” keybinding on document tabs, and makes OS “open document” failures more actionable on headless systems.
Changes:
- Skip OCR when a document already has extracted text, and pass along existing TSV/layout data into the LLM prompt.
- Add
rkeybinding (and status hint) in edit mode on document tabs to trigger extraction for the selected document. - Persist and display extraction metadata (extraction model + serialized extraction operations), and improve
xdg-open/opener error messages for headless environments.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/data/store.go | Extends document listing columns and updates extraction persistence to include model/ops. |
| internal/data/models.go | Adds ExtractionModel and ExtractionOps fields to Document. |
| internal/data/meta_generated.go | Adds generated column constants for extraction_model and extraction_ops. |
| internal/data/fts_test.go | Updates test callsite for new UpdateDocumentExtraction signature. |
| internal/app/view.go | Adds edit-mode status hint for r extract on document tabs. |
| internal/app/tables.go | Adds “Model” column value rendering for document tables. |
| internal/app/model.go | Wires r in edit mode to trigger extraction; passes ExtractData into overlay. |
| internal/app/extraction_test.go | Adds tests for “skip OCR when existing text” and model-used behavior. |
| internal/app/extraction.go | Implements OCR skip when cached text exists; persists extraction model/ops. |
| internal/app/docopen_test.go | Adds cross-platform ExitError helper and new opener error tests. |
| internal/app/docopen.go | Adds extractSelectedDocument and improves opener error messaging/headless detection. |
| internal/app/columns_generated.go | Adds generated document column index for “Model”. |
| internal/app/coldefs.go | Adds “Model” column definition for document tables. |
Comments suppressed due to low confidence (1)
internal/data/store.go:1234
- UpdateDocumentExtraction unconditionally updates ocr_data/extraction_model/extraction_ops even when callers pass zero values (e.g. data=nil). In the new “skip OCR on re-extraction” flow, this will overwrite previously stored TSV layout data with NULL when OCR is skipped, which contradicts the goal of preserving spatial data. Build the updates map conditionally (only include ocr_data when a new value is present / explicitly clearing), and similarly avoid overwriting extraction_model/extraction_ops unless a successful LLM run produced new values.
updates := map[string]any{
ColExtractData: data,
ColExtractionModel: model,
ColExtractionOps: ops,
}
if text != "" {
updates[ColExtractedText] = text
}
return s.db.Model(&Document{}).Where(ColID+" = ?", id).Updates(updates).Error
…ss server Instead of showing a cryptic "open: exit status 3", detect when xdg-open fails because no display server is available and show a message explaining the likely cause and environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an explicit `r` keybinding in edit mode on document tabs to trigger extraction on the selected document. This makes the OCR-skip feature from the previous commit discoverable and testable -- previously extraction only ran as a side effect of saving a document form. The status bar shows `r extract` on document tabs. If no extraction tools or LLM are configured, a status error is shown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3943739 to
e725cb2
Compare
e725cb2 to
0c2209f
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the document extraction workflow by enabling faster re-extraction using cached text/OCR artifacts, adds a dedicated extraction keybinding in the TUI, and persists extraction provenance (model + operations) on documents for inspection/audit.
Changes:
- Skip OCR when a document already has extracted text, while preserving any previously captured TSV/layout data for LLM re-extraction.
- Add
rkeybinding (Edit mode, document tabs) to open the extraction overlay for the selected document, plus footer hint updates. - Persist
extraction_modelandextraction_opsondocuments, and display the model in the Documents table.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/data/store.go | Includes extraction_model in document list queries; expands UpdateDocumentExtraction to persist model/ops with sparse updates. |
| internal/data/models.go | Adds ExtractionModel and ExtractionOps fields to the Document model. |
| internal/data/meta_generated.go | Adds generated column constants for extraction_model and extraction_ops. |
| internal/data/fts_test.go | Updates FTS test call sites for the new UpdateDocumentExtraction signature. |
| internal/app/view.go | Adds an “extract” help hint on document tabs in Edit mode. |
| internal/app/tables.go | Adds a visible “Model” column value in document row rendering. |
| internal/app/model.go | Adds r handling in edit-mode key dispatch; passes ExtractData into extraction overlay; skips OCR when cached text exists. |
| internal/app/forms.go | Treats the new “Model” column like other read-only columns for inline edit routing. |
| internal/app/extraction.go | Implements cached-text OCR skipping, carries forward TSV data, and persists extraction model + operations metadata. |
| internal/app/extraction_test.go | Adds tests for r keybinding, OCR skip behavior, TSV preservation, and model-used logic. |
| internal/app/docopen.go | Adds extractSelectedDocument; improves opener error messages for headless environments and adds display detection. |
| internal/app/docopen_test.go | Adds cross-platform ExitError helper and tests for improved opener error messaging paths. |
| internal/app/coldefs.go | Adds “Model” to document column definitions. |
| internal/app/columns_generated.go | Updates generated document column enum to include documentColModel. |
0c2209f to
024ed7c
Compare
024ed7c to
3f51ecb
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the document extraction workflow by enabling faster re-extraction (reusing previously extracted text/TSV), adding a dedicated extract keybinding in document edit mode, and persisting/displaying extraction provenance (model + ops) for auditability.
Changes:
- Skip OCR when a document already has extracted text; preserve prior TSV (layout) data for LLM-only re-extraction.
- Add
rkeybinding (+ status hint) on document tabs to open the extraction overlay for the selected document. - Persist
extraction_modelandextraction_opson documents and show the model in the Documents table; improvexdg-openheadless error messaging.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/data/store.go | Select extraction_model when listing docs; extend UpdateDocumentExtraction to persist model/ops and avoid zero-value overwrites. |
| internal/data/models.go | Add ExtractionModel and ExtractionOps fields to Document. |
| internal/data/meta_generated.go | Add column constants for extraction_model / extraction_ops. |
| internal/data/fts_test.go | Update call site for new UpdateDocumentExtraction signature. |
| internal/app/view.go | Add edit-mode footer hint for the new extract action on document tabs. |
| internal/app/tables.go | Add “Model” cell to document rows. |
| internal/app/model.go | Bind r in edit mode; pass ExtractData into extraction overlay; skip OCR when cached text exists. |
| internal/app/forms.go | Route clicks on the new Model column to the document edit form (read-only column behavior). |
| internal/app/extraction_test.go | Add tests for r keybinding, OCR-skip behavior, TSV preservation, and model-label selection. |
| internal/app/extraction.go | Implement OCR skip when cached text exists; preserve TSV in sources; persist extraction model + ops; add marshalOps + extractionModelUsed. |
| internal/app/docopen_test.go | Add cross-platform helper-process ExitError generation and new wrapOpenerError test cases. |
| internal/app/docopen.go | Add extractSelectedDocument; improve opener error messages for headless/remote environments; add hasDisplay. |
| internal/app/columns_generated.go | Add documentColModel constant. |
| internal/app/coldefs.go | Add “Model” column definition to document table column specs. |
) Store which LLM model produced the extraction and the raw operations JSON alongside the document. The model column is visible in the documents table; the ops blob is stored for future inspection (see #766). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3f51ecb to
3ab7e71
Compare
Summary
rkeybinding in edit mode on document tabs to trigger extraction on the selected documentxdg-openerror messages on headless/remote servers: detect missingDISPLAY/WAYLAND_DISPLAYand surface an actionable message instead of a cryptic exit codecloses #711
closes #764