Skip to content

Sapana-Micro-Software/Document-Extract-I

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document-Extract-I (DoclingSwift)

Native Swift tooling to convert plain text, Markdown, and PDF into a unified document model with Markdown or JSON export. Inspired by concepts from Docling, but implemented in-process on Apple platforms (PDFKit + optional Vision OCR). This repo is not a drop-in replacement for upstream Docling.

Requirements

  • Swift 6.x (swift-tools-version: 6.0)
  • macOS 13+ or iOS 16+ (library targets match Package.swift)
  • Full Xcode is strongly recommended so tests (Swift Testing) and PDF tooling behave predictably.

Quick start

swift build -c release
swift run docling-swift convert path/to/note.md --format markdown

Outputs go to stdout unless you pass --output-file or --output-directory.

CLI: docling-swift

Default subcommand is convert:

swift run docling-swift convert <paths...> [options]
swift run docling-swift convert --help
Flag Meaning
positional One or more input paths: .txt, .text, .md, .markdown, .pdf.
--format markdown or json (default markdown).
--output-file Exact output path (single input only). Mutually exclusive with --output-directory.
--output-directory Folder receiving <basename>.md or <basename>.json. Required when you pass multiple inputs. With a single input it writes one file named after that input’s basename. Must be a directory, not something like out.md; use --output-file for a concrete filename.
--jobs Parallel PDF page work and concurrent file conversions (defaults to active processor count).
--progress / --no-progress Progress on stderr (default on). Keeps stdout free for piping.

Examples

# Pipe-friendly (stdout)
swift run docling-swift convert paper.pdf --format markdown

# Explicit output file
swift run docling-swift convert paper.pdf --format markdown --output-file paper.md

# Put output next to other files using the PDF basename
swift run docling-swift convert paper.pdf --format markdown --output-directory ~/Downloads

# Several documents at once → directory required
swift run docling-swift convert a.txt b.pdf --format json --output-directory ./out --jobs 4

Library: DoclingCore

Depending on SwiftPM from another package or Xcode:

import DoclingCore

let result = try DocumentConverter().convert(source: url)
print(result.document.exportMarkdown())
// or structured export:
let json = try result.document.exportJSON(options: ExportOptions(prettyPrintJSON: true))

// Multiple files, parallel batches (preserve order):
let urls = [...]
let results = try DocumentConverter(
    options: DocumentConverterOptions(
        maximumParallelPDFPages: 8,
        maximumParallelFiles: 4,
        progress: { fraction, msg in  }
    )
).convertBatch(sources: urls)

Behaviour vs DoclingPython is summarized via DoclingCapabilityMatrix.summaryMarkdown and DoclingCapabilitySummary in DoclingCore.

Building and testing

swift build             # Debug
swift build -c release  # Release

Tests use Swift Testing. Prefer:

make test               # wrappers `scripts/swift-test.sh`; picks Xcode DEV dir when CLT-only is selected
swift test              # Plain SwiftPM (works when `swift test` can link/find Swift Testing; Package pins macOS linker search for typical Xcode installs)

If swift test fails with missing _TestingInterop or similar, install Xcode and:

sudo xcode-select -s /Applications/Xcode.app/Contents/Developer

(or run make test, which adjusts DEVELOPER_DIR when Xcode.app is present.)

Relation to upstream Docling

Upstream docling-project/docling supports many formats, layout models, and integrations. DoclingSwift focuses on Apple-native conversion for a subset of formats. See DoclingCore for the capability summary.

About

SwiftPM DoclingSwift: convert text, Markdown, and PDF to Markdown or JSON (PDFKit + optional Vision OCR)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors