Skip to content

a Node.js library that extracts and cleans text from various document formats (docx, pptx, xlsx, odt, odp, ods, pdf, txt, html ..)

License

Notifications You must be signed in to change notification settings

johaven/doc-textify

Repository files navigation

npm version

Doc-Textify

Doc-Textify is a TypeScript library and command-line tool that extracts and cleans text from various document formats.

🚀 Features

  • Multi-format support:

    • Microsoft Word (.docx)
    • PowerPoint (.pptx)
    • Excel (.xlsx)
    • OpenOffice/LibreOffice (.odt, .odp, .ods)
    • PDF (.pdf)
    • Plain text (.txt)
    • HTML (.html, .htm)
  • Content cleaning: removes extra whitespace, handles custom line delimiters.

  • Configurable options: set newline delimiter, minimum characters to extract, and toggle error logging.

📦 Library Usage

Install the package and import it in your project:

npm install doc-textify --save
import { docTextify } from 'doc-textify'

// async/await version
try {
    const text = await docTextify('path/to/file.pdf')
} catch (e) {
    console.error(err)
}

// or callback version
docTextify('path/to/file.pdf')
    .then(text => console.log(text))
    .catch(err => console.error(err))

Default options:

try {
  const text = await docTextify('path/to/file.pdf', {
      newlineDelimiter: '\n', // output content delimiter
      minCharsToExtract: 0, // number of chars required to output the content, default disabled (0)
      outputErrorToConsole: true // log error to console
      })
  } catch (e) {
      console.error(err)
  }

🚀 CLI Usage (Optional)

If you prefer a ready-made command, the doc-textify CLI wraps the same functionality:

Installation

Global install to use the doc-textify command anywhere:

npm install -g doc-textify

Or install locally:

npm install doc-textify --save

Command

doc-textify <path/to/document> [options]

Options

Option Description Default
-n, --newlineDelimiter Line delimiter to insert "\n"
-m, --minCharsToExtract Minimum number of characters to extract 0 (disabled)
-h, --help Display help message

Example

doc-textify document.docx -n "\r\n" -m 20 > output.txt

📥 Installation from Source

git clone https://github.com/johaven/doc-textify.git
cd doc-textify
npm install
npm run build    # outputs compiled files into /dist
npm run test     # test parsing

🤝 Contributing

  1. Fork the repository
  2. Create a branch: git checkout -b feature/my-feature
  3. Commit your changes: git commit -m "Add my feature"
  4. Push to your branch: git push origin feature/my-feature
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License.

About

a Node.js library that extracts and cleans text from various document formats (docx, pptx, xlsx, odt, odp, ods, pdf, txt, html ..)

Resources

License

Stars

Watchers

Forks

Packages

No packages published