-
Notifications
You must be signed in to change notification settings - Fork 898
TIKA-4578 #2462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
TIKA-4578 #2462
Conversation
- Add Dockerfile with Ubuntu base, Java 17, OCR, and font support - Add docker-build.sh script to build and optionally push images - Add start-tika-grpc.sh entrypoint script - Include all tika-pipes plugins (fetchers, emitters, iterators) - Include parser packages (standard, extended, ML, scientific, sqlite3, NLP) - Add README with usage instructions and examples - Support multi-arch builds and multiple registries (Docker Hub, ECR, ACR)
- Add two exec-maven-plugin configurations matching reference pattern - First plugin: runs TikaGrpcServer (exec:java) - Second plugin: chmod and docker-build.sh executions - Add validate phase execution to chmod +x the docker-build.sh script - Add package phase execution to run prepare-docker-image - Add skip.docker.build property (default: true) to control execution - Update README with Maven integration instructions
- Pass MULTI_ARCH, AWS_REGION, AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, DOCKER_ID, PROJECT_NAME, RELEASE_IMAGE_TAG - Update README with comprehensive Maven + env var examples - Enable full control: MULTI_ARCH=false DOCKER_ID=ndipiazza PROJECT_NAME=tika-grpc RELEASE_IMAGE_TAG=4.0.0-SNAPSHOT mvn clean package -Dskip.docker.build=false - Clean up duplicate examples in README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Docker image build capabilities to the tika-grpc module, enabling automated builds during the Maven package phase. The implementation supports pushing images to Docker Hub, AWS ECR, and Azure Container Registry, with optional multi-architecture build support.
Key changes:
- Maven integration via exec-maven-plugin to trigger Docker builds during package phase
- Shell scripts for Docker image building and container startup
- Profile-based activation using environment variables (DOCKER_ID, AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| tika-grpc/pom.xml | Adds Maven profiles and exec-maven-plugin configuration to trigger Docker builds based on environment variables |
| tika-grpc/docker-build/docker-build.sh | Shell script that assembles Docker context, handles registry authentication, and executes docker build with appropriate tags |
| tika-grpc/docker-build/start-tika-grpc.sh | Container entrypoint script that configures and starts the Tika gRPC server with environment-based settings |
| tika-grpc/docker-build/Dockerfile | Defines Ubuntu-based image with Java, Tesseract OCR, GDAL, and font support for the Tika gRPC server |
| tika-grpc/docker-build/README.md | Comprehensive documentation covering build options, environment variables, and usage examples |
| tika-grpc/README.md | Updated main README with quick start documentation for Docker builds |
Comments suppressed due to low confidence (1)
tika-grpc/docker-build/docker-build.sh:95
- The buildx builder 'tikabuilder' is created but may already exist from a previous run, which will cause the 'docker buildx create' command to fail. Add the '--use' flag or check if the builder exists first, or use 'docker buildx create --name tikabuilder --driver docker-container --bootstrap || true' to avoid errors on re-runs.
docker buildx create --name tikabuilder
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
… On Tue, Dec 16, 2025 at 1:44 PM Nicholas DiPiazza ***@***.***> wrote:
https://issues.apache.org/jira/browse/TIKA-4578
adding the ability to do stuff like this
MULTI_ARCH=false \
DOCKER_ID=ndipiazza \
PROJECT_NAME=tika-grpc \
RELEASE_IMAGE_TAG=4.0.0-SNAPSHOT \
mvn package -DskipTests -f tika-grpc
View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/s064ofjilbw5g09r82mtlb0mj
===================================================================================================
Done running docker build with tag -t ndipiazza/tika-grpc:4.0.0-SNAPSHOT
===================================================================================================
------------------------------
You can view, comment on, or merge this pull request online at:
#2462
Commit Summary
- 08df7e7
<08df7e7>
TIKA-4578: Add Docker build configuration for tika-grpc
- ffb8a42
<ffb8a42>
TIKA-4578: Integrate Docker build into Maven lifecycle
- ffbe00a
<ffbe00a>
TIKA-4578: Pass all environment variables from Maven to docker-build.sh
- daea452
<daea452>
TIKA-4578 - Add profiles to enable Docker build for AWS, Azure, and Docker
Hub
- 75d1208
<75d1208>
TIKA-4578 - Add profiles to enable Docker build for AWS, Azure, and Docker
Hub
File Changes
(6 files <https://github.com/apache/tika/pull/2462/files>)
- *M* tika-grpc/README.md
<https://github.com/apache/tika/pull/2462/files#diff-8eb1dd37bbabb938fa86e44aae1037caea81daa9f0de2214bc515871424882f6>
(56)
- *A* tika-grpc/docker-build/Dockerfile
<https://github.com/apache/tika/pull/2462/files#diff-e19d5934b994905dcd957ddaf3e1948fa19dfd05e165d2a0440ae2bf08598593>
(39)
- *A* tika-grpc/docker-build/README.md
<https://github.com/apache/tika/pull/2462/files#diff-950b4c55c8551c3951b039259629270e3a6a4f41e0c6ab7198ebaaf2b4e36c87>
(170)
- *A* tika-grpc/docker-build/docker-build.sh
<https://github.com/apache/tika/pull/2462/files#diff-3665f4ba56a8416c1010c9658624c9c95e67c97672fa70ad321829e70722e217>
(113)
- *A* tika-grpc/docker-build/start-tika-grpc.sh
<https://github.com/apache/tika/pull/2462/files#diff-de3ec4b62dfe6e52a270160654a736246fcde8e488bb301d323006881844f476>
(29)
- *M* tika-grpc/pom.xml
<https://github.com/apache/tika/pull/2462/files#diff-8dd3f2e428e7f3f7d7fe38b6a26f0f494f2269b3215619da4ad5a6c9cdebb24b>
(83)
Patch Links:
- https://github.com/apache/tika/pull/2462.patch
- https://github.com/apache/tika/pull/2462.diff
—
Reply to this email directly, view it on GitHub
<#2462>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARMSMNXDLFUFBLQVAW2CEH34CBVJVAVCNFSM6AAAAACPHSKNFKVHI2DSMVQWIX3LMV43ASLTON2WKOZTG4ZTMMRWGAZDQNI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Security and Robustness: - Pin base image to ubuntu:22.04 instead of ubuntu:latest for reproducible builds - Pin tonistiigi/binfmt to specific digest (sha256:8de6f2dec...) to prevent supply chain attacks - Add Docker installation check at script start with clear error message - Add error handling for AWS/Azure authentication failures - Add buildx builder cleanup (rm) after multi-arch build Code Quality: - Remove hardcoded VERSION default in Dockerfile ARG - Change start-tika-grpc.sh shebang from /bin/sh to /bin/bash for consistency - Merge duplicate exec-maven-plugin declarations into single plugin config - Replace chmod exec with maven-antrun-plugin for cross-platform compatibility - Add WARNING prefix to plugin skip messages for better visibility Documentation: - Clarify two build activation mechanisms (env vars vs -Dskip.docker.build) - Update all examples to show environment variable activation (no -Dskip.docker.build needed) - Add prerequisite step to build from root before Docker build - Add note about multi-arch --push behavior requiring authentication - Use -pl :tika-grpc -am in examples to build only required modules - Remove unnecessary 'clean' from example commands All 18 review comments addressed. Multi-arch builds now use pinned digest for security. Documentation clearly explains activation precedence.
|
@tballison interested what you think of this one. |
|
test job failed due to flaky 500 error from dep server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix README example showing Maven property syntax instead of environment variable - Add 'set -e' to docker-build.sh for automatic error handling - Remove unnecessary '-r' flag from cp commands for files - Add buildx builder existence check to avoid duplicate creation errors - Add proper error handling for docker build commands with cleanup - Add date comment for binfmt digest verification - Pass VERSION build arg to docker build commands - Move COPY commands after RUN in Dockerfile for better cache efficiency - Add non-root user (tika) to run container for security - Add TIKA_VERSION validation in start-tika-grpc.sh
https://issues.apache.org/jira/browse/TIKA-4578
adding the ability to do stuff like this