Skip to content

Conversation

@nddipiazza
Copy link
Contributor

https://issues.apache.org/jira/browse/TIKA-4578

adding the ability to do stuff like this

MULTI_ARCH=false \
DOCKER_ID=ndipiazza \
PROJECT_NAME=tika-grpc \
RELEASE_IMAGE_TAG=4.0.0-SNAPSHOT \
mvn package -DskipTests -f tika-grpc
View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/s064ofjilbw5g09r82mtlb0mj
 ===================================================================================================
 Done running docker build with tag -t ndipiazza/tika-grpc:4.0.0-SNAPSHOT
 ===================================================================================================

Nicholas DiPiazza added 5 commits December 16, 2025 13:50
- Add Dockerfile with Ubuntu base, Java 17, OCR, and font support
- Add docker-build.sh script to build and optionally push images
- Add start-tika-grpc.sh entrypoint script
- Include all tika-pipes plugins (fetchers, emitters, iterators)
- Include parser packages (standard, extended, ML, scientific, sqlite3, NLP)
- Add README with usage instructions and examples
- Support multi-arch builds and multiple registries (Docker Hub, ECR, ACR)
- Add two exec-maven-plugin configurations matching reference pattern
- First plugin: runs TikaGrpcServer (exec:java)
- Second plugin: chmod and docker-build.sh executions
- Add validate phase execution to chmod +x the docker-build.sh script
- Add package phase execution to run prepare-docker-image
- Add skip.docker.build property (default: true) to control execution
- Update README with Maven integration instructions
- Pass MULTI_ARCH, AWS_REGION, AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, DOCKER_ID, PROJECT_NAME, RELEASE_IMAGE_TAG
- Update README with comprehensive Maven + env var examples
- Enable full control: MULTI_ARCH=false DOCKER_ID=ndipiazza PROJECT_NAME=tika-grpc RELEASE_IMAGE_TAG=4.0.0-SNAPSHOT mvn clean package -Dskip.docker.build=false
- Clean up duplicate examples in README
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Docker image build capabilities to the tika-grpc module, enabling automated builds during the Maven package phase. The implementation supports pushing images to Docker Hub, AWS ECR, and Azure Container Registry, with optional multi-architecture build support.

Key changes:

  • Maven integration via exec-maven-plugin to trigger Docker builds during package phase
  • Shell scripts for Docker image building and container startup
  • Profile-based activation using environment variables (DOCKER_ID, AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
tika-grpc/pom.xml Adds Maven profiles and exec-maven-plugin configuration to trigger Docker builds based on environment variables
tika-grpc/docker-build/docker-build.sh Shell script that assembles Docker context, handles registry authentication, and executes docker build with appropriate tags
tika-grpc/docker-build/start-tika-grpc.sh Container entrypoint script that configures and starts the Tika gRPC server with environment-based settings
tika-grpc/docker-build/Dockerfile Defines Ubuntu-based image with Java, Tesseract OCR, GDAL, and font support for the Tika gRPC server
tika-grpc/docker-build/README.md Comprehensive documentation covering build options, environment variables, and usage examples
tika-grpc/README.md Updated main README with quick start documentation for Docker builds
Comments suppressed due to low confidence (1)

tika-grpc/docker-build/docker-build.sh:95

  • The buildx builder 'tikabuilder' is created but may already exist from a previous run, which will cause the 'docker buildx create' command to fail. Add the '--use' flag or check if the builder exists first, or use 'docker buildx create --name tikabuilder --driver docker-container --bootstrap || true' to avoid errors on re-runs.
  docker buildx create --name tikabuilder

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Ivan-Radovanovic
Copy link

Ivan-Radovanovic commented Dec 17, 2025 via email

Security and Robustness:
- Pin base image to ubuntu:22.04 instead of ubuntu:latest for reproducible builds
- Pin tonistiigi/binfmt to specific digest (sha256:8de6f2dec...) to prevent supply chain attacks
- Add Docker installation check at script start with clear error message
- Add error handling for AWS/Azure authentication failures
- Add buildx builder cleanup (rm) after multi-arch build

Code Quality:
- Remove hardcoded VERSION default in Dockerfile ARG
- Change start-tika-grpc.sh shebang from /bin/sh to /bin/bash for consistency
- Merge duplicate exec-maven-plugin declarations into single plugin config
- Replace chmod exec with maven-antrun-plugin for cross-platform compatibility
- Add WARNING prefix to plugin skip messages for better visibility

Documentation:
- Clarify two build activation mechanisms (env vars vs -Dskip.docker.build)
- Update all examples to show environment variable activation (no -Dskip.docker.build needed)
- Add prerequisite step to build from root before Docker build
- Add note about multi-arch --push behavior requiring authentication
- Use -pl :tika-grpc -am in examples to build only required modules
- Remove unnecessary 'clean' from example commands

All 18 review comments addressed.
Multi-arch builds now use pinned digest for security.
Documentation clearly explains activation precedence.
@nddipiazza nddipiazza requested a review from tballison December 17, 2025 21:28
@nddipiazza
Copy link
Contributor Author

@tballison interested what you think of this one.
when releases are done, we would want to add a github action to publish the tika-grpc docker image

@nddipiazza
Copy link
Contributor Author

test job failed due to flaky 500 error from dep server

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix README example showing Maven property syntax instead of environment variable
- Add 'set -e' to docker-build.sh for automatic error handling
- Remove unnecessary '-r' flag from cp commands for files
- Add buildx builder existence check to avoid duplicate creation errors
- Add proper error handling for docker build commands with cleanup
- Add date comment for binfmt digest verification
- Pass VERSION build arg to docker build commands
- Move COPY commands after RUN in Dockerfile for better cache efficiency
- Add non-root user (tika) to run container for security
- Add TIKA_VERSION validation in start-tika-grpc.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants