-
Notifications
You must be signed in to change notification settings - Fork 903
TIKA-4595: Dynamic Fetcher/Emitter Management with ConfigStore Support #2489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+1,575
−20
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Added SAVE_FETCHER, DELETE_FETCHER, LIST_FETCHERS, GET_FETCHER commands - Added SAVE_EMITTER, DELETE_EMITTER, LIST_EMITTERS, GET_EMITTER commands - Implemented PipesClient public API methods for runtime configuration - Implemented PipesServer command handlers - Added deleteComponent() and getComponentConfig() to AbstractComponentManager - Added wrapper methods to FetcherManager and EmitterManager - Added remove() method to ConfigStore interface and implementations - All tests passing
- saveFetcher now calls both fetcherManager.saveFetcher() and pipesClient.saveFetcher() - This ensures fetchers are available in the forked PipesServer process - Implemented deleteFetcher to call both managers as well - Fixes FetcherNotFoundException when using dynamic fetchers via gRPC The issue was that fetchers saved via gRPC were only stored in the gRPC server's FetcherManager, but when pipesClient.process() forks a new PipesServer process, that process has its own FetcherManager and doesn't have access to the dynamically created fetchers. Now both are updated.
…rver - Created FileBasedConfigStore that persists to JSON file - Created FileBasedConfigStoreFactory with @extension annotation - Updated PipesServer.initializeResources() to create and use ConfigStore - Both gRPC server and forked PipesServer can now share fetcher configs via file This enables dynamic fetcher management across JVM processes: 1. gRPC saves fetcher → writes to config file 2. PipesServer starts → reads from same file 3. Both JVMs share the same fetcher configuration
- Added direct handling for 'file' type in ConfigStoreFactory.createConfigStore() - File-based store is in core, not a plugin, so needs special handling - Avoids ClassNotFoundException when trying to load 'file' as a class name - Also added remove() method to IgniteConfigStore for interface compliance
ExtensionConfig is sent over sockets between PipesClient and PipesServer, so it needs to implement Serializable. Records can implement Serializable and all fields (String, String, String) are already serializable. Fixes NotSerializableException when calling saveFetcher via gRPC.
- Created IgniteStoreServer class that runs as embedded server - TikaGrpcServer starts Ignite server on startup (if ignite ConfigStore configured) - IgniteConfigStore now acts as client-only (clientMode=true) - No external Ignite dependency needed in Docker/Kubernetes - Server runs in background daemon thread within tika-grpc process - Clients (gRPC + forked PipesServer) connect to embedded server Architecture: ┌─────────────────────────────────┐ │ tika-grpc Process │ │ ┌──────────────────────────┐ │ │ │ IgniteStoreServer │ │ │ │ (server mode, daemon) │ │ │ └────────▲─────────────────┘ │ │ │ │ │ ┌────────┴─────────────────┐ │ │ │ IgniteConfigStore │ │ │ │ (client mode) │ │ │ └──────────────────────────┘ │ └─────────────────────────────────┘ ▲ │ (client connection) │ ┌────────┴─────────────────┐ │ PipesServer (forked) │ │ IgniteConfigStore │ │ (client mode) │ └──────────────────────────┘
- Set workDirectory to /tmp/ignite-work in IgniteStoreServer - Set workDirectory to /tmp/ignite-work in IgniteConfigStore - Avoids 'Work directory does not exist and cannot be created: /work' error - Uses system property ignite.work.dir if set, defaults to /tmp/ignite-work - Ensures Ignite can write to work directory in Docker containers
- Changed from /tmp/ignite-work to /var/cache/tika/ignite-work - Aligns with Tika's standard cache location - /var/cache/tika is already used for plugins and other Tika cache data
- Find Ignite plugin's classloader from plugin manager - Load IgniteStoreServer and CacheMode using plugin classloader - Fixes NoClassDefFoundError for H2 classes - Ensures all Ignite dependencies (including H2) are available - Plugin classloader has all dependencies from lib/ directory
- Set setPeerClassLoadingEnabled(true) in IgniteConfigStore - Must match server configuration - Fixes: Remote node has peer class loading enabled flag different from local - Both server and client now have peerClassLoading=true
- Set setPeerClassLoadingEnabled(false) on both server and client - Fixes ClassCastException due to class loaded by different classloaders - Server uses plugin classloader, client uses app classloader - Peer class loading causes the same class to be in both, creating conflicts - We don't need peer class loading for our use case
- Made tika-pipes-ignite a required (non-optional) dependency of tika-grpc - Added ignite.version and h2.version properties - Removed reflection-based classloader lookup - Direct instantiation of IgniteStoreServer - Avoids all PF4J plugin classloader issues - Ignite classes now on main classpath
- Set IGNITE_ENABLE_OBJECT_INPUT_FILTER_AUTOCONFIGURATION=false - Fixes: Failed to autoconfigure Ignite Object Input Filter - Ignite was conflicting with existing serialization filter - Apply in both IgniteStoreServer and IgniteConfigStore
- Handle 'ignite' type directly in ConfigStoreFactory - Load IgniteConfigStoreFactory via reflection - Works in forked PipesServer without plugin system - Matches pattern used for 'file' type - Fixes: ClassNotFoundException: ignite in forked process
…at it's implemented
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR implements dynamic fetcher, emitter, and pipe iterator management for Apache Tika Pipes through a new ConfigStore abstraction, enabling runtime configuration changes without restarts.
Key Features
1. Dynamic Configuration Management API
saveFetcher(FetcherConfig)- Save fetcher at runtimedeleteFetcher(String fetcherId)- Remove fetcherupdateFetcher(String fetcherId, byte[] config)- Update existing fetchersaveEmitter(EmitterConfig)- Save emitter at runtimedeleteEmitter(String emitterId)- Remove emitterupdateEmitter(String emitterId, byte[] config)- Update existing emittersavePipeIterator(PipeIteratorConfig)- Save pipe iterator at runtimedeletePipeIterator(String iteratorId)- Remove pipe iteratorupdatePipeIterator(String iteratorId, byte[] config)- Update existing pipe iterator2. ConfigStore Abstraction
Built-in Implementations:
Storage Support:
3. Cross-JVM Configuration Sharing
4. gRPC API Integration
SaveFetcher/DeleteFetcher/UpdateFetcherRPC endpointsSaveEmitter/DeleteEmitter/UpdateEmitterRPC endpointsSavePipeIterator/DeletePipeIterator/UpdatePipeIteratorRPC endpointsImplementation Details
FileBasedConfigStore
pathparameterIgniteConfigStore Enhancements
Configuration Examples
File-Based:
{ "pipes": { "configStoreType": "file", "configStoreParams": "{\"path\": \"/tmp/tika-config-store.json\"}" } }Ignite:
{ "pipes": { "configStoreType": "ignite", "configStoreParams": "{\"cacheName\": \"tika-config\", \"cacheMode\": \"REPLICATED\"}", "forkedJvmArgs": [ "--add-opens=java.base/java.nio=ALL-UNNAMED", "--add-opens=java.base/java.util=ALL-UNNAMED", "--add-opens=java.base/java.lang=ALL-UNNAMED", "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED", "-DIGNITE_ENABLE_OBJECT_INPUT_FILTER_AUTOCONFIGURATION=false" ] } }Testing
E2E Tests Added
-Dcorpa.numdocs=NBoth tests verify:
Backward Compatibility
✅ Fully backward compatible
configStoreTypeRelated Issues
Fixes: TIKA-4595
Migration Guide
For users wanting dynamic component management:
File-based (recommended for single-instance):
configStoreType: fileto pipes configIgnite (for multi-instance/distributed):
configStoreType: igniteFiles Changed
Performance Impact