feat(backend): Replace MLMD with KFP Server APIs by HumairAK · Pull Request #12430 · kubeflow/pipelines

HumairAK · 2025-11-10T22:10:09Z

Description of your changes:

This PR removes MLMD as per the KEP here

Resolves: #11760

Overview

Core Change: Replaced MLMD (ML Metadata) service with direct database storage via KFP API server.

This is a major architectural shift that eliminates the external ML Metadata service dependency and consolidates all artifact and task metadata operations directly into the KFP API server with MySQL/database backend.

NOTE: Migration and UI changes will follow this PR. UI will be in a broken state until then. The UI change is a blocker to merge the mlmd-removal branch to master.

Components Removed

MLMD Service Infrastructure

metadata-writer component (backend/metadata_writer/)
- Python-based service that wrote execution metadata to MLMD
- Dockerfile and all source code removed
metadata-grpc deployment
- MLMD gRPC service and envoy proxy
- All Kustomize manifests and configurations removed
- DNS configuration patches removed from all deployment variants
MLMD Client Library (backend/src/v2/metadata/)
- ~1,800 lines of Go client code removed
- Client, converter, and test utilities deleted

Deployment Changes

Removed from all Kustomization variants (standalone, multiuser, kubernetes-native)
Removed metadata-writer from CI image builds
Removed metadata service from proxy NO_PROXY configurations
Removed metadata-grpc port forwarding from integration test workflows

Components Added

New API Layer

Artifact Service API (`backend/api/v2beta1/artifact.proto`)

CRUD Operations:
- CreateArtifact - Create single artifact
- GetArtifact - Retrieve artifact by ID
- ListArtifacts - Query artifacts with filtering
- BatchCreateArtifacts - Bulk artifact creation
Artifact Task Operations:
- CreateArtifactTask - Track artifact usage in tasks
- ListArtifactTasks - Query artifact-task relationships
- BatchCreateArtifactTasks - Bulk task-artifact linking
Generated Clients:
- Go HTTP client (~4,000 lines)
- Python HTTP client (~3,500 lines)
- Swagger documentation

Extended Run Service API (`backend/api/v2beta1/run.proto`)

New Task Endpoints:
- CreateTask - Create pipeline task execution record
- GetTask - Retrieve task details
- ListTasks - Query tasks with filtering
- UpdateTask - Update task status/metadata
- BatchUpdateTasks - Efficient bulk task updates
ViewMode Feature:
- BASIC - Minimal response (IDs, status, timestamps)
- RUNTIME_ONLY - Include runtime details without full spec
- FULL - Complete task/run details with spec
- Reduces payload size for list operations by 80%+

Storage Layer

Artifact Storage (`backend/src/apiserver/storage/artifact_store.go`)

Direct MySQL table for artifacts
Stores: name, URI, type, metadata, custom properties
Supports filtering by run_id, task_name, artifact_type
~300 lines with comprehensive test coverage

Artifact Task Store (`backend/src/apiserver/storage/artifact_task_store.go`)

Junction table linking artifacts to tasks
Tracks: IO type (input/output), producer task, artifact metadata
Bulk insert optimization for batch operations
~400 lines with test coverage

Enhanced Task Store (`backend/src/apiserver/storage/task_store.go`)

Expanded from ~500 to ~1,400 lines
Added task state tracking (PENDING, RUNNING, SUCCEEDED, FAILED, etc.)
Input/output artifact and parameter tracking
Pod information (name, namespace, type)
Batch update support for efficient status synchronization

API Server Implementation

Artifact Server (`backend/src/apiserver/server/artifact_server.go`)

Implements all artifact service endpoints
Request validation and conversion
Pagination support for list operations
~600 lines with 1,000+ lines of tests

Extended Run Server (`backend/src/apiserver/server/run_server.go`)

Added task CRUD operation handlers
ViewMode implementation for optimized responses
Batch task update endpoint
~350 lines of new code with comprehensive tests

Client Infrastructure

KFP API Client (`backend/src/v2/apiclient/`)

New client package for driver/launcher to call API server
OAuth2/OIDC authentication support
Retry logic and error handling
Mock implementation for testing
~800 lines total

Driver/Launcher Refactoring

Parameter/Artifact Resolution (`backend/src/v2/driver/resolver/`)

Extracted resolution logic from monolithic resolve.go (~1,100 lines removed)
New focused modules:
- parameters.go - Parameter resolution (~560 lines)
- artifacts.go - Artifact resolution (~314 lines)
- resolve.go - Orchestration (~90 lines)
Improved testability and maintainability

Driver Changes (`backend/src/v2/driver/`)

Removed MLMD client dependency
Added KFP API client for task/artifact operations
Refactored execution flow to use API server
Container/DAG execution updated for new storage model

Launcher Changes (`backend/src/v2/cmd/launcher-v2/`)

Replaced MLMD calls with API server calls
Uses batch updater for efficient status reporting
Artifact publishing through artifact API

Batch Updater (`backend/src/v2/component/batch_updater.go`)

Efficient batching mechanism for task updates
Reduces API calls during execution
Configurable batch size and flush intervals
~250 lines with interfaces for testing

Testing Infrastructure

Test Data Pipelines (`backend/src/v2/driver/test_data/`)

15+ new compiled pipeline YAMLs for integration testing:
- cache_test.yaml - Cache hit/miss scenarios
- componentInput.yaml - Input parameter testing
- k8s_parameters.yaml - Kubernetes-specific features
- oneof_simple.yaml - Conditional execution
- nested_naming_conflicts.yaml - Name resolution edge cases
- Loop iteration scenarios
- Optional input handling
- And more...

Test Coverage

Storage layer: ~650 lines of tests for artifact/task stores
API server: ~1,700 lines of tests for artifact/run servers
Driver: ~1,400 lines of new integration tests
Setup utilities: ~900 lines of test infrastructure

Utility Additions

Scope Path (`backend/src/common/util/scope_path.go`)

Hierarchical DAG navigation for nested pipelines
Tracks execution context through task hierarchy
Used for parameter/artifact resolution
~230 lines with tests

Proto Helpers (`backend/src/common/util/proto_helpers.go`)

Conversion utilities for proto messages
Type-safe helpers for common operations
~44 lines

YAML Parser (`backend/src/common/util/yaml_parser.go`)

Pipeline spec parsing utilities
~108 lines

Key Behavioral Changes

Artifact Tracking

Before: Driver writes to MLMD via gRPC, launcher writes execution metadata via metadata-writer
After: Driver/launcher call artifact API endpoints directly, writes to MySQL

Task State Management

Before: State inferred from MLMD execution contexts
After: Explicit task records with status, pod info, I/O tracking in task_store

Performance Optimizations

ViewMode: List operations can request minimal data, reducing response size dramatically
Batch Updates: Task status updates batched to reduce API overhead
Direct DB Access: Eliminates gRPC hop to separate MLMD service

API Response Size

ListRuns with VIEW_MODE=DEFAULT: ~80% smaller payloads
Improves UI responsiveness for pipeline listing

Migration Considerations

Database Schema

New tables: artifacts, artifact_tasks
Extended tasks table with new columns
Proto test golden files updated to reflect new response formats

Backwards Compatibility

API endpoints maintain backward compatibility
Existing pipeline specs continue to work
No changes required to user-facing SDK

Deployment

Simpler deployment (2 fewer services)
Reduced resource requirements (no metadata-grpc, metadata-writer pods)
Fewer network policies needed

Testing Strategy

Unit Tests

Comprehensive coverage for all new storage/server components
Mock implementations for API client
Isolated testing of resolver logic

Integration Tests

15+ compiled test pipelines covering edge cases
Driver integration tests with real Kubernetes API server
Task/artifact lifecycle validation

Golden File Updates

Proto test golden files regenerated
Reflects new API response structure

Files Changed Summary

Total files changed: ~550
Lines added: ~50,000
Lines removed: ~15,000
Net addition: ~35,000 (mostly generated client code and tests)

Breakdown

Generated API clients (Go/Python): ~15,000 lines
Test code and test data: ~10,000 lines
Storage layer implementation: ~2,000 lines
API server implementation: ~1,500 lines
Driver/launcher refactoring: ~1,000 lines
Removed MLMD code: ~15,000 lines

Risks & Considerations

Testing

Extensive test coverage added
Integration tests validate end-to-end flows
Proto compatibility tests ensure API stability

Performance

Direct database access should be faster than gRPC → MLMD → DB
Batch updates reduce API call overhead
ViewMode optimization for large lists

Operational

Simpler deployment reduces operational complexity
Fewer moving parts = fewer failure modes
All metadata operations auditable through API server logs

Recommended Follow-up

Monitor database performance under load with new artifact tables
Consider adding database indexes if artifact queries become slow
Document migration path for existing MLMD data (if applicable)
Update deployment documentation to reflect MLMD removal
Performance benchmarking comparing MLMD vs. direct storage

Conclusion

This is an architectural improvement that:

Reduces system complexity
Improves maintainability
Maintains API compatibility
Includes comprehensive testing
Simplifies deployment

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

google-oss-prow · 2025-11-10T22:10:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from humairak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

HumairAK · 2025-11-12T20:15:28Z

Upgrade Test failures are expected until we add migration logic (to follow this PR). Note also UI changes are not included in this, those too - will follow this pr.

CarterFendley · 2025-11-14T20:05:15Z

First off, this is amazing! Not sure where you find the time 😂

A couple questions because this overlaps with an area of interest. My understanding is that this PR is reporting / updating the status of tasks (components) directly from the launcher such as here. So to check my understanding, this means that we are moving completely away from the persistence agent, correct? I have been running into issues with the persistence agent at scale & with short lived workflows so I am excited about new approaches.

Secondly, I see the added RPCs to update task state. Are these the counter part to the ones used by the V1 persistence agent to populate tasks table here? If this is the case, should we remove the V2 equivalent which, unless I am mistaken, seems to be currently unused (even before this PR).

droctothorpe · 2025-11-16T14:49:15Z

Insanely impressive, @HumairAK! I look forward to going through it in-depth.

Please let us know if there any specific areas you want us to sequence first / prioritize with our reviews.

Document migration path for existing MLMD data (if applicable)

^ This will be critical for existing workloads.

HumairAK · 2025-11-18T17:20:49Z

@CarterFendley

For your first point, PA is still required to report the overall status of the Run. It monitors the Argo WF resource and we still require this to report on failures not encountered during driver/launcher runs (e.g. pod schedule failures, etc.). So we still require an external monitoring of a run. I will also be moving the update status propagation logic to the api server in this PR after some offline discussions with Matt/Nelesh.

For your second point, the tasks table in v1 is being removed it is only used for caching today and it is not utilized by any other APIs. It is a bit abused and part of an incomplete implementation of a different approach that was intended by previous maintainers. As such this change will be part of the next KFP major version bump (3.0). All the data required for KFP runs in tasks table is persisted in mlmd, and we can use this for migration (namely just cache fingerprints).

Please let us know if there any specific areas you want us to sequence first / prioritize with our reviews.

@droctothorpe as per our discussion today, I would suggest you review the higher level changes first, e.g. Proto files, Gorm Models, Authorization and related changes - consideration for things like migration etc.

HumairAK

per discussion with @mprahl / @nsingla

HumairAK · 2025-11-18T17:25:02Z

+  rpc UpdateTasksBulk(UpdateTasksBulkRequest) returns (UpdateTasksBulkResponse) {
+    option (google.api.http) = {
+      post: "/apis/v2beta1/tasks:batchUpdate"
+      body: "*"
+    };
+    option (grpc.gateway.protoc_gen_openapiv2.options.openapiv2_operation) = {
+      operation_id: "batch_update_tasks"
+      summary: "Updates multiple tasks in bulk."
+      tags: "RunService"
+    };
+  }


Get rid of bulk operations, make individual calls to update status from launcher/driver, and move status/artifact propagations within api server. There is concern around race conditions, we will need to update tasks in this order:

For an update task request:

Update Task

Fetch run.

Propagate statuses up the dag

So is UpdateTasksBulk to be removed?

We discussed this offline and this can be done in a follow up PR.

HumairAK · 2025-11-18T17:59:45Z

Instead of aggregating to default roles, create a new SA for driver/launcher to utilize for making calls to API Server. Have sync.py ensure this SA, and the required rbac is created in kubeflow profiles.

CarterFendley · 2025-11-18T23:47:28Z

Thanks for the response @HumairAK!

PA is still required to report the overall status of the Run... we still require this to report on failures not encountered during driver/launcher runs (e.g. pod schedule failures, etc.)

Interesting, good to know!

the tasks table in v1 is being removed it is only used for caching today and it is not utilized by any other APIs.

The other place I have seen it seen it used previously was in the task_details attributed of the GetRun API return (see here). Looks like this will be replaced in your PR.

I will also be moving the update status propagation logic to the api server in this PR

That part went over my head lol.

So I am mostly concerned with the ability to get run / component information (status / runtime) primarily through the SDK. At the moment this depends on the PA (only partially for V2) and is why I am asking about these components. As mentioned, I have noticed some instability when handling many workflows. Since you expect the PA to exist in V3 too, want to make sure we are able to scale that properly.

Since I do not know the timeline for V3, maybe it is worthwhile implementing something in V2 to help us with this in the meantime. Potentially building in some metrics and suggested scaling behavior of the PA deployment or similar. Any suggestions where I should continue discussion on this? Any existing similar issues / threads you are familiar with?

HumairAK · 2025-11-20T14:34:31Z

@CarterFendley

The other place I have seen it seen it used previously was in the task_details attributed of the GetRun API return

It's used to populate the run details field for the runs object, but it's mostly just a copy of the run's associated Argo workflow status field (node statuses). We will likely drop this field next major version upgrade.

Since you expect the PA to exist in V3 too, want to make sure we are able to scale that properly.

Our current intent is to get rid of PA as we see it as an unnecessary overhead for merely run status reporting, either we consolidate this logic into the KFP server or a separate dedicated controller that uses controller runtime, either way we'll certainly keep scalability in mind.

mprahl · 2025-11-20T19:22:50Z

GPT5.1 Codex review:

### Review Findings
1. **Artifact-task records always marked as plain outputs** 
   `CreateArtifact` and `CreateArtifactsBulk` ignore the `request.type` field and hardcode every `ArtifactTask` as `IOType_OUTPUT`, even when the caller explicitly sets `IOType_ITERATOR_OUTPUT` for loop iterations or other specialized output modes. This drops iterator semantics, so parent DAGs can no longer distinguish per-iteration outputs and downstream resolvers will treat every propagated artifact as a flat output. 
   ```87:95:backend/src/apiserver/server/artifact_server.go
   artifactTask := &apiv2beta1.ArtifactTask{
     ArtifactId: artifact.UUID,
     TaskId:     task.UUID,
     RunId:      request.GetRunId(),
     Type:       apiv2beta1.IOType_OUTPUT,
     Producer:   producer,
     Key:        request.GetProducerKey(),
   }

The same hardcoding occurs in the bulk path (artifactReq loop). The server should honor request.GetType() so iterator outputs and ONE_OF outputs survive.
2. Namespace filtering in ListTasks uses only “get” permissions
When tasks are listed by namespace (no run or parent filter) the API checks for Verb: get on the runs resource, not list. That means any subject that can “get” a single run in the namespace can enumerate all tasks in that namespace, even if they were denied list permission. This is a privilege escalation and breaks RBAC expectations—namespace-wide enumeration should require list.

resourceAttributes := &authorizationv1.ResourceAttributes{
  Namespace: namespace,
  Verb:      common.RbacResourceVerbGet,
  Group:     common.RbacPipelinesGroup,
  Version:   common.RbacPipelinesVersion,
  Resource:  common.RbacResourceTypeRuns,
}
err := s.resourceManager.IsAuthorized(ctx, resourceAttributes)

Please change Verb to common.RbacResourceVerbList (or split into a separate authorization path) so listing tasks adheres to Kubernetes RBAC semantics.
3. End users lack RBAC to the new artifact APIs
The aggregated “view/edit” cluster roles that profiles install still grant only pipelines, runs, experiments, etc. They don’t include the new artifacts resource, so any user who only has the standard aggregate-to-kubeflow-pipelines-view role will get 403s when the UI starts calling ListArtifacts/GetArtifact. Only the pipeline-runner SA can hit these endpoints today.

- apiGroups:
    - [pipelines.kubeflow.org](http://pipelines.kubeflow.org/)
  resources:
    - runs
  verbs:
    - get
    - list
    - readArtifact

Please extend the aggregated roles (both “view” and “edit” flavors) with resources: artifacts and appropriate verbs (get, list, maybe create where required) so non-admin users retain parity with the old MLMD-backed functionality.

Open Questions / Follow-ups

Should artifact creation also validate that the artifact namespace matches the associated task/run namespace? Right now the server trusts the client-provided namespace, which could leave artifacts in namespaces where the run owner has no access.

Suggested Next Steps

Fix the artifact-task type handling and add regression tests for iterator outputs.
Adjust the namespace authorization in ListTasks.
Update the RBAC manifests (and any profile-controller templating) so standard users gain the new permissions before the UI is switched over.
Let me know if you’d like a focused re-test after these are addressed.

mprahl · 2025-11-20T19:30:50Z

Claude 4.5 review:

PR #12430: MLMD Removal - Action Items

PR: #12430
Branch: mlmd-removal
Review Date: 2025-11-20
Overall Status: ⚠️ 3 Issues to Address (1 Critical, 2 Important)

🚨 Critical Issues (Must Fix Before Merge)

1. Terminal State Enforcement Missing

Priority: 🔴 CRITICAL
Effort: ~4 hours
Severity: Data integrity issue

Problem

The design requires preventing task updates when the parent run is in a terminal state (SUCCEEDED, FAILED, or CANCELED). This check is not implemented.

Impact

Launchers could update tasks after a run completes, leading to inconsistent state.

Required Fix

File: backend/src/apiserver/server/run_server.go

Location: UpdateTask() function (line ~685)

Add this code before the authorization check:

func (s *RunServer) UpdateTask(ctx context.Context, request *apiv2beta1.UpdateTaskRequest) (*apiv2beta1.PipelineTaskDetail, error) {
    taskID := request.GetTaskId()
    
    // Get existing task
    existingTask, err := s.resourceManager.GetTask(taskID)
    if err != nil {
        return nil, util.Wrap(err, "Failed to get existing task for authorization")
    }
    
    // ✅ ADD THIS: Check if run is in terminal state
    run, err := s.resourceManager.GetRun(existingTask.RunUUID)
    if err != nil {
        return nil, util.Wrap(err, "Failed to get run to check terminal state")
    }
    
    terminalStates := []model.RuntimeState{
        model.RuntimeStateSucceeded,
        model.RuntimeStateFailed,
        model.RuntimeStateCanceled,
    }
    
    for _, terminalState := range terminalStates {
        if run.State == terminalState {
            return nil, util.NewInvalidInputError(
                "Cannot update task %s: parent run %s is in terminal state %s",
                taskID, existingTask.RunUUID, terminalState,
            )
        }
    }
    
    // Continue with existing authorization and update logic...
}

Also apply to: UpdateTasksBulk() function

Required Test

File: backend/src/apiserver/server/run_server_tasks_test.go

func TestUpdateTask_TerminalState_Rejected(t *testing.T) {
    // Setup
    clientManager, resourceManager := setupTestEnv()
    runSrv := NewRunServer(resourceManager, nil)
    
    // Create run and task
    run := createTestRun(t, resourceManager, "test-run")
    task := createTestTask(t, runSrv, run.UUID, "test-task")
    
    // Mark run as SUCCEEDED (terminal state)
    resourceManager.UpdateRun(run.UUID, &model.Run{State: model.RuntimeStateSucceeded})
    
    // Attempt to update task - should fail
    _, err := runSrv.UpdateTask(context.Background(), &apiv2beta1.UpdateTaskRequest{
        TaskId: task.GetTaskId(),
        Task: &apiv2beta1.PipelineTaskDetail{
            TaskId: task.GetTaskId(),
            State:  apiv2beta1.PipelineTaskDetail_FAILED,
        },
    })
    
    // Assert: Update should be rejected
    assert.Error(t, err)
    assert.Contains(t, err.Error(), "terminal state")
}

⚠️ Important Issues (Should Fix Before Merge to Master)

2. Cache Fingerprint Not Cleared on Failure

Priority: 🟡 MEDIUM
Effort: ~2 hours
Severity: Potential false cache hits

Problem

When a task fails, its cache fingerprint is not explicitly cleared. While the cache detection queries for status=SUCCEEDED, it's safer to explicitly clear the fingerprint to prevent any edge cases.

Impact

Low risk of false cache hits if the query logic changes in the future.

Required Fix

File: backend/src/v2/component/launcher_v2.go

Location: Execute() function deferred error handler (line ~200)

Modify the defer block:

func (l *LauncherV2) Execute(ctx context.Context) (executionErr error) {
    defer func() {
        if executionErr != nil {
            l.options.Task.State = apiV2beta1.PipelineTaskDetail_FAILED
            l.options.Task.CacheFingerprint = ""  // ✅ ADD THIS LINE
            l.options.Task.StatusMetadata = &apiV2beta1.PipelineTaskDetail_StatusMetadata{
                Message: executionErr.Error(),
            }
        }
        l.options.Task.EndTime = timestamppb.New(time.Now())
        l.batchUpdater.QueueTaskUpdate(l.options.Task)
        
        // ... rest of defer logic
    }()
    
    // ... rest of Execute function
}

Required Test

File: backend/src/v2/component/launcher_v2_test.go

func TestLauncher_FailedExecution_ClearFingerprint(t *testing.T) {
    // Setup launcher with mocked failure
    launcher := setupTestLauncher(t)
    
    // Set initial fingerprint
    launcher.options.Task.CacheFingerprint = "test-fingerprint-123"
    
    // Execute (will fail due to mock)
    err := launcher.Execute(context.Background())
    
    // Assert: Fingerprint should be cleared
    assert.Error(t, err)
    assert.Equal(t, "", launcher.options.Task.CacheFingerprint)
    assert.Equal(t, apiV2beta1.PipelineTaskDetail_FAILED, launcher.options.Task.State)
}

3. Exit Handler Task Type Not Detected

Priority: 🟡 MEDIUM
Effort: ~3 hours
Severity: Tasks misclassified as generic DAG

Problem

The proto defines EXIT_HANDLER task type, but there's no explicit detection logic in the driver. Exit handler tasks will be classified as generic DAG type.

Impact

Exit handler tasks won't have the correct type in the database, making it harder to query or display them correctly.

Required Fix

File: backend/src/v2/driver/dag.go

Location: Type detection switch statement (line ~103)

Add exit handler detection:

// Determine type of DAG task
switch {
case iterationCount != nil:
    count := int64(*iterationCount)
    taskToCreate.TypeAttributes = &gc.PipelineTaskDetail_TypeAttributes{IterationCount: &count}
    taskToCreate.Type = gc.PipelineTaskDetail_LOOP
    taskToCreate.DisplayName = "Loop"
    execution.IterationCount = util.IntPointer(int(count))

// ✅ ADD THIS CASE
case strings.HasPrefix(opts.TaskName, "exit-handler-"):
    taskToCreate.Type = gc.PipelineTaskDetail_EXIT_HANDLER
    taskToCreate.DisplayName = "Exit Handler"

case condition != "":
    taskToCreate.Type = gc.PipelineTaskDetail_CONDITION_BRANCH
    taskToCreate.DisplayName = "Condition Branch"

case strings.HasPrefix(opts.TaskName, "condition") && !strings.HasPrefix(opts.TaskName, "condition-branch"):
    taskToCreate.Type = gc.PipelineTaskDetail_CONDITION
    taskToCreate.DisplayName = "Condition"

default:
    taskToCreate.Type = gc.PipelineTaskDetail_DAG
}

Required Test

File: backend/src/v2/driver/dag_test.go

func TestDAG_ExitHandler_TypeSet(t *testing.T) {
    opts := common.Options{
        TaskName: "exit-handler-cleanup",
        // ... other required fields
    }
    
    execution, err := DAG(context.Background(), opts, mockClientManager)
    
    assert.NoError(t, err)
    // Verify task was created with EXIT_HANDLER type
    task, _ := mockClientManager.KFPAPIClient().GetTask(context.Background(), 
        &apiv2beta1.GetTaskRequest{TaskId: execution.TaskID})
    assert.Equal(t, apiv2beta1.PipelineTaskDetail_EXIT_HANDLER, task.Type)
}

📝 Documentation Updates

Update Design Document

Priority: 🟢 LOW
Effort: ~1 hour

File: proposals/12147-mlmd-removal/design-details.md

Change Required

The design states metrics will be stored in a separate metrics table, but the implementation stores them in the artifacts table with special types.

Update Section: "Metrics" (line ~307)

From:

"we'll instead leverage the Metrics table in KFP which is currently unused"

To:

"Metrics are stored in the artifacts table with dedicated artifact types (Artifact_Metric, Artifact_ClassificationMetric, Artifact_SlicedClassificationMetric). They are distinguished by having no URI and storing data in the number_value or metadata fields."

📋 Summary Table

Issue	Priority	Effort	Files to Modify	Tests Required
1. Terminal State	🔴 Critical	4h	`run_server.go`	`run_server_tasks_test.go`
2. Cache Fingerprint	🟡 Medium	2h	`launcher_v2.go`	`launcher_v2_test.go`
3. Exit Handler	🟡 Medium	3h	`dag.go`	`dag_test.go`
4. Documentation	🟢 Low	1h	`design-details.md`	N/A
Total		10h	4 files	3 test files

✅ Merge Recommendations

For `mlmd-removal` Branch

Status: ⚠️ Conditional Approval

Requirements:

✅ Must fix: Issue 1 (Terminal State Enforcement)

Timeline: 1 day

For `master` Branch

Status: 🚫 Not Ready

Requirements:

✅ Must fix: Issue 1 (Terminal State Enforcement)
⚠️ Should fix: Issue 2 (Cache Fingerprint)
⚠️ Should fix: Issue 3 (Exit Handler)
📝 Should update: Issue 4 (Documentation)

Timeline: 2-3 days

🎯 Next Steps

Immediate (Before merging to mlmd-removal):
- Implement terminal state enforcement
- Add terminal state tests
- Test manually with concurrent runs
Before merging to master:
- Clear cache fingerprint on failure
- Add exit handler detection
- Update design documentation
- Run full integration test suite
- Verify all new tests pass
Post-merge (Follow-up PRs as planned):
- Migration implementation
- Frontend changes

📞 Contact

For questions or clarifications about these action items, refer to the detailed review in BACKEND_VERIFICATION_CHECKLIST.md.

Reviewer: AI Assistant
Date: 2025-11-20

hbelmiro · 2026-02-06T17:11:37Z

+	// Create an artifact entry in the database.
+	CreateArtifact(artifact *model.Artifact) (*model.Artifact, error)
+
+	// Fetches an artifact with a given id.
+	GetArtifact(id string) (*model.Artifact, error)
+
+	// Fetches artifacts for given filtering and listing options.
+	ListArtifacts(filterContext *model.FilterContext, opts *list.Options) ([]*model.Artifact, int, string, error)


nit: It is a convention to begin a comment with the name of the exported element

Suggested change

// Create an artifact entry in the database.

CreateArtifact(artifact *model.Artifact) (*model.Artifact, error)

// Fetches an artifact with a given id.

GetArtifact(id string) (*model.Artifact, error)

// Fetches artifacts for given filtering and listing options.

ListArtifacts(filterContext *model.FilterContext, opts *list.Options) ([]*model.Artifact, int, string, error)

// CreateArtifact creates an artifact entry in the database.

CreateArtifact(artifact *model.Artifact) (*model.Artifact, error)

// GetArtifact fetches an artifact with a given id.

GetArtifact(id string) (*model.Artifact, error)

// ListArtifacts fetches artifacts for given filtering and listing options.

ListArtifacts(filterContext *model.FilterContext, opts *list.Options) ([]*model.Artifact, int, string, error)

hbelmiro · 2026-02-06T17:27:26Z

+	ListArtifacts(filterContext *model.FilterContext, opts *list.Options) ([]*model.Artifact, int, string, error)
+}
+
+type ArtifactStore struct {


ArtifactStore is exported but consumers only use ArtifactStoreInterface. Exporting both makes the interface (that is good for encapsulation) pointless. External code can bypass it and depend on the concrete type directly.
Since this is new code, I suggest making the struct unexported (artifactStore) and returning the interface from the constructor. Then you can also rename the interface from ArtifactStoreInterface to ArtifactStore.
Same applies to ArtifactTaskStore.

hbelmiro · 2026-02-06T17:41:45Z

+			&numberValue,
+		)
+		if err != nil {
+			return artifacts, err


Suggested change

return artifacts, err

return nil, err

hbelmiro · 2026-02-06T17:43:03Z

+		if metadataBytes != nil {
+			err = metadata.Scan(metadataBytes)
+			if err != nil {
+				return artifacts, util.NewInternalServerError(err, "Failed to parse artifact metadata")


Suggested change

return artifacts, util.NewInternalServerError(err, "Failed to parse artifact metadata")

return nil, util.NewInternalServerError(err, "Failed to parse artifact metadata")

hbelmiro · 2026-02-06T17:50:42Z

+	// Fetches an artifact with a given id.
+	GetArtifact(id string) (*model.Artifact, error)
+
+	// Fetches artifacts for given filtering and listing options.


The 4 return values ([]*model.Artifact, int, string, error) are not documented. A reader has to trace through the implementation to know that int is the total count and string is the next page token.

hbelmiro · 2026-02-06T18:37:20Z

+
+	rows, err := tx.Query(rowsSQL, rowsArgs...)
+	if err != nil {
+		tx.Rollback()


tx.Rollback() error is discarded here (and on lines 228, 233, 240, 244, 249). If the rollback itself fails, there's no indication.

Consider using a single defer after Begin instead of scattered rollback calls:

defer func() { if err != nil { if rbErr := tx.Rollback(); rbErr != nil { glog.Errorf("Failed to rollback: %v", rbErr) } } }()

hbelmiro · 2026-02-06T18:39:08Z

+	sizeRow, err := tx.Query(sizeSQL, sizeArgs...)
+	if err != nil {
+		tx.Rollback()
+		return errorF(err)
+	}
+	if err := sizeRow.Err(); err != nil {
+		tx.Rollback()
+		return errorF(err)
+	}
+	totalSize, err := list.ScanRowToTotalSize(sizeRow)
+	if err != nil {
+		tx.Rollback()
+		return errorF(err)
+	}
+	defer sizeRow.Close()


Same here about closing sizeRow and discarding rollback errors.

hbelmiro · 2026-02-06T18:40:34Z

+
+	err = tx.Commit()
+	if err != nil {
+		glog.Errorf("Failed to commit transaction to list artifacts")


Same as earlier regarding logging and returning error.

hbelmiro · 2026-02-06T18:46:34Z

+	}
+
+	npt, err := opts.NextPageToken(artifacts[opts.PageSize])
+	return artifacts[:opts.PageSize], totalSize, npt, err


Suggested change

return artifacts[:opts.PageSize], totalSize, npt, err

if err != nil {

return nil, 0, "", err

}

return artifacts[:opts.PageSize], totalSize, npt, nil

hbelmiro · 2026-02-06T18:58:59Z

+
+	artifacts, err := s.scanRows(r)
+	if err != nil || len(artifacts) > 1 {
+		return nil, util.NewInternalServerError(err, "Failed to get artifact: %v", err.Error())


err can be nil here. If scanRows succeeds but returns more than 1 artifact, err is nil and err.Error() on line 289 panics. These two conditions should be separate checks.

hbelmiro · 2026-02-10T13:09:28Z

+	return "artifacts"
+}
+
+func (a Artifact) GetSortByFieldPrefix(s string) string {


nit: Unused parameter

Suggested change

func (a Artifact) GetSortByFieldPrefix(s string) string {

func (a Artifact) GetSortByFieldPrefix(string) string {

hbelmiro · 2026-02-10T13:25:11Z

+type Artifact struct {
+	UUID            string       `gorm:"column:UUID; not null; primaryKey; type:varchar(191);"`
+	Namespace       string       `gorm:"column:Namespace; not null; type:varchar(63); index:idx_type_namespace,priority:1;"`
+	Type            ArtifactType `gorm:"column:Type; default:null; index:idx_type_namespace,priority:2;"`


This is a nullable column. It should be a pointer.

hbelmiro · 2026-02-10T13:27:03Z

+	Namespace       string       `gorm:"column:Namespace; not null; type:varchar(63); index:idx_type_namespace,priority:1;"`
+	Type            ArtifactType `gorm:"column:Type; default:null; index:idx_type_namespace,priority:2;"`
+	URI             *string      `gorm:"column:URI; type:text;"`
+	Name            string       `gorm:"column:Name; type:varchar(128); default:null;"`


Should be a pointer.

hbelmiro · 2026-02-10T13:27:20Z

+	Type            ArtifactType `gorm:"column:Type; default:null; index:idx_type_namespace,priority:2;"`
+	URI             *string      `gorm:"column:URI; type:text;"`
+	Name            string       `gorm:"column:Name; type:varchar(128); default:null;"`
+	Description     string       `gorm:"column:Description; type:text; default:null;"`


Should be a pointer.

hbelmiro · 2026-02-10T13:50:16Z

+	return "", false
+}
+
+func (a Artifact) GetFieldValue(name string) interface{} {


Suggested change

func (a Artifact) GetFieldValue(name string) interface{} {

func (a Artifact) GetFieldValue(name string) any {

hbelmiro · 2026-02-10T15:44:33Z

+	if err != nil || len(artifactTasks) > 1 {
+		return nil, util.NewInternalServerError(err, "Failed to get artifact-task: %v", err.Error())


if err != nil || len(artifactTasks) > 1 then util.NewInternalServerError(err, ...); when err is nil but len > 1, err.Error() panics. Use separate checks.

hbelmiro · 2026-02-10T15:46:22Z

+			taskIDs = append(taskIDs, filterContext.ID)
+		case model.RunResourceType:
+			runIDs = append(runIDs, filterContext.ID)
+		}


Any other filterContext.Type is skipped with no error or log. Fail fast in default branch.

hbelmiro · 2026-02-10T16:07:18Z

+	if err != nil {
+		return nil, util.NewInternalServerError(err, "Failed to start transaction for creating artifact-tasks")
+	}
+	defer tx.Rollback()


Unhandled error.

hbelmiro · 2026-02-10T16:11:08Z

+			&key,
+		)
+		if err != nil {
+			return artifactTasks, err


Suggested change

return artifactTasks, err

return nil, err

hbelmiro · 2026-02-10T20:22:51Z

+		glog.Errorf("Failed to commit transaction to list artifact-tasks")
+		return errorF(err)


Suggested change

glog.Errorf("Failed to commit transaction to list artifact-tasks")

return errorF(err)

return errorF(err)

hbelmiro · 2026-02-11T12:42:02Z

@@ -79,6 +91,267 @@ func NewTaskStore(db *DB, time util.TimeInterface, uuid util.UUIDGeneratorInterf
 	}
 }

+// scanTaskRow scans a single row into a model.Task. It expects the column order to match taskColumns.
+func scanTaskRow(rowscanner interface{ Scan(dest ...any) error }) (*model.Task, error) {


All calls pass *sql.Rows as argument. Can we change the parameter type?

Suggested change

func scanTaskRow(rowscanner interface{ Scan(dest ...any) error }) (*model.Task, error) {

func scanTaskRow(rowscanner *sql.Rows) (*model.Task, error) {

hbelmiro · 2026-02-11T13:21:15Z

+			fmt.Printf("scan error is %v", err)
+			return tasks, err


Suggested change

fmt.Printf("scan error is %v", err)

return tasks, err

return nil, err

hbelmiro · 2026-02-11T13:29:21Z

 	if err != nil {
-		return util.NewInternalServerError(err, "Failed to check existing tasks")
+		if err == sql.ErrNoRows {


Comparison with errors using equality operators fails on wrapped errors.

Suggested change

if err == sql.ErrNoRows {

if errors.Is(err, sql.ErrNoRows) {

hbelmiro · 2026-02-11T13:33:15Z

+	if err != nil {
+		return nil, util.NewInternalServerError(err, "Failed to start transaction for task update")
+	}
+	defer tx.Rollback() // Will be no-op if Commit() succeeds


Unhandled error.

hbelmiro · 2026-02-11T13:49:52Z

+		// to include the value hash to avoid collisions.
+		valueHash, err := hashProtoValue(p.GetValue())
+		if err != nil {
+			glog.Errorf("Failed to hash parameter value: %v", err)


It shouldn't continue in case of error.

hbelmiro · 2026-02-11T14:57:29Z

-	// Creates a new metric entry.
-	CreateMetric(metric *model.RunMetric) (err error)
+	// CreateV1Metric Creates a new metric entry.
+	// Deprecated: use CreateMetric instead.


There's no such method.

hbelmiro · 2026-02-11T15:35:33Z

+		if run == nil || run.UUID == "" {
+			continue
+		}


Are there any valid cases that this would happen? Otherwise it should return error or be removed to avoid masking bugs.

hbelmiro · 2026-02-11T15:54:06Z

+func (r *ResourceManager) ListArtifacts(filterContexts []*model.FilterContext, opts *list.Options) ([]*model.Artifact, int, string, error) {
+	// Use the first filter context for now (artifacts are typically filtered by namespace)
+	var filterContext *model.FilterContext
+	if len(filterContexts) > 0 {
+		filterContext = filterContexts[0]
+	}
+


If only one FilterContext is used, why accepting a slice?

Suggested change

func (r *ResourceManager) ListArtifacts(filterContexts []*model.FilterContext, opts *list.Options) ([]*model.Artifact, int, string, error) {

// Use the first filter context for now (artifacts are typically filtered by namespace)

var filterContext *model.FilterContext

if len(filterContexts) > 0 {

filterContext = filterContexts[0]

}

func (r *ResourceManager) ListArtifacts(filterContext *model.FilterContext, opts *list.Options) ([]*model.Artifact, int, string, error) {

hbelmiro · 2026-02-11T15:57:38Z

 // Fetches a run with a given id.
+// GetRun fetches a run with full task hydration (backward compatible).


The comment got duplicated.

hbelmiro · 2026-02-11T17:42:20Z

-func (r *ResourceManager) ReportMetric(metric *model.RunMetric) error {
-	err := r.runStore.CreateMetric(metric)
+// ReportMetric Read metrics as ordinary artifacts instead.
+// Creates a run metric entry. Deprecated.


Suggested change

// Creates a run metric entry. Deprecated.

// Creates a run metric entry.

// Deprecated.

hbelmiro · 2026-02-12T14:40:09Z

+		// Set the validated namespace
+		modelArtifact.Namespace = namespace
+
+		artifact, err := s.resourceManager.CreateArtifact(modelArtifact)


A mid-loop failure leaves partial state. Consider adding a CreateArtifacts batch method at the store layer (like CreateArtifactTasks) and calling it here.

hbelmiro · 2026-02-16T13:52:31Z

+	return s.listRunsWithHydration(ctx, pageToken, pageSize, sortBy, opts, namespace, experimentId, true)
+}
+
+func (s *BaseRunServer) listRunsWithHydration(ctx context.Context, pageToken string, pageSize int, sortBy string, opts *list.Options, namespace string, experimentID string, hydrateTasks bool) ([]*model.Run, int, string, error) {


pageToken, pageSize and sortBy are unused and can be removed.

hbelmiro · 2026-02-16T13:55:48Z

-// Reports run metrics.
-// Supports v1beta1 API.
+// ReportRunMetricsV1 reports run metrics.
+// Supports v1beta1 API. Deprecated.


Suggested change

// Supports v1beta1 API. Deprecated.

// Supports v1beta1 API.

// Deprecated.

juliusvonkohout · 2026-03-26T10:19:52Z

I just hope that after this is merged, that we still have one last release that still has v1 as well and with this PR here a usable v2. Many companies need to have a chance to migrate production pipelines from v1 to v2. For that we first need to have a release with a reasonable secure, reliable and scalable v2 (with this PR) and still v1. Then a few months later we can cut another release that removes v1.

mprahl · 2026-04-09T18:12:59Z

+	if opts.Namespace == "" {
+		return fmt.Errorf("namespace is required")
+	}
+	if opts.Task.GetTaskInfo().GetName() != "" {


** 🤖 AI Review **

validateRootDAG() now dereferences opts.Task here, but the compiled root-driver shape still omits task entirely. That makes normal root DAG runs panic before the root task is created. Could this treat opts.Task == nil as the expected root case and add a regression test using a real compiled root-driver invocation?

mprahl · 2026-04-09T18:13:00Z

@@ -1,14 +0,0 @@
-apiVersion: kustomize.config.k8s.io/v1beta1


** 🤖 AI Review **

Deleting this base leaves a few install/tooling entry points still referencing base/metadata/** and base/pipeline/metadata-writer/** (env/dev-kind, platform-agnostic-postgresql, platform-agnostic-multi-user-legacy, and hack/{release,format}.sh). PR head still has broken kustomize/release paths until those references are updated or removed in the same rollout.

mprahl · 2026-04-09T18:13:01Z

+	}
+
+	// Fetch task and artifact for validation and authorization
+	task, err := s.resourceManager.GetTask(at.GetTaskId())


** 🤖 AI Review **

We fetch the task/artifact here for auth, but we never validate that artifact_task.run_id == task.RunUUID. The same ownership gap exists in CreateArtifact() / CreateArtifactsBulk(), which authorize on the submitted artifact namespace before linking to a task/run. Could the server derive run/namespace from the fetched task and reject mismatches across all artifact-create paths?

mprahl · 2026-04-09T18:13:02Z

-	cm.cacheClient = cacheClient
+
+	// Initialize connection to new KFP v2beta1 API server
+	apiCfg := apiclient.FromEnv()


** 🤖 AI Review **

This now ignores the existing --ml_pipeline_server_address / --ml_pipeline_server_port plumbing and only honors KFP_API_ADDRESS / KFP_API_PORT. The compiler, driver, and launcher still thread the flag-based endpoint through, so non-default installs will silently dial the hard-coded default here. Could the new client path reuse those existing options or at least honor both config sources?

mprahl · 2026-04-09T18:13:04Z

+			if err != nil {
+				return execution, fmt.Errorf("failed to create artifact tasks: %w", err)
+			}
+			execution.TaskID = createdTask.TaskId


** 🤖 AI Review **

Could this also set taskToCreate.TaskId = createdTask.TaskId? The deferred error/status paths later operate on taskToCreate, and the update branch above sends UpdateTaskRequest{Task: taskToCreate} without a separate path ID. Today that means cleanup/status propagation can end up targeting no task even though CreateTask() succeeded.

mprahl · 2026-04-09T18:13:23Z

-	Payload            LargeText        `gorm:"column:Payload; default:null;"`
+	UUID             string     `gorm:"column:UUID; not null; primaryKey; type:varchar(191);"`
+	Namespace        string     `gorm:"column:Namespace; not null; type:varchar(63);"`
+	RunUUID          string     `gorm:"column:RunUUID; type:varchar(191); not null; index:idx_parent_run,priority:1;"`


** 🤖 AI Review **

model.Task now renamed this field to RunUUID, but validation.LengthSpecs still references the old Go field name RunID. Since ValidateModel() looks fields up by struct name via reflection, task validation now trips an internal error instead of enforcing the length check. Could the validation spec be updated to RunUUID with a targeted test?

LukaszCmielowski · 2026-05-25T07:49:02Z

@mprahl @hbelmiro any ETA for that work ?.

Remove the MLMD dependency by storing artifacts and task state directly in KFP's API server and database, and switch the runtime path to use the new v2beta1 artifact and task APIs instead of the metadata service. This change: - adds artifact/task models, storage, API surface, and generated clients - updates the driver, launcher, importer, auth, and workspace artifact flows to use the KFP API end to end - removes MLMD deployment/manifests and refreshes the frontend, CI, tests, and generated outputs needed to keep the rebased branch passing Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

Port the merged task API contract fixes from the rebased runtime branch back onto mlmd-removal-11, restore the missing runtime pod-name helper, and refresh the run proto goldens so the backend unit suite passes on the squashed integration branch. Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

Populate the top-level run_id on runtime task API requests, update the v2 integration cache test to the merged PipelineTask HTTP model names, and stop the manifest smoke test from probing metadata overlays that this branch removes. Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

Stop pre-populating artifact IDs on create requests, map bulk-created IDs back onto queued artifact-task links before creating those relationships, and update the manifest release helper to stop rewriting metadata image tags that this branch removes. Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

Filter duplicate artifact-task relationships before the bulk create call so the runtime does not trip the artifact_tasks unique constraint, and add a unit test covering the dedupe path. Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

Rely on the artifact APIs to create the initial output artifact-task link when a new artifact is created, and align the local mock behavior with that server contract so the runtime path and tests stay consistent. Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

google-oss-prow Bot added the do-not-merge/work-in-progress label Nov 10, 2025

google-oss-prow Bot requested a review from droctothorpe November 10, 2025 22:10

google-oss-prow Bot added the size/XXL label Nov 10, 2025

google-oss-prow Bot requested review from gmfrasca, nsingla and zazulam November 10, 2025 22:10

HumairAK changed the title ~~WIP: remove mlmd~~ WIP: feat(backend): Replace MLMD with KFP Server APIs Nov 10, 2025

HumairAK force-pushed the mlmd-removal-11 branch 7 times, most recently from 48b441e to bac691f Compare November 12, 2025 14:31

HumairAK changed the title ~~WIP: feat(backend): Replace MLMD with KFP Server APIs~~ feat(backend): Replace MLMD with KFP Server APIs Nov 12, 2025

google-oss-prow Bot removed the do-not-merge/work-in-progress label Nov 12, 2025

HumairAK requested review from mprahl and removed request for gmfrasca November 12, 2025 14:38

HumairAK commented Nov 18, 2025

View reviewed changes

HumairAK force-pushed the mlmd-removal-11 branch 2 times, most recently from 9763470 to cb02722 Compare November 20, 2025 14:21

hbelmiro reviewed Feb 6, 2026

View reviewed changes

hbelmiro reviewed Feb 10, 2026

View reviewed changes

hbelmiro reviewed Feb 11, 2026

View reviewed changes

hbelmiro reviewed Feb 16, 2026

View reviewed changes

droctothorpe mentioned this pull request Feb 17, 2026

feat(backend/frontend): Add Pod Status tab for each component #12780

Closed

2 tasks

jeffspahr mentioned this pull request Feb 25, 2026

feat(#12169): Show retry attempts for pipeline tasks in UI #12596

Closed

2 tasks

HumairAK force-pushed the mlmd-removal-11 branch 3 times, most recently from 1512455 to 89065dd Compare March 24, 2026 15:38

mprahl reviewed Apr 9, 2026

View reviewed changes

HumairAK force-pushed the mlmd-removal-11 branch 3 times, most recently from bd8350c to 3be92f4 Compare May 6, 2026 13:11

a-reich mentioned this pull request May 26, 2026

V2beta1PipelineTaskDetail inputs and outputs always empty #9858

Closed

HumairAK force-pushed the mlmd-removal-11 branch from 3be92f4 to fc81f28 Compare June 1, 2026 19:23

HumairAK added 2 commits June 1, 2026 15:41

HumairAK force-pushed the mlmd-removal-11 branch from fc81f28 to 2ff0446 Compare June 1, 2026 19:53

HumairAK added 4 commits June 1, 2026 16:39

ntny mentioned this pull request Jun 16, 2026

feat(proposals): add KEP-12843 pod lifecycle failure support and visualization #13517

Open

	return artifacts, util.NewInternalServerError(err, "Failed to parse artifact metadata")
	return nil, util.NewInternalServerError(err, "Failed to parse artifact metadata")

	func (a Artifact) GetSortByFieldPrefix(s string) string {
	func (a Artifact) GetSortByFieldPrefix(string) string {

	func (a Artifact) GetFieldValue(name string) interface{} {
	func (a Artifact) GetFieldValue(name string) any {

		if err != nil \|\| len(artifactTasks) > 1 {
		return nil, util.NewInternalServerError(err, "Failed to get artifact-task: %v", err.Error())

		glog.Errorf("Failed to commit transaction to list artifact-tasks")
		return errorF(err)

	func scanTaskRow(rowscanner interface{ Scan(dest ...any) error }) (*model.Task, error) {
	func scanTaskRow(rowscanner sql.Rows) (model.Task, error) {

	fmt.Printf("scan error is %v", err)
	return tasks, err
	return nil, err

		// Fetches a run with a given id.
		// GetRun fetches a run with full task hydration (backward compatible).

	// Creates a run metric entry. Deprecated.
	// Creates a run metric entry.
	// Deprecated.

	// Supports v1beta1 API. Deprecated.
	// Supports v1beta1 API.
	// Deprecated.

Conversation

HumairAK commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Components Removed

MLMD Service Infrastructure

Deployment Changes

Components Added

New API Layer

Artifact Service API (backend/api/v2beta1/artifact.proto)

Extended Run Service API (backend/api/v2beta1/run.proto)

Storage Layer

Artifact Storage (backend/src/apiserver/storage/artifact_store.go)

Artifact Task Store (backend/src/apiserver/storage/artifact_task_store.go)

Enhanced Task Store (backend/src/apiserver/storage/task_store.go)

API Server Implementation

Artifact Server (backend/src/apiserver/server/artifact_server.go)

Extended Run Server (backend/src/apiserver/server/run_server.go)

Client Infrastructure

KFP API Client (backend/src/v2/apiclient/)

Driver/Launcher Refactoring

Parameter/Artifact Resolution (backend/src/v2/driver/resolver/)

Driver Changes (backend/src/v2/driver/)

Launcher Changes (backend/src/v2/cmd/launcher-v2/)

Batch Updater (backend/src/v2/component/batch_updater.go)

Testing Infrastructure

Test Data Pipelines (backend/src/v2/driver/test_data/)

Test Coverage

Utility Additions

Scope Path (backend/src/common/util/scope_path.go)

Proto Helpers (backend/src/common/util/proto_helpers.go)

YAML Parser (backend/src/common/util/yaml_parser.go)

Key Behavioral Changes

Artifact Tracking

Task State Management

Performance Optimizations

API Response Size

Migration Considerations

Database Schema

Backwards Compatibility

Deployment

Testing Strategy

Unit Tests

Integration Tests

Golden File Updates

Files Changed Summary

Breakdown

Risks & Considerations

Testing

Performance

Operational

Recommended Follow-up

Conclusion

Uh oh!

google-oss-prow Bot commented Nov 10, 2025

Uh oh!

HumairAK commented Nov 12, 2025

Uh oh!

CarterFendley commented Nov 14, 2025

Uh oh!

droctothorpe commented Nov 16, 2025

Uh oh!

HumairAK commented Nov 18, 2025

Uh oh!

HumairAK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

HumairAK commented Nov 10, 2025 •

edited

Loading

Artifact Service API (`backend/api/v2beta1/artifact.proto`)

Extended Run Service API (`backend/api/v2beta1/run.proto`)

Artifact Storage (`backend/src/apiserver/storage/artifact_store.go`)

Artifact Task Store (`backend/src/apiserver/storage/artifact_task_store.go`)

Enhanced Task Store (`backend/src/apiserver/storage/task_store.go`)

Artifact Server (`backend/src/apiserver/server/artifact_server.go`)

Extended Run Server (`backend/src/apiserver/server/run_server.go`)

KFP API Client (`backend/src/v2/apiclient/`)

Parameter/Artifact Resolution (`backend/src/v2/driver/resolver/`)

Driver Changes (`backend/src/v2/driver/`)

Launcher Changes (`backend/src/v2/cmd/launcher-v2/`)

Batch Updater (`backend/src/v2/component/batch_updater.go`)

Test Data Pipelines (`backend/src/v2/driver/test_data/`)

Scope Path (`backend/src/common/util/scope_path.go`)

Proto Helpers (`backend/src/common/util/proto_helpers.go`)

YAML Parser (`backend/src/common/util/yaml_parser.go`)

For `mlmd-removal` Branch

For `master` Branch