Skip to content

Deploy ToolHive Operator into OpenShift (#1063) #1253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RoddieKieley
Copy link

* Update helm chart resources and seccompProfile type values for OKD environment.
* Update MCPServer Deployment with check for XDG_CONFIG_HOME and HOME env vars being set.
* Add OpenShift environment detection by way of route v1 API availability or OPENSHIFT_OPERATOR env var override.

    * Update helm chart resources and seccompProfile type values for OKD environment.
    * Update MCPServer Deployment with check for XDG_CONFIG_HOME and HOME env vars being set.
    * Add OpenShift environment detection by way of route v1 API availability or OPENSHIFT_OPERATOR env var override.

Signed-off-by: Roddie Kieley <[email protected]>
@RoddieKieley
Copy link
Author

When running go-task operator-test I observed the following problem first:

"shell": executable file not found in $PATH

As well as a number of test failures that didn't seem all associated with the changes in this PR. I passed the terminal output through cursor -> o3 with a request to write the content out to markdown which I'm attaching:
1063-cursor-o3-feedback.md

I utilized the 'Option B' Fix for the "shell" issue above and that worked locally for me on Fedora 42. Not sure how other environment friendly that is however so not including it in the PR itself.

Either way feedback welcome and would be curious if you've had any luck with OKD @jhrozek ?

@ChrisJBurns
Copy link
Collaborator

For awareness, I will be deploying OpenShift into our environment next week for @jhrozek to play around with ToolHive installations

@dmartinol
Copy link

Hey @ChrisJBurns could you also suggest a default strategy for the XDG_CONFIG_HOME env vars? ATM we're hardcoding /tmp which is probably not ideal, WDYT?

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch @RoddieKieley ! We'll work with @ChrisJBurns on setting up the OKD cluster so we can test properly.

defaultRetryInterval = 3 * time.Second
)

var isOpenShift *bool = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not needed to include the PR, we can iterate on the patches in subsequent PRs as well)
I wonder if using sync.Once would be cleaner than using a global variable?

break
}

time.Sleep(defaultRetryInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using ExponentialBackoff from k8s.io/apimachinery/pkg/util/wait?
(Again, can be done later)

podTemplateSpec = ensurePodTemplateConfig(podTemplateSpec, containerLabels)
isOpenShift := false
// Get a config to talk to the apiserver
cfg, err := clientconfig.GetConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use rest.InClusterConfig here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we have some examples here with mkp.

Copy link
Contributor

@jhrozek jhrozek Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I spent a little more time looking at the test failures and one of the tests fails because the code now expects a config but the unit test doesn't have one. I think the cleanest solution would be to have an enum of supported platforms:

  type Platform int

  const (
        PlatformKubernetes Platform = iota
        PlatformOpenShift
  )

and an interface to detect a platform:

  type PlatformDetector interface {
        DetectPlatform(ctx context.Context) (Platform, error)
  }

this way we could mock the detection in the tests.

btw I realize we let you waiting for a review and now I'm asking for a change, of course I'd be happy with a lower-cost solution or I'll be happy to contribute to your branch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree here, this would somewhat avoid the isOpenShift flags below as well (and any future isEKS etc). If we have a nice way of being able to support multiple platforms without having toggles passed into functions that would be quite nice.

func ensureRequiredEnvVars(env []corev1.EnvVar) []corev1.EnvVar {
// Check for the existence of the XDG_CONFIG_HOME and HOME environment variables
// and set them to /tmp if they don't exist
xdgConfigHomeFound := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisJBurns do we have a ticket to ensure the proxy runner doesn't write any files? I think in general the proxy runner should be able to run with read-only filesystem

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops almost missed this! Are you speaking about the general securityContext for the ProxyRunner (readOnlyFileSystem: true) or specifically about the code itself? I think for Kubernetes, it doesn't write to any files, and even if it did, I can't think of any reason why they would be used. Mainly because the Client configurations only matter for local setups. However, that being said, I don't see us setting the securityContext in the code, so I can create a ticket to do some best-practice stuff there like we do for the actual MCPServer itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #1285 to track the ProxyRunner secContext

@jhrozek
Copy link
Contributor

jhrozek commented Aug 4, 2025

@RoddieKieley good news! Thanks to @ChrisJBurns we have an OKD cluster now 🚀

@@ -621,6 +623,36 @@ func (r *MCPServerReconciler) deploymentForMCPServer(m *mcpv1alpha1.MCPServer) *
return dep
}

func ensureRequiredEnvVars(env []corev1.EnvVar) []corev1.EnvVar {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the fact that we are adding the variables during the initial deployment but not adding them in deploymentNeedsUpdate is causing some unit test failures.
It seems like deploymentNeedsUpdate is comparing container.Env with expectedProxyEnv, but it should account for the automatically added environment variables fromensureRequiredEnvVars as well.


# -- Container security context settings for the operator
containerSecurityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisJBurns would you prefer if this change was done in a separate values-openshift.yaml file?

Either way it looks like we need to bump the chart version (deploy/charts/operator/Chart.yaml and deploy/charts/operator/README.md)

Copy link
Collaborator

@ChrisJBurns ChrisJBurns Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think you'll be able to override this anyways during deployment. In GitOps you just pass in:

 containerSecurityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    runAsNonRoot: true

This should override the default values. However if you're not using GitOps, then a separate values file like you mentioned values-openshift.yaml where you override the values you want is an alternative. I'd like to avoid removing the runAsUser value from the default values because we want to opt for sensible defaults that will work on most setups. For those that need something different, that's where the overrides above come in.

requests:
cpu: 10m
memory: 64Mi
memory: 192Mi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@@ -85,10 +86,10 @@ operator:
resources:
limits:
cpu: 500m
memory: 128Mi
memory: 384Mi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we want to bump these up? You could override these values at deploy time if needed.

Copy link
Collaborator

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I few more comments from me. I'm wondering if we instead do the OpenShift checks in the Operator and then pass in the desired securityContext's to the ProxyRunner? This avoids the need for the isOpenShift flags at the runtime layer when it comes to the podTemplateSpec and also allows us to remove some of the securityContext code from the runtime layer.

@@ -940,6 +955,31 @@ func ensurePodTemplateConfig(
podTemplateSpec.Spec.SecurityContext = podTemplateSpec.Spec.SecurityContext.WithRunAsGroup(int64(1000))
}
}

if isOpenShift {
if podTemplateSpec.Spec.SecurityContext.RunAsUser != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may show my lack of OpenShift knowledge here 😆 , but is there a reason we are setting these to nil if they aren't nil in the podTemplateSpec?

Another point here, maybe not now, but in future, I'd like to keep the securityContext code as simple as possible in the runtime layer so we don't have to do additional checks. I think a longer-term goal would be to drive it from the podTemplateSpec that's passed in. So the Operator is responsible for passing in the sensible defaults and at the same time can detect if it's OpenShift or not (so it doesn't pass in anything)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants