-
Notifications
You must be signed in to change notification settings - Fork 85
Deploy ToolHive Operator into OpenShift (#1063) #1253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
RoddieKieley
commented
Aug 1, 2025
* Update helm chart resources and seccompProfile type values for OKD environment. * Update MCPServer Deployment with check for XDG_CONFIG_HOME and HOME env vars being set. * Add OpenShift environment detection by way of route v1 API availability or OPENSHIFT_OPERATOR env var override. Signed-off-by: Roddie Kieley <[email protected]>
When running go-task operator-test I observed the following problem first:
As well as a number of test failures that didn't seem all associated with the changes in this PR. I passed the terminal output through cursor -> o3 with a request to write the content out to markdown which I'm attaching: I utilized the 'Option B' Fix for the "shell" issue above and that worked locally for me on Fedora 42. Not sure how other environment friendly that is however so not including it in the PR itself. Either way feedback welcome and would be curious if you've had any luck with OKD @jhrozek ? |
For awareness, I will be deploying OpenShift into our environment next week for @jhrozek to play around with ToolHive installations |
Hey @ChrisJBurns could you also suggest a default strategy for the XDG_CONFIG_HOME env vars? ATM we're hardcoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch @RoddieKieley ! We'll work with @ChrisJBurns on setting up the OKD cluster so we can test properly.
defaultRetryInterval = 3 * time.Second | ||
) | ||
|
||
var isOpenShift *bool = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not needed to include the PR, we can iterate on the patches in subsequent PRs as well)
I wonder if using sync.Once would be cleaner than using a global variable?
break | ||
} | ||
|
||
time.Sleep(defaultRetryInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider using ExponentialBackoff
from k8s.io/apimachinery/pkg/util/wait
?
(Again, can be done later)
podTemplateSpec = ensurePodTemplateConfig(podTemplateSpec, containerLabels) | ||
isOpenShift := false | ||
// Get a config to talk to the apiserver | ||
cfg, err := clientconfig.GetConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should use rest.InClusterConfig
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we have some examples here with mkp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I spent a little more time looking at the test failures and one of the tests fails because the code now expects a config but the unit test doesn't have one. I think the cleanest solution would be to have an enum of supported platforms:
type Platform int
const (
PlatformKubernetes Platform = iota
PlatformOpenShift
)
and an interface to detect a platform:
type PlatformDetector interface {
DetectPlatform(ctx context.Context) (Platform, error)
}
this way we could mock the detection in the tests.
btw I realize we let you waiting for a review and now I'm asking for a change, of course I'd be happy with a lower-cost solution or I'll be happy to contribute to your branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree here, this would somewhat avoid the isOpenShift flags below as well (and any future isEKS etc). If we have a nice way of being able to support multiple platforms without having toggles passed into functions that would be quite nice.
func ensureRequiredEnvVars(env []corev1.EnvVar) []corev1.EnvVar { | ||
// Check for the existence of the XDG_CONFIG_HOME and HOME environment variables | ||
// and set them to /tmp if they don't exist | ||
xdgConfigHomeFound := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisJBurns do we have a ticket to ensure the proxy runner doesn't write any files? I think in general the proxy runner should be able to run with read-only filesystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops almost missed this! Are you speaking about the general securityContext
for the ProxyRunner (readOnlyFileSystem: true
) or specifically about the code itself? I think for Kubernetes, it doesn't write to any files, and even if it did, I can't think of any reason why they would be used. Mainly because the Client configurations only matter for local setups. However, that being said, I don't see us setting the securityContext
in the code, so I can create a ticket to do some best-practice stuff there like we do for the actual MCPServer itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #1285 to track the ProxyRunner secContext
@RoddieKieley good news! Thanks to @ChrisJBurns we have an OKD cluster now 🚀 |
@@ -621,6 +623,36 @@ func (r *MCPServerReconciler) deploymentForMCPServer(m *mcpv1alpha1.MCPServer) * | |||
return dep | |||
} | |||
|
|||
func ensureRequiredEnvVars(env []corev1.EnvVar) []corev1.EnvVar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like the fact that we are adding the variables during the initial deployment but not adding them in deploymentNeedsUpdate
is causing some unit test failures.
It seems like deploymentNeedsUpdate
is comparing container.Env
with expectedProxyEnv
, but it should account for the automatically added environment variables fromensureRequiredEnvVars
as well.
|
||
# -- Container security context settings for the operator | ||
containerSecurityContext: | ||
allowPrivilegeEscalation: false | ||
readOnlyRootFilesystem: true | ||
runAsNonRoot: true | ||
runAsUser: 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisJBurns would you prefer if this change was done in a separate values-openshift.yaml
file?
Either way it looks like we need to bump the chart version (deploy/charts/operator/Chart.yaml
and deploy/charts/operator/README.md
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think you'll be able to override this anyways during deployment. In GitOps you just pass in:
containerSecurityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
This should override the default values. However if you're not using GitOps, then a separate values file like you mentioned values-openshift.yaml
where you override the values you want is an alternative. I'd like to avoid removing the runAsUser
value from the default values because we want to opt for sensible defaults that will work on most setups. For those that need something different, that's where the overrides above come in.
requests: | ||
cpu: 10m | ||
memory: 64Mi | ||
memory: 192Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
@@ -85,10 +86,10 @@ operator: | |||
resources: | |||
limits: | |||
cpu: 500m | |||
memory: 128Mi | |||
memory: 384Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we want to bump these up? You could override these values at deploy time if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I few more comments from me. I'm wondering if we instead do the OpenShift checks in the Operator and then pass in the desired securityContext
's to the ProxyRunner? This avoids the need for the isOpenShift
flags at the runtime layer when it comes to the podTemplateSpec
and also allows us to remove some of the securityContext code from the runtime layer.
@@ -940,6 +955,31 @@ func ensurePodTemplateConfig( | |||
podTemplateSpec.Spec.SecurityContext = podTemplateSpec.Spec.SecurityContext.WithRunAsGroup(int64(1000)) | |||
} | |||
} | |||
|
|||
if isOpenShift { | |||
if podTemplateSpec.Spec.SecurityContext.RunAsUser != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may show my lack of OpenShift knowledge here 😆 , but is there a reason we are setting these to nil
if they aren't nil
in the podTemplateSpec
?
Another point here, maybe not now, but in future, I'd like to keep the securityContext code as simple as possible in the runtime layer so we don't have to do additional checks. I think a longer-term goal would be to drive it from the podTemplateSpec that's passed in. So the Operator is responsible for passing in the sensible defaults and at the same time can detect if it's OpenShift or not (so it doesn't pass in anything)