Skip to content

Update Argo version #74

Closed
Closed
@deliahu

Description

@deliahu

Description

Long running jobs error (they are actually still running, but Cortex shows them as error)

To Reproduce

Run any long job (e.g. training job with a lot of steps)

Possible Solution / Implementation

This is due to a known bug in Argo, fixed here. It should be released with v2.3 (releases)

Stack Trace

time="2019-01-15T01:05:13Z" level=info msg="Creating a docker executor"
time="2019-01-15T01:05:13Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation: {}\ninputs: {}\nmetadata:\n  labels:\n    appName: iris\n    argo: \"true\"\n    workloadId: wrguzieqiqtj6gic4wlx\n    workloadType: data-job\nname: wrguzieqiqtj6gic4wlx\noutputs: {}\nresource:\n  action: create\n  failureCondition: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)\n  manifest: '{\"kind\":\"SparkApplication\",\"apiVersion\":\"sparkoperator.k8s.io/v1alpha1\",\"metadata\":{\"name\":\"wrguzieqiqtj6gic4wlx\",\"namespace\":\"cortex\",\"creationTimestamp\":null,\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"ownerReferences\":[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"Workflow\",\"name\":\"argo-iris-7wvh7\",\"uid\":\"9bc7aee5-1861-11e9-a664-026185601f56\",\"blockOwnerDeletion\":false}]},\"spec\":{\"type\":\"Python\",\"mode\":\"cluster\",\"image\":\"764403040460.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\",\"imagePullPolicy\":\"Always\",\"mainApplicationFile\":\"local:///src/spark_job/spark_job.py\",\"arguments\":[\"--workload-id=wrguzieqiqtj6gic4wlx\n    --context=s3://cortex-cluster-david/apps/iris/contexts/83b914617c417fbe26b59c7839d77914b43f02c48fb52c713a5d4c094dff0bd.msgpack\n    --cache-dir=/mnt/context --raw-features=ad88a38e98b74d6dfc1a718b51c3da769141f46ddd0cd8fb96ca9fd01a71521,9bcb141ed9f8ee4f52c55a40e16815d2a3eddbdc118cc81751e762c73134820,ffa5ac7dad3f546d5b867892d4bf820ba34e83362042c894dc86527295b9352,bacc1fe5d16b0dbbb3527ea9de82409f67b926d0243a549cb6ed844ef76cdd7,b02d30d78186521e2399ba6a93e0f20459eb72e661234c9cf88cd289a78af30\n    --aggregates=5566592b7087b3cf822b7a7a56085fcfd1678b363db00fefefab725f7892d37,0c8177a3736da4987b7629346ca461eced6f090c40edbbe9b67595ca00f73ce,ea21d88863bf2fa675550ae8e57d301d247b311b8f7441e1a342f0803a352e8,9fdd4490123c08fc5da726e3ca597f9b0cf9f0f7b7762a5fca7ea7b036df33b,2a55250299f0a8935f8aafba8cef94e5f7776c8e4c480e2d7f42fc5ca336d36,d59e72278cf560705ddb8cd11dbbb9dd2592165d458f51ddefa425f35fe0a3d,77b9f029fcaac3f513e549042f2f64eee9e73ad0d2c7549ec167943f2adbd46,d3e9aaa136d3e338ac1654e4da3ce5fbe6f6e6517e375672efd9ce3b20bedef,a8a3c0802612f715243577910061579af8cc49bc44ffc8f580e9746c264a237\n    --transformed-features=8cc4ed1ed6c711dc775f51d8d129f5b0030ae109c9ae1605f0530c3ecdf511b,1d26f726a9c1879096147a7219418752140c6572a86d29a8a273ac8474c4622,79df34c5ba324066ec281a70ca9f45dd027e24a243bc137efc24eb1f928b708,ba8d0bc773530313233b0c51a27526ab6f13db97b02c9b8114bd9f0667793c1,15533d18f1de1b545c1a32dd6f6df24481f07fe314f78219ad6bab782276703\n    --training-datasets=27fd4e1f80338ab4fbe8f2c67d58b6d521d2576f3cd492ad1ace5c80d94e413\n    --ingest\"],\"driver\":{\"cores\":10,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"userFacing\":\"true\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"podName\":\"wrguzieqiqtj6gic4wlx\",\"serviceAccount\":\"spark\"},\"executor\":{\"cores\":1,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"instances\":1},\"deps\":{\"pyFiles\":[\"local:///src/spark_job/spark_util.py\",\"local:///src/lib/*.py\"]},\"restartPolicy\":{\"type\":\"Never\"},\"pythonVersion\":\"3\"},\"status\":{\"lastSubmissionAttemptTime\":null,\"completionTime\":null,\"driverInfo\":{},\"applicationState\":{\"state\":\"\",\"errorMessage\":\"\"}}}'\n  successCondition: status.applicationState.state in (COMPLETED)\n"
time="2019-01-15T01:05:13Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-01-15T01:05:13Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o name"
time="2019-01-15T01:05:17Z" level=info msg=sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx
time="2019-01-15T01:05:17Z" level=info msg="Waiting for conditions: status.applicationState.state in (COMPLETED)"
time="2019-01-15T01:05:17Z" level=info msg="Failing for conditions: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)"
time="2019-01-15T01:05:17Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:05:20Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:20Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:07:17Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:07:17Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:07:17Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:07:17Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:07:22Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:07:22Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:07:22Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:09:08Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:09:08Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:09:08Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:09:08Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:09:28Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:09:28Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:09:28Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:10:51Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:10:51Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:10:51Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:10:51Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:12:11Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:12:11Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:11Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:12:25Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:25Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="0/1 success conditions matched"

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions