Skip to content

Added helm chart for observability #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,11 @@ Hyperflow provides two key Helm charts:
To run a sample workflow on a clean Kubernetes cluster, you should do the following:
- Install the `hyperflow-ops` chart
```
helm upgrade --dependency-update -i hf-ops hyperflow-ops
helm upgrade --dependency-update -i hf-ops charts/hyperflow-ops
```
- Install the `hyperflow-run` chart (prefarably in a separate namespace)
```
helm upgrade --dependency-update -i hf-run-montage hyperflow-run
helm upgrade --dependency-update -i hf-run-montage charts/hyperflow-run
```
- Once all pods are up and running or completed, you can manually run the workflow as follows:
```
Expand Down
4 changes: 2 additions & 2 deletions charts/hyperflow-engine/templates/deployment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@ spec:
- name: HF_VAR_ENABLE_TRACING
value: "false"
- name: HF_VAR_ENABLE_OTEL
value: "false"
value: "1"
- name: HF_VAR_OPT_URL
value: nil
value: "http://hf-obs-opentelemetry-collector"
- name: HF_VAR_function
# The source of this function can be found here
# https://github.com/hyperflow-wms/hyperflow/blob/master/functions/kubernetes/k8sCommand.js
Expand Down
20 changes: 20 additions & 0 deletions charts/hyperflow-observability/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: v2
name: hyperflow-observability
description: Helm chart to deploy observability stack
type: application
version: 0.1.0
appVersion: "1.0"

dependencies:
- name: opensearch
version: "2.34.0"
repository: https://opensearch-project.github.io/helm-charts/
- name: opensearch-dashboards
version: "2.30.0"
repository: https://opensearch-project.github.io/helm-charts/
- name: data-prepper
version: "0.3.1"
repository: https://opensearch-project.github.io/helm-charts/
- name: opentelemetry-collector
version: "0.126.0"
repository: https://open-telemetry.github.io/opentelemetry-helm-charts
23 changes: 23 additions & 0 deletions charts/hyperflow-observability/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# HyperFlow K8S monitoring

```
helm upgrade --dependency-update -i hf-obs charts/hyperflow-observability
```

## Open opensearch dashboards

```
kubectl port-forward svc/hf-obs-opensearch-dashboards 5601:5601
```

Navigate to
http://localhost:5601/

Go to Dashboards Management -> Index Patterns

create index patterns
- hyperflow_traces
- hyperflow_metrics
- hyperflow_logs

Go to Discover and choose one of new index patterns as source
40 changes: 40 additions & 0 deletions charts/hyperflow-observability/templates/metric-rules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: hyperflow-rules
labels:
app: kube-prometheus-stack
spec:
groups:
- name: hyperflow-deployment-metrics
interval: 1s
rules:
- record: hyperflow_deployment_status_replicas_available
expr: |
kube_deployment_status_replicas_available
* on(namespace, deployment) group_left(label_origin)
kube_deployment_labels{label_origin="hyperflow"}
- name: node_cpu_usage
interval: 5s
rules:
- record: node_cpu_usage_percent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balis I would recommend a 1-2 sentence per metric explaining what it does, so next people can have a less steep learning curve

expr: |
100 * (
sum by (node) (
rate(container_cpu_usage_seconds_total{container!=""}[1m])
)
/
sum by (node) (
kube_node_status_allocatable{resource="cpu", unit="core"}
)
)
- name: node_memory_usage
interval: 5s
rules:
- record: node_memory_usage_percent
expr: |
(
sum(container_memory_working_set_bytes) by (node)
/
sum(kube_node_status_allocatable{resource="memory"}) by (node)
) * 100
220 changes: 220 additions & 0 deletions charts/hyperflow-observability/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
opensearch:
replicas: 1

config:
opensearch.yml: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balis this looks very worring, we need to talk about this

cluster.name: opensearch-cluster
network.host: 0.0.0.0
plugins:
security:
disabled: true
extraEnvs:
- name: OPENSEARCH_JAVA_OPTS
value: "-Xms512m -Xmx512m"
- name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
value: "Hyperflow1!"

opensearch-dashboards:
opensearchHosts: "http://opensearch-cluster-master:9200"

extraEnvs:
- name: DISABLE_SECURITY_DASHBOARDS_PLUGIN
value: "true"

resources:
requests:
cpu: "200m"
memory: 0.5Gi
limits:
cpu: "1"
memory: 3Gi

data-prepper:
pipelineConfig:
enabled: true
config:
entry-pipeline:
delay: "100"
source:
otel_trace_source:
ssl: false
sink:
- pipeline:
name: "raw-pipeline"
- pipeline:
name: "service-map-pipeline"
raw-pipeline:
source:
pipeline:
name: "entry-pipeline"
processor:
- otel_trace_raw:
sink:
- opensearch:
hosts: [ "http://opensearch-cluster-master:9200" ]
insecure: true
username: admin
password: "Hyperflow1!"
index_type: custom
index: hyperflow_traces
service-map-pipeline:
delay: "100"
source:
pipeline:
name: "entry-pipeline"
processor:
- service_map_stateful:
sink:
- opensearch:
hosts: [ "http://opensearch-cluster-master:9200" ]
insecure: true
username: admin
password: "Hyperflow1!"
index_type: trace-analytics-service-map

metrics-pipeline:
source:
otel_metrics_source:
ssl: false
sink:
- opensearch:
hosts: [ "http://opensearch-cluster-master:9200" ]
insecure: true
username: admin
password: "Hyperflow1!"
index_type: custom
index: hyperflow_metrics

logs-pipeline:
source:
otel_logs_source:
ssl: false
sink:
- opensearch:
hosts: [ "http://opensearch-cluster-master:9200" ]
insecure: true
username: admin
password: "Hyperflow1!"
index: hyperflow_logs

opentelemetry-collector:
mode: "statefulset"

image:
repository: "otel/opentelemetry-collector"
tag: "0.123.0"

command:
name: "otelcol"

resources:
requests:
cpu: 1
memory: 5Gi
limits:
cpu: 2
memory: 5Gi

config:
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679

receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: "kube-state-metrics"
scrape_interval: 1s
metrics_path: /federate
honor_labels: true
params:
match[]:
- '{label_origin="hyperflow"}'
static_configs:
- targets: [ "monitoring-prometheus:9090" ]
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: "kube_deployment_labels"
action: drop
- job_name: "cpu-by-node"
scrape_interval: 5s
metrics_path: /federate
honor_labels: true
params:
match[]:
- 'node_cpu_usage_percent'
static_configs:
- targets: [ "monitoring-prometheus:9090" ]
- job_name: "memory-by-node"
scrape_interval: 5s
metrics_path: /federate
honor_labels: true
params:
match[]:
- 'node_memory_usage_percent'
static_configs:
- targets: [ "monitoring-prometheus:9090" ]
- job_name: "rabbitmq-exporter"
scrape_interval: 1s
static_configs:
- targets: [ "hf-ops-prometheus-rabbitmq-exporter:9419" ]
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: "rabbitmq_queue_messages_ready"
action: keep

processors:
batch: { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balis u sure this has to be defined as empty?

filter:
metrics:
exclude:
match_type: regexp
metric_names:
- "up"
- "scrape_.*"


exporters:
otlp/traces:
endpoint: hf-obs-data-prepper:21890
tls:
insecure: true
insecure_skip_verify: true
otlp/metrics:
endpoint: hf-obs-data-prepper:21891
tls:
insecure: true
insecure_skip_verify: true
otlp/logs:
endpoint: hf-obs-data-prepper:21892
tls:
insecure: true
insecure_skip_verify: true
debug:
verbosity: detailed

service:
pipelines:
traces:
receivers: [ otlp ]
processors: [ batch ]
exporters: [ debug, otlp/traces ]
metrics:
receivers: [ otlp, prometheus ]
processors: [ batch, filter ]
exporters: [ debug, otlp/metrics ]
logs:
receivers: [ otlp ]
processors: [ batch ]
exporters: [ debug, otlp/logs ]

extensions: [ health_check, pprof, zpages ]
2 changes: 1 addition & 1 deletion charts/hyperflow-ops/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ worker-pools:
enable-rabbitmq: &enable-rabbit-mq true
enable-kube-prometheus-stack: &enable-kube-prometheus-stack true
enable-alert-manager: &enable-alert-manager false
enable-grafana: &enable-grafana true
enable-grafana: &enable-grafana false
enable-prometheus-operator: &enable-prometheus-operator true
enable-prometheus: &enable-prometheus true

Expand Down
22 changes: 22 additions & 0 deletions charts/hyperflow-run/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,8 @@ hyperflow-engine:
value: "${enableTracing}"
- name: HF_VAR_ENABLE_OTEL
value: "${enableOtel}"
- name: HF_VAR_OPT_URL
value: "http://hf-obs-opentelemetry-collector"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balis I don't like the fact that this is defined here and here c12659e#diff-7800e510fef5761baa4ff5930e280adbc39c087c52583ca395d8aa5d38c86dc6R69
we should talk why it is in 2 places

- name: HF_VAR_OT_PARENT_ID
value: "${optParentId}"
- name: HF_VAR_OT_TRACE_ID
Expand All @@ -197,6 +199,26 @@ hyperflow-engine:
valueFrom:
fieldRef:
fieldPath: spec.serviceAccountName
- name: HF_LOG_CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: test
resource: requests.cpu
- name: HF_LOG_CPU_LIMIT
valueFrom:
resourceFieldRef:
containerName: test
resource: limits.cpu
- name: HF_LOG_MEM_REQUEST
valueFrom:
resourceFieldRef:
containerName: test
resource: requests.memory
- name: HF_LOG_MEM_LIMIT
valueFrom:
resourceFieldRef:
containerName: test
resource: limits.memory
- name: HF_VAR_FS_MONIT_ENABLED
value: "0"
- name: HF_VAR_FS_MONIT_COMMAND
Expand Down
Loading