PoC usecase
The PoC demonstrates two edges connected to a central cloud. Each edge comprises of metric generator whose metrics are scraped by prometheus. The prometheus does a remote write to thanos (running in the central cloud) for long term storage and analysis. The remote write is intercepted by our processor proxy running at each edge location. The processor applies various transformation to the collected metrics.
In this PoC we showcase two transformations. - On east cloud/edge we showcase the need to control the bandwidth utilization. We identify the increase in bandwidth utilization using alerts and our manager triggers addition of the filter transformation on processor running on east cloud/edge. This will limit the volume of data sent from east edge. - On west cloud/edge, we showcase the need to change the frequency of metric collection for a particular app/cluster. This scenario arises when for example an issue is identified at a node in a cluster and you need metrics for that node at a higher frequency to fix this issue. The identification of the issue in the PoC is also done with help of alerts on the metrics.
Both the PoC scenarios which can also happen simulatenously, we also showcase how we revert the applied transformation once the issue is resolved. We instrument the issue with increase in metric values and reduce the value to revert the issue.
Environment Setup
The PoC requires docker installed on the machine to test the scenario. The following containers get installed when the PoC environment is bought up.\
Central containers:
- thanos-receive
- thanos-query
- thanos-ruler
- ruler-config
- alertmanager
- manager
\
Edge containers:
- metricgen1,2
- prometheus1,2
- pmf_processsor1,2
Understanding the Rules specification
The rules are specified by the user in a yaml format. The manager parses them and applies them to the thanos ruler with the help of the ruler_config
container. The rules file that the user specifies as part of the demo is demo_2rules.yaml
. As per the file we apply two rules:
- First rule rule_1
is applied for metrics received from east edge. The rule is triggered when the metric app_A_network_metric_0{IP="192.168.1.3"}
cross the value of 200. The specification also mentions what needs to be done when the alert is triggered (firing_action
). In the demo for this cloud, we add a new transform to filter all other metrics and allow just app_A_network_metric_0
for IP 192.168.1.3 from cloud east. The resolved_action
specifies what to be done when the alert is resolved. Here, we simply remove the added transform.
- Second rule rule_2
is applied for metics received from west edge. The rule is triggered when the metric cluster_hardware_metric_0{node="0"}
cross the value of 200. The firing_action
is to increase the frequency of cluster_hardware_metric_0
for node:0
to 5 seconds.
Note that the actions (firing and resolved) are stored by the manager and applied to the corresponding edge processor when the alert is fired/resolved.
Running the PoC story
- Bring up the PoC environment
docker-compose -f docker-compose-quay.yml up -d
- Add the rules and actions (in the form of transformation) corresponding to the rule
curl -X POST --data-binary @demo_2rules.yaml -H "Content-type: text/x-yaml" http://0.0.0.0:5010/api/v1/rules
- Confirm the two rules are added correctly in the
thanos ruler
UI:http://0.0.0.0:10903/rules
- Confirm the metrics are flowing correctly in the
thanos query
UI:http://0.0.0.0:19192/
\ Search forcluster_hardware_metric_0
andapp_A_network_metric_0
. You should see 6 metrics (3 for each edge) for each of these. The edge can be identified by theprocessor
label in the metric. - Next, we instrument the issue scenario with metric value change. We apply the value change to metricgen for each edges (
metricgen1
andmetricgen2
)- For issue at east edge:
curl -X POST --data-binary @change_appmetric.yml -H "Content-type: text/x-yaml" http://0.0.0.0:5002
- For issue at west edge:
curl -X POST --data-binary @change_hwmetric.yml -H "Content-type: text/x-yaml" http://0.0.0.0:5003
- For issue at east edge:
- You can visualize the change in the metrics value in the
thanos query
UI (in the graph mode) - The alert should also be firing and can be seen in the
thanos ruler
UI (http://0.0.0.0:10903/alerts
) P.S. Wait for 30sec-1min before checking this step. - The manager triggers adding of the corresponding transforms. To check the transforms added got to Manager API:
http://0.0.0.0:5010/apidocs/#/Processor%20Configuration/getProcessorConfig
and useeast
andwest
as processor id.\ P.S. This step is just for extra confirmation and is optional. - To visualize the transformation thanos query UI (in the graph mode).
- If you search
app_A_network_metric_0{processor="east"}
you will see the metric with labelIP:192.168.1.3
has changing value but the other metrics with labelIP:192.168.1.1
andIP:192.168.1.2
with no change (straight line; showcasing no value change or should stop completely).\ This is because we have filtered and allowedapp_A_network_metric_0
only for app withIP:192.168.1.3
. - If you search
cluster_hardware_metric_0{processor="west"}
you will see the metric with labelnode:'0'
having a nice sinusoidal wave. This is because its value is changing every 5 sec. The other metrics with labelnode:'1'
andnode:'2'
will be blocky showing their frequency is still 30 sec. This showcases the transformation happening in an automated fashion.
- If you search
-
We will just revert issue in edge cloud east and showcase that we work on specific cloud as well.
In sometime, you should see the alert rule 1 resolved (in thanos ruler UI (curl -X POST --data-binary @change_appmetricRESET.yml -H "Content-type: text/x-yaml" http://0.0.0.0:5002
http://0.0.0.0:10903/alerts
)) and should see the metricapp_A_network_metric_0{processor="east"}
again coming in for labelsIP:192.168.1.1
andIP:192.168.1.2
as well demontrating the removal offilter
transform from edge cloud 1. -
P.S. One can revert the transformation applied to west edge as well using the below command.
curl -X POST --data-binary @change_hwmetricRESET.yml -H "Content-type: text/x-yaml" http://0.0.0.0:5003
- Note: The PoC can also be tested with using OTel Collector instead of prometheus. For this the only step that will change is - 1. Bring up the PoC environment to below
docker-compose -f docker-compose-otel.yml up -d