Metrics & monitoring
In order to support monitoring of replication relationships, VolSync exports a number of metrics that can be scraped with Prometheus. These metrics permit monitoring whether volumes are “in sync” and how long the synchronization iterations take.
Available metrics
The following metrics are provided by VolSync for each replication object (source or destination):
- volsync_missed_intervals_total
This is a count of the number of times that a replication iteration failed to complete before the next scheduled start. This metric is only valid for objects that have a schedule (
.spec.trigger.schedule
) specified. For example, when using the rsync mover with a schedule on the source but not on the destination, only the metric for the source side is meaningful.- volsync_sync_duration_seconds
This is a summary of the time required for each sync iteration. By monitoring this value it is possible to determine how much “slack” exists in the synchronization schedule (i.e., how much less is the sync duration than the schedule frequency).
- volsync_volume_out_of_sync
This is a gauge that has the value of either “0” or “1”, with a “1” indicating that the volumes are not currently synchronized. This may be due to an error that is preventing synchronization or because the most recent synchronization iteration failed to complete prior to when the next should have started. This metric also requires a schedule to be defined.
Each of the above metrics include the following labels to assist with monitoring and alerting:
- obj_name
This is the name of the VolSync CustomResource
- obj_namespace
This is the Kubernetes Namespace that contains the CustomResource
- role
This contains the value of either “source” or “destination” depending on whether the CR is a ReplicationSource or a ReplicationDestination.
- method
This indicates the synchronization method being used. Currently, “rsync” or “rclone”.
As an example, the below raw data comes from a single rsync-based relationship
that is replicating data using the ReplicationSource dsrc
in the srcns
namespace to the ReplicationDestination dest
in the dstns
namespace.
$ curl -s http://127.0.0.1:8080/metrics | grep volsync
# HELP volsync_missed_intervals_total The number of times a synchronization failed to complete before the next scheduled start
# TYPE volsync_missed_intervals_total counter
volsync_missed_intervals_total{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
volsync_missed_intervals_total{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
# HELP volsync_sync_duration_seconds Duration of the synchronization interval in seconds
# TYPE volsync_sync_duration_seconds summary
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.5"} 179.725047058
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.9"} 544.86628289
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.99"} 544.86628289
volsync_sync_duration_seconds_sum{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 828.711667153
volsync_sync_duration_seconds_count{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 3
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.5"} 11.547060835
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.9"} 12.013468222
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.99"} 12.013468222
volsync_sync_duration_seconds_sum{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 33.317039014
volsync_sync_duration_seconds_count{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 3
# HELP volsync_volume_out_of_sync Set to 1 if the volume is not properly synchronized
# TYPE volsync_volume_out_of_sync gauge
volsync_volume_out_of_sync{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
volsync_volume_out_of_sync{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
Obtaining metrics
The above metrics can be collected by Prometheus. If the cluster does not already have a running instance set to scrape metrics, one will need to be started.
Configuring Prometheus
The following steps start a simple Prometheus instance to scrape metrics from VolSync. Some platforms may already have a running Prometheus operator or instance, making these steps unnecessary.
Start the Prometheus operator:
$ kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.46.0/bundle.yaml
Start Prometheus by applying the following block of yaml via:
$ kubectl create ns volsync-system
$ kubectl -n volsync-system apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: volsync-system # Change if necessary!
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
control-plane: volsync-controller
resources:
requests:
memory: 400Mi
If necessary, create a monitoring configuration
in the openshift-user-workload-monitoring
namespace and enable user
workload monitoring:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
# Allocate persistent storage for user Prometheus
prometheus:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
# Allocate persistent storage for user Thanos Ruler
thanosRuler:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
# Allocate persistent storage for alertmanager
alertmanagerMain:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
# Enable user workload monitoring stack
enableUserWorkload: true
# Allocate persistent storage for cluster prometheus
prometheusK8s:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
Monitoring VolSync
The metrics port for VolSync is (by default) protected via kube-auth-proxy. In order to grant
Prometheus the ability to scrape the metrics, its ServiceAccount must be granted
access to the volsync-metrics-reader
ClusterRole. This can be accomplished by
(substitute in the namespace & SA name of the Prometheus server):
$ kubectl create clusterrolebinding metrics --clusterrole=volsync-metrics-reader --serviceaccount=<namespace>:<service-account-name>
Optionally, authentication of the metrics port can be disabled by setting the
Helm chart value metrics.disableAuth
to false
when deploying VolSync.
A ServiceMonitor needs to be defined in order to scrape metrics. If the
ServiceMonitor CRD was defined in the cluster when the VolSync chart was
deployed, this has already been added. If not, apply the following into the
namespace where VolSync is deployed. Note that the control-plane
labels may
need to be adjusted.
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: volsync-monitor
namespace: volsync-system
labels:
control-plane: volsync-controller
spec:
endpoints:
- interval: 30s
path: /metrics
port: https
scheme: https
tlsConfig:
# Using self-signed cert for connection
insecureSkipVerify: true
selector:
matchLabels:
control-plane: volsync-controller