Metrics & monitoring

In order to support monitoring of replication relationships, VolSync exports a number of metrics that can be scraped with Prometheus. These metrics permit monitoring whether volumes are “in sync” and how long the synchronization iterations take.

Available metrics

The following metrics are provided by VolSync for each replication object (source or destination):

volsync_missed_intervals_total: This is a count of the number of times that a replication iteration failed to complete before the next scheduled start. This metric is only valid for objects that have a schedule (.spec.trigger.schedule) specified. For example, when using the rsync mover with a schedule on the source but not on the destination, only the metric for the source side is meaningful.
volsync_sync_duration_seconds: This is a summary of the time required for each sync iteration. By monitoring this value it is possible to determine how much “slack” exists in the synchronization schedule (i.e., how much less is the sync duration than the schedule frequency).
volsync_volume_out_of_sync: This is a gauge that has the value of either “0” or “1”, with a “1” indicating that the volumes are not currently synchronized. This may be due to an error that is preventing synchronization or because the most recent synchronization iteration failed to complete prior to when the next should have started. This metric also requires a schedule to be defined.

Each of the above metrics include the following labels to assist with monitoring and alerting:

obj_name: This is the name of the VolSync CustomResource
obj_namespace: This is the Kubernetes Namespace that contains the CustomResource
role: This contains the value of either “source” or “destination” depending on whether the CR is a ReplicationSource or a ReplicationDestination.
method: This indicates the synchronization method being used. Currently, “rsync” or “rclone”.

As an example, the below raw data comes from a single rsync-based relationship that is replicating data using the ReplicationSource dsrc in the srcns namespace to the ReplicationDestination dest in the dstns namespace.

Example raw metrics data

 $ curl -s http://127.0.0.1:8080/metrics | grep volsync

 # HELP volsync_missed_intervals_total The number of times a synchronization failed to complete before the next scheduled start
 # TYPE volsync_missed_intervals_total counter
 volsync_missed_intervals_total{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
 volsync_missed_intervals_total{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
 # HELP volsync_sync_duration_seconds Duration of the synchronization interval in seconds
 # TYPE volsync_sync_duration_seconds summary
 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.5"} 179.725047058
 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.9"} 544.86628289
 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.99"} 544.86628289
 volsync_sync_duration_seconds_sum{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 828.711667153
 volsync_sync_duration_seconds_count{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 3
 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.5"} 11.547060835
 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.9"} 12.013468222
 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.99"} 12.013468222
 volsync_sync_duration_seconds_sum{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 33.317039014
 volsync_sync_duration_seconds_count{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 3
 # HELP volsync_volume_out_of_sync Set to 1 if the volume is not properly synchronized
 # TYPE volsync_volume_out_of_sync gauge
 volsync_volume_out_of_sync{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
 volsync_volume_out_of_sync{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0

Obtaining metrics

The above metrics can be collected by Prometheus. If the cluster does not already have a running instance set to scrape metrics, one will need to be started.

Configuring Prometheus

The following steps start a simple Prometheus instance to scrape metrics from VolSync. Some platforms may already have a running Prometheus operator or instance, making these steps unnecessary.

Start the Prometheus operator:

$ kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.46.0/bundle.yaml

Start Prometheus by applying the following block of yaml via:

$ kubectl create ns volsync-system
$ kubectl -n volsync-system apply -f -

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - configmaps
    verbs: ["get"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: volsync-system  # Change if necessary!
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      control-plane: volsync-controller
  resources:
    requests:
      memory: 400Mi

If necessary, create a monitoring configuration in the openshift-user-workload-monitoring namespace and enable user workload monitoring:

Example user workload monitoring configuration

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    # Allocate persistent storage for user Prometheus
    prometheus:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi
    # Allocate persistent storage for user Thanos Ruler
    thanosRuler:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi

Enabling user workload monitoring

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    # Allocate persistent storage for alertmanager
    alertmanagerMain:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi
    # Enable user workload monitoring stack
    enableUserWorkload: true
    # Allocate persistent storage for cluster prometheus
    prometheusK8s:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi

Monitoring VolSync

The metrics port for VolSync is (by default) protected via kube-auth-proxy. In order to grant Prometheus the ability to scrape the metrics, its ServiceAccount must be granted access to the volsync-metrics-reader ClusterRole. This can be accomplished by (substitute in the namespace & SA name of the Prometheus server):

$ kubectl create clusterrolebinding metrics --clusterrole=volsync-metrics-reader --serviceaccount=<namespace>:<service-account-name>

Optionally, authentication of the metrics port can be disabled by setting the Helm chart value metrics.disableAuth to false when deploying VolSync.

A ServiceMonitor needs to be defined in order to scrape metrics. If the ServiceMonitor CRD was defined in the cluster when the VolSync chart was deployed, this has already been added. If not, apply the following into the namespace where VolSync is deployed. Note that the control-plane labels may need to be adjusted.

VolSync ServiceMonitor

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: volsync-monitor
  namespace: volsync-system
  labels:
    control-plane: volsync-controller
spec:
  endpoints:
    - interval: 30s
      path: /metrics
      port: https
      scheme: https
      tlsConfig:
        # Using self-signed cert for connection
        insecureSkipVerify: true
  selector:
    matchLabels:
      control-plane: volsync-controller