CODEX

Reliable Kubernetes on a Raspberry Pi Cluster: Monitoring

Published in

CodeX

6 min readJan 16, 2021

With the previous articles in this series, you will have created a 3 node k3s cluster running on RPis. But how do you know it's working? How do you know that a node is struggling? Or even worse, down? Monitoring has been crucial for the health of my cluster pointing out the pain points where improvements were needed. Today I will work through how I did it so you can follow along with your own cluster

Part 1: Introduction
Part 2: The Foundations
Part 3: Storage
Part 4: Monitoring
Part 5: Security

Node-Exporter

The first critical piece of the puzzle is a way of being able to export crucial health information about each node. Node-Exporter is a brilliant utility you can install as a service and leave running which has the ability to export in a format for Prometheus to pick up. You want to do this for each of the nodes in your cluster.

First, you need to download and extract it.

$ curl -SL https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-armv7.tar.gz > node_exporter.tar.gz && sudo tar -xvf node_exporter.tar.gz -C /usr/local/bin/ --strip-components=1

Once we have it, we need to set it up as a systemd service to ensure it restarts after a reboot. Create a file /etc/systemd/system/nodeexporter.service :

[Unit]
Description=NodeExporter[Service]
TimeoutStartSec=0
ExecStart=/usr/local/bin/node_exporter[Install]
WantedBy=multi-user.target

Then you need to register it:

$ sudo systemctl daemon-reload \
&& sudo systemctl enable nodeexporter \
&& sudo systemctl start nodeexporter

Prometheus

We now need to configure and deploy Prometheus to our cluster. Thankfully, this is as simple as a single file. Create prometheus.yaml as below:

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    app: prometheus
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yaml: |
    global:
      scrape_interval:     15s
      external_labels:
        monitor: 'k3s-monitor'
    scrape_configs:
      - job_name: 'prometheus'
        scrape_interval: 5s
        static_configs:
          - targets: ['localhost:9090']
      - job_name: 'K3S'
        static_configs:
          - targets: ['k3s-master:9100', 'k3s-node1:9100', 'k3s-node2:9100']
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus
        volumeMounts:
          - name: config-volume
            mountPath: /etc/prometheus/prometheus.yml
            subPath: prometheus.yaml
        ports:
        - containerPort: 9090
      volumes:
        - name: config-volume
          configMap:
           name: prometheus-config
---
kind: Service
apiVersion: v1
metadata:
  namespace: monitoring
  name: prometheus-service
spec:
  selector:
    app: prometheus
  ports:
  - name: promui
    protocol: TCP
    port: 9090
    targetPort: 9090

Apply this in the usual way

$ sudo kubectl apply -f prometheus.yaml

This sets up our scrape config for Prometheus and deploys it into our cluster in the monitoring namespace. We create a service for other pods in the cluster to be able to access it, but we don't give it a load balancer or ingress route because we explicitly do not want to expose it outside our cluster. Check it is up and running with the following command

$ sudo kubectl get pods -n monitoring

If all has been successful, you will get output like the one below

Successfully running prometheus

Now we have something collating all our data, we need something to be able to display it.

Grafana

Grafana is the perfect tool for being able to visualize all this data within our cluster. It has support for Prometheus, and lots of community created dashboards for a lot of the standard monitoring you are going to want to be doing.

First things first — we need to create our manifest file, grafana.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-nfs-volume
  namespace: monitoring
  labels:
    directory: grafana
spec:
  capacity:
    storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: slow
  nfs:
    path: <<NFS-SERVER-PATH>>
    server: <<NFS-SERVER-CLUSTER-IP>>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-nfs-claim
  namespace: monitoring
spec:
  storageClassName: slow
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  selector:
    matchLabels:
      directory: grafana
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
        name: grafana
    spec:
      securityContext:
        runAsUser: 1001
        runAsGroup: 1001
      containers:
      - name: grafana
        image: grafana/grafana
        imagePullPolicy: Always
        volumeMounts:
        - name: grafana-nfs-volume
          mountPath: "/var/lib/grafana"
      volumes:
      - name: grafana-nfs-volume
        persistentVolumeClaim:
          claimName: grafana-nfs-claim
---
apiVersion: v1
kind: Service
metadata:
  name: grafana-service
  namespace: monitoring
spec:
  selector:
    app: grafana
  ports:
  - port: 3000
    name: grafana
    protocol: TCP
    targetPort: 3000
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: grafana-route
  namespace: monitoring
spec:
  entryPoints:
    - websecure
  routes:
  - match: Host(`grafana.internal`)
    kind: Rule
    services:
    - name: grafana-service
      port: 3000
  tls:
    certResolver: cloudflare

Something you will need to update in the above yaml — it has persistent storage requirements, so will need updating to point to the cluster ip of the nfs server we set up earlier. Once you have done this, apply it as normal:

$ sudo kubectl apply -f grafana.yaml

When it has successfully started up, listing the pods will return one more running pod

$ sudo kubectl get pods -n monitoring

Get pods output for 2 running pods

You should now be able to go to https://grafana.internal and see a running Grafana instance

The default username and password are admin and admin. This will need changing, but obviously, there are things to do to secure this more — new user, oAuth, etc. Securing this instance is out of scope for this article.

The first thing you want to do here is to create a new data source. There is a button on the first dashboard to do that. You want a Prometheus data source, and the only thing you have to input is the URL — http://prometheus-service:9090 . Hit Save & Test and you should see a success message

Once we have a data source, we need a dashboard to display it. Head over to the dashboard import page (https://grafana.internal/dashboard/import). We are going to be using the Prometheus Node Exporter Full dashboard (https://grafana.com/grafana/dashboards/1860) so go ahead and drop that in the dashboard import box and hit load

Select your Prometheus data source and import and there we have it, a dashboard with which we can monitor the state of our cluster. Be sure to mark it as a favorite so it's easier to find at a later point.

We now have a way of viewing cluster health, and to start our diagnosis should anything actually go awry in the cluster. It's not a full picture since we do not yet have application-specific metrics, but node health is a major step forward for being your own sysadmin! Next time we will be taking a look at security and ensuring access to the things running on your cluster can be configured and locked down.

CODEX

Reliable Kubernetes on a Raspberry Pi Cluster: Monitoring

Node-Exporter

Prometheus

Grafana

Written by Scott Jones