This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Ops Guides

A collection of guides made for Ops.

1 - Deployment

Learn to deploy Chall-Manager, either for production or development purposes.

You can deploy the Chall-Manager in many ways. The following table summarize the properties of each one.

Name Maintained Isolation Scalable Janitor
Kubernetes (with Pulumi)
Kubernetes
Docker ✅¹ ❌²
Binary ❌²

¹ Autoscaling is possible with an hypervisor (e.g. Docker Swarm).

² Cron could be configured through a cron on the host machine.

Kubernetes (with Pulumi)

This deployment strategy guarantee you a valid infrastructure regarding our functionalities and security guidelines. Moreover, if you are afraid of Pulumi you’ll have trouble creating scenarios, so it’s a good place to start !

The requirements are:

  • a distributed block storage solution such as Longhorn, if you want replicas.
  • an etcd cluster, if you want to scale.
  • an OpenTelemetry Collector, if you want telemetry data.
  • an origin namespace in which the chall-manager will run.
# Get the repository and its own Pulumi factory
git clone git@github.com:ctfer-io/chall-manager.git
cd chall-manager/deploy

# Use it straightly !
# Don't forget to configure your stack if necessary.
# Refer to Pulumi's doc if necessary.
pulumi up

Now, you’re done !

Micro Services Architecture of chall-manager deployed in a Kubernetes cluster.

Kubernetes

With this deployment strategy, you are embracing the hard path of setting up a chall-manager to production. You’ll have to handle the functionalities, the security, and you won’t implement variability easily. We still highly recommend you deploying with Pulumi, but if you love YAMLs, here is the doc.

The requirements are:

  • a distributed block storage solution such as Longhorn, if you want replicas.
  • an etcd cluster, if you want to scale.
  • an OpenTelemetry Collector, if you want telemetry data.
  • an origin namespace in which the chall-manager will run.

We’ll deploy the following:

  • a target Namespace to deploy instances into.
  • a ServiceAccount for chall-manager to deploy instances in the target namespace.
  • a Role and its RoleBinding to assign permissions required for the ServiceAccount to deploy resources. Please do not give non-namespaced permissions as it would enable a CTF player to pivot into the cluster thus break isolation.
  • a set of 4 NetworkPolicies to ensure security by default in the target namespace.
  • a PersistentVolumeClaim to replicate the Chall-Manager filesystem data among the replicas.
  • a Deployment for the Chall-Manager pods.
  • a Service to expose those Chall-Manager pods, required for the janitor and the CTF platform communications.
  • a CronJob for the Chall-Manager-Janitor.

First of all, we are working on the target namespace the chall-manager will deploy challenge instances to. Those steps are mandatory in order to obtain a secure and robust deployment, without players being able to pivot in your Kubernetes cluster, thus to your applications (databases, monitoring, etc.).

The first step is to create the target namespace.

target-namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: target-ns

To deploy challenge instances into this target namespace, we are going to need 3 resources: the ServiceAccount, the Role and its RoleBinding. This ServiceAccount should not be shared with other applications, and we are here detailing 1 way to build its permissions. As a Kubernetes administrator, you can modify those steps to aggregate roles, create a Cluster-wide Role and RoleBindng, etc. Nevertheless, we trust our documented approach to be wiser for maintenance and accessibility.

Adjust the role permissions to your needs. You can do this using kubectl api-resources –-namespaced=true –o wide.

role.yaml

apiVersion: rbac.authorieation.k8s.io/v1
kind: Role
metadata:
  name: chall-manager-role
  namespace: target-ns
  labels:
    app: chall-manager
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - endpoints
  - persistentvolumeclaims
  - pods
  - resourcequotas
  - secrets
  - service
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - apps
  resources:
  - deployments
  - replicasets
  - statefulsets
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  - networkpolicies
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch

The the ServiceAccount it will refer to.

service-account.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: chall-manager-sa
  metadata: source-ns
  labels:
    app: chall-manager

Finally, bind the Role and ServiceAccount.

role-binding.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chall-manager-role-binding
  namespace: target-ns
  labels:
    app: chall-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: chall-manager-role
subjects:
- kind: ServiceAccount
  name: chall-manager-sa
  namespace: source-ns

Now, we will prepare isolation of scenarios to avoid pivoting in the infrastructure.

First, we start by denying all networking.

netpol-deny-all.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-deny-all
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Then, make sure that intra-cluster communications are not allowed from this namespace to any other.

netpol-inter-ns.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-inter-ns
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  egress:
  - to:
    - namespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values:
          - target-ns
  podSelector: {}
  policyTypes:
  - Egress

For complex scenarios that require multiple pods, we need to be able to resolve intra-cluster DNS entries.

netpol-dns.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-dns
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  egress:
  - ports:
    - port: 53
      protocol: UDP
    - port: 53
      protocol: TCP
    to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
  podSelector: {}
  policyTypes:
  - Egress

Ultimately our challenges will probably need to access internet, or our players will operate online, so we need to grant access to internet addresses.

netpol-internet.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-internet
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        expect:
        - 10.0.0.0/8
        - 172.16.0.0/12
        - 192.168.0.0/16
  podSelector: {}
  policyTypes:
  - Egress

At this step, no communication will be accepted by the target namespace. Every scenario will need to define its own NetworkPolicies regarding its inter-pods and exposed services communications.

Before starting the chall-manager, we need to create the PersistentVolumeClaim to write the data to.

pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: chall-manager-pvc
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 2Gi # arbitrary, you may need more or less
  storageClassName: longhorn # or anything else compatible
  volumeName: chall-manager-pvc

We’ll now deploy the chall-manager and provide it the ServiceAccount we created before. For additionnal configuration elements, refer to the CLI documentation (chall-manager -h).

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chall-manager-deploy
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  replicas: 1 # scale if necessary
  selector:
    matchLabels:
      app: chall-manager
  template:
    metadata:
      namespace: source-ns
      labels:
        app: chall-manager
    spec:
      containers:
      - name: chall-manager
        image: ctferio/chall-manager:v1.0.0
        env:
        - name: PORT
          value: "8080"
        - name: DIR
          value: /etc/chall-manager
        - name: LOCK_KIND
          value: local # or "etcd" if you have an etcd cluster
        - name: KUBERNETES_TARGET_NAMESPACE
          value: target-ns
        ports:
        - name: grpc
          containerPort: 8080
          protocol: TCP
        volumeMounts:
        - name: dir
          mountPath: /etc/chall-manager
      serviceAccount: chall-manager-sa
      # if you have an etcd cluster, we recommend creating an InitContainer to wait for the cluster to be up and running before starting chall-manager, else it will fail to handle requests
      volumes:
      - name: dir
        persistentVolumeClaim:
          claimName: chall-manager-pvc

We need to expose the pods to integrate chall-manager with a CTF platform, and to enable the janitor to run.

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: chall-manager-svc
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  ports:
  - name: grpc
    port: 8080
    targePort: 8080
    protocol: TCP
  # if you are using the chall-manager gateway (its REST API), don't forget to add an entry here
  selector:
    app: chall-manager

Now, to enable the janitoring, we have to create the CronJob for the chall-manager-janitor.

cronjob.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: chall-manager-janitor
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  schedule: "*/1 * * * *" # run every minute ; configure it elseway if necessary
  jobTemplate:
    spec:
      template:
        metadata:
          namespace: source-ns
          labels:
            app: chall-manager
        spec:
          containers:
            - name: chall-manager-janitor
              image: ctferio/chall-manager-janitor:v1.0.0
              env: 
              - name: URL
                value: chall-manager-svc:8080

Finally, deploy them all.

kubectl apply -f target-namespace.yaml \
  -f role.yaml -f service-account.yaml -f role-binding.yaml \
  -f netpol-deny-all.yaml -f netpol-inter-ns.yaml -f netpol-dns.yaml -f netpol-internet.yaml \
  -f pvc.yaml -f deployment.yaml -f service.yaml \
  -f cronjob.yaml

Docker

To deploy the docker container on a host machine, run the following. It will come with a limited set of features, thus will need additional configurations for the Pulumi providers to communicate with their targets.

docker run -p 8080:8080 -v ./data:/etc/chall-manager ctferio/chall-manager:v1.0.0

For the janitor, you may use a cron service on your host machine. In this case, you may also want to create a specific network to isolate them from other adjacent services.

Binary

To deploy the binary on a host machine, run the following. It will come with a limited set of features, thus will need additional configurations for the Pulumi providers to communicate with their targets.

# Download the binary from https://github.com/ctfer-io/chall-manager/releases, then run it
./chall-manager

For the janitor, you may use a cron service on your host machine.

2 - Monitoring

What are the signals to capture once in production, and how to deal with them ?

Once in production, the chall-manager provides its functionalities to the end-users.

But production can suffer from a lot of disruptions: network latencies, interruption of services, an unexpected bug, chaos engineering going a bit too far… How can we monitor the chall-manager to make sure everything goes fine ? What to monitor to quickly understand what is going on ?

Metrics

A first approach to monitor what is going on inside the chall-manager is through its metrics.

Name Type Description
challenges int64 The number of registered challenges.
instances int64 The number of registered instances.

You can use them to build dashboards, build KPI or anything else. They can be interesting for you to better understand the tendencies of usage of chall-manager through an event.

Tracing

A way to go deeper in understanding what is going on inside chall-manager is through tracing.

First of all, it will provide you information of latencies in the distributed locks system and Pulumi manipulations. Secondly, it will also provide you Service Performance Monitoring (SPM).

Using the OpenTelemetry Collector, you can configure it to produce RED metrics on the spans through the spanmetrics connector. When a Jaeger is bound to both the OpenTelemetry Collector and the Prometheus containing the metrics, you can monitor performances AND visualize what happens.

An example view of the Service Performance Monitoring in Jaeger, using the OpenTelemetry Collector and Prometheus server.

Through the use of those metrics and tracing capabilities, you could build alerts thresholds and automate responses or on-call alerts with the alertmanager.

A reference architecture to achieve this description follows.

graph TD
    subgraph Monitoring
        COLLECTOR["OpenTelemetry Collector"]
        PROM["Prometheus"]
        JAEGER["Jaeger"]
        ALERTMANAGER["AlertManager"]
        GRAFANA["Grafana"]

        COLLECTOR --> PROM
        JAEGER --> COLLECTOR
        JAEGER --> PROM
        ALERTMANAGER --> PROM
        GRAFANA --> PROM
    end

    subgraph Chall-Manager
        CM["Chall-Manager"]
        CMJ["Chall-Manager-Janitor"]
        ETCD["etcd cluster"]

        CMJ --> CM

        CM --> |OTLP| COLLECTOR
        CM --> ETCD
    end