Chall-Manager solves the Challenge on Demand problem through a generalization technically based upon Pulumi, and embodies the State-of-the-Art Continuous Deployment practices.

Warning

Currently entering private alpha phase, for any issue: ctfer-io@protonmail.com

Chall-Manager is a MicroService whose goal is to manage challenges and their instances. Each of those instances are self-operated by the source (either team or user) which rebalance the concerns to empowers them.

It eases the job of Administrators, Operators, Chall-Makers and Players through various features, such as hot infrastructure update, non-vendor-specific API, telemetry and scalability.

Thanks to a generic Scenario concept, the Chall-Manager could be operated anywhere (e.g. a Kubernetes cluster, AWS, GCP, microVM, on-premise infrastructures) and run everything (e.g. containers, VMs, IoT, FPGA, custom resources).

1.1 - Ops Guides

A collection of guides made for Ops.

1.1.1 - Deployment

Learn to deploy Chall-Manager, either for production or development purposes.

You can deploy the Chall-Manager in many ways. The following table summarize the properties of each one.

Name	Maintained	Isolation	Scalable	Janitor
Kubernetes (with Pulumi)	✅	✅	✅	✅
Kubernetes	❌	✅	✅	✅
Docker	❌	✅	✅¹	❌²
Binary	❌	❌	❌	❌²

¹ Autoscaling is possible with an hypervisor (e.g. Docker Swarm).

² Cron could be configured through a cron on the host machine.

Kubernetes (with Pulumi)

Note

We highly recommend the use of this deployment strategy.

We use it to test the chall-manager, and will ease parallel deployments.

This deployment strategy guarantee you a valid infrastructure regarding our functionalities and security guidelines. Moreover, if you are afraid of Pulumi you’ll have trouble creating scenarios, so it’s a good place to start !

The requirements are:

a distributed block storage solution such as Longhorn, if you want replicas.
an etcd cluster, if you want to scale.
an OpenTelemetry Collector, if you want telemetry data.
an origin namespace in which the chall-manager will run.

# Get the repository and its own Pulumi factory
git clone git@github.com:ctfer-io/chall-manager.git
cd chall-manager/deploy

# Use it straightly !
# Don't forget to configure your stack if necessary.
# Refer to Pulumi's doc if necessary.
pulumi up

Now, you’re done !

Kubernetes

With this deployment strategy, you are embracing the hard path of setting up a chall-manager to production. You’ll have to handle the functionalities, the security, and you won’t implement variability easily. We still highly recommend you deploying with Pulumi, but if you love YAMLs, here is the doc.

The requirements are:

a distributed block storage solution such as Longhorn, if you want replicas.
an etcd cluster, if you want to scale.
an OpenTelemetry Collector, if you want telemetry data.
an origin namespace in which the chall-manager will run.

We’ll deploy the following:

a target Namespace to deploy instances into.
a ServiceAccount for chall-manager to deploy instances in the target namespace.
a Role and its RoleBinding to assign permissions required for the ServiceAccount to deploy resources. Please do not give non-namespaced permissions as it would enable a CTF player to pivot into the cluster thus break isolation.
a set of 4 NetworkPolicies to ensure security by default in the target namespace.
a PersistentVolumeClaim to replicate the Chall-Manager filesystem data among the replicas.
a Deployment for the Chall-Manager pods.
a Service to expose those Chall-Manager pods, required for the janitor and the CTF platform communications.
a CronJob for the Chall-Manager-Janitor.

First of all, we are working on the target namespace the chall-manager will deploy challenge instances to. Those steps are mandatory in order to obtain a secure and robust deployment, without players being able to pivot in your Kubernetes cluster, thus to your applications (databases, monitoring, etc.).

The first step is to create the target namespace.

target-namespace.yaml


apiVersion: v1
kind: Namespace
metadata:
  name: target-ns

To deploy challenge instances into this target namespace, we are going to need 3 resources: the ServiceAccount, the Role and its RoleBinding. This ServiceAccount should not be shared with other applications, and we are here detailing 1 way to build its permissions. As a Kubernetes administrator, you can modify those steps to aggregate roles, create a Cluster-wide Role and RoleBindng, etc. Nevertheless, we trust our documented approach to be wiser for maintenance and accessibility.

Adjust the role permissions to your needs. You can do this using kubectl api-resources –-namespaced=true –o wide.

role.yaml


apiVersion: rbac.authorieation.k8s.io/v1
kind: Role
metadata:
  name: chall-manager-role
  namespace: target-ns
  labels:
    app: chall-manager
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - endpoints
  - persistentvolumeclaims
  - pods
  - resourcequotas
  - secrets
  - service
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - apps
  resources:
  - deployments
  - replicasets
  - statefulsets
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  - networkpolicies
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch

The the ServiceAccount it will refer to.

service-account.yaml


apiVersion: v1
kind: ServiceAccount
metadata:
  name: chall-manager-sa
  metadata: source-ns
  labels:
    app: chall-manager

Finally, bind the Role and ServiceAccount.

role-binding.yaml


apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chall-manager-role-binding
  namespace: target-ns
  labels:
    app: chall-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: chall-manager-role
subjects:
- kind: ServiceAccount
  name: chall-manager-sa
  namespace: source-ns

Now, we will prepare isolation of scenarios to avoid pivoting in the infrastructure.

First, we start by denying all networking.

netpol-deny-all.yaml


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-deny-all
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Then, make sure that intra-cluster communications are not allowed from this namespace to any other.

netpol-inter-ns.yaml


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-inter-ns
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  egress:
  - to:
    - namespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values:
          - target-ns
  podSelector: {}
  policyTypes:
  - Egress

For complex scenarios that require multiple pods, we need to be able to resolve intra-cluster DNS entries.

netpol-dns.yaml


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-dns
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  egress:
  - ports:
    - port: 53
      protocol: UDP
    - port: 53
      protocol: TCP
    to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
  podSelector: {}
  policyTypes:
  - Egress

Ultimately our challenges will probably need to access internet, or our players will operate online, so we need to grant access to internet addresses.

netpol-internet.yaml


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-internet
  namespace: target-ns
  labels:
    app: chall-manager
spec:
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        expect:
        - 10.0.0.0/8
        - 172.16.0.0/12
        - 192.168.0.0/16
  podSelector: {}
  policyTypes:
  - Egress

At this step, no communication will be accepted by the target namespace. Every scenario will need to define its own NetworkPolicies regarding its inter-pods and exposed services communications.

Before starting the chall-manager, we need to create the PersistentVolumeClaim to write the data to.

pvc.yaml


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: chall-manager-pvc
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 2Gi # arbitrary, you may need more or less
  storageClassName: longhorn # or anything else compatible
  volumeName: chall-manager-pvc

We’ll now deploy the chall-manager and provide it the ServiceAccount we created before. For additionnal configuration elements, refer to the CLI documentation (chall-manager -h).

deployment.yaml


apiVersion: apps/v1
kind: Deployment
metadata:
  name: chall-manager-deploy
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  replicas: 1 # scale if necessary
  selector:
    matchLabels:
      app: chall-manager
  template:
    metadata:
      namespace: source-ns
      labels:
        app: chall-manager
    spec:
      containers:
      - name: chall-manager
        image: ctferio/chall-manager:v1.0.0
        env:
        - name: PORT
          value: "8080"
        - name: DIR
          value: /etc/chall-manager
        - name: LOCK_KIND
          value: local # or "etcd" if you have an etcd cluster
        - name: KUBERNETES_TARGET_NAMESPACE
          value: target-ns
        ports:
        - name: grpc
          containerPort: 8080
          protocol: TCP
        volumeMounts:
        - name: dir
          mountPath: /etc/chall-manager
      serviceAccount: chall-manager-sa
      # if you have an etcd cluster, we recommend creating an InitContainer to wait for the cluster to be up and running before starting chall-manager, else it will fail to handle requests
      volumes:
      - name: dir
        persistentVolumeClaim:
          claimName: chall-manager-pvc

We need to expose the pods to integrate chall-manager with a CTF platform, and to enable the janitor to run.

service.yaml


apiVersion: v1
kind: Service
metadata:
  name: chall-manager-svc
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  ports:
  - name: grpc
    port: 8080
    targePort: 8080
    protocol: TCP
  # if you are using the chall-manager gateway (its REST API), don't forget to add an entry here
  selector:
    app: chall-manager

Now, to enable the janitoring, we have to create the CronJob for the chall-manager-janitor.

cronjob.yaml


apiVersion: batch/v1
kind: CronJob
metadata:
  name: chall-manager-janitor
  namespace: source-ns
  labels:
    app: chall-manager
spec:
  schedule: "*/1 * * * *" # run every minute ; configure it elseway if necessary
  jobTemplate:
    spec:
      template:
        metadata:
          namespace: source-ns
          labels:
            app: chall-manager
        spec:
          containers:
            - name: chall-manager-janitor
              image: ctferio/chall-manager-janitor:v1.0.0
              env: 
              - name: URL
                value: chall-manager-svc:8080

Finally, deploy them all.

kubectl apply -f target-namespace.yaml \
  -f role.yaml -f service-account.yaml -f role-binding.yaml \
  -f netpol-deny-all.yaml -f netpol-inter-ns.yaml -f netpol-dns.yaml -f netpol-internet.yaml \
  -f pvc.yaml -f deployment.yaml -f service.yaml \
  -f cronjob.yaml

Docker

Warning

This mode does not cover the deployment of the janitor.

To deploy the docker container on a host machine, run the following. It will come with a limited set of features, thus will need additional configurations for the Pulumi providers to communicate with their targets.

docker run -p 8080:8080 -v ./data:/etc/chall-manager ctferio/chall-manager:v1.0.0

For the janitor, you may use a cron service on your host machine. In this case, you may also want to create a specific network to isolate them from other adjacent services.

Binary

Security

We highly discourage the use of this mode as it does not guarantee proper isolation. The chall-manager is basically a RCE-as-a-Service carrier, so if you run this on your host machine, prepare for dramatic issues.

Warning

This mode does not cover the deployment of the janitor.

To deploy the binary on a host machine, run the following. It will come with a limited set of features, thus will need additional configurations for the Pulumi providers to communicate with their targets.

# Download the binary from https://github.com/ctfer-io/chall-manager/releases, then run it
./chall-manager

For the janitor, you may use a cron service on your host machine.

1.1.2 - Monitoring

What are the signals to capture once in production, and how to deal with them ?

Once in production, the chall-manager provides its functionalities to the end-users.

But production can suffer from a lot of disruptions: network latencies, interruption of services, an unexpected bug, chaos engineering going a bit too far… How can we monitor the chall-manager to make sure everything goes fine ? What to monitor to quickly understand what is going on ?

Metrics

A first approach to monitor what is going on inside the chall-manager is through its metrics.

Warning

Metrics are exported by the OTLP client. If you did not configure an OTEL Collector, please refer to the deployment documentation.

Name	Type	Description
`challenges`	`int64`	The number of registered challenges.
`instances`	`int64`	The number of registered instances.

You can use them to build dashboards, build KPI or anything else. They can be interesting for you to better understand the tendencies of usage of chall-manager through an event.

Tracing

A way to go deeper in understanding what is going on inside chall-manager is through tracing.

First of all, it will provide you information of latencies in the distributed locks system and Pulumi manipulations. Secondly, it will also provide you Service Performance Monitoring (SPM).

Using the OpenTelemetry Collector, you can configure it to produce RED metrics on the spans through the spanmetrics connector. When a Jaeger is bound to both the OpenTelemetry Collector and the Prometheus containing the metrics, you can monitor performances AND visualize what happens.

An example view of the Service Performance Monitoring in Jaeger, using the OpenTelemetry Collector and Prometheus server.

Through the use of those metrics and tracing capabilities, you could build alerts thresholds and automate responses or on-call alerts with the alertmanager.

A reference architecture to achieve this description follows.

graph TD
    subgraph Monitoring
        COLLECTOR["OpenTelemetry Collector"]
        PROM["Prometheus"]
        JAEGER["Jaeger"]
        ALERTMANAGER["AlertManager"]
        GRAFANA["Grafana"]

        COLLECTOR --> PROM
        JAEGER --> COLLECTOR
        JAEGER --> PROM
        ALERTMANAGER --> PROM
        GRAFANA --> PROM
    end

    subgraph Chall-Manager
        CM["Chall-Manager"]
        CMJ["Chall-Manager-Janitor"]
        ETCD["etcd cluster"]

        CMJ --> CM

        CM --> |OTLP| COLLECTOR
        CM --> ETCD
    end

1.2 - ChallMaker Guides

A collection of guides made for ChallMakers.

1.2.1 - Create a Scenario

Create a Chall-Manager Scenario from scratch.

You are a ChallMaker or only curious ? You want to understand how the chall-manager can spin up challenge instances on demand ? You are at the best place for it then.

This tutorial will be split up in three parts:

Design your Pulumi factory

We call a “Pulumi factory” a golang code or binary that fits the chall-manager scenario API. For details on this API, refer to the SDK documentation.

The requirements are:

have go installed.
have pulumi installed.

Create a directory and start working in it.

mkdir my-challenge
cd $_

go mod init my-challenge

First of all, you’ll configure your Pulumi factory. The example below constitutes the minimal requirements, but you can add more configuration if necessary.

Pulumi.yaml


name: my-challenge
runtime: go
description: Some description that enable others understand my challenge scenario.

Then create your entrypoint base.

main.go


package main

import (
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)

func main() {
	pulumi.Run(func(ctx *pulumi.Context) error {
        // Scenario will go there

		return nil
	})
}

You will need to add github.com/pulumi/pulumi/sdk/v3/go to your dependencies: execute go mod tidy.

Starting from here, you can get configurations, add your resources and use various providers.

For this tutorial, we will create a challenge consuming the identity from the configuration and create an Amazon S3 Bucket. At the end, we will export the connection_info to match the SDK API.

main.go


package main

import (
    "github.com/pulumi/pulumi-aws/sdk/v6/go/aws/s3"
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi/config"
)

func main() {
	pulumi.Run(func(ctx *pulumi.Context) error {
        // 1. Load config
		cfg := config.New(ctx, "my-challenge")
		config := map[string]string{
			"identity": cfg.Get("identity"),
		}

        // 2. Create resources
        _, err := s3.NewBucketV2(ctx, "example", &s3.BucketV2Args{
			Bucket: pulumi.String(config["identity"]),
			Tags: pulumi.StringMap{
				"Name":     pulumi.String("My Challenge Bucket"),
				"Identity": pulumi.String(config["identity"]),
			},
		})
		if err != nil {
			return err
		}

        // 3. Export outputs
        // This is a mockup connection info, please provide something meaningfull and executable
		ctx.Export("connection_info", pulumi.String("..."))
		return nil
	})
}

Don’t forget to run go mod tidy to add the required Go modules. Additionally, make sure to configure the chall-manager pods to get access to your AWS configuration through environment variables, and add a Provider configuration in your code if necessary.

Tips & Tricks

You can compile your code to make the challenge creation/update faster, but chall-manager will automatically do it anyway to enhance performances (avoid re-downloading Go modules and Pulumi providers, and compile the scenario). Such build could be performed through CGO_ENABLED=0 go build -o main path/to/main.go.

Add the following configuration in your Pulumi.yaml file to consume it, and set the binary path accordingly to the filesystem.

runtime:
  name: go
  options:
   binary: ./main

You can test it using the Pulumi CLI with for instance the following.

pulumi stack init # answer the questions
pulumi up         # preview and deploy

Make it ready for chall-manager

Now that your scenario is designed and coded accordingly to your artistic direction, you have to prepare it for the chall-manager to receive it. Make sure to remove all unnecessary files, and zip the directory it is contained within.

cd ..
zip -r my-challenge.zip ./my-challenge/*

And you’re done. Yes, it was that easy :)

But it could be even more using the SDK !

Tips & Tricks

You don’t need to archive all files.

If you don’t pre-compiled your scenario, you need to archive all source files.

If you prebuilt the scenario, you’ll only need to pack the main binary and Pulumi.yaml file.

1.2.2 - Software Development Kit

Sometimes, you don’t need big things. The SDK makes sure you don’t need to be a DevOps.

When you (a ChallMaker) want to deploy a single container specific for each source, you don’t want to understand how to deploy it to a specific provider. In fact, your technical expertise does not imply you are a Cloud expert… And it was not to expect ! Writing a 500-lines long scenario fitting the API only to deploy a container is a tedious job you don’t want to do more than once: create a deployment, the service, possibly the ingress, have a configuration and secrets to handle…

For this reason, we built a Software Development Kit to ease your use of chall-manager. It contains all the features of the chall-manager without passing you the issues of API compliance.

Additionnaly, we prepared some common use-cases factory to help you focus on your CTF, not the infrastructure:

Kubernetes ExposedMonopod

The community is free to create new pre-made recipes, and we welcome contributions to add new official ones. Please open an issue as a Request For Comments, and a Pull Request if possible to propose an implementation.

Build scenarios

Fitting the chall-manager scenario API imply inputs and outputs.

Despite it not being complex, it still requires work, and functionalities or evolutions does not guarantee you easy maintenance: offline compatibility with OCI registry, pre-configured providers, etc.

Indeed, if you are dealing with a chall-manager deployed in a Kubernetes cluster, the ...pulumi.ResourceOption contains a pre-configured provider such that every Kubernetes resources the scenario will create, they will be deployed in the proper namespace.

Inputs

Those are fetchable from the Pulumi configuration.

Name	Required	Description
`identity`	✅	the identity of the Challenge on Demand request

Outputs

Those should be exported from the Pulumi context.

Name	Required	Description
`connection_info`	✅	the connection information, as a string (e.g. `curl http://a4...d6.my-ctf.lan`)
`flag`	❌	the identity-specific flag the CTF platform should only validate for the given source

Kubernetes ExposedMonopod

When you want to deploy a challenge composed of a single container, on a Kubernetes cluster, you want it to be fast and easy.

Then, the Kubernetes ExposedMonopod fits your needs ! You can easily configure the container you are looking for and deploy it to production in the next seconds. The following shows you how easy it is to write a scenario that creates a Deployment with a single replica of a container, exposes a port through a service, then build the ingress specific to the identity and finally provide the connection information as a curl command.

main.go


package main

import (
	"github.com/ctfer-io/chall-manager/sdk"
	"github.com/ctfer-io/chall-manager/sdk/kubernetes"
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)

func main() {
	sdk.Run(func(req *sdk.Request, resp *sdk.Response, opts ...pulumi.ResourceOption) error {
		cm, err := kubernetes.NewExposedMonopod(req.Ctx, &kubernetes.ExposedMonopodArgs{
			Image:      "myprofile/my-challenge:latest",
			Port:       8080,
			ExposeType: kubernetes.ExposeIngress,
			Hostname:   "brefctf.ctfer.io",
			Identity:   req.Config.Identity,
		}, opts...)
		if err != nil {
			return err
		}

		resp.ConnectionInfo = pulumi.Sprintf("curl -v https://%s", cm.URL)
		return nil
	})
}

Requirements

To use ingresses, make sure your Kubernetes cluster can deal with them: have an ingress controller (e.g. Traefik), and DNS resolution points to the Kubernetes cluster.

The Kubernetes ExposedMonopod architecture for deployed resources.

1.2.3 - Update in production

How to update a challenge scenario once it is in production (instances are deployed) ?

So you have a challenge that made its way to production, but it contains a bug or an unexpected solve ? Yes, we understand your pain: you would like to patch this but expect services interruption… It is not a problem anymore !

A common worklow of a challenge fix happening in production.

We adopted the reflexions of The Update Framework to provide infrastructure update mecanisms with different properties.

What to do

You will have to update the scenario, of course. Once it is fixed and validated, archive the new version.

Then, you’ll have to pick up an Update Strategy.

Update Strategy	Require Robustness¹	Time efficiency	Cost efficiency	Availability	TL;DR;
Update in place	✅	✅	✅	✅	Efficient in time & cost ; require high maturity
Blue-Green	❌	✅	❌	✅	Efficient in time ; costfull
Recreate	❌	❌	✅	❌	Efficient in cost ; time consuming

¹ Robustness of both the provider and resources updates.

More information on the selection of those models and how they work internally is available in the design documentation.

You’ll only have to update the challenge, specifying the Update Strategy of your choice. Chall-Manager will temporarily block operations on this challenge, and update all existing instances. This makes the process predictible and reproductible, thus you can test in a pre-production environment before production. It also avoids human errors during fix, and lower the burden at scale.

1.2.4 - Use the flag variation engine

Use the flag variation engine to block shareflag, as a native feature of the Chall-Manager SDK.

Shareflag is considered by some as the worst part of competitions leading to unfair events, while some others consider this a strategy. We consider this a problem we could solve.

Context

In “standard” CTFs as we could most see them, it is impossible to solve this problem: if everyone has the same binary to reverse-engineer, how can you differentiate the flag per each team thus avoid shareflag ?

For this, you have to variate the flag for each source. One simple solution is to use the SDK.

Use the SDK

The SDK can variate a given input with human-readable equivalent characters in the ASCII-extended charset, making it handleable for CTF platforms (at least we expect it). If one character is out of those ASCII-character, it will be untouched.

To import this part of the SDK, execute the following.

go get github.com/ctfer-io/chall-manager/sdk

Then, in your scenario, you can create a constant that contains the “base flag” (i.e. the unvariated flag).

const flag = "my-supper-flag"

Finally, you can export the variated flag.

package main

import (
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi"

	"github.com/ctfer-io/chall-manager/sdk"
)

const flag = "my-supper-flag"

func main() {
	sdk.Run(func(req *sdk.Request, resp *sdk.Response, opts ...pulumi.ResourceOption) error {
		// ...

		resp.ConnectionInfo = pulumi.String("...").ToStringOutput()
		resp.Flag = pulumi.Sprintf("BREFCTF{%s}", sdk.VariateFlag(req.Config.Identity, flag))
		return nil
	})
}

package main

import (
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi/config"

    "github.com/ctfer-io/chall-manager/sdk"
)

const flag = "my-supper-flag"

func main() {
	pulumi.Run(func(ctx *pulumi.Context) error {
		// 1. Load config
		cfg := config.New(ctx, "no-sdk")
		config := map[string]string{
			"identity": cfg.Get("identity"),
		}

		// 2. Create resources
		// ...

		// 3. Export outputs
		ctx.Export("connection_info", pulumi.String("..."))
        ctx.Export("flag", pulumi.Sprintf("BREFCTF{%s}", sdk.VariateFlag(config["identity"], flag)))
		return nil
	})
}

If you want to use decorator around the flag (e.g. BREFCTF{}), don’t put it in the flag constant else it will be variated.

1.3 - Developer Guides

A collection of guides made for the developers using the Micro Service.

1.3.1 - Entensions

How to extend chall-manager capabilities ?

How can we extend chall-manager capabilities ? You cannot, in itself: there is no support for live-mutability of functionalities, plugins, nor there will be (immutability is both an operation and security principle, determinism is a requirement).

But as chall-manager is designed as a Micro Service, you only have to reuse it !

Hacking chall-manager API

Taking a few steps back, you can abstract the chall-manager API to fit your needs:

the connection_info is an Output data from the instance to the player.
the flag is an optional Output data from the instance to the backend. Then, if you want to pass additional data, you can use those commmunications buses.

Case studies

Follows some extension case studies.

MultiStep & MultiFlags

The original idea comes from the JointCyberRange, the following architecture is our proposal to solve the problem. In the “MultiStep challenge” problem they would like to have Jeopardy-style challenges constructed as a chain of steps, where each step has its own flag. To completly flag the challenge, the player have to get all flags in the proper order. Their target environment is an integration in CTFd as a plugin, and challenges deployed to Kubernetes.

Deploying those instances, isolating them, janitoring if necessary are requirements, thus would have been reimplemented. But chall-manager can deploy scenarios to whatever environment. Our proposal is then to cut the problem in two parts according to the Separation of Concerns Principle:

a CTFd plugin that implement a new challenge type, and communicate with chall-manager
chall-manager to deploy the instances

The connection_info can be unchanged from its native usage in chall-manager.

The flag output could contain a JSON object describing the flags chain and internal requirements.

Our suggested architecture for the JCR MultiStep challenge plugin for CTFd.

Through this architecture, JCR would be able to fit their needs shortly by capitalizing on the chall-manager capabilities, thus extend its goals. Moreover, it would enable them to provide the community a CTFd plugin that does not only fit Kubernetes, thanks to the genericity.

1.3.2 - Integrate with a CTF platform

Tips, tricks and guidelines on the integration of Chall-Manager in a CTF platform.

So you want to integrate chall-manager with a CTF platform ? Good job, you are contributing to the CTF ecosystem !

Here are the known integrations in CTF platforms:

CTFd: ctfer-io/ctfd-chall-manager

The design

The API is split in two services:

the ChallengeStore to handle the CRUD operations on challenges (ids, scenarios, etc.).
the InstanceManager to handle players CRUD operations on instances.

You’ll have to handle both services as part of the chall-manager API if you want proper integration.

We encourage you to add additional efforts around this integration, with for instance:

management views to monitor which challenge is started for every players.
pre-provisionning to better handle load spikes at the beginning of the event.
add rate limiting through a mana.
the support of OpenTelemetry for distributed tracing.

Use the proto

The chall-manager was conceived using a Model-Based Systems Engineering practice, so API models (the contracts) were written, and then the code was generated. This makes the .proto files the first-class citizens you may want to use in order to integrate chall-manager to a CTF platform. Those could be found in the subdirectories here. Refer the your proto-to-code tool for generating a client from those.

If you are using Golang, you can directly use the generated clients for the ChallengeStore and InstanceManager services API.

If you cannot or don’t want to use the proto files, you can use the gateway API.

Use the gateway

Because some languages don’t support gRPC, or you don’t want to, you can simply communicate with chall-manager through its JSON REST API.

To access this gateway, you have to start your chall-manager with the proper configuration:

either --gw as an arg or GATEWAY=true as a varenv
either --gw-swagger as an arg or GATEWAY_SWAGGER=true as a varenv

You can then reach the Swagger at http://my-chall-manager:9090/swagger/#, which should show you the following.

The chall-manager REST JSON API Swagger.

Use this Swagger to understand the API, and build your language-specific client in order to integrate chall-manager. We do not provide official language-specific REST JSON API clients.

1.4 - Design

Understand what led us to design Chall-Manager as such: what are the needs, what were the problems, how we solved them.

In this documentation chapter, we explain what led us to design the Chall-Manager as such. We go from the CTF limitations, the Challenge on Demand problem, the functionalities, the genericity of our approach, how we made sure of its usability, to the innovative testing strategy.

1.4.1 - Context

Inform yourself on the context of the Challenge on Demand problem.

A Capture The Flag (CTF) is an event that brings a set of players together, to challenge themselves and others on domain-specific problems. Those events could have for objective learning, or made for competing with cashprizes. They could be hold physically, virtually or hybrid.

CTFs are largely adopted in the cybersecurity community, taking place all over the world on a daily basis. In this community, plenty domains are played: web, forensic, Active Directory, cryptography, game hacking, telecommunications, steganography, mobile, OSINT, reverse engineering, programming, etc.

In general, a challenge is composed of a name, has a description, a set of hints, files and other data, shared for all. On top of those, the competition runs over points displayed on scoreboards: this is how people keep getting entertained throughout a continuous and hours-long rush. Most of the challenges find sufficient solutions to their needs with those functionalities, perhaps some does not…

If we consider other aspects of cybersecurity -as infrastructures, telecommunications, satellites, web3 and others- those solutions are not sufficient. They need specific deployments strategies, and are costfull to deploy even only once.

Nevertheless, with the emergence of the Infrastructure as Code paradigm, we think of infrastructures as reusable components, composing with pieces like for a puzzle. Technologies appeared to embrace this new paradigm, and where used by the cybersecurity community to build CTF infrastructures. Players could then share infrastructures to play those categories.

In reality, yes they could. But, de facto, they share the same infrastructure. If you are using a CTF to select the top players for a larger event, how would you be able to determine who performed better ? How do you assure their success is due to their sole effort and not a side effect of someone else work ? In the opposite direction, if you are a player who is loosing its mind on a challenge, you won’t be glad that someone broke the entire challenge thus your efforts are wortheless, isn’t it ?

What’s next ?

Read The Need to clarify the necessity of Challenge on Demand.

1.4.2 - The Need

Understand why it was a problem before chall-manager.

“Sharing is caring” they said… they were wrong. Sometimes, isolating things can get spicy, but it implies replicating infrastructures. If someone soft-locks its challenge, it would be its entire problem: it won’t affect other players experience. Furthermore, you could then imagine a network challenge where your goal is to chain a vulnerability to a Man-in-the-Middle attack to get access to a router and spy on communications to steal a flag ! Or a small company infrastructure that you would have to break into and get an administrator account ! And what if the infrastructure went down ? Then it would be isolated, other players could still play in their own environment. This idea can be really productive, enhance the possibilities of a CTF and enable realistic -if not realism- attacks to take place in a controlled manner.

This is the Challenge on Demand problem: giving every player or team its own instance of a challenge, isolated, to play complex scenarios. With a solution to this problem, players and teams could use a brand new set of knowledges they could not until this: pivoting from an AWS infrastructure to an on-premise company IT system, break into a virtual bank vault to squeeze out all the belongings, hack their own Industrial Control System or IoT assets… A solution to this problem would open a myriad possibilites.

The existing solutions

As it is a widespread known limitation in the community, people tried to solve the problem. They conceived solutions that would fit their need, considering a set of functionalities and requirements, then built, released and deployed them successfully.

Some of them are famous:

CTFd whale is a CTFd plugin able to spin up Docker containers on demand.
CTFd owl is an alternative to CTFd whale, less famous.
KubeCTF is another CTFd plugin made to spin up Kubernetes environments.
Klodd is a rCTF service also made to spin up Kubernetes environments.

Nevertheless, they partially solved the root problem: those solutions solved the problem in a context (Docker, Kubernetes), with a Domain Specific Language (DSL) that does not guarantee non-vendor-lock-in nor ease use and testing, and lack functionalities such as Hot Update.

An ideal solution to this problem require:

the use of a programmatic language, not a DSL (non-vendor-lock-in and functionalities)
the capacity to deploy an instance without the solution itself (non-vendor-lock-in)
the capacity to use largely-adopted technologies (e.g. Terraform, Ansible, etc., for functionalities)
the genericity of its approach to avoid re-implementing the solution for every service provider (functionalities)

There were no existing solutions that fits those requirements… Until now.

Grey litterature survey

Follows an exhaustive grey litterature survey on the solutions made to solve the Challenge on Demand problem. To enhance this one, please open an issue or a pull request, we would be glad to improve it !

Service	CTF platform	Genericity	Technical approach	Scalable
CTFd whale	CTFd	❌	Docker socket	❌¹
CTFd owl	CTFd	❌	Docker socket	❌¹
KubeCTF	Agnostic, CTFd plugin	❌	Use Kubernetes API and annotations	✅
Klodd	rCTF	❌	Use Kubernetes CRDs and a microservice	✅
HeroCTF's deploy-dynamic	Agnostic	❌	API wrapper around Docker socket	❌¹
4T$'s I	CTFd	❌	Use Kubernetes CRDs and a microservice	✅

¹ not considered scalable as it reuses a Docker socket thus require a whole VM. As such a VM is not scaled automatically (despite being technically feasible), the property aligns with the limitation.

Classification for Scalable:

✅ partially or completfully scalable. Classification does not goes further on criteria as the time to scale, autoscaling, load balancing, anticipated scaling, descaling, etc.
❌ only one instance is possible.

Litterature

More than a technical problem, the chall-manager also provides a solution to a scientific problem. In previous approaches for cybersecurity competitions, many referred to an hypothetic generic approach to the Challenge on Demand problem. None of them introduced a solution or even a realistic approach, until ours.

In those approaches to Challenge on Demand, we find:

“PAIDEUSIS: A Remote Hybrid Cyber Range for Hardware, Network, and IoT Security Training”, Bera et al. (2021) provide Challenge on Demand for Hardware systems (Industrial Control Systems, FPGA).
“Design of Remote Service Infrastructures for Hardware-based Capture-the-Flag Challenges”, Marongiu and Perra (2021) related to PAIDEUSIS, with the foundations for hardware-based Challenge on Demand.
“Lowering the Barriers to Capture the Flag administration and Participation”, Kevin Chung (2017) in which it is a limitation of CTFd, where picoCTF has an “autogen” challenge feature.
“Automatic Problem Generation for Capture-the-Flag Competitions”, Burket et al. (2015) discusses the problem of generating challenges on demand, regarding fairness evaluation for players.
“Scalable Learning Environments for Teaching Cybersecurity Hands-on”, Vykopal et al. (2021) discusses the problem of creating cybersecurity environments, similarly to Challenge on Demand, and provide guidelines for production environments at scale.
“GENICS: A framework for Generating Attack Scenarios for Cybersecurity Exercices on Industrial Control Systems”, Song et al. (2024) formulates a process to generate attack scenario applicable in a CTF context, technically possible through approaches like PAIDEUSIS.

Conclusions

Even if there are some solutions developed to help the community deal with the Challenge on Demand problem, we can see many limitations: only Docker and Kubernetes are covered, and none is able to provide the required genericity to consider other providers, even custom ones.

A production-ready solution would enable the community to explore new kinds of challenges, both technical and scientific.

This is why we created chall-manager: provide a free, open-source, generic, non-vendor-lock-in and ready-for-production solution. Feel free to build on top of it. Change the game.

What’s next ?

How we tackled down the complexity of building this system, starting from the architecture.

1.4.3 - Genericity

What is the layer of genericity ?

While trying to find a generic approach to deploy any infrastructure with a non-vendor-lock-in API, we looked at existing approaches. None of them proposed such an API, so we had to pave the way to a new future. But we did not know how to do it.

One day, after deploying infrastructures with Victor, we realised it was the solution. Indeed, Victor is able to deploy any Pulumi stack. This imply that the solution was already before our eyes: a Pulumi stack.

This consist the genericity layer, as easy as this.

Pulumi as the solution

To go from theory to practice, we had to make choices. One of the problem with a large genericity is it being… large, actually.

If you consider all ecosystems covered by Pulumi, to cover them you’ll require all runtimes installed on the host machine. For instance, the Pulumi Docker image is around 1.5 GB. This imply that a generic solution covering all ecosystems would be around 2 GB of memory.

Moreover, the enhancements you can propose in a language would have to be re-implemented similarly in every language, or transpiled. As transpilation is a heavy task, either manual or automatic but with a high error rate, it is not suitable for production.

Our choice was to focus on one language first (Golang), and later permit transpilation in other languages if technically automatable with a high success rate. With this choice, we would only have to deal with the Pulumi Go Docker image, around 200 MB (a 7.5 reduction factor). It could be even more reduced using minified images, using the Slim Toolkit or Chainguard Apko.

From the idea to an actual tool

With those ideas in mind, we had to transition from TRLs by implementing it in a tool. This tool could provide a service, thus the architecture was though as a Micro Service.

Doing so enable other Micro Services or CTF platforms to be developed and reuse the capabilities of chall-manager. We can then imagine plenty other challenges kind that would require Challenge on Demand:

King of the Hill
Attack & Defense (1 vs 1, 1 vs n, 1 vs bot)
MultiSteps & MultiFlags (for Jeopardy)

We already plan creating 1 and 2.

What’s next

Understand the Architecture of the microservice.

1.4.4 - Architecture

All architecture documents, from the API to the Kubernetes deployment.

In the process of architecting a microservice, a first thing is to design the API exposed to the other microservices. Once this done, you implement it, write the underlying software that provides the service, and think of its deployment architecture to fulfill the goals.

API

We decided to avoid API conception issues by using a Model Based Systems Engineering method, to focus on the service to provide. This imply less maintenance, an improved quality and stability.

This API revolve around a simplistic class diagram as follows.

classDiagram
    class Challenge {
        + id: String!
        + hash: String!
        + timeout: Duration
        + until: DateTime
    }

    class Instance {
        + id: String!
        + since: DateTime!
        + last_renew: DateTime!
        + until: DateTime
        + connection_info: String!
        + flag: String
    }

    Challenge "*" *--> "1" Instance: instances

Given those two classes, we applied the Separation of Concerns Principle and split it in two cooperating services:

the ChallengeStore to handle the challenges configurations
the InstanceManager to be the carrier of challenge instances on demand

Then, we described their objects and methods in protobuf, and using buf we generated the Golang code for those services. The gRPC API would then be usable by any service that would want to make use of the service.

Additionally, for documentation and ease of integration, we wanted a REST JSON API: through a gateway and a swagger, that was performed still in a code-generation approach.

Software

Based on the generated code, we had to implement the functionalities to fulfill our goals.

Nothing very special here, it was basically features implementation. This was made possible thanks to the Pulumi API that can be manipulated directly as a Go module with the auto package.

Notice we decided to avoid depending on a file-database such as an S3-compatible (AWS S3 or MinIO) as those services may not be compatible either to offline contexts or business contexts (due to the MinIO license being GNU AGPL-v3.0).

Nevertheless, we required a solution for distributed locking systems that would not fit into chall-manager. For this, we choosed etcd. File storage replication would be handled by another solution like Longhorn or Ceph. This is further detailed in High-Availability.

In our design, we deploy an etcd instance rather than using the Kubernetes already existing one. By doing so, we avoid deep integration of our proposal into the cluster which enables multiple instances to run in parallel inside an already existing cluster. Additionnaly, it avoids innapropriate service intimacy and shared persistence issues described as good development practices in Micro Services architectures by Taibi et al. (2020) and Bogard (2017).

Deployment

Then, multiple deployment targets could be discussed. As we always think of online, offline and on-premise contexts, we wanted to be able to deploy on a non-vendor-specific hypervisor everyone would be free to use. We choosed Kubernetes.

To lift the requirements and goals of the software, we had to deploy an etcd cluster, a deployment for chall-manager and a cronjob for the janitor.

The Kubernetes deployment of a chall-manager.

One opportunity of this approach is, as chall-manager is as close as possible to the Kubernetes provider (intra-cluster), it will be easily feasible to manage resources in it. Using this, ChallMaker and Ops could deploy instances on Demand with a minimal effort. This is the reason why we recommend an alternative deployment in the Ops guides, to be able to deploy challenges in the same Kubernetes cluster.

Exposed vs. Scenario

Finally, another aspect of deployment is the exposed and scenario APIs. The exposed one has been previously described, while the scenario one applies on scenarios once executed by the InstanceManager.

This API will be described in the Software Development Kit.

What’s next ?

In the next section, you will understand how data consistency is performed while maintaining High Availability.

1.4.5 - High Availability

How we balanced Availability and Consistency for acceptable performances.

When designing an highly available application, the Availability and Consistency are often deffered to the database layer. This database explains its tradeoff(s) according to the CAP theorem. Nevertheless, chall-manager does not use a database for simplicity of use.

First of all, some definitions:

Availability is the characteristic of an application to be reachable to the final user of services.
High Availability contains Availability, with minimal interruptions of services and response times as low as possible.

To perform High Availability, you can use update strategies such as a Rolling Update with a maximum unavailability rate, replicate instances, etc. But it depends on a property: the application must be scalable (many instances in parallel should not overlap in their workloads if not designed so), have availability and consistency mecanisms integrated.

One question then arise: how do we assure consistency with a maximal availability ?

Fallback

As the chall-manager should scale, the locking mecanism must be distributed. In case of a network failure, it imply that whatever the final decision, the implementation should provide a recovery mecanism of the locks.

Transactions

A first reflex would be to think of transactions. They guarantee data consistency, so would be good candidates. Nevertheless, as we are not using a database that implements it, we would have to design and implement those ourselves. Such an implementation is costfull and should only happen when reusability is a key aspect. As it is not the case here (specific to chall-manager), it was a counter-argument.

Mutex

Another reflex would be to think of (distributed) mutexes. Through a mutual exclusion, we would be sure to guarantee data consistency, but nearly no availability. Indeed, a single mutex would lock out all incoming requests until the operation is completly performed. Even if it would be easy to implement, it does not match our requirements.

Such implementation would look like the following.

sequenceDiagram
    Upstream ->> API: Request
    API ->>+ Challenge: Lock
    Challenge ->> Challenge: Perform CRUD operation
    Challenge ->>- API: Unlock
    API ->> Upstream: Response

Multiple mutex

We need something finer than (distributed) mutex: if a challenge A is under a CRUD operation, we don’t need challenge B to not be able to handle another CRUD operation !

We can imagine one (distributed) mutex per challenge such that they won’t stuck one another. Ok, that’s fine… But what about instances ?

The same problem arise, the same solution: we can construct a chain of mutexes such that to perform a CRUD operation on an Instance, we lock the Challenge first, then the Instance, unlock the Challenge, execute the operation, and unlock the Instance. An API call from an upstream source or service is represented with this strategy as follows.

sequenceDiagram
    Upstream ->> API: Request
    API ->>+ Challenge: Lock
    Challenge ->> Instance: Lock
    Challenge ->>- API: Unlock
    API ->>+ Instance: Request
    Instance ->> Instance: Perform CRUD operation
    Instance ->>- Challenge: Unlock
    Instance ->> API: Response
    API ->> Upstream: Response

One last thing, what if we want to query all challenges information (to build a dashboard, janitor outdated instances, …) ? We would need a “Stop The World"-like mutex from which every challenge mutex would require context-relock before operation. To differenciate this from the Garbage Collector ideas, we call this a “Top-of-the-World” aka TOTW.

The previous would now become the following.

sequenceDiagram
    Upstream ->> API: Request
    API ->>+ TOTW: Lock
    TOTW ->>+ Challenge: Lock
    TOTW ->>- API: Unlock
    Challenge ->> Instance: Lock
    Challenge ->>- TOTW: Unlock
    API ->>+ Instance: Request
    Instance ->> Instance: Perform CRUD operation
    Instance ->>- Challenge: Unlock
    Instance ->> API: Response
    API ->> Upstream: Response

That would guarantee consistency of the data while having availability on the API resources. Nevertheless, this availability is not high availability: we could enhance further.

Writer-Preference Reader-Writer Distributed Lock

All CRUD operations are not equal, and can be split in two groups:

reading (Query, Read)
writing (Create, Update, Delete).

The reading operations does not affect the state of an object, while writing ones does. Moreover, in the case of chall-manager, reading operations are nearly instantaneous and writing ones are at least 10-seconds long. How to deal with those unbalanced operations ?

As soon as discussions on OS began, researchers worked on the similar question and found solutions. They called this one the “reader-writer problem”.

In our specific case, we want writer-preference as they would largely affect the state of the resources. Using the Courtois et al. (1971) second problem solution for writer-preference reader-writer solution, we would need 5 locks and 2 counters.

For the technical implementation, we have multiple solutions: etcd, redis or valkey. We decided to choose etcd because it was already used by Kubernetes, and the etcd client v3 already implement mutex and counters.

The triple chain of writer-preference reader-writer distributed locks.

With this approach, we could ensure data consistency throughout all replicas of chall-manager, and high-availability.

CRDT

Can a Conflict-Free Replicated data Type have been a solution ?

In our case, we are not looking for eventual consistency, but strict consistency. Moreover, using CRDT is costfull in development, integration and operation, so if avoidable they should be. CRDT are not the best tool to use here.

What’s next ?

Based on the guarantee of consistency and high availability, inform you on the other major problem: Hot Update.

1.4.6 - Hot Update

How do we handle the update of an infrastructure once it runs ?

When a challenge is affected by a bug leading to the impossibility of solving it, or if an unexpected solve is found, you will most probably want it fixed for fairness.

If this happens on challenge with low requirements, you will fix the description, files, hints, etc. But what about challenges that require infrastructures ? You may fix the scenario, but it won’t fix the already-existing instances. If no instance has been deployed, it is fine: fixing the scenario is sufficient. Once instances has been deployed, we require a mecanism to perform this update automatically.

A reflex when dealing with a question of updates and deliveries is to refer to The Update Framework. With the embodiement of its principles, we wanted to provide macro models for hot update mecanisms. To do this, we listed a bunch of deployment strategies, analysed their inner concepts and group them on this.

Precise model	Applicable	Macro model
Blue-Green deployment	✅	Blue-Green
Canary deployment	✅
Rolling deployment	✅
A/B testing	✅
Shadow deployment	✅
Red-Black deployment	✅
Highlander deployment	✅	Recreate
Recreate deployment	✅	Recreate
Update-in-place deployment	✅	Update-in-place
Immutable infrastructure	❌
Feature toggles	❌
Dark launches	❌
Ramped deployment	❌
Serverless deployment	❌
Multi-cloud deployment	❌

Strategies were classified not applicable when they did not include update mecanisms.

With the 3 macro models, we define 3 hot update strategies.

Blue-Green

The blue-green update strategy starts a new instance with the new scenario, and once it is done shuts down the old one.

It requires both instances in parallel, thus is a resources-consuming update strategy. Nevertheless, it reduces services interruptions to low or none. To the extreme, the infrastructure should be able to handle two times the instances load.

sequenceDiagram
    Upstream ->>+ API: Request
    API ->>+ New Instance: Start instance
    New Instance ->>- API: Instance up & running
    API ->>+ Old Instance: Stop instance
    Old Instance ->>- API: Done
    API ->>- Upstream: Respond new instance info

Recreate

The recreate update strategy shuts down the old one, then starts a new instance with the new scenario.

It is a resources-saving update strategy, but imply services interruptions enough time to stop the old instance and starts a new one. To the extreme, it does not require additional resources more than one time the instance load.

sequenceDiagram
    Upstream ->>+ API: Request
    API ->>+ Old Instance: Stop instance
    Old Instance ->>- API: Done
    API ->>+ New Instance: Start instance
    New Instance ->>- API: Instance up & running
    API ->>- Upstream: Respond refreshed instance info

Update-in-place

The update-in-place update strategy loads the new scenario and update resources in live.

It is a resource-saving update strategy, that imply low to none services interruptions, but require robustness in the update mecanisms. If the update mecanisms are not robust, we do not recommend this one as it could soft-lock resources in the providers. To the extreme, it does not require additional resources more than one time the instance load.

sequenceDiagram
    Upstream ->>+ API: Request
    API ->>+ Instance: Update instance
    Instance ->>- API: Refreshed instance
    API ->>- Upstream: Respond refreshed instance info

Overall

Update Strategy	Require Robustness¹	Time efficiency	Cost efficiency	Availability	TL;DR;
Update in place	✅	✅	✅	✅	Efficient in time & cost ; require high maturity
Blue-Green	❌	✅	❌	✅	Efficient in time ; costfull
Recreate	❌	❌	✅	❌	Efficient in cost ; time consuming

¹ Robustness of both the provider and resources updates.

What’s next ?

How did we incorporate security in such a powerfull service ? Find answers in Security.

1.4.7 - Security

Learn how we designed security in a “RCE-as-a-Service” system, and how we used its features for security purposes.

The problem with the genericity of chall-manager resides in its capacity to execute any Golang code as long as it fits in a Pulumi stack i.e. anything. For this reason, there are multiple concerns to address when using the chall-manager.

Nevertheless, it also provides actionable responses to security concerns, such as shareflag and bias.

Authentication & Authorization

A question regarding such a security concern of an “RCE-as-a-Service” system is to throw authentication and authorization at it. Technically, it could fit and get justified.

Nevertheless, we think that, first of all, chall-manager replicas should not be exposed to end users and untrusted services thus Ops should put mTLS in place between trusted services and restraint communications to the bare minimum, and secondly, the Separation of Concerns Principle imply authentication and authorization are another goal thus should be achieved by another service.

Finally, authentication and authorization may but justifiable if Chall-Manager was operated as a Service. As this would not be the case with a Community Edition, we consider it out of scope.

Kubernetes

If deployed as part of a Kubernetes cluster, with a ServiceAccount and a specific namespace to deploy instances, the chall-manager is able to mutate the architecture on the fly. To minimize the effect of such mutations, we recommend you provide this ServiceAccount a Role with a limited set of verbs on api groups. Those resources should only be namespaced.

To build this Role for your needs, you can use the command kubectl api-resources –-namespaced=true –o wide to visualize a cluster resources and applicable verbs.

An extract of a the resources of a Kubernetes cluster and their applicable verbs.

More details on Kubernetes RBAC.

Shareflag

One of the actionable response provided by the chall-manager is through an anti-shareflag mecanism.

Each instance deployed by the chall-manager can return in its scenario a specific flag. This flag will then be used by the upstream CTF platform to ensure the source -and only it- found the solution.

Moreover, each instance-specific flag could be derived from an original constant one using the flag variation engine.

ChallOps bias

As each instance is an infrastructure in itself, variations could bias them: lighter network policies, easier brute force, etc. A scenario is not biased by essence.

If we make a risk analysis on the chall-manager capabilities and possibilities for an event, we have to consider a biased ChallMaker or Ops that produces uneven-balanced scenarios.

For this reason, the chall-manager API does not expose the source identifier of the request to the scenario code, but an identity. This is declined as follows. It strictly identifies an infrastructure identifier, the challenge the instance was requested from, and the source identifier for this instance.

A visualization of how views are split apart to avoid the ChallOps bias.

Notice the identity is limited to 16 hexadecimals, making it compatible to multiple uses like a DNS name or a PRNG seed. This increases the possibilites of collisions, but can still cover $16^{16} = 18.446.744.073.709.551.616$ combinations, trusted sufficient for a CTF ($f(x,y) = x \times y - 16^{16}$, find roots: $x \times y=16^{16} \Leftrightarrow y=\frac{16^{16}}{x}$ so roots are given by the couple $(x, \frac{16^{16}}{x})$ with $x$ the number of challenges. With largely enough challenges e.g. 200, there is still place for $\frac{16^{16}}{200} \simeq 9.2 \times 10^{16}$ instances each).

What’s next ?

Learn how we dealt with the resources expirations.

1.4.8 - Expiration

Find how we handle fairness in the use of infrastructure resources with expirations.

Context

During the CTF, we don’t want players to be capable of manipulating the infrastructure at their will: starting instances are costful, require computational capabilities, etc. It is mandatory to control this while providing the players the power to manipulate their instances at their own will.

For this reason, one goal of the chall-manager is to provide ephemeral (or not) scenarios. Ephemeral imply lifetimes, expirations and deletions. To implement this, for each Challenge the ChallMaker and Ops can set a timeout in seconds after which the Instance will be deleted once up & running, or an until date after which the instance will be deleted whatever the timeout. When an Instance is deployed, its start date is saved, and every update is stored for traceability. A participant (or a dependent service) can then renew an instance on demand for additional time, as long as it is under the until date of the challenge. This is based on a hypothesis that a challenge should be solved after $n$ minutes.

Note

The timeout should be evaluated based on expert’s point of view regarding the complexity of the conceived challenge, with a consideration of the participant skill sets (an expert can be expected to solve an introduction challenge in seconds, while a beginer can take several minutes).

There is no “rule of the thumb”, but we recommend double-testing the challenge by both a domain-expert for technical difficulty and another ChallMaker unrelated to this domain.

Deleting instances when outdated then becomes a new goal of the system, thus we cannot extend the chall-manager as it would be a rupture of the Separation of Concerns Principle: it is the goal of another service, chall-manager-janitor. This is also justified by the frequency model applied to the janitor, which is unrelated to the chall-manager service itself. With such approach, other players could use the resources. Nevertheless, it requires a mecanism to wipe out infrastructure resources after a given time. Some tools exist to do so.

Tool	Environment
hjacobs/kube-janitor	Kubernetes
kubernetes-sig/boskos	Kubernetes
rancher/aws-janitor	AWS

Despite tools exist, they are context-specifics thus are limited: each one has its own mecanism and only 1 environment is considered. As of genericity, we want a generic approach able to handle all ecosystems without the need for specific implementations. For instance, if a ChallMaker decides to cover a unique, private and offline ecosystem, how could (s)he do ?

That is why the janitor must have the same level of genericity as chall-manager itself. Despite it is not optimal for specifics providers, we except this genericity to be a better tradeoff than covering a limited set of technologies. This modular approach enable covering new providers (vendor-specifics, public or private) without involving CTFer.io in the loop.

How it works

By using the chall-manager API, the janitor looks up at expiration dates. Once an instance is expired, it simply deletes it. Using a cron, the janitor could then monitor the instances frequently.

flowchart LR
    subgraph Chall-Manager
        CM[Chall-Manager]
        Etcd

        CM --> Etcd
    end

    CMJ[Chall-Manager-Janitor]
    CMJ -->|gRPC| CM

If two janitors triggers in parallel, the API will maintain consistency. Errors code are to expect, but no data inconsistency.

As it does not plugs into a specific provider mecanism nor requirement, it guarantees platform agnosticity. Whatever the scenario, the chall-manager-janitor will be able to handle it.

Follows the algorithm used to determine the instance until date based on a challenge configuration for both until and timeout. Renewing an instance re-execute this to ensure consistency with the challenge configuration. Based on the instance until date, the janitor will determine whether to delete it or not ($instance.until > now() \Rightarrow delete(instance)$).

flowchart LR
    Start[Compute until]

    Start-->until{"until == nil ?"}
    until---|true|timeout1{"timeout == nil ?"}
    timeout1---|true|out1["nil"]
    timeout1---|false|out2["now()+timeout"]

    until---|false|timeout2{"timeout == nil ?"}
    timeout2---|true|out3{"until"}
    timeout2---|false|out4{"min(now()+timeout,until)"}

What’s next ?

Listening to the community, we decided to improve further with a Software Development Kit.

1.4.9 - Software Development Kit

Learn how we designed a SDK to ease the use of chall-manager for non-DevOps people.

A first comment on chall-manager was that it required ChallMaker and Ops to be DevOps. Indeed, if we expect people to be providers’ experts to deploy a challenge, when there expertise is on a cybersecurity aspect… well, it is incoherent.

To avoid this, we took a few steps back and asked ourselves: for a beginner, what are the deployment practices that could arise form the use of chall-manager ?

A naive approach was to consider the deployment of a single Docker container in a Cloud provider (Kubernetes, GCP, AWS, etc.). For this reason, we implemented the minimal requirements to effectively deploy a Docker container in a Kubernetes cluster, exposed through an Ingress or a NodePort. The results were hundreds-line-long, so confirmed we cannot expect non-professionnals to do it.

Based on this experiment, we decided to reuse this Pulumi scenario to build a Software Development Kit to empower the ChallMaker. The references architectures contained in the SDK are available here. The rule of thumb with them is to infer the most possible things, to have a mimimum configuration for the end user.

Other features are available in the SDK.

Flag variation engine

Commonly, each challenge has its own flag. This suffers a big limitation that we can come up to: as each instance is specific to a source, we can define the flag on the fly. But this flag must not be shared with other players or it will enable shareflag.

For this reason, we provide the ability to mutate a string (expected to be the flag): for each character, if there is a variant in the ASCII-extended charset, select one of them randomly and based on the identity.

Variation rules

The variation rules follows, and if a character is not part of it, it is not mutated (each variant has its mutations evenly distributed):

a, A, 4, @, ª, À, Á, Â, Ã, Ä, Å, à, á, â, ã, ä, å
b, B, 8, ß
c, C, (, ¢, ©, Ç, ç
d, D, Ð
e, E, €, &, £, È, É, Ê, Ë, è, é, ê, ë, 3
f, F, ƒ
g, G
h, H, #
i, I, 1, !, Ì, Í, Î, Ï, ì, í, î, ï
j, J
k, K
l, L
m, M
n, N, Ñ, ñ
o, O, 0, ¤, °, º, Ò, Ó, Ô, Õ, Ö, Ø, ø, ò, ó, ô, õ, ö, ð
p, P
q, Q
r, R, ®
s, S, 5, $, š, Š, §
t, T, 7, †
u, U, µ, Ù, Ú, Û, Ü, ù, ú, û, ü
v, V
w, W
x, X, ×
y, Y, Ÿ, ¥, Ý, ý, ÿ
z, Z, ž, Ž
, -, _, ~

Tips & Tricks

If you want to use a decorator (e.g. BREFCTF{...}), do not put it in the flag to variate. More info here.

Limitations

We are aware that this proposition does not solve all issues: if people share their write-up, they will be able to flag. This limitation is considered out of our scope, as we don’t think the Challenge on Demand solution fits this use case.

Nevertheless, our differentiation strategy can be the basis of a proper solution to the APG-problem (Automatic Program Generation): we are able to write one scenario that will differentiate the instances per source. This could fit the input of an APG-solution.

Moreover, it considers a precise scenario of advanced malicious collaborative sources, where shareflag consider malicious collaborative sources only (more “accessible” by definition).

What’s next ?

The final step from there is to ensure the quality of our work, with testing.

1.4.10 - Testing

Building something cool is a thing, assuring its quality is another. Learn how we dealt with this specific service Integration, Verification and Validation, especially through Romeo.

Generalities

In the goal of asserting the quality of a software, testing becomes a powerful tool: it provides trust in a codebase. But testing, in itself, is a whole domain of IT. Some common requirements are:

having those tests written with a programmation language (most often the same one as the software, enable technical skills sharing among Software Engineers)
documented (the high-level strategy should be auditable to quickly assess quality practices)
reproducible (could be run in two distinct environments and produce the same results)
explainable (each test case sould document its goal(s))
systematically run, or if not possible, as frequent as possible (detect regressions as soon as possible)

To fulfill those requirements, the strategy can contain many phases which each focused on a specific aspect or condition of the software under tests.

In the following, we provide the high-level testing strategy of the chall-manager Micro Service. Software Engineers are invited to read it as such strategy is rare: we challenged the practices to push them beyond what the community does, with Romeo.

Testing strategy

The testing strategy contains multiple phases:

unit tests to ensure core functions behave as expected ; those does not depend on network, os, files, etc. out thus the code and only it.
functional tests to ensure the system behaves as expected ; those require the system to run hence a place to do so.
integration tests to ensure the system is integrated properly given a set of targets ; those require whether the production environment or a clone of it.

Additional Quality Assurance steps could be found, like Stress Tests to assess Service Level Objectives under high-load conditions.

In the case of chall-manager, we write both the code and tests in Golang. To deploy infrastructures for tests, we use the Pulumi automation API and on-premise Kubernetes cluster (built using Lab and L3).

Unit tests

The unit tests revolve around the code and are isolated from anything else: no network, no files, no port listening, other tests, etc. This provides them the capacity to get run on every machine and always produce the same results: idempotence.

As chall-manager is mostly a wrapper around the Pulumi automation API to deploy scenarios, it could not be much tested using this step. Code coverage could barely reach $10\%$ thus confidence is not sufficient.

Convention is to prefix these tests Test_U_ hence could be run using go test ./... -run=^Test_U_ -v.

They most often have the same structure, based on Table-Driven Testing (TDT). Some diverge to fit specific needs (e.g. no regression on an issue related to a state machine).

package xxx_test

import (
    "testing"

    "github.com/stretchr/testify/assert"
)

func Test_U_XXX(t *testing.T) {
    t.Parallel()

    var tests = map[string]struct{
        // ... inputs, expected outputs
    }{
        // ... test cases each named uniquely
        // ... fuzz crashers if applicable, named "Fuzz_<id>" and a description of why it crashed
    }

    for testname, tt := range tests {
        t.Run(testname, func(t *testing.T) {
            assert := assert.New(t)

            // ... the test body

            // ... assert expected outputs
        })
    }
}

Functional tests

The functional tests revolve around the system during execution, and tests behaviors of services (e.g. state machines). Those should be reproducible, but some unexpected behaviors could arise: network interruptions, no disk available, etc. but should not be considered as the first source of a failing test (the underlying infrastructure is most often large enough to run the test, without interruption).

Convention is to prefix these tests Test_F_ hence could be run using go test ./... -run=^Test_F_ -v. They require a Docker image (build artifact) to be built and pushed to a registry. For Monitoring purposes, the chall-manager binary is built with the -cover flag to instrument the Go binary such that it exports its coverage data to filesystem. As they require a Kubernetes cluster to run, you must define the environment variable K8S_BASE with the DNS base URL to reach this cluster. Cluster should fit the requirements for deployment.

Their structure depends on what needs to be tested, but follows TDT approach if applicable. Deployment is performed using the Pulumi factory.

package xxx_test

import (
    "os"
    "path"
    "testing"

    "github.com/stretchr/testify/assert"
)

func Test_F_XXX(t *testing.T) {
    // ... a description of what is the goal of this test: inputs, outputs, behaviors

    cwd, _ := os.Getwd()
	integration.ProgramTest(t, &integration.ProgramTestOptions{
		Quick:       true,
		SkipRefresh: true,
		Dir:         path.Join(cwd, ".."), // target the "deploy" directory at the root of the repository
		Config: map[string]string{
			// ... more configuration
		},
		ExtraRuntimeValidation: func(t *testing.T, stack integration.RuntimeValidationStackInfo) {
            // If TDT, do it here
            
			assert := assert.New(t)

            // ... the test body

            // ... assert expected outputs
        },
    })
}

Integration tests

The integration tests revolve around the use of the system in the production environment (or most often a clone of it, to ensure no service interruptions on the production in case of a sudden outage). In the case of chall-manager, we ensure the examples can be launched by the chall-manager. This requires us the use of multiple providers, thus a specific configuration (to sum it up, more secrets than the functional tests).

Convention is to prefix these tests Test_I_ hence could be run using go test ./... -run=^Test_I_ -v. They require a Docker image (build artifact) to be built and pushed to a registry. For Monitoring purposes, the chall-manager binary is built with the -cover flag to instrument the Go binary such that it exports its coverage data to filesystem. As they require a Kubernetes cluster to run, you must define the environment variable K8S_BASE with the DNS base URL to reach this cluster. Cluster should fit the requirements for deployment.

Their structure depends on what needs to be tested, but follows TDT approach if applicable. Deployment is performed using the Pulumi factory.

package xxx_test

import (
    "os"
    "path"
    "testing"

    "github.com/stretchr/testify/assert"
)

func Test_F_XXX(t *testing.T) {
    // ... a description of what is the goal of this test: inputs, outputs, behaviors

    cwd, _ := os.Getwd()
	integration.ProgramTest(t, &integration.ProgramTestOptions{
		Quick:       true,
		SkipRefresh: true,
		Dir:         path.Join(cwd, ".."), // target the "deploy" directory at the root of the repository
		Config: map[string]string{
			// ... configuration
		},
		ExtraRuntimeValidation: func(t *testing.T, stack integration.RuntimeValidationStackInfo) {
            // If TDT, do it here

			assert := assert.New(t)

            // ... the test body

            // ... assert expected outputs
        },
    })
}

Monitoring coverages

Beyond testing for Quality Assurance, we also want to monitor what portion of the code is actually tested. This helps Software Development and Quality Assurance engineers to pilot where to focus the efforts, and in the case of chall-manager, what conditions where not covered at all during the whole process (e.g. an API method of a Service, or a common error).

By monitoring it and through public display, we challenge ourselves to improve to an acceptable level (e.g. $\ge 85.00\%$).

Disclaimer

Do not run after coverages: $100\%$ code coverage imply no room for changes, and could be a burden to develop, run and maintain.

What you must cover are the major and minor functionalities, not all possible node in the Control Flow Graph. A good way to start this is by writing the tests by only looking at the models definition files (contracts, types, documentation).

When a Pull Request is opened (whether dependabot for automatic updates, a bot or an actual contributor), the tests are run thus helps us understand the internal changes. If the coverage decreases suddenly with the PR, reviewers would ask the PR author(s) to work on tests improvement. It also makes sure that the contribution won’t have breaking changes, thus no regressions on the covered code.

For security reasons, the tests that require platform deployments require a first review by a maintainer.

To perform monitoring of those coverages, we integrate Romeo in the testing strategy as follows.

Coverages extract performed on the high-level testing strategy used for chall-manager. Values are fictive.

By combining multiple code coverages we build an aggregated code coverage higher than what standard Go tests could do.

1.5 - Tutorials

A set of tutorials to use the Chall-Manager.

1.5.1 - A complete example

Let’s go from a challenge idea to chall-manager deploying instances !

In the current documentation page, we will go through every step a ChallMaker will encounter, from the concept to players deploying instances in production.

flowchart LR
    Concept --> Build
    Build --> Scenario
    Scenario --> Pack
    Pack --> Deploy

    subgraph Build
        B1["Docker image"]
    end

    subgraph Scenario
        S1["Pulumi.yaml"]
        subgraph ExposedMonopod
            S2["main"]
        end
        S1 -.-> S2
    end

    subgraph Pack
        P1["scenario.zip"]
    end

The concept

Imagine you are a ChallMaker who wants to challenge players on web application pentest.

The challenge is an application that require a licence to unlock functionalities, especially one that can read files vulnerable to local file inclusion. The initial access is due to an admin:admin account. The state of the application will be updated once the licence check is bypassed. Its artistic direction is considered out of scope for now, but you’ll find this example all along our documentation !

flowchart LR
    A["Web app access"]
    A --> |"Initial access with admin:admin"| B
    B["Administrator account"]
    B --> |"License validation"| C 
    C["Unlock vulnerable feature"]
    C --> |"Local File Inclusion"| D
    D["Read flag"]

Obviously, you don’t want players to impact others during their journey: Challenge on Demand is a solution.

Build the challenge

Using your bests software engineering skills, you conceive the application in the language of your choice, with the framework you are used to. You quickly test it, everthing behaves as expected, so you build a Write-Up for acceptance by reviewers.

This challenge is then packed into a Docker image: account/challenge:latest We will then want to deploy this Docker image for every source that wants it.

Construct the scenario

To deploy this scenario, we don’t have big needs: one container, and a Kubernetes cluster. We’ll use the Kubernetes ExposedMonopod to ease its deployment.

First of all, we create the Pulumi.yaml file to handle the scenario. We write it to handle a pre-compiled binary of the scenario, for better performances.

Pulumi.yaml


name: stateful-webapp
description: The scenario to deploy the stateful web app challenge.
runtime:
  name: go
  options:
   binary: ./main

Create the Go module using go mod init example. Then, we write the scenario file.

main.go


package main

import (
	"github.com/ctfer-io/chall-manager/sdk"
	"github.com/ctfer-io/chall-manager/sdk/kubernetes"
	"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)

func main() {
	sdk.Run(func(req *sdk.Request, resp *sdk.Response, opts ...pulumi.ResourceOption) error {
		cm, err := kubernetes.NewExposedMonopod(req.Ctx, &kubernetes.ExposedMonopodArgs{
			Image:      "account/challenge:latest", // challenge Docker image
			Port:       8080,                       // pod listens on port 8080
			ExposeType: kubernetes.ExposeIngress,   // expose the challenge through an ingress (HTTP)
			Hostname:   "brefctf.ctfer-io.lab",     // CTF hostname
			Identity:   req.Config.Identity,        // identity will be prepended to hostname
		}, opts...)
		if err != nil {
			return err
		}

		resp.ConnectionInfo = pulumi.Sprintf("curl -v https://%s", cm.URL) // a simple web server
		return nil
	})
}

Download the required dependencies using go mod tidy.

To test it you can open a terminal and execute pulumi up. It requires your host machine to have a kubeconfig or a ServiceAccount token in its filesystem, i.e. you are able to execute commands like kubectl get pods -A.

Finally, compile using CGO_ENABLED=0 go build ./main main.go.

Send it to chall-manager

The challenge is ready to be deployed. To give those information to chall-manager, you have to build the scenario zip archive. As the scenario has been compiled, we only have to archive the Pulumi.yaml and main files.

zip -r scenario.zip Pulumi.yaml main

Then, you have to create the challenge (e.g. some-challenge) in chall-manager. You can do it using the gRPC API or the HTTP gateway. We’ll use chall-manager-cli to do so easily.

chall-manager-cli --url chall-manager:8080 challenge create \
    --id "some-challenge" \
    --file scenario.zip

Now, chall-manager is able to deploy our challenge for players.

Deploy instances

To deploy instances, we’ll mock a player (e.g. mocking-bird) Challenge on Demand request using chall-manager-cli. In reality, it would be to the CTF platform to handle the previous step and this one, but it is considered out of scope.

chall-manager-cli --url chall-manager:8080 instance create \
    --challenge_id "some-challenge" \
    --source_id "mocking-bird"

This will return us the connection information to our instance of the challenge.

1.6 - Security

Explanations on the Security problems that could arise from a chall-manager deployment.

RCE-as-a-Service

Through the whole documentation, we often refer to chall-manager as an RCE-as-a-Service platform. Indeed it executes scenarios on Demand, without authentication nor authorization.

For this reason, we recommend deployments to be deeply burried in the infrastructure, with firewall rules or network policies, encrypted tunnels between the dependent service(s), and anything else applicable.

Under no condition you should launch it exposed to participants and untrusted services. If not, secrets could be exfiltrated, host platform could be compromised, etc.

Kubernetes

If you are not using the recommended architecture, please make sure to not deploy instances in the same namespace as the chall-manager instances are deployed into. Elseway, players may pivot through the service and use the API for malicious purposes.

Additionally, please make sure the ServiceAccount the chall-manager Pods use has only its required permissions, and if possible, only on namespaced resources. To build this, you can use kubectl api-resources –-namespaced=true –o wide.

As the chall-manager could become costful to deploy and maintain at scale, you may want to share the deployments between multiple plateforms. Notice the Community Edition does not provide isolation capabilities, so secrets, files, etc. are shared along all scenarios.

1.7 - Glossary

The concepts used or introduced by the Chall-Manager.

Challenge on Demand

The capacity of a CTF platform to empower a source to deploy its own challenges autonomously.

Scenario

It is the refinement of an artistic direction for a CTF.

In the case of Chall-Manager, it could be compared as the recipe of deployment for a given challenge. Technically, the scenario is a Pulumi entrypoint written in Go that conforms to the SDK. When launched, it deploys the source’s infrastructure and return data such as the connection information or an instance-specific flag.

Source

Either a team or user at the origin of a request. For abstraction purposes, we consider it being the same under the use of the “source” term.

Identity

An identity tie a challenge, a source and an instance request together. This last one is random (crypto) thus can’t be guessed. It enable the chall-manager to strictly identify resources as part of separate instances running at the same spot, and provide the scenario a reproductible random seed in case of update (idempotence is not guaranteed through the challenge lifecycle).

Instance

An instance is the product of a scenario, once launched with an identity.

Player

A player is a CTF participant who is going to manipulate instances of challenges throughout the lifetime of the event.

ChallMaker

The designer of the challenge, often with a security expert profile on the category (s)he contributes to. This is an essential role for a CTF event, as without them, the CTF would simply not exist !

Notice it is the responsibility of the ChallMaker to make its challenge playable, not the Ops. If you can’t make your challenge run into pre-prod/prod, you can’t blame the Ops.

(S)He collaborates with plenty profiles:

other ChallMakers to debate its ideas and assess the difficulty.
Ops to make sure its challenges can reach production smoothly.
Admins to discuss the technical feasibility of its challenges, for instance if it requires FPGAs, online platforms as GCP or AWS, etc. or report on the status of the CTF.
an artistic direction, graphical designer, etc. to assist on the coherence of the challenge in the whole artistic process.

Ops

The Operator of the event who ensure the infrastructure is up and running, everything runs untroubled thus players can compete. They do not need to be security experts, but might probably be due to the community a CTF brings.

They are the rulers of the infrastructure, its architecture and its incidents. ChallMakers have both fear and admiration on them, as they enable playing complex scenarios but are one click away of destructing everything.

(S)He collaborates with various profiles:

other Ops as a rubber ducky, a mental support during an outage or simply to work in group.
ChallMakers to assist writing the scenarios in case of a difficulty or a specific infrastructure architecture or requirement.
Admins to report on the current lifecycle of the infrastructures, the incidents, or provide ideas for evolutions such as a partnership.
a technical leader to centralize the reflexions on architectures and means to enable the artistic direction achieving their goals.

Administrator

The Administrator is the showcase of the event. (S)He takes responsibility and decisions during the creation process of the event, make sure to synchronize teams throughout the development of the artistic and technical ideas, and manage partnerships if necessary. They are the managers through the whole event, before and after, not only during the CTF.

(S)He basically collaborates with everyone, which is a double-edged sword: you take the gratification of the whole effort, but have no time to rest.

2 - CTFd-chall-manager

CTFd-chall-manager is the official chall-manager integration on CTFd.

Warning

Currently entering private alpha phase, for any issue: ctfer-io@protonmail.com

2.1 - Get started

Ready to use quickstart to use chall-manager on CTFd.

2.1.1 - Setup

Learn how to install the plugin in your CTFd.

Goal

This tutorial will guide you through the installation and configuration of the CTFd-chall-manager plugin to use chall-manager.

Prerequisites

Ensure you have chall-manager running before starting this tutorial. You can find the relevant documentation for setup instructions.

Install the plugin

If you are not using the docker-compose.yml file, you need to clone the repository into CTFd/CTFd/plugins/ctfd-chall-manager.

# Clone the CTFd repository
git clone https://github.com/CTFd/CTFd

# Clone the plugin repository
git clone https://github.com/ctfer-io/ctfd-chall-manager CTFd/CTFd/plugins/ctfd-chall-manager

# Start Redis with Docker
docker run -d -p 6379:6379 redis:<version>

## Configure plugin to use redis serveur
export REDIS_URL=redis://localhost:6379

# Start CTFd
cd CTFd
python3 -m venv venv 
source venv/bin/activate 
pip3 install -r requirements.txt
python3 serve.py

# Clone the plugin repository
git clone https://github.com/ctfer-io/ctfd-chall-manager

# Create Docker network
docker network create testing

# Start Redis with Docker
docker run -d --name redis-svc --network testing redis:<version>

# Start CTFd with Docker
docker run -d -p 8000:8000 -e REDIS_URL=redis://redis-svc:6379 -v ./ctfd-chall-manager:/opt/CTFd/CTFd/plugins/ctfd-chall-manager --network testing ctfd/ctfd:<version>

After completing this step, you should be able to access the plugin settings configuration in the CTFd UI.

If the plugin does not appear, verify your container volume mounts, then check the CTFd logs for import module entries, such as:

Configure the plugin to use chall-manager

To connect the plugin to chall-manager, go to the plugin settings.

The default configuration is:

Adjust the plugin settings to match your environment, ensuring CTFd can communicate with chall-manager. For instance:

CTFd can successfully reach chall-manager

What’s next?

Congratulations! At this point, your setup is ready to use chall-manager for your CTF events.

Create a challenge

2.1.2 - Create a challenge

Create a challenge and understand basic configuration.

Goal

In this tutorial, we will create a dynamic_iac challenge, a new challenge type introduced by the plugin.
If you are unfamiliar with the new attributes of the dynamic_iac challenge type, please refer to the related design.
For guidance on maintenance operations (e.g., modifying challenge attributes), please refer to the relevant guides. For details on the Infra-as-Code scenario, consult the appropriate documentation.

Create the Challenge

For this example, we will create a challenge where each player gets their own instance. The instance will cost user 2 mana units, must be destroyed after 10 minutes without maximum due date, and using the scenario from the no-sdk examples.

Here are the basic CTFd settings:

Key	Value
Name	example
Category	example
Message	example
Initial Value	500
Decay Function	Logarithmic
Decay	10
Minimum Value	10

First, configure the scope. Since we want each player to have their own instance, disable the global scope.

Next, set the mana cost. Players will need to spend 2 mana to deploy their own instance of the challenge.

As mentioned, we want instances to be destroyed after 10 minutes of usage (600 seconds), without any due date.

Let the Until value empty, and configure de Timeout value at 600.

Then, provide the scenario archive. For this example, we’ll use demo-deploy.zip from the no-sdk examples.

Finally, click Create to set up the challenge.

Note

When you click Create, the upload process to the chall-manager may take several seconds. Please be patient.

What’s Next?

Congratulations! Your CTF installation is now configured to use the chall-manager for this challenge.

2.1.3 - Play the challenge

Visualize what the player will see during your CTF.

Goal

In this tutorial we will see all actions a user has access to in order to control its instances.

Prerequisites

At this step, we assume that you are a CTF player, the infrastructure is already configured and you understand the key concepts, if not please refer to associated design.

Differents challenges mode

You can combine the values Until and Timeout to have 4 modes according to your needs:

Players can control all combinaisons, but each has these specificities.

Note

If global scope is enabled, only admins can launch the instance, see associated guide.

For all challenges, the default view (instance is not booted) will display a button to launch the instance, the mana cost and the remaining mana for current user.

To start the instance, click on the button.

None

This mode allows you to use your instance with no duration limit. Its deployment can cost you mana, but if you want to regain it, you need to destroy the instance. If you want to reset your instance (either because it is soft-locked or you corrupted it), you can restart it.

Until

This mode allows you to use the same actions as None mode, but the instance will be destroyed by the Janitor at a due date. Don’t worry, your mana will be automatically regained if the instance is janitored.

Timeout

This mode allows you to use the same actions as None and Until, but the instance will be destroyed by the Janitor n seconds after their start.

If you see that the Janitor will destroy you instance soon, you can renew the instance (reset the timer).

Both

With Timeout and Until, the plugin will display the timeout mode buttons, but Chall-Manager will take care of restricting or not the possibility of renewing the challenge based on the design.

What’s next ?

Congrat’s ! If you’ve made it this far, you’ll probably want to make a challenge for your event, and the plugin documentation doesn’t include explanations on how to do this, so please refer to the associated documentation.

Otherwise, you can learn more about the plugin’s design or advanced user guides here:

2.2 - Guides

Step by step guides for advanced usage.

2.2.1 - Settings

Manage the plugin settings.

Goal

This guide assumes you are a CTF administrator and you understand the key concepts. Before or during your event, you may need to configure or update the plugin. At the moment, you can configure the total amount of mana for Source or and the chall-manager API URL.

Configure with environment variables

To configure the plugin at CTFd startup, you can use the next environment variables:

Variable	Default	Description
PLUGIN_SETTINGS_CM_API_URL	http://localhost:9090/api/v1	URL of Chall-Manager API
PLUGIN_SETTINGS_CM_MANA_TOTAL	0	Maximum mana that source are allowed to use

Note

The environment variable lookup is triggered at CTFd first startup. To modify settings, you need to change it on CTFd UI.

Configure in UI

To configure or perform an update, Go to CTFd Admin Panel > Plugins > chall-manager > Settings, (1) select the text input, edit it, then (2) submit form, as shown below:

Warning

We strongly recommends you to NOT edit the chall-manager API URL during your event.

2.2.2 - Challenge

Manage your dynamic_iac challenge.

Goal

Here, we assume that you are a CTF administrator, the infrastructure is already configured and you understand the key concepts. During or before your event you may need to change challenge attributes.

For all updates, go to the CTFd Admin Panel and edit the challenge (https://CTFD_URL/admin/challenges).

Change the global scope

When you arrive on the modification page, the value displayed is the one configured at the challenge level.

By editing this value, you trigger instances destruction, see workflow belows:

flowchart LR
    A[Update] --> B
    B[Update on CTFd backend] --> C 
    C{challenge global ?} 
    C -->|True|F[Destroy all instances]
    C -->|False| E[Destroy global instance] 
    E --> H[Send update payload to CM]
    F --> H

Change the mana cost

When you arrive on the modification page, the value displayed is the one configured at the challenge level.

By editing this value, you do not edit the existing coupons of this challenge. Also, you can organize sales periods.

Change Timeout

When you arrive on the modification page, the value displayed is the one configured at the challenge level.

You can change or reset this value, Chall-Manager will update all the computed until for instances.

Change Until

When you arrive on the modification page, the value displayed is the one configured at the challenge level.

You can change or reset this value, Chall-Manager will update all the computed until for instances.

Change the scenario

When you arrive on the modification page, you can download the current scenario archive.

By editing this value, you need to provide an update strategy.

The update can be long, depends on the update gap and the strategy.

2.2.3 - Panel

Prepare your event or perform batch deployment.

Goal

Here, we assume that you are a CTF administrator, the infrastructure is already configured and you understand the key concepts.

During your event you may need to deploy instances for users we will see several use cases. If mana is enabled, 1 coupon will be created for each user.

Note

Even if users don’t have the required mana, the instance will be created here.

Deploy instances for users

There may be several reasons why administrators create instances for users. Here, we’ll imagine that the CTF hasn’t started yet, that it’s a CTF in team mode and that 10 teams are expected. For the 10 teams, we want to deploy 1 instance of the “generated-flag” challenge.

Go to CTFd Admin Panel > Plugins > chall-manager > Panel such as:

By clicking on the button, you will send several requests to CTFd. To make the user experience easier, we add a progress bar.

Once the progress bar is done (this mean instances are created on chall-manager), you can go the instances monitoring panel.

You can use this method during your event to deploy a hidden challenge (deploy instances before the challenge is release) for instance.

Deploy a global scope instance

Here, we’ll imagine that you did configure a challenge with global scope enabled. For the global challenge, you must put the pattern at 0.

Like the previous use case, you will get a progress bar while the instance are getting deployed.

2.2.4 - Instances

Monitor and Manage instances.

Goal

This guide assumes you are a CTF administrator with a properly configured infrastructure and an understanding of key concepts.

During your event, you may need to monitor or manage instances associated with Sources.

How to do it

Monitoring

To monitor instances, go to the plugin settings in the CTFd Admin Panel > Plugins > chall-manager > Instances section.

Administration

You can Renew or Destroy an individual instance by clicking the corresponding button in the instance row.

To perform actions on multiple instances, (1) select the instances using the checkboxes, (2) click the associated button, (3) then confirm your choice, as shown below:

2.2.5 - Mana

Monitor mana usage.

Goal

Here, we assume that you are a CTF administrator, the infrastructure is already configured and you understand the key concepts.

During your event you may need to monitor the consumed mana for Source.

How to do it

Access the plugin settings configuration in the CTFd Admin Panel > Plugins > chall-manager > Mana

2.3 - Design

Understand the CTFd integration of chall-manager.

2.3.1 - Architecture

Learn how we design the plugin.

Concept

As explained in the chall-manager documentation, we avoid exposing its API to prevent the risk of direct resource manipulation by players.

CTFd inherently provides functionalities like authentication, team management, scoring, and flag handling. By adding our plugin, CTFd can serve as both a challenge management platform for administrators and a request manager that acts as a proxy with user authentication, mana limitations, and more.

Overview

The basic architecture is straightforward: we have created new API endpoints for both administrators and users. These endpoints mainly handle CRUD operations on challenge instances.

API

AdminInstance

This endpoint allows administrators to perform CRUD operations on challengeId for a specified sourceId. Essentially, this endpoint forwards requests to the chall-manager for processing.

UserInstance

Unlike the AdminInstance endpoint, this one does not accept sourceId as a parameter. Instead, it automatically identifies the source issuing the request and checks mana availability before forwarding the request to the chall-manager.

UserMana

This endpoint handles GET requests to display the remaining mana of the source issuing the request.

Detailed Overview

The following diagram provides a more detailed view of how your browser interacts with the API endpoints and how these endpoints are mapped to the corresponding chall-manager endpoints.

2.3.2 - Challenge

Learn how we design the dynamic_iac challenge.

Overview

The plugin introduces a new challenge type called dynamic_iac, allowing the deployment of instances per user or team (referred to as a source). It works seamlessly with both user modes defined by CTFd (individual users or teams).

What’s new about this challenge type?

To get started, navigate to the challenge creation panel in CTFd and select the dynamic_iac challenge type.

Here’s what the plugin adds:

Global Scope

The global scope is a boolean setting that allows a single instance to be shared among all players.

For instance, in the following setup, challenges 1, 2, and 3 have the global scope enabled, while challenge 4 does not:

In this scenario, player X (yellow) and player Y (blue) will each have their own instances for challenge 4, but will share the same instance for challenges 1, 2, and 3. We recommend enabling this feature for static, stateless applications (e.g., websites).

Mana Cost

The mana cost is an integer representing the price users must pay in mana to deploy their own instance. Mana is refunded when the instance is destroyed. This system helps control the impact users have on the platform. For more details, see how mana works.

Until

The until setting allows you to specify a date and time at which instances will be automatically destroyed by the Janitor.

Example:
As a CTF administrator running a week-long event, you want a challenge available only on the first day. Set the Janitoring Strategy to Until, and configure the end date to DD/MM/YYYY 23:59.

As a player, you can start your instance anytime before this date and destroy it whenever you like before the deadline.

Timeout

The timeout is an integer that specifies, in seconds, how long after starting an instance the Janitor will automatically destroy it.

Example:
As a CTF administrator or challenge creator, you estimate that your challenge takes about 30 minutes to solve. Set the Janitoring Strategy to Timeout and set the value to 1800 seconds (30 minutes).

As a player, once you start your instance, it will be destroyed after 30 minutes unless renewed. You can also manually destroy the instance at any time to reclaim your mana.

Tips & Tricks

You can either combine the Until and Timeout values or leave both undefined. Find more info here.

Scenario Archive

The scenario is a zip archive defining the challenge deployment as a Golang Pulumi project. You can use examples or create your own using the SDK and Pulumi.

2.3.3 - Flag

Learn how we use flag variation to prevent shareflag.

Concept

Cheating is a part of competitions, especially cash-prize is involved. The player experience is particularly frustrating, so we try to minimize some of the common cases of cheating.

We’ve concentrated on the two most common cases, namely ShareFlag and FlagHolding:

ShareFlag: get or share flags with an another team.
FlagHolding: store flags to submit them on the very last moment.

How this work

We redefine the submit method of the dynamic_iac challenge to verify the generated flag against the current source.

Here’s the algorithm:

flowchart LR
    Submission --> A{Instance On ?}
    A -->|True| B{Instance I flag ?}
    A -->|False| Expired
    B -->|True| C{submission == I.flag}
    B -->|False| D{CTFd C flag ?}
    C -->|True| Correct
    C -->|False| D
    D -->|True| E{submission == C.flag}
    D -->|False| Incorrect
    E -->|True| Correct
    E -->|False| Incorrect

    %% I/O
    Submission
    Expired
    Correct
    Incorrect

On the submit method, we get all informations of the instance on chall-manager.

We want the instance ON to prevent FlagHolding and make sure the submitted flag is the one in the instance.
We check if the instance define a flag (if you don’t use the flag variation on the sdk).
If the instance define a flag, we check if the submission is correct.
If the instance does not define a flag or if the submission is incorrect, we check if a flag is defined on CTFd (also use as fallback).
Now, it is the classic behavior of CTFd.

Conclusion

Addressing the shareflag issue is crucial to maintaining fairness in CTF competitions. By redefining the submission process with methods like the dynamic_iac challenge, we ensure that only legitimate efforts are rewarded, preserving the true spirit and integrity of the competition.

FAQ

Why don’t we use the CTFd Flag system for generated flag ?

That’s simple: if we generate the flag on CTFd, each team can submit any generated flag, which doesn’t address the shareflag issue.

Why must the instance be ON to submit?

There are two main reasons for this requirement:

To prevent ShareFlag: the instance must be up and running to retrieve the generated flag from chall-manager, ensuring flags are unique and not shared between participants.
To prevent FlagHolding: keeping the instance active forces players to consume their mana when submitting flags, encouraging continuous participation in the competition. Additionnaly, the generated flag is valid for a running instance and will be regenerated in case of recreation.

Why the CTFd Flag is consider as “fallback” ?

As we said before, the CTFd Flag system allows users to submit the same flag. We use this system to prevent connection error or latency with chall-manager or if the generated flag is invalid for synthax error (we choose the extended ASCII so it should not happen).

2.3.4 - Source

Learn how the Source abstraction is implemented in the plugin.

Concept

As explained here, the Source refers to the team or user that initiates a request. This abstraction allows compatibility with CTFd operating in either “users” or “teams” mode, making the plugin versatile and usable in both modes.

To enable sharing across all users, the sourceId is set to 0. The table below summarizes the different scenarios:

user_mode	global scope	sourceId
users	FALSE	current_user.id
teams	FALSE	current_user.team_id
users	TRUE	0
teams	TRUE	0

Example workflow

Instance creation

Here an example of sourceId usage in the instance creation process:

flowchart LR
    A[Launch] --> B{CTFd mode}
    B -->|users| C[sourceId = user.id]
    B -->|teams| D[sourceId = user.team_id]
    D --> E{challenge.global_scope}
    C --> E
    E -->|True| F[sourceId = 0]
    E -->|False| End1
    F --> End1

    End1(((1)))

The rest of the workflows is detailed in mana section.

2.3.5 - Mana

Discover how we designed mana to prevent infrastructure overload.

Concept

As a CTF administrator, it’s important to manage infrastructure resources such as VMs, containers, or Kubernetes Pods effectively. To avoid overloading the infrastructure by deploying excessive instances, we introduced a “mana” system that assigns a cost to each challenge.

For instance, consider these three challenges:

challenge1: a static web server in the web category;
challenge2: a Unix shell challenge in the pwn category;
challenge3: a Windows server VM in the forensic category.

Challenge3 demands more resources than Challenge2, and Challenge2 requires more than Challenge1.

Suppose each team has a mana limit of 5, with the following costs assigned:

Name	Cost
challenge1	1
challenge2	2
challenge3	5

In this scenario, teams can deploy instances for Challenge1 and Challenge2 simultaneously but must remove both to deploy Challenge3.

Example:

Team 1: Focused on pwn challenges, will likely prioritize deploying Challenge2.
Team 2: Specializes in Windows forensics, making Challenge3 their main priority.

This system helps administrators control the use of resources on CTF infrastructures.

Combined with the flag variation engine feature of chall-manager, mana also helps minimize Flag Holding. To progress, players must submit flags and destroy instances to reclaim mana. If players choose to hoard flags until the last moment, mana limits their ability to continue, adding a strategic element to the competition.

How it works

Mana Total

To set the total mana available to users, go to the Settings section of the plugin and adjust the “Mana Total” field. By default, this feature is disabled with a value of 0. The total mana can be modified during the event without any impact on existing configurations.

Mana Cost

Each dynamic_iac challenge is assigned a mana cost, which can be set to 0 if no cost is desired. Changes to mana costs can be made during the event, but previously issued coupons will retain the cost at the time of their creation.

Coupons

When a user requests an instance, a coupon is generated as long as they have enough mana. Administrators can bypass restrictions and deploy instances directly through the admin panel, creating coupons without limits. Each coupon stores the sourceId, challengeId, and challengeManaCost at the time of creation.

Example workflow

Instance creation

Here an example of the usage of mana and the coupon creation while the instance create process:

flowchart LR
    End1(((1)))

    End1 --> G{Mana is enabled ?}
    G -->|True|H{Source can afford ?}
    G -->|False|I[Deploy instance on CM]
    H -->|True|J[Create coupon]
    H -->|False|End
    J --> I
    I --> K{Instance is deployed ?}
    K -->|True|M[End]
    K -->|False|N{Mana is enabled?}
    N -->|True|L[Destroy coupon]
    N -->|False|M
    L --> M

Detailed process:

We check that Mana Total is greater than 0.
If mana is enable, we check that the Source can afford the current Mana Cost.
If Source cannot afford the instance, the process end.
If Source can afford, we create a coupon.
While the coupon is created, or the mana feature is disable, we create the instance on chall-manager.
We check that the instance is running on Chall-Manager.
If it’s running correctly, we end the process.
If not, we destroy the coupon if exists, then end process with an error.

Synchronicity

Due to the vital role of mana, we have to ensure its consistency: elseway it could be possible to brute-force the use of Chall-Manager to either run all possible challenges thus bypassing restrictions.

To provide this strict consistency, we implemented a Writer-preference Reader-Writer problem based upon Courtois et al. (1971) work. We model the problem such as administration tasks are Writers and players Readers, hence administration has priority over players. For scalability, we needed a distributed lock system, and due to CTFd’s use of Redis, we choosed to reuse it.

FAQ

Why aren’t previous coupon prices updated?

We modeled the mana mechanism on a pricing strategy similar to retail: prices can fluctuate during promotions, and when an item is returned, the customer receives the price paid at purchase. This approach allows CTF administrators to adjust challenge costs dynamically while maintaining consistency.

Why can administrators bypass the mana check?

During CTF events, administrators might need to deploy instances for users who have exhausted their mana. While coupons will still be generated and mana consumed, administrators can set the cost to 0 at the challenge level to waive the charge if needed.

2.3.6 - Testing

Learn how we test the plugin.

Purpose

Developing in an Open Source ecosystem is good, being active in responding to problems is better, detecting problems before they affect the end-user is the ideal goal!

Following this philosophy, we put a lot of effort into CI testing, mainly using Cypress, and analyze code security via CodeQL.

GitHub Actions

From a technology point of view, we use GitHub Actions whenever possible to automate our tests.

Security

To ensure ongoing security, we enable advanced security analysis on the repository and conduct periodic scans with CodeQL on each pull request.

Testing

Our tests are systematically executed on each push and pull request, following a two-stage process.

The first stage involves Cypress for plugin integration, where we validate correct test cases to ensure comprehensive coverage. The second stage is executed to check for potential error with invalid cases. This stage runs after the Cypress tests to save time and due to technical constraints, such as the creation of challenges (preconditions).

To detect potential divergences with CTFd mode, we execute the tests in mode: users and mode: teams.