Expiration

Find how we handle fairness in the use of infrastructure resources with expirations.

Context

During the CTF, we don’t want players to be capable of manipulating the infrastructure at their will: starting instances are costful, require computational capabilities, etc. It is mandatory to control this while providing the players the power to manipulate their instances at their own will.

For this reason, one goal of the chall-manager is to provide ephemeral (or not) scenarios. Ephemeral imply lifetimes, expirations and deletions. To implement this, for each Challenge the ChallMaker and Ops can set a timeout in seconds after which the Instance will be deleted once up & running, or an until date after which the instance will be deleted whatever the timeout. When an Instance is deployed, its start date is saved, and every update is stored for traceability. A participant (or a dependent service) can then renew an instance on demand for additional time, as long as it is under the until date of the challenge. This is based on a hypothesis that a challenge should be solved after \(n\) minutes.

Deleting instances when outdated then becomes a new goal of the system, thus we cannot extend the chall-manager as it would be a rupture of the Separation of Concerns Principle: it is the goal of another service, chall-manager-janitor. This is also justified by the frequency model applied to the janitor, which is unrelated to the chall-manager service itself. With such approach, other players could use the resources. Nevertheless, it requires a mecanism to wipe out infrastructure resources after a given time. Some tools exist to do so.

Tool Environment
hjacobs/kube-janitor Kubernetes
kubernetes-sig/boskos Kubernetes
rancher/aws-janitor AWS

Despite tools exist, they are context-specifics thus are limited: each one has its own mecanism and only 1 environment is considered. As of genericity, we want a generic approach able to handle all ecosystems without the need for specific implementations. For instance, if a ChallMaker decides to cover a unique, private and offline ecosystem, how could (s)he do ?

That is why the janitor must have the same level of genericity as chall-manager itself. Despite it is not optimal for specifics providers, we except this genericity to be a better tradeoff than covering a limited set of technologies. This modular approach enable covering new providers (vendor-specifics, public or private) without involving CTFer.io in the loop.

How it works

By using the chall-manager API, the janitor looks up at expiration dates. Once an instance is expired, it simply deletes it. Using a cron, the janitor could then monitor the instances frequently.

flowchart LR
    subgraph Chall-Manager
        CM[Chall-Manager]
        Etcd

        CM --> Etcd
    end

    CMJ[Chall-Manager-Janitor]
    CMJ -->|gRPC| CM

If two janitors triggers in parallel, the API will maintain consistency. Errors code are to expect, but no data inconsistency.

As it does not plugs into a specific provider mecanism nor requirement, it guarantees platform agnosticity. Whatever the scenario, the chall-manager-janitor will be able to handle it.

Follows the algorithm used to determine the instance until date based on a challenge configuration for both until and timeout. Renewing an instance re-execute this to ensure consistency with the challenge configuration. Based on the instance until date, the janitor will determine whether to delete it or not (\(instance.until > now() \Rightarrow delete(instance)\)).

flowchart LR
    Start[Compute until]

    Start-->until{"until == nil ?"}
    until---|true|timeout1{"timeout == nil ?"}
    timeout1---|true|out1["nil"]
    timeout1---|false|out2["now()+timeout"]

    until---|false|timeout2{"timeout == nil ?"}
    timeout2---|true|out3{"until"}
    timeout2---|false|out4{"min(now()+timeout,until)"}

What’s next ?

Listening to the community, we decided to improve further with a Software Development Kit.