Caches "support" in tekton pipelines

This article is an exploratory one. It might contains example and command that might not work for the reader.

The idea of this article is to explore the different ways to "handle" cache when using tektoncd/pipeline or openshift-pipelines. This should evolve as time passes as it will explore what is possible today and what could be done built-in. If someday tektoncd/pipeline or any tektoncd component provides this features, this article would be either adapted or marked as deprecated.

Introduction
What caching looks like in Tekton
Authoring cache friendly tasks (example with go)
Remote resolution explorations
- tekton-wrap-pipeline
- tekton-cache-pipeline
CustomTask explorations
Though on a built-in API

Introduction

Cache and caching are very abstract, wide concept. In the context of a CI pipeline, it can take several forms depending on what the pipeline and tasks are doing. If we are building an image, the cache is most likely related to the "base image" and layers re-used accross multiple builds. If we are building a go project, it could be around fetching the dependencies (if no vendor folder) or around build cache (the go compiler cache, to speed the next compilation).

We will focus on filesystem based caching in the next part of the article. Things like a cluster-wide registry for oci image, a maven proxy or a go modules proxy for geting dependency quicker is out of the scope of this article as it doesn’t have anything to do with Tekton itself.

As of today (end of 2022), there is nothing specific related to caches in the tektoncd/pipeline API. This means it is up to the Task authors, users or tektoncd component integrators to manage cache how they fit. Let’s explore different ways this can be handled.

What caching looks like in Tekton

Even though there is a lot of different ways to define caching, in term of how it "looks" in Tekton, is relatively straightforward. Filesystem-based caching in Tekton maps really well with workspaces. Most if not all of the rest of this article will use workspaces as the feature we use to handle caching.

There is two parts of this for Tekton:

How Task are written to easily use/interact with caches. For example, a go build Task should be able to use a GOCACHE if provided, but doesn’t need to know anything about how it is provided. This is the easiest part and the least opiniated one as well.
How to provide the cache content to the the TaskRun. This is the tricky part, and there is a lot of different / possible ways to do that.

Authoring cache friendly tasks (example with `go`)

This is the easy part of supporting caches in Tekton, and this is also the most important one for Task author. This part is about how to write tasks that can easily be hook-up, somehow, with some caching mechanism. We’ll go through example of different "language" to draw a decent picture.

As an author of a Task, especially if we aim to make this Task as shareable as possible, this is the only thing we should care about. Someone else (a.k.a the user using this Task or a tool) will have to provide you with the cache.

The `Task` definition

Go uses GOCACHE to know and store any build related cache (compilation) and GOMODCACHE for go.mod dependencies. To write a Task

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: golang-build
spec:
  description: >-
    This Task is Golang task to build Go projects.
  params:
  - name: package
    description: base package to build in
  - name: packages
    description: "packages to build (default: ./cmd/...)"
    default: "./cmd/..."
  # Additionnal parameters
  # […]
  workspaces:
  - name: source
  - name: cache-go (1)
    optional: true
  - name: cache-gomod (2)
    optional: true
  steps:
  - name: build
    image: docker.io/library/golang:1.18
    workingDir: $(workspaces.source.path)
    script: |
      [[ "$(workspaces.cache-go.bound)" == "true" ]] && { (3)
        export GOCACHE=$(workspaces.cache-go.path)
      }
      [[ "$(workspaces.cache-gomod.bound)" == "true" ]] && { (4)
        export GOMODCACHE=$(workspaces.cache-gomod.path)
      }
      go build $(params.flags) $(params.packages)
    env:
    - name: GOOS
      value: #[…]

1	This defines an optional workspace for setting up the `GOCACHE`.
2	This defines an optional workspace for setting up the `GOMODCACHE`.
3	If something is bound to the `cache-go`, export `GOCACHE` to point to the workspace.
4	If something is bound to the `cache-gomod`, export `GOCACHE` to point to the workspace.

Providing the cache

Now, the idea is to provide the cache at runtime, using a PVC for example. Let’s give an example with a Pipeline and a PipelineRun.

First, we need a PVC for gomod cache, and one for go. Let’s create a 2Gi PVC.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: go-cache
spec:
  resources:
    requests:
      storage: 4Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: gomod-cache
spec:
  resources:
    requests:
      storage: 2Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany

The Pipeline will fetch some sources using the git-clone Task and our go Task.

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: build-go-with-optional-cache
spec:
  workspaces:
  - name: shared-workspace
  - name: go-cache (1)
    optional: true
  - name: gomod-cache (2)
    optional: true
  params:
  - name: git-url
    default: https://github.com/tektoncd/pipeline
  tasks:
  - name: fetch-repository
    taskRef:
    name: git-clone
    workspaces:
    - name: output
      workspace: shared-workspace
    params:
    - name: url
      value: $(params.git-url)
    - name: subdirectory
      value: ""
    - name: deleteExisting
      value: "true"
  - name: build
    taskRef:
      name: golang-build
    runAfter:
    - fetch-repository
    workspaces:
    - name: source
      workspace: shared-workspace
    - name: cache-go (3)
      workspace: cache-go
    - name: cache-gomod (4)
      workspace: cache-gomod

1	As for the `Task`, we define a workspace for `GOCACHE`.
2	As for the `Task`, we define a workspace for `GOMODCACHE`.
3	We are bind the Pipeline’s defined `cache-go workspace to the `cache-go` workspace defined in the `Task`.
4	We are bind the Pipeline’s defined `cache-gomod workspace to the `cache-gomod` workspace defined in the `Task`.

Now the PipelineRun.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: build-go-with-cache-run
spec:
  pipelineRef:
    name: build-go-with-optional-cache
  params:
  - name: git-url
    value: https://github.com/tektoncd/pipeline
  workspaces:
  - name: shared-workspace
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 100Mi
  - name: go-cache
    persistentVolumeClaim: (1)
      claimName: go-cache
  - name: gomod-cache
    persistentVolumeClaim: (2)
      claimName: gomod-cache

1	We bind our `go-cache` PVC to the `go-cache` workspace.
2	We bind our `gomod-cache` PVC to the `gomod-cache` workspace.

Shortcomings

There is a few possible shortcomings with this approach:

Depending on the class of the persistent storage, it might be tricky to get those PVC provisionned. In addition, the cluster might have some quotas on the number of PVC used, and these being "always" there would take some place in this quota.
Similar to the previous point, depending on the access mode (ReadWriteMany, ReadWriteOnce, …), it may force the pipeline to all run on the same node, or make the cache read-only (which would be.. way less useful).

Full cache support in a Task

Following what we just did, we could enhance our Task definition to be able to caching itself. At authoring time, we can have steps that pull and push some folder/workspace.

There would be 2 steps to add : cache-fetch and cache-upload. For each implementation, the content/tool/workflow of those steps would differ.

Using `oci` image

cache-fetch, which would consist of
- compute a unique hash from something to identify the cache. For example, for go we would use go.sum file
- try to fetch an image tagged with that hash
  - if it doesn’t exit, we just warn and pass on
  - if it succeeds, we extract it to the go-cache workspace/folder
cache-upload
- compute a unique hash from something to identify the cache. For example, for go we would use go.sum file
- push the image tagged with that hash
- possible optimization: compare the cache content / image tarball, with the one we fetch, if it’s similar, we don’t even need to push

Using `rsync`

This need to be completed

Remote resolution explorations

The general idea behind this exploration is very similar to tekton-wrap-pipeline.

Tekton resolver that allows to run a `Pipeline` with `emptydir`
workspaces that will be using different mean to transfer data from a
one `Task` to the other.

This is a experimentation around not using PVC for sharing data with
workspace in a Pipeline.

The idea, adapted to caching, would be to enhance Task with steps to pull and push the cache(s) in the correct workspaces, bound to emptydir.

This need to be completed

`tekton-wrap-pipeline`

We can use tekton-wrap-pipeline directly. If we take the previous example, we can re-write the PipelineRun above like the following.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: simple-pipelinerun
spec:
  serviceAccountName: mysa
  pipelineRef:
    resolver: wrap (1)
    params:
    - name: pipelineref
      value: build-go-with-optional-cache
    - name: workspaces
      value: go-cache,gomod-cache (2)
    - name: target
      value: quay.io/vdemeest/pipelinerun-$(context.pipelineRun.namespace)-{{workspace}}:latest (3)
    - name: base
      value: quay.io/vdemeest/pipelinerun-$(context.pipelineRun.namespace)-{{workspace}}:latest (4)
  params:
  - name: git-url
    value: https://github.com/tektoncd/pipeline
  workspaces:
  - name: shared-workspace
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 100Mi
  - name: go-cache (5)
    emptyDir: {}
  - name: gomod-cache (6)
    emptyDir: {}

1	We are using the remote resolution with the resolver named `wrap` (provided by `tekton-wrap-pipeline`)
2	These are the 2 workspaces we want to handle with the "wrapper". In a gist, this means : for those 2 workspaces, using an oci image (a different one) for saving and pushing to it
3	This is the naming template for the target image to use, one "image" per namespace and go-cache/gomod-cache.
4	This is the naming template for the base image to use, using the same to ensure we "keep" the content from one run to the other.
5	We "bind" the go-cache workspace with emptydir just to "pass validation"
6	Same with gomod-cache, we "bind" the go-cache workspace with emptydir just to "pass validation"

This approach has few shortcomings as of today:

Using base image means we need to "create" the repository prior to being to run (otherwise, it will fail to get the content of the cache because it doesn’t exists)
As it is proposed, it will share the go-cache and gomod-cache for all runs using this, on the same namespace. Fiddling with target and base allow you to decide what to use, but still it doesn’t take into account the go.sum, …
As of today, it only works with OCI images
As of today, it needs extra auth to be able to push/pull the cache to an oci image registry
tekton-wrap-pipeline append layers, which means at some point, the image will be too big and have too many layers. In our case, we don’t necessarily care about the layers but only the final content.

`tekton-cache-pipeline`

We can build on top of this tekton-wrap-pipeline to provide an "easier" way to setup cache. The idea, is to be able to write the following PipelineRun.

As said above, tekton-wrap-pipeline append layers, which means at some point, the image will be too big and have too many layers. In our case, we don’t necessarily care about the layers but only the final content. What we want here, is a way to get some content from a given hash.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: simple-pipelinerun
spec:
  serviceAccountName: mysa
  pipelineRef:
    resolver: cache (1)
    params:
    - name: pipelineref
      value: build-go-with-optional-cache
    - name: workspaces
      value: go-cache,gomod-cache (2)
    - name: files (3)
      value: **/go.sum
    - name: target (4)
      value: quay.io/vdemeest/cache/{{workspace}}:{{hash}}
  params:
  - name: git-url
    value: https://github.com/tektoncd/pipeline
  workspaces:
  - name: shared-workspace
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 100Mi
  - name: go-cache
    emptyDir: {}
  - name: gomod-cache
    emptyDir: {}

1	We are using our `tekton-cache-pipeline` resolver :)
2	These are the 2 workspaces we want to handle with the "wrapper". In a gist, this means : for those 2 workspaces, using an oci image (a different one) for saving and pushing to it
3	Which files to compute the hash from. The idea here is that, we will compute the hash, and try to fetch the content (using an oci image for now) tagged with that hash, if it doesn’t exists, we don’t fetch anything. But in any case, we’ll push an image tagged with that hash at the end.
4	Very similar with `tekton-wrap-pipeline`, it’s the naming pattern for the image we want to use to push the cache to/from

This need to be implemented

`CustomTask` explorations

The idea is very similar to remote resolution, but using CustomTask instead.

This need to be completed

Though on a built-in API

This need to be completed

Caches "support" in tekton pipelines

Introduction

What caching looks like in Tekton

Authoring cache friendly tasks (example with go)

The Task definition

Providing the cache

Shortcomings

Full cache support in a Task

Using oci image

Using rsync

Remote resolution explorations

tekton-wrap-pipeline

tekton-cache-pipeline

CustomTask explorations

Though on a built-in API

Authoring cache friendly tasks (example with `go`)

The `Task` definition

Using `oci` image

Using `rsync`

`tekton-wrap-pipeline`

`tekton-cache-pipeline`

`CustomTask` explorations