Application Storage

RAIL provide persistent storage that you can mount in your containers. The clusters provide StorageClasses, which are all described in here. There is also S3 compatible object storage available, provided directly by the NREC IaaS platform in both the BGO and OSL regions. It’s important to understand the distinction between block storage, file storage and object storage in order to choose the best strategy for your workload.

Making the correct storage choice for your workload is crucial. You need to consider both the amount of data needed, if the application pods need to share data, as well as the amount of IOPS generated in a production environment. If you are unsure, please reach out to the RAIL core team for consultation.

Persistent storage provided by RAIL

Each RAIL cluster provides storage from an internal Ceph cluster, and the Ceph clusters provide both block storage and file system storage. The StorageClasses available are:

rook-ceph-block (the current default)
rook-ceph-block-ec
rook-cephfs

The persistent volumes are represented by PersistenVolumeClaim (shortname pvc).

Note

The current default persistent storage will change to rook-ceph-block-ec when the Ceph clusters in RAIL reaches the Tentacle release, as erasure coded pools at that point will provide an equivalent amount of IOPS as a replicated pool

The rook-ceph-block class

The rook-ceph-block class provides a RADOS Block Device (RBD) to a single pod. Think of it as a single hard drive formatted with the ext4 file system, which explains why the volume is mountable only to a single pod. In the underlaying Ceph cluster, the data is replicated accross nodes with three replicas. Thus, the data stored in a rook-ceph-block persistent volume impose a 3x write amplification in the storage available to the ceph cluster. Still, this storage class provides the most IOPS of the available alternatives at this point in time.

The rook-ceph-block-ec class

The rook-ceph-block-ec class also provides an RBD device mountable to a single pod. However, instead of creating a replicated hard drive image in the underlaying Ceph cluster, it utilizes an erasure coded pool, which brings down the storage overhead (write amplification) to 1.5x while retaining redundancy. At this point in time this comes with a modest IOPS penalty, but for most workloads it’s OK. As noted, this will become the default storage class in the near future, and existing volumes in this class will have their IOPS penalty removed.

The rook-cephfs class

This storage class is fundamentally different from the other classes, as the volume provided is backed by CephFS rather than an RBD image. It provides a POSIX compliant file system which is mountable across multiple pods in a cluster, ie shared storage. This is conceptually similar to, for example, NFS. Each pvc in this class is allotted a separate file system tree isolated from other pvcs. Use this class if your application is dependant on sharing data from a file system in order to work across several pods.

Using persistent storage

You can create a PersistenVolumeClaim with a manifest like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vol1c
spec:
  storageClassName: rook-ceph-block-ec
  accessModes:
   - ReadWriteOncePod
  resources:
    requests:
      storage: 150Mi

and then you mount it into your container with:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    fsGroup: 100
  volumes:
  - name: vol1
    persistentVolumeClaim:
      claimName: vol1c
  containers:
  - name: ...
    image: ...
    volumeMounts:
    - name: vol1
      mountPath: /mnt/vol1

The manifests above will create a volume that can only be mounted in a single Pod. This was requested by the specified ReadWriteOncePod access mode.

Alternatively you can specify access mode ReadWriteOnce which will allow multiple Pods running on the same Node to access the volume. This is a bit hard to use since you would then have to figure out how to influence the pod affinity to ensure that the pods are scheduled together.

This is an example that tries to make the given Pod run on the same node where the http-deamon runs:

kind: Pod
spec:
  affinity:
    podAffinity:
      # In order to be able to mount the RWO volume we need to make sure
      # the Pod runs on the same node where the http-deamon pod runs.
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - http-deamon
        topologyKey: kubernetes.io/hostname

In order to use the access mode ReadWriteMany, which is required in order to mount the volume across several pods, you must use the rook-cephfs class:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cephfsvol
spec:
  storageClassName: rook-cephfs
  accessModes:
   - ReadWriteMany
  resources:
    requests:
      storage: 150Mi

which you then can mount in several pods, for example in a deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ...
  namespace: ...
spec:
  replicas: 2
  template:
    metadata:
      volumes:
      - name: cephfs
        persistentVolumeClaim:
          claimName: cephfsvol
      containers:
      - name: ...
        volumeMounts:
        - mountPath: /opt/shared
          name: cephfs

File ownership

When the root directory of the volume is created, it is owned by root and this ownership cannot be changed. However, it will also be associated with the group specified by securityContext.fsGroup and will have g+rxs permissions. Additionally, your container processes will belong to the specified group.

If you don’t specify securityContext.fsGroup, the default is 1 which should work fine for most use cases.

Resizing volumes

You specify the requested size of the volume with resources.request.storage.

If you run out of space, RAIL allow you to increase this size on a live volume.

You are not able to decrease the size of a live volume. In this case you need to create a new smaller volume and copy the files over.

Backup

To be answered…

  • How can you recover if you loose your pvc?

  • How can you restore a pvc to the state it had some time ago?

Object storage provided by NREC

Many applications have the ability to consume S3 compatible object storage. This provides stateless, potentially multiregional cloud native storage. If your workload has the ability to natively use object storage it is most often the recommended way. It’s worth noting that access to the object buckets provided by NREC is bound to the NREC project. If your project creates several buckets, all your EC credentials in that project will have access to all your buckets in said project. If you want to isolate the object storage from other workloads, you should ask the NREC team for a new project with access to object storage, and then use that project to create a new bucket.

Different applications configure access to object storage in different ways, so no example is given here. However, ObjectStore creation in the postgres database documentation provides an application specific example.

Warning

Mounting an overlay filesystem (ie S3FS) on top of object storage imposes a quite severe performance overhead, and is generally slow and inefficient. This may not be a big problem if the workload has low IO demands, but will quickly bring you into problems if the applications has more than modest IO needs. Think twice before utilizing overlay file systems on S3 compatible storage.