Application Storage¶
RAIL provide persistent storage that you can mount in your containers. The clusters
provide StorageClasses, which are all described in here. There is also S3 compatible
object storage available, provided directly by the NREC IaaS platform in both the
BGO and OSL regions. It’s important to understand the distinction between block storage,
file storage and object storage in order to choose the best strategy for your workload.
Making the correct storage choice for your workload is crucial. You need to consider both the amount of data needed, if the application pods need to share data, as well as the amount of IOPS generated in a production environment. If you are unsure, please reach out to the RAIL core team for consultation.
Persistent storage provided by RAIL¶
Each RAIL cluster provides storage from an internal Ceph cluster, and the Ceph clusters
provide both block storage and file system storage. The StorageClasses available are:
rook-ceph-block (the current default)
rook-ceph-block-ec
rook-cephfs
The persistent volumes are represented by PersistenVolumeClaim (shortname pvc).
Note
The current default persistent storage will change to rook-ceph-block-ec when
the Ceph clusters in RAIL reaches the Tentacle release, as erasure coded pools at that
point will provide an equivalent amount of IOPS as a replicated pool
The rook-ceph-block class¶
The rook-ceph-block class provides a RADOS Block Device (RBD) to a single pod. Think of
it as a single hard drive formatted with the ext4 file system, which explains why
the volume is mountable only to a single pod. In the underlaying Ceph cluster, the data
is replicated accross nodes with three replicas. Thus, the data stored in a
rook-ceph-block persistent volume impose a 3x write amplification in the storage
available to the ceph cluster. Still, this storage class provides the most IOPS of the
available alternatives at this point in time.
The rook-ceph-block-ec class¶
The rook-ceph-block-ec class also provides an RBD device mountable to a single pod.
However, instead of creating a replicated hard drive image in the underlaying Ceph cluster,
it utilizes an erasure coded pool, which brings down the storage overhead
(write amplification) to 1.5x while retaining redundancy. At this point in time this
comes with a modest IOPS penalty, but for most workloads it’s OK. As noted, this will become
the default storage class in the near future, and existing volumes in this class will
have their IOPS penalty removed.
The rook-cephfs class¶
This storage class is fundamentally different from the other classes, as the volume
provided is backed by CephFS rather than an RBD image. It provides a POSIX compliant
file system which is mountable across multiple pods in a cluster, ie shared storage. This
is conceptually similar to, for example, NFS. Each pvc in this class is allotted a
separate file system tree isolated from other pvcs. Use this class if your application
is dependant on sharing data from a file system in order to work across several pods.
Using persistent storage¶
You can create a PersistenVolumeClaim with a manifest like this:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vol1c
spec:
storageClassName: rook-ceph-block-ec
accessModes:
- ReadWriteOncePod
resources:
requests:
storage: 150Mi
and then you mount it into your container with:
apiVersion: v1
kind: Pod
spec:
securityContext:
fsGroup: 100
volumes:
- name: vol1
persistentVolumeClaim:
claimName: vol1c
containers:
- name: ...
image: ...
volumeMounts:
- name: vol1
mountPath: /mnt/vol1
The manifests above will create a volume that can only be mounted in a single
Pod. This was requested by the specified ReadWriteOncePod access mode.
Alternatively you can specify access mode ReadWriteOnce which will allow multiple
Pods running on the same Node to access the volume. This is a bit hard to use since you
would then have to figure out how to influence the pod affinity to ensure
that the pods are scheduled together.
This is an example that tries to make the given Pod run on the same node where the http-deamon runs:
kind: Pod
spec:
affinity:
podAffinity:
# In order to be able to mount the RWO volume we need to make sure
# the Pod runs on the same node where the http-deamon pod runs.
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- http-deamon
topologyKey: kubernetes.io/hostname
In order to use the access mode ReadWriteMany, which is required in order
to mount the volume across several pods, you must use the rook-cephfs
class:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cephfsvol
spec:
storageClassName: rook-cephfs
accessModes:
- ReadWriteMany
resources:
requests:
storage: 150Mi
which you then can mount in several pods, for example in a deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ...
namespace: ...
spec:
replicas: 2
template:
metadata:
volumes:
- name: cephfs
persistentVolumeClaim:
claimName: cephfsvol
containers:
- name: ...
volumeMounts:
- mountPath: /opt/shared
name: cephfs
File ownership¶
When the root directory of the volume is created, it is owned by root and this
ownership cannot be changed. However, it will also be associated with the group
specified by securityContext.fsGroup and will have g+rxs permissions.
Additionally, your container processes will belong to the specified group.
If you don’t specify securityContext.fsGroup, the default is 1 which should
work fine for most use cases.
Resizing volumes¶
You specify the requested size of the volume with resources.request.storage.
If you run out of space, RAIL allow you to increase this size on a live volume.
You are not able to decrease the size of a live volume. In this case you need to create a new smaller volume and copy the files over.
Backup¶
To be answered…
How can you recover if you loose your
pvc?How can you restore a
pvcto the state it had some time ago?
Object storage provided by NREC¶
Many applications have the ability to consume S3 compatible object storage. This provides stateless, potentially multiregional cloud native storage. If your workload has the ability to natively use object storage it is most often the recommended way. It’s worth noting that access to the object buckets provided by NREC is bound to the NREC project. If your project creates several buckets, all your EC credentials in that project will have access to all your buckets in said project. If you want to isolate the object storage from other workloads, you should ask the NREC team for a new project with access to object storage, and then use that project to create a new bucket.
Different applications configure access to object storage in different ways, so no example is given here. However, ObjectStore creation in the postgres database documentation provides an application specific example.
Warning
Mounting an overlay filesystem (ie S3FS) on top of object storage imposes a quite severe performance overhead, and is generally slow and inefficient. This may not be a big problem if the workload has low IO demands, but will quickly bring you into problems if the applications has more than modest IO needs. Think twice before utilizing overlay file systems on S3 compatible storage.