Ceph is an open-source, software-defined storage platform providing unified object, block, and file storage with automatic data distribution and no single point of failure.

Addresses below are RFC 5737 documentation ranges or placeholders - swap in your own.

Table of Contents#

  1. Overview
  2. Architecture
  3. CRUSH Algorithm and Failure Domains
  4. Core Components
  5. Storage Interfaces
  6. Release Versions
  7. Installation
  8. Cluster Creation
  9. Node and OSD Management
  10. Capacity Planning
  11. Monitoring and Alerting
  12. Troubleshooting
  13. See Also
  14. Sources

1. Overview#

Ceph provides massively scalable storage that operates on commodity hardware. A single Ceph cluster can serve object storage (via RADOS Gateway), block storage (via RBD), and a POSIX-compatible filesystem (via CephFS), all backed by the same underlying RADOS (Reliable Autonomic Distributed Object Store) layer.

Key characteristics:

  • No single point of failure - all components can be deployed redundantly
  • Self-healing - automatically detects and recovers from disk, node, and rack failures
  • Self-managing - rebalances data when capacity is added or removed
  • Linearly scalable - performance and capacity grow with the cluster
  • Unified storage - object, block, and file from one platform

2. Architecture#

Ceph's architecture revolves around RADOS, the object store that underpins everything:

+-------------------+-------------------+-------------------+
|   RADOS Gateway   |       RBD         |      CephFS       |
|  (Object / S3)    |  (Block Device)   |   (Filesystem)    |
+-------------------+-------------------+-------------------+
|                       librados                            |
+-----------------------------------------------------------+
|                        RADOS                              |
|  +--------+  +--------+  +--------+  +--------+          |
|  |  OSD   |  |  OSD   |  |  OSD   |  |  OSD   |  ...     |
|  +--------+  +--------+  +--------+  +--------+          |
|  +--------+  +--------+  +--------+                       |
|  |  MON   |  |  MGR   |  |  MDS   |  (MDS for CephFS)    |
|  +--------+  +--------+  +--------+                       |
+-----------------------------------------------------------+

Storage Pools#

Pools are logical partitions within RADOS. Each pool has configurable:

  • Replication factor (size) or erasure coding profile
  • Placement group (PG) count - determines data distribution granularity
  • CRUSH rules - control which failure domains store the data
# Create a replicated pool with 128 PGs
ceph osd pool create mypool 128 128 replicated

# Create an erasure-coded pool (k=4 data, m=2 parity)
ceph osd erasure-code-profile set ec-42 k=4 m=2
ceph osd pool create ec-pool 128 128 erasure ec-42

3. CRUSH Algorithm and Failure Domains#

The CRUSH (Controlled Replication Under Scalable Hashing) algorithm determines where data is placed without requiring a central lookup table. Clients compute placement locally using the CRUSH map.

CRUSH Map Hierarchy#

The CRUSH map defines a hierarchy of physical locations:

root default
  rack rack1
    host node1
      osd.0
      osd.1
    host node2
      osd.2
      osd.3
  rack rack2
    host node3
      osd.4
      osd.5
    host node4
      osd.6
      osd.7

Failure Domains#

Failure domains ensure that replicas are spread across distinct physical boundaries. The most common levels, from broadest to narrowest:

LevelDescriptionTypical Rule
datacenterSeparate data centers or regionsMulti-site / stretch clusters
roomServer rooms within a buildingLarge single-site deployments
rackIndividual server racksProduction minimum recommendation
hostIndividual serversSmall clusters (3-5 nodes)
osdIndividual disksTesting only, no real redundancy
# View the current CRUSH map
ceph osd crush tree

# Set a CRUSH rule to replicate across racks
ceph osd crush rule create-replicated rack-rule default rack host

# Assign a pool to a specific CRUSH rule
ceph osd pool set mypool crush_rule rack-rule

Editing the CRUSH Map#

# Export, decompile, edit, recompile, inject
ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt
# Edit crushmap.txt as needed
crushtool -c crushmap.txt -o crushmap-new.bin
ceph osd setcrushmap -i crushmap-new.bin

4. Core Components#

Ceph OSDs (Object Storage Daemons)#

Each OSD manages one storage device. OSDs handle data replication, recovery, rebalancing, and heartbeat checks with peer OSDs and monitors.

  • BlueStore (default since Luminous) - purpose-built storage backend, bypasses the local filesystem
  • One OSD per physical disk is the standard deployment model
  • Minimum 3 OSDs for a production replicated pool

Ceph Monitors (MON)#

Monitors maintain the authoritative copy of the cluster map, which includes:

  • Monitor map - list of monitors
  • OSD map - OSD states and locations
  • CRUSH map - data placement rules
  • MDS map - metadata server states (for CephFS)
  • Manager map - active/standby managers

Deploy an odd number (3 or 5) for Paxos quorum.

Ceph Managers (MGR)#

Managers collect cluster-wide metrics and host modules:

  • Dashboard - web-based monitoring and management UI
  • Prometheus - exposes metrics for Prometheus scraping
  • Balancer - automatic PG distribution optimization
  • Orchestrator - interfaces with cephadm or Rook for deployment

Deploy 2 (one active, one standby) minimum.

Ceph MDS (Metadata Server)#

Required only for CephFS. Manages the POSIX filesystem namespace (directory hierarchy, permissions, timestamps). Deploy at least 2 for high availability.

5. Storage Interfaces#

InterfaceProtocolUse Case
RBD (RADOS Block Device)Kernel module or librbdVM disks, Kubernetes PVs, database storage
CephFSKernel or FUSE mountShared filesystems, home directories, HPC
RADOS Gateway (RGW)S3 / Swift REST APIObject storage, backups, media archives

6. Release Versions#

Ceph uses alphabetical marine creature names. Each release is supported for approximately 2 years.

ReleaseVersionStatusKey Features
Squid19.xCurrent (2024)Improved RGW multisite, crimson OSD tech preview
Reef18.xLTS (2023)RBD image mirroring improvements, dashboard overhaul
Quincy17.xMaintenance (2022)cephadm maturity, msgr2 default
Pacific16.xEOL (2021)cephadm orchestrator, stretch clusters

Always target the latest LTS release (Reef 18.x as of 2024) for production deployments.

7. Installation#

Prerequisites#

  • Minimum 3 nodes for production (1 node for testing)
  • Each node: dedicated OS disk plus one or more OSD disks
  • All nodes reachable via SSH from the bootstrap host
  • NTP synchronized across all nodes
  • Python 3.6+ and container runtime (Podman preferred, Docker supported)

Network Configuration#

Ceph recommends two separate networks:

  • Public network - client-to-cluster traffic (MON, RGW, RBD, CephFS)
  • Cluster network - OSD-to-OSD replication and recovery traffic
# Example: /etc/ceph/ceph.conf network settings
[global]
public_network = 192.0.2.0/24
cluster_network = 198.51.100.0/24

Debian/Ubuntu:

# Add the Ceph repository
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
echo "deb https://download.ceph.com/debian-reef/ $(lsb_release -sc) main" | \
  sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt update
sudo apt install -y cephadm ceph-common

RHEL/CentOS/Rocky:

sudo dnf install -y centos-release-ceph-reef
sudo dnf install -y cephadm ceph-common

Arch Linux:

# cephadm is available in the AUR
sudo pacman -S ceph-common

Installing via cephadm (Alternative Bootstrap)#

If your distribution does not package cephadm, download it directly:

curl --silent --remote-name \
  https://download.ceph.com/rpm-reef/el9/noarch/cephadm
chmod +x cephadm
sudo ./cephadm add-repo --release reef
sudo ./cephadm install

8. Cluster Creation#

Bootstrapping#

Initialize the first monitor on the bootstrap node:

sudo cephadm bootstrap \
  --mon-ip <mon-ip-address> \
  --initial-dashboard-user admin \
  --initial-dashboard-password <password> \
  --cluster-network <cluster-cidr>

This creates the initial MON, MGR, and enables the dashboard at https://<mon-ip>:8443.

Enabling the Ceph CLI#

# Install the ceph CLI wrapper
sudo cephadm install ceph-common

# Verify connectivity
ceph -s

9. Node and OSD Management#

Adding Nodes#

# Copy the SSH key to the new host
ssh-copy-id -f -i /etc/ceph/ceph.pub root@<new-host>

# Add the host to the cluster
ceph orch host add <hostname> <ip-address> --labels _admin

# Verify the host is registered
ceph orch host ls

Adding OSDs#

# List available devices on all hosts
ceph orch device ls

# Add all available devices as OSDs
ceph orch apply osd --all-available-devices

# Or add a specific device
ceph orch daemon add osd <hostname>:/dev/<device>

Removing an OSD#

# Mark the OSD out so data migrates away
ceph osd out osd.<id>

# Wait for rebalancing to complete
ceph -w

# Remove the OSD daemon
ceph orch osd rm osd.<id>

# Verify removal
ceph osd tree

Replacing a Failed Disk#

# Destroy the old OSD (preserves ID for replacement)
ceph orch osd rm osd.<id> --replace

# Insert new disk, then redeploy
ceph orch daemon add osd <hostname>:/dev/<new-device>

10. Capacity Planning#

Sizing Guidelines#

ComponentRecommendation
MON2 GiB RAM minimum; SSD or NVMe for MON data
OSD (HDD)5 GiB RAM per OSD daemon
OSD (NVMe)5-7 GiB RAM per OSD daemon
MGR1 GiB RAM minimum
MDS1-4 GiB RAM per active MDS (scales with metadata load)
Network10 GbE minimum for production; 25 GbE recommended

Calculating Usable Space#

For a replicated pool with size=3:

Usable = Raw Capacity / Replication Factor
Example: 12 x 8 TB HDDs = 96 TB raw / 3 = 32 TB usable

For an erasure-coded pool with k=4, m=2:

Usable = Raw Capacity x (k / (k + m))
Example: 96 TB raw x (4/6) = 64 TB usable

PG Count#

The recommended PG count per pool depends on the number of OSDs:

# Let Ceph auto-tune PG count (Reef and later)
ceph osd pool set <pool-name> pg_autoscale_mode on

Manual calculation:

Total PGs = (OSDs x 100) / Replication Factor
Round up to the nearest power of 2.

11. Monitoring and Alerting#

Cluster Health#

# Quick status overview
ceph -s

# Detailed health messages
ceph health detail

# Watch cluster events in real time
ceph -w

Ceph Dashboard#

The built-in dashboard provides a web UI for monitoring and basic management:

# Enable the dashboard module
ceph mgr module enable dashboard

# Create a self-signed certificate
ceph dashboard create-self-signed-cert

# Set or reset admin password
ceph dashboard ac-user-set-password admin -i <password-file>

Access at https://<mgr-host>:8443.

Prometheus Integration#

# Enable the Prometheus module
ceph mgr module enable prometheus

# Metrics are exposed at http://<mgr-host>:9283/metrics

Recommended Prometheus alert rules:

AlertConditionSeverity
Cluster health not OKceph_health_status != 0Warning
OSD downceph_osd_up == 0 for any OSDCritical
Near-full OSDceph_osd_stat_bytes_used / ceph_osd_stat_bytes > 0.85Warning
PGs not active+cleanceph_pg_active < ceph_pg_total for >5 minWarning
Pool near fullceph_pool_bytes_used / ceph_pool_max_avail > 0.80Critical

Email Alerts#

# Enable the built-in alert module
ceph mgr module enable alerts
ceph config set mgr mgr/alerts/smtp_host <smtp-server>
ceph config set mgr mgr/alerts/smtp_destination <email>
ceph config set mgr mgr/alerts/smtp_sender ceph@<domain>

12. Troubleshooting#

IssueCauseSolution
HEALTH_WARN: clock skew detectedNTP not synchronized across nodesInstall and configure chrony or ntpd on all nodes; verify with ceph time-sync-status
HEALTH_ERR: X osds downDisk failure, network issue, or OSD process crashCheck systemctl status ceph-osd@<id>, OSD logs in /var/log/ceph/, and network connectivity
HEALTH_WARN: too few PGs per OSDPG count too low for the cluster sizeIncrease PG count or enable pg_autoscale_mode on for affected pools
PGs stuck in degraded or undersizedNot enough OSDs to satisfy replicationAdd OSDs, or temporarily reduce pool min_size (not recommended for production)
PGs stuck in remappedData migration after topology changeWait for rebalancing; monitor with ceph pg stat
MON quorum lostMajority of monitors unreachableRestore network, restart MON services; if unrecoverable, rebuild from surviving MON
Slow requests / blocked opsSlow OSD disk, network congestionCheck ceph daemon osd.<id> dump_ops_in_flight, review disk I/O with iostat
HEALTH_WARN: nearfull osd(s)OSD approaching 85% capacityAdd capacity, rebalance with CRUSH weight adjustments, or delete unused data

See Also#

Sources#