Ceph · ArchWorks

Ceph is an open-source, software-defined storage platform providing unified object, block, and file storage with automatic data distribution and no single point of failure.

Addresses below are RFC 5737 documentation ranges or placeholders - swap in your own.

Table of Contents#

Overview
Architecture
CRUSH Algorithm and Failure Domains
Core Components
Storage Interfaces
Release Versions
Installation
Cluster Creation
Node and OSD Management
Capacity Planning
Monitoring and Alerting
Troubleshooting
See Also
Sources

1. Overview#

Ceph provides massively scalable storage that operates on commodity hardware. A single Ceph cluster can serve object storage (via RADOS Gateway), block storage (via RBD), and a POSIX-compatible filesystem (via CephFS), all backed by the same underlying RADOS (Reliable Autonomic Distributed Object Store) layer.

Key characteristics:

No single point of failure - all components can be deployed redundantly
Self-healing - automatically detects and recovers from disk, node, and rack failures
Self-managing - rebalances data when capacity is added or removed
Linearly scalable - performance and capacity grow with the cluster
Unified storage - object, block, and file from one platform

2. Architecture#

Ceph's architecture revolves around RADOS, the object store that underpins everything:

+-------------------+-------------------+-------------------+
|   RADOS Gateway   |       RBD         |      CephFS       |
|  (Object / S3)    |  (Block Device)   |   (Filesystem)    |
+-------------------+-------------------+-------------------+
|                       librados                            |
+-----------------------------------------------------------+
|                        RADOS                              |
|  +--------+  +--------+  +--------+  +--------+          |
|  |  OSD   |  |  OSD   |  |  OSD   |  |  OSD   |  ...     |
|  +--------+  +--------+  +--------+  +--------+          |
|  +--------+  +--------+  +--------+                       |
|  |  MON   |  |  MGR   |  |  MDS   |  (MDS for CephFS)    |
|  +--------+  +--------+  +--------+                       |
+-----------------------------------------------------------+

Storage Pools#

Pools are logical partitions within RADOS. Each pool has configurable:

Replication factor (size) or erasure coding profile
Placement group (PG) count - determines data distribution granularity
CRUSH rules - control which failure domains store the data

# Create a replicated pool with 128 PGs
ceph osd pool create mypool 128 128 replicated

# Create an erasure-coded pool (k=4 data, m=2 parity)
ceph osd erasure-code-profile set ec-42 k=4 m=2
ceph osd pool create ec-pool 128 128 erasure ec-42

3. CRUSH Algorithm and Failure Domains#

The CRUSH (Controlled Replication Under Scalable Hashing) algorithm determines where data is placed without requiring a central lookup table. Clients compute placement locally using the CRUSH map.

CRUSH Map Hierarchy#

The CRUSH map defines a hierarchy of physical locations:

root default
  rack rack1
    host node1
      osd.0
      osd.1
    host node2
      osd.2
      osd.3
  rack rack2
    host node3
      osd.4
      osd.5
    host node4
      osd.6
      osd.7

Failure Domains#

Failure domains ensure that replicas are spread across distinct physical boundaries. The most common levels, from broadest to narrowest:

Level	Description	Typical Rule
`datacenter`	Separate data centers or regions	Multi-site / stretch clusters
`room`	Server rooms within a building	Large single-site deployments
`rack`	Individual server racks	Production minimum recommendation
`host`	Individual servers	Small clusters (3-5 nodes)
`osd`	Individual disks	Testing only, no real redundancy

# View the current CRUSH map
ceph osd crush tree

# Set a CRUSH rule to replicate across racks
ceph osd crush rule create-replicated rack-rule default rack host

# Assign a pool to a specific CRUSH rule
ceph osd pool set mypool crush_rule rack-rule

Editing the CRUSH Map#

# Export, decompile, edit, recompile, inject
ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt
# Edit crushmap.txt as needed
crushtool -c crushmap.txt -o crushmap-new.bin
ceph osd setcrushmap -i crushmap-new.bin

4. Core Components#

Ceph OSDs (Object Storage Daemons)#

Each OSD manages one storage device. OSDs handle data replication, recovery, rebalancing, and heartbeat checks with peer OSDs and monitors.

BlueStore (default since Luminous) - purpose-built storage backend, bypasses the local filesystem
One OSD per physical disk is the standard deployment model
Minimum 3 OSDs for a production replicated pool

Ceph Monitors (MON)#

Monitors maintain the authoritative copy of the cluster map, which includes:

Monitor map - list of monitors
OSD map - OSD states and locations
CRUSH map - data placement rules
MDS map - metadata server states (for CephFS)
Manager map - active/standby managers

Deploy an odd number (3 or 5) for Paxos quorum.

Ceph Managers (MGR)#

Managers collect cluster-wide metrics and host modules:

Dashboard - web-based monitoring and management UI
Prometheus - exposes metrics for Prometheus scraping
Balancer - automatic PG distribution optimization
Orchestrator - interfaces with cephadm or Rook for deployment

Deploy 2 (one active, one standby) minimum.

Ceph MDS (Metadata Server)#

Required only for CephFS. Manages the POSIX filesystem namespace (directory hierarchy, permissions, timestamps). Deploy at least 2 for high availability.

5. Storage Interfaces#

Interface	Protocol	Use Case
RBD (RADOS Block Device)	Kernel module or librbd	VM disks, Kubernetes PVs, database storage
CephFS	Kernel or FUSE mount	Shared filesystems, home directories, HPC
RADOS Gateway (RGW)	S3 / Swift REST API	Object storage, backups, media archives

6. Release Versions#

Ceph uses alphabetical marine creature names. Each release is supported for approximately 2 years.

Release	Version	Status	Key Features
Squid	19.x	Current (2024)	Improved RGW multisite, crimson OSD tech preview
Reef	18.x	LTS (2023)	RBD image mirroring improvements, dashboard overhaul
Quincy	17.x	Maintenance (2022)	cephadm maturity, msgr2 default
Pacific	16.x	EOL (2021)	cephadm orchestrator, stretch clusters

Always target the latest LTS release (Reef 18.x as of 2024) for production deployments.

7. Installation#

Prerequisites#

Minimum 3 nodes for production (1 node for testing)
Each node: dedicated OS disk plus one or more OSD disks
All nodes reachable via SSH from the bootstrap host
NTP synchronized across all nodes
Python 3.6+ and container runtime (Podman preferred, Docker supported)

Network Configuration#

Ceph recommends two separate networks:

Public network - client-to-cluster traffic (MON, RGW, RBD, CephFS)
Cluster network - OSD-to-OSD replication and recovery traffic

# Example: /etc/ceph/ceph.conf network settings
[global]
public_network = 192.0.2.0/24
cluster_network = 198.51.100.0/24

Installing via Package Manager (Recommended)#

Debian/Ubuntu:

# Add the Ceph repository
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
echo "deb https://download.ceph.com/debian-reef/ $(lsb_release -sc) main" | \
  sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt update
sudo apt install -y cephadm ceph-common

RHEL/CentOS/Rocky:

sudo dnf install -y centos-release-ceph-reef
sudo dnf install -y cephadm ceph-common

Arch Linux:

# cephadm is available in the AUR
sudo pacman -S ceph-common

Installing via cephadm (Alternative Bootstrap)#

If your distribution does not package cephadm, download it directly:

curl --silent --remote-name \
  https://download.ceph.com/rpm-reef/el9/noarch/cephadm
chmod +x cephadm
sudo ./cephadm add-repo --release reef
sudo ./cephadm install

8. Cluster Creation#

Bootstrapping#

Initialize the first monitor on the bootstrap node:

sudo cephadm bootstrap \
  --mon-ip <mon-ip-address> \
  --initial-dashboard-user admin \
  --initial-dashboard-password <password> \
  --cluster-network <cluster-cidr>

This creates the initial MON, MGR, and enables the dashboard at https://<mon-ip>:8443.

Enabling the Ceph CLI#

# Install the ceph CLI wrapper
sudo cephadm install ceph-common

# Verify connectivity
ceph -s

9. Node and OSD Management#

Adding Nodes#

# Copy the SSH key to the new host
ssh-copy-id -f -i /etc/ceph/ceph.pub root@<new-host>

# Add the host to the cluster
ceph orch host add <hostname> <ip-address> --labels _admin

# Verify the host is registered
ceph orch host ls

Adding OSDs#

# List available devices on all hosts
ceph orch device ls

# Add all available devices as OSDs
ceph orch apply osd --all-available-devices

# Or add a specific device
ceph orch daemon add osd <hostname>:/dev/<device>

Removing an OSD#

# Mark the OSD out so data migrates away
ceph osd out osd.<id>

# Wait for rebalancing to complete
ceph -w

# Remove the OSD daemon
ceph orch osd rm osd.<id>

# Verify removal
ceph osd tree

Replacing a Failed Disk#

# Destroy the old OSD (preserves ID for replacement)
ceph orch osd rm osd.<id> --replace

# Insert new disk, then redeploy
ceph orch daemon add osd <hostname>:/dev/<new-device>

10. Capacity Planning#

Sizing Guidelines#

Component	Recommendation
MON	2 GiB RAM minimum; SSD or NVMe for MON data
OSD (HDD)	5 GiB RAM per OSD daemon
OSD (NVMe)	5-7 GiB RAM per OSD daemon
MGR	1 GiB RAM minimum
MDS	1-4 GiB RAM per active MDS (scales with metadata load)
Network	10 GbE minimum for production; 25 GbE recommended

Calculating Usable Space#

For a replicated pool with size=3:

Usable = Raw Capacity / Replication Factor
Example: 12 x 8 TB HDDs = 96 TB raw / 3 = 32 TB usable

For an erasure-coded pool with k=4, m=2:

Usable = Raw Capacity x (k / (k + m))
Example: 96 TB raw x (4/6) = 64 TB usable

PG Count#

The recommended PG count per pool depends on the number of OSDs:

# Let Ceph auto-tune PG count (Reef and later)
ceph osd pool set <pool-name> pg_autoscale_mode on

Manual calculation:

Total PGs = (OSDs x 100) / Replication Factor
Round up to the nearest power of 2.

11. Monitoring and Alerting#

Cluster Health#

# Quick status overview
ceph -s

# Detailed health messages
ceph health detail

# Watch cluster events in real time
ceph -w

Ceph Dashboard#

The built-in dashboard provides a web UI for monitoring and basic management:

# Enable the dashboard module
ceph mgr module enable dashboard

# Create a self-signed certificate
ceph dashboard create-self-signed-cert

# Set or reset admin password
ceph dashboard ac-user-set-password admin -i <password-file>

Access at https://<mgr-host>:8443.

Prometheus Integration#

# Enable the Prometheus module
ceph mgr module enable prometheus

# Metrics are exposed at http://<mgr-host>:9283/metrics

Recommended Prometheus alert rules:

Alert	Condition	Severity
Cluster health not OK	`ceph_health_status != 0`	Warning
OSD down	`ceph_osd_up == 0` for any OSD	Critical
Near-full OSD	`ceph_osd_stat_bytes_used / ceph_osd_stat_bytes > 0.85`	Warning
PGs not active+clean	`ceph_pg_active < ceph_pg_total` for >5 min	Warning
Pool near full	`ceph_pool_bytes_used / ceph_pool_max_avail > 0.80`	Critical

Email Alerts#

# Enable the built-in alert module
ceph mgr module enable alerts
ceph config set mgr mgr/alerts/smtp_host <smtp-server>
ceph config set mgr mgr/alerts/smtp_destination <email>
ceph config set mgr mgr/alerts/smtp_sender ceph@<domain>

12. Troubleshooting#

Issue	Cause	Solution
`HEALTH_WARN: clock skew detected`	NTP not synchronized across nodes	Install and configure chrony or ntpd on all nodes; verify with `ceph time-sync-status`
`HEALTH_ERR: X osds down`	Disk failure, network issue, or OSD process crash	Check `systemctl status ceph-osd@<id>`, OSD logs in `/var/log/ceph/`, and network connectivity
`HEALTH_WARN: too few PGs per OSD`	PG count too low for the cluster size	Increase PG count or enable `pg_autoscale_mode on` for affected pools
PGs stuck in `degraded` or `undersized`	Not enough OSDs to satisfy replication	Add OSDs, or temporarily reduce pool `min_size` (not recommended for production)
PGs stuck in `remapped`	Data migration after topology change	Wait for rebalancing; monitor with `ceph pg stat`
MON quorum lost	Majority of monitors unreachable	Restore network, restart MON services; if unrecoverable, rebuild from surviving MON
Slow requests / blocked ops	Slow OSD disk, network congestion	Check `ceph daemon osd.<id> dump_ops_in_flight`, review disk I/O with `iostat`
`HEALTH_WARN: nearfull osd(s)`	OSD approaching 85% capacity	Add capacity, rebalance with CRUSH weight adjustments, or delete unused data