Ceph is an open-source, software-defined storage platform providing unified object, block, and file storage with automatic data distribution and no single point of failure.
Addresses below are RFC 5737 documentation ranges or placeholders - swap in your own.
Table of Contents#
- Overview
- Architecture
- CRUSH Algorithm and Failure Domains
- Core Components
- Storage Interfaces
- Release Versions
- Installation
- Cluster Creation
- Node and OSD Management
- Capacity Planning
- Monitoring and Alerting
- Troubleshooting
- See Also
- Sources
1. Overview#
Ceph provides massively scalable storage that operates on commodity hardware. A single Ceph cluster can serve object storage (via RADOS Gateway), block storage (via RBD), and a POSIX-compatible filesystem (via CephFS), all backed by the same underlying RADOS (Reliable Autonomic Distributed Object Store) layer.
Key characteristics:
- No single point of failure - all components can be deployed redundantly
- Self-healing - automatically detects and recovers from disk, node, and rack failures
- Self-managing - rebalances data when capacity is added or removed
- Linearly scalable - performance and capacity grow with the cluster
- Unified storage - object, block, and file from one platform
2. Architecture#
Ceph's architecture revolves around RADOS, the object store that underpins everything:
+-------------------+-------------------+-------------------+
| RADOS Gateway | RBD | CephFS |
| (Object / S3) | (Block Device) | (Filesystem) |
+-------------------+-------------------+-------------------+
| librados |
+-----------------------------------------------------------+
| RADOS |
| +--------+ +--------+ +--------+ +--------+ |
| | OSD | | OSD | | OSD | | OSD | ... |
| +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ |
| | MON | | MGR | | MDS | (MDS for CephFS) |
| +--------+ +--------+ +--------+ |
+-----------------------------------------------------------+Storage Pools#
Pools are logical partitions within RADOS. Each pool has configurable:
- Replication factor (size) or erasure coding profile
- Placement group (PG) count - determines data distribution granularity
- CRUSH rules - control which failure domains store the data
# Create a replicated pool with 128 PGs
ceph osd pool create mypool 128 128 replicated
# Create an erasure-coded pool (k=4 data, m=2 parity)
ceph osd erasure-code-profile set ec-42 k=4 m=2
ceph osd pool create ec-pool 128 128 erasure ec-423. CRUSH Algorithm and Failure Domains#
The CRUSH (Controlled Replication Under Scalable Hashing) algorithm determines where data is placed without requiring a central lookup table. Clients compute placement locally using the CRUSH map.
CRUSH Map Hierarchy#
The CRUSH map defines a hierarchy of physical locations:
root default
rack rack1
host node1
osd.0
osd.1
host node2
osd.2
osd.3
rack rack2
host node3
osd.4
osd.5
host node4
osd.6
osd.7Failure Domains#
Failure domains ensure that replicas are spread across distinct physical boundaries. The most common levels, from broadest to narrowest:
| Level | Description | Typical Rule |
|---|---|---|
datacenter | Separate data centers or regions | Multi-site / stretch clusters |
room | Server rooms within a building | Large single-site deployments |
rack | Individual server racks | Production minimum recommendation |
host | Individual servers | Small clusters (3-5 nodes) |
osd | Individual disks | Testing only, no real redundancy |
# View the current CRUSH map
ceph osd crush tree
# Set a CRUSH rule to replicate across racks
ceph osd crush rule create-replicated rack-rule default rack host
# Assign a pool to a specific CRUSH rule
ceph osd pool set mypool crush_rule rack-ruleEditing the CRUSH Map#
# Export, decompile, edit, recompile, inject
ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt
# Edit crushmap.txt as needed
crushtool -c crushmap.txt -o crushmap-new.bin
ceph osd setcrushmap -i crushmap-new.bin4. Core Components#
Ceph OSDs (Object Storage Daemons)#
Each OSD manages one storage device. OSDs handle data replication, recovery, rebalancing, and heartbeat checks with peer OSDs and monitors.
- BlueStore (default since Luminous) - purpose-built storage backend, bypasses the local filesystem
- One OSD per physical disk is the standard deployment model
- Minimum 3 OSDs for a production replicated pool
Ceph Monitors (MON)#
Monitors maintain the authoritative copy of the cluster map, which includes:
- Monitor map - list of monitors
- OSD map - OSD states and locations
- CRUSH map - data placement rules
- MDS map - metadata server states (for CephFS)
- Manager map - active/standby managers
Deploy an odd number (3 or 5) for Paxos quorum.
Ceph Managers (MGR)#
Managers collect cluster-wide metrics and host modules:
- Dashboard - web-based monitoring and management UI
- Prometheus - exposes metrics for Prometheus scraping
- Balancer - automatic PG distribution optimization
- Orchestrator - interfaces with cephadm or Rook for deployment
Deploy 2 (one active, one standby) minimum.
Ceph MDS (Metadata Server)#
Required only for CephFS. Manages the POSIX filesystem namespace (directory hierarchy, permissions, timestamps). Deploy at least 2 for high availability.
5. Storage Interfaces#
| Interface | Protocol | Use Case |
|---|---|---|
| RBD (RADOS Block Device) | Kernel module or librbd | VM disks, Kubernetes PVs, database storage |
| CephFS | Kernel or FUSE mount | Shared filesystems, home directories, HPC |
| RADOS Gateway (RGW) | S3 / Swift REST API | Object storage, backups, media archives |
6. Release Versions#
Ceph uses alphabetical marine creature names. Each release is supported for approximately 2 years.
| Release | Version | Status | Key Features |
|---|---|---|---|
| Squid | 19.x | Current (2024) | Improved RGW multisite, crimson OSD tech preview |
| Reef | 18.x | LTS (2023) | RBD image mirroring improvements, dashboard overhaul |
| Quincy | 17.x | Maintenance (2022) | cephadm maturity, msgr2 default |
| Pacific | 16.x | EOL (2021) | cephadm orchestrator, stretch clusters |
Always target the latest LTS release (Reef 18.x as of 2024) for production deployments.
7. Installation#
Prerequisites#
- Minimum 3 nodes for production (1 node for testing)
- Each node: dedicated OS disk plus one or more OSD disks
- All nodes reachable via SSH from the bootstrap host
- NTP synchronized across all nodes
- Python 3.6+ and container runtime (Podman preferred, Docker supported)
Network Configuration#
Ceph recommends two separate networks:
- Public network - client-to-cluster traffic (MON, RGW, RBD, CephFS)
- Cluster network - OSD-to-OSD replication and recovery traffic
# Example: /etc/ceph/ceph.conf network settings
[global]
public_network = 192.0.2.0/24
cluster_network = 198.51.100.0/24Installing via Package Manager (Recommended)#
Debian/Ubuntu:
# Add the Ceph repository
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
echo "deb https://download.ceph.com/debian-reef/ $(lsb_release -sc) main" | \
sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt update
sudo apt install -y cephadm ceph-commonRHEL/CentOS/Rocky:
sudo dnf install -y centos-release-ceph-reef
sudo dnf install -y cephadm ceph-commonArch Linux:
# cephadm is available in the AUR
sudo pacman -S ceph-commonInstalling via cephadm (Alternative Bootstrap)#
If your distribution does not package cephadm, download it directly:
curl --silent --remote-name \
https://download.ceph.com/rpm-reef/el9/noarch/cephadm
chmod +x cephadm
sudo ./cephadm add-repo --release reef
sudo ./cephadm install8. Cluster Creation#
Bootstrapping#
Initialize the first monitor on the bootstrap node:
sudo cephadm bootstrap \
--mon-ip <mon-ip-address> \
--initial-dashboard-user admin \
--initial-dashboard-password <password> \
--cluster-network <cluster-cidr>This creates the initial MON, MGR, and enables the dashboard at https://<mon-ip>:8443.
Enabling the Ceph CLI#
# Install the ceph CLI wrapper
sudo cephadm install ceph-common
# Verify connectivity
ceph -s9. Node and OSD Management#
Adding Nodes#
# Copy the SSH key to the new host
ssh-copy-id -f -i /etc/ceph/ceph.pub root@<new-host>
# Add the host to the cluster
ceph orch host add <hostname> <ip-address> --labels _admin
# Verify the host is registered
ceph orch host lsAdding OSDs#
# List available devices on all hosts
ceph orch device ls
# Add all available devices as OSDs
ceph orch apply osd --all-available-devices
# Or add a specific device
ceph orch daemon add osd <hostname>:/dev/<device>Removing an OSD#
# Mark the OSD out so data migrates away
ceph osd out osd.<id>
# Wait for rebalancing to complete
ceph -w
# Remove the OSD daemon
ceph orch osd rm osd.<id>
# Verify removal
ceph osd treeReplacing a Failed Disk#
# Destroy the old OSD (preserves ID for replacement)
ceph orch osd rm osd.<id> --replace
# Insert new disk, then redeploy
ceph orch daemon add osd <hostname>:/dev/<new-device>10. Capacity Planning#
Sizing Guidelines#
| Component | Recommendation |
|---|---|
| MON | 2 GiB RAM minimum; SSD or NVMe for MON data |
| OSD (HDD) | 5 GiB RAM per OSD daemon |
| OSD (NVMe) | 5-7 GiB RAM per OSD daemon |
| MGR | 1 GiB RAM minimum |
| MDS | 1-4 GiB RAM per active MDS (scales with metadata load) |
| Network | 10 GbE minimum for production; 25 GbE recommended |
Calculating Usable Space#
For a replicated pool with size=3:
Usable = Raw Capacity / Replication Factor
Example: 12 x 8 TB HDDs = 96 TB raw / 3 = 32 TB usableFor an erasure-coded pool with k=4, m=2:
Usable = Raw Capacity x (k / (k + m))
Example: 96 TB raw x (4/6) = 64 TB usablePG Count#
The recommended PG count per pool depends on the number of OSDs:
# Let Ceph auto-tune PG count (Reef and later)
ceph osd pool set <pool-name> pg_autoscale_mode onManual calculation:
Total PGs = (OSDs x 100) / Replication Factor
Round up to the nearest power of 2.11. Monitoring and Alerting#
Cluster Health#
# Quick status overview
ceph -s
# Detailed health messages
ceph health detail
# Watch cluster events in real time
ceph -wCeph Dashboard#
The built-in dashboard provides a web UI for monitoring and basic management:
# Enable the dashboard module
ceph mgr module enable dashboard
# Create a self-signed certificate
ceph dashboard create-self-signed-cert
# Set or reset admin password
ceph dashboard ac-user-set-password admin -i <password-file>Access at https://<mgr-host>:8443.
Prometheus Integration#
# Enable the Prometheus module
ceph mgr module enable prometheus
# Metrics are exposed at http://<mgr-host>:9283/metricsRecommended Prometheus alert rules:
| Alert | Condition | Severity |
|---|---|---|
| Cluster health not OK | ceph_health_status != 0 | Warning |
| OSD down | ceph_osd_up == 0 for any OSD | Critical |
| Near-full OSD | ceph_osd_stat_bytes_used / ceph_osd_stat_bytes > 0.85 | Warning |
| PGs not active+clean | ceph_pg_active < ceph_pg_total for >5 min | Warning |
| Pool near full | ceph_pool_bytes_used / ceph_pool_max_avail > 0.80 | Critical |
Email Alerts#
# Enable the built-in alert module
ceph mgr module enable alerts
ceph config set mgr mgr/alerts/smtp_host <smtp-server>
ceph config set mgr mgr/alerts/smtp_destination <email>
ceph config set mgr mgr/alerts/smtp_sender ceph@<domain>12. Troubleshooting#
| Issue | Cause | Solution |
|---|---|---|
HEALTH_WARN: clock skew detected | NTP not synchronized across nodes | Install and configure chrony or ntpd on all nodes; verify with ceph time-sync-status |
HEALTH_ERR: X osds down | Disk failure, network issue, or OSD process crash | Check systemctl status ceph-osd@<id>, OSD logs in /var/log/ceph/, and network connectivity |
HEALTH_WARN: too few PGs per OSD | PG count too low for the cluster size | Increase PG count or enable pg_autoscale_mode on for affected pools |
PGs stuck in degraded or undersized | Not enough OSDs to satisfy replication | Add OSDs, or temporarily reduce pool min_size (not recommended for production) |
PGs stuck in remapped | Data migration after topology change | Wait for rebalancing; monitor with ceph pg stat |
| MON quorum lost | Majority of monitors unreachable | Restore network, restart MON services; if unrecoverable, rebuild from surviving MON |
| Slow requests / blocked ops | Slow OSD disk, network congestion | Check ceph daemon osd.<id> dump_ops_in_flight, review disk I/O with iostat |
HEALTH_WARN: nearfull osd(s) | OSD approaching 85% capacity | Add capacity, rebalance with CRUSH weight adjustments, or delete unused data |