summaryrefslogtreecommitdiff
path: root/config-schema.md
blob: d03d1a4e1054f10219384687042ff43840835dd4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# Configuration Schema Design

## Overview
The configuration system uses YAML files organized in a hierarchical structure. Configurations are split between:
- **Cluster-level config**: Global settings, network topology, service defaults
- **Node-level config**: Per-node settings, roles, and service overrides

## Directory Structure
```
configs/
├── cluster.yaml              # Cluster-wide configuration
├── services/                 # Service-specific configurations
│   ├── kubernetes.yaml
│   ├── ceph.yaml
│   ├── kafka.yaml
│   ├── mqtt.yaml
│   └── dns.yaml
└── nodes/                    # Per-node configurations
    ├── master-01.yaml
    ├── worker-01.yaml
    ├── kafka-01.yaml
    └── ...
```

## Cluster Configuration (cluster.yaml)

```yaml
cluster:
  name: "production-cluster"
  domain: "cluster.local"

network:
  pod_cidr: "10.244.0.0/16"
  service_cidr: "10.96.0.0/12"
  dns_servers:
    - "10.96.0.10"

nodes:
  # List of all nodes in the cluster
  - name: "master-01"
    hostname: "master-01.cluster.local"
    ip: "192.168.1.10"
    roles: ["master", "control-plane"]

  - name: "worker-01"
    hostname: "worker-01.cluster.local"
    ip: "192.168.1.20"
    roles: ["worker"]

  - name: "kafka-01"
    hostname: "kafka-01.cluster.local"
    ip: "192.168.1.30"
    roles: ["worker", "kafka-broker"]

  - name: "ceph-01"
    hostname: "ceph-01.cluster.local"
    ip: "192.168.1.40"
    roles: ["worker", "ceph-osd", "ceph-mon"]

services:
  # Which services are enabled cluster-wide
  enabled:
    - kubernetes
    - ceph
    - kafka
    - mqtt
    - dns
```

## Node Configuration (nodes/{node-name}.yaml)

```yaml
node:
  name: "master-01"
  roles:
    - "master"
    - "control-plane"

  # Node-specific overrides
  hostname: "master-01.cluster.local"
  ip: "192.168.1.10"

  # Hardware/resource hints
  resources:
    cpu_cores: 8
    memory_gb: 32
    storage_gb: 500

# Services to run on this node
services:
  kubernetes:
    enabled: true
    type: "master"
    components:
      - "kube-apiserver"
      - "kube-controller-manager"
      - "kube-scheduler"
      - "etcd"

  ceph:
    enabled: false

  kafka:
    enabled: false

  mqtt:
    enabled: false

  dns:
    enabled: true
    type: "coredns"
```

## Service Configuration (services/kubernetes.yaml)

```yaml
service:
  name: "kubernetes"
  version: "1.28"

# Service-specific configuration
config:
  api_server:
    port: 6443
    bind_address: "0.0.0.0"

  kubelet:
    cgroup_driver: "systemd"
    container_runtime: "containerd"

  network_plugin: "calico"

  feature_gates:
    - "EphemeralContainers=true"

# Systemd unit configuration
systemd:
  unit_file: "kubelet.service"
  wants:
    - "containerd.service"
  after:
    - "containerd.service"
    - "network-online.target"
```

## Role Definitions

### Predefined Roles
- **master**: Kubernetes control plane node
- **worker**: Kubernetes worker node
- **kafka-broker**: Kafka message broker
- **kafka-controller**: Kafka controller (KRaft mode)
- **ceph-mon**: Ceph monitor daemon
- **ceph-osd**: Ceph object storage daemon
- **ceph-mds**: Ceph metadata server
- **mqtt-broker**: MQTT message broker
- **dns-server**: DNS server

### Custom Roles
Users can define custom roles by creating role definition files in `roles/` directory.

## Configuration Validation Rules

1. Each node must have at least one role
2. At least one node must have the "master" role
3. Service configurations must match enabled services
4. IP addresses must be unique across nodes
5. Node names must be valid DNS names
6. Required service dependencies must be met

## Single-ISO Deployment Model

This system uses a **single bootable ISO** that can be installed on any node in the cluster. Node identity is detected automatically at first boot.

### ISO Contents
The ISO contains configurations for the **entire cluster**:
```
/etc/cluster-config/
├── cluster.yaml              # Full cluster topology (all nodes)
├── services/                 # All service configs
│   ├── kubernetes.yaml
│   ├── ceph.yaml
│   ├── kafka.yaml
│   ├── mqtt.yaml
│   └── dns.yaml
└── nodes/                    # Configs for every node in cluster
    ├── master-01.yaml
    ├── worker-01.yaml
    ├── kafka-01.yaml
    ├── storage-01.yaml
    └── ...
```

### Boot-time Configuration Resolution (First Boot)

1. **System boots** from the ISO
2. **Very early in boot**: `cluster-detect.service` starts (before other services)
3. **Node detection** (`cluster-detect.sh`):
   - Try to identify node by **MAC address** (compare against `hardware.mac_addresses` in node configs)
   - Fallback to **IP address** detection (if static IP or DHCP reservation)
   - Fallback to **hostname** detection
   - Final fallback: **Interactive prompt** on console asking user to select node identity
4. **Once identified**:
   - Create symlink: `/etc/cluster-config/current-node.yaml` → `/etc/cluster-config/nodes/{detected-node}.yaml`
   - Write `/etc/cluster-config/node-identity` with node name
5. **Role activation** (`cluster-activate-roles.sh`):
   - Read roles from `current-node.yaml`
   - Map roles to systemd targets:
     - `master` → `kubernetes-master.target`
     - `worker` → `kubernetes-worker.target`
     - `kafka-broker` → `kafka.target`
     - `ceph-osd` → `ceph-osd.target`
     - etc.
   - Enable and start appropriate targets
6. **Service startup**:
   - Systemd targets pull in their service units
   - Services read configs from `/etc/cluster-config/services/` and `/etc/cluster-config/current-node.yaml`
   - Services start in dependency order

### Normal Boot (Subsequent Boots)

1. System boots
2. `cluster-detect.service` runs but finds existing `/etc/cluster-config/node-identity`
3. Skips detection, proceeds to activate saved roles
4. Services start normally based on persisted systemd target enablement

## Implementation Status

- ✅ Configuration schema defined
- ✅ Configuration validator tool (`tools/validate-config.py`)
- ✅ Node detection script (`tools/cluster-detect.sh`)
- ✅ Role activation script (`tools/cluster-activate-roles.sh`)
- ✅ Environment file generator (`tools/generate-environment-files.sh`)
- ✅ Systemd service units and targets (19 units total)
- ✅ Service unit files (containerd, kubelet, kube-apiserver, etcd, kafka, ceph, mqtt, coredns)
- ✅ Service configuration generators (8 scripts)
- ⏳ Certificate/key generation (Kubernetes PKI, Ceph keys)
- ⏳ Network configuration on boot
- ⏳ ISO builder tool
- ⏳ Cluster bootstrapping (multi-master, join tokens)

See [IMPLEMENTATION.md](IMPLEMENTATION.md) for complete architecture overview.