Sfoglia il codice sorgente

feat: 生产环境部署方案 (#91)

- docker-compose.override.yml: 生产覆盖配置(资源限制、日志驱动、安全加固、只读文件系统)
- nginx/nginx.conf: 生产 Nginx 配置(HTTPS、HTTP/2、安全 headers、限流、WebSocket、Gzip)
- nginx/certbot-renew.sh: Let's Encrypt 证书自动续期脚本
- backup/backup-db.sh: 数据库备份脚本(每日/每周/每月保留策略、压缩、可选加密、S3 上传、企业微信通知)
- backup/restore-db.sh: 数据库恢复脚本(支持按文件/日期/最近备份恢复)
- monitoring/prometheus.yml: Prometheus 指标抓取配置(全部微服务 + 基础设施)
- monitoring/alert_rules.yml: 告警规则(CPU>80%、内存>85%、磁盘>90%、服务宕机等)
- monitoring/docker-compose.monitoring.yml: 监控栈编排(Prometheus + Grafana + NodeExporter + cAdvisor + AlertManager)
- monitoring/alertmanager.yml: AlertManager 告警路由与企业微信通知
- monitoring/grafana/provisioning/datasources.yml: Grafana 数据源自动配置
- logging/docker-compose.logging.yml: 日志栈编排(Loki + Promtail 轻量方案)
- logging/loki-config.yml: Loki 日志存储配置(30天保留)
- logging/promtail-config.yml: Promtail 日志收集配置(Docker 容器、Nginx、系统日志)
- server-setup.sh: 服务器初始化脚本(系统更新、Docker、防火墙、SSH加固、Fail2Ban、内核优化)
- README.md: 完整生产部署文档(服务器要求、部署步骤、监控、备份、运维手册、故障排查)
- .env.production.example: 生产环境变量模板
bot_dev2 3 giorni fa
parent
commit
35272c3fd0

+ 53
- 0
deploy/production/.env.production.example Vedi File

@@ -0,0 +1,53 @@
1
+# ============================================================
2
+# 生产环境配置文件
3
+# 复制此文件为 .env 并修改所有占位符
4
+# ============================================================
5
+
6
+# ==================== 域名 ====================
7
+DOMAIN=water.example.com
8
+CERTBOT_EMAIL=admin@example.com
9
+
10
+# ==================== 数据库 ====================
11
+POSTGRES_DB=water_management
12
+POSTGRES_USER=water
13
+POSTGRES_PASSWORD=<请设置强密码,至少16位>
14
+
15
+# ==================== Redis ====================
16
+REDIS_PASSWORD=<请设置强密码,至少16位>
17
+
18
+# ==================== EMQX MQTT ====================
19
+EMQX_ADMIN_USER=admin
20
+EMQX_ADMIN_PASSWORD=<请设置强密码>
21
+
22
+# ==================== MinIO 对象存储 ====================
23
+MINIO_USER=minioadmin
24
+MINIO_PASSWORD=<请设置强密码>
25
+
26
+# ==================== GeoServer GIS ====================
27
+GEOSERVER_USER=admin
28
+GEOSERVER_PASSWORD=<请设置强密码>
29
+
30
+# ==================== 镜像配置 ====================
31
+REGISTRY=
32
+IMAGE_TAG=latest
33
+
34
+# ==================== Spring 配置 ====================
35
+SPRING_PROFILES=prod
36
+
37
+# ==================== 监控 ====================
38
+GRAFANA_ADMIN_USER=admin
39
+GRAFANA_ADMIN_PASSWORD=<请设置强密码>
40
+GRAFANA_ROOT_URL=https://grafana.water.example.com
41
+
42
+# ==================== 备份 ====================
43
+S3_ENABLED=false
44
+S3_BUCKET=water-backups
45
+S3_ENDPOINT=http://minio:9000
46
+S3_ACCESS_KEY=minioadmin
47
+S3_SECRET_KEY=<请设置强密码>
48
+ENCRYPT_ENABLED=false
49
+ENCRYPT_PASSPHRASE=<加密密钥>
50
+WECOM_WEBHOOK=
51
+
52
+# ==================== Kafka ====================
53
+KAFKA_ADVERTISED=kafka

+ 583
- 0
deploy/production/README.md Vedi File

@@ -0,0 +1,583 @@
1
+# 生产环境部署指南
2
+
3
+> 智慧水务管理系统 - 生产环境完整部署文档
4
+
5
+---
6
+
7
+## 目录
8
+
9
+1. [服务器配置要求](#服务器配置要求)
10
+2. [前置准备](#前置准备)
11
+3. [服务器初始化](#服务器初始化)
12
+4. [应用部署](#应用部署)
13
+5. [HTTPS 配置](#https-配置)
14
+6. [监控告警](#监控告警)
15
+7. [日志收集](#日志收集)
16
+8. [数据库备份策略](#数据库备份策略)
17
+9. [运维手册](#运维手册)
18
+10. [故障排查](#故障排查)
19
+
20
+---
21
+
22
+## 服务器配置要求
23
+
24
+### 最低配置(小规模部署,≤500 设备)
25
+
26
+| 配置项 | 要求 |
27
+|--------|------|
28
+| CPU | 4 核 |
29
+| 内存 | 8 GB |
30
+| 系统盘 | 100 GB SSD |
31
+| 数据盘 | 200 GB SSD |
32
+| 网络 | 10 Mbps |
33
+| 系统 | Ubuntu 22.04 LTS / CentOS 8+ |
34
+
35
+### 推荐配置(中等规模,500-5000 设备)
36
+
37
+| 配置项 | 要求 |
38
+|--------|------|
39
+| CPU | 8 核 |
40
+| 内存 | 16 GB |
41
+| 系统盘 | 100 GB SSD |
42
+| 数据盘 | 500 GB SSD |
43
+| 网络 | 100 Mbps |
44
+| 系统 | Ubuntu 22.04 LTS |
45
+
46
+### 高可用配置(大规模,>5000 设备)
47
+
48
+| 配置项 | 要求 |
49
+|--------|------|
50
+| 应用服务器 | 2+ 台,8C16G,负载均衡 |
51
+| 数据库 | 主从架构,8C32G |
52
+| Redis | 哨兵/集群模式 |
53
+| Kafka | 3 节点集群 |
54
+| 存储 | MinIO 分布式或云存储 |
55
+
56
+---
57
+
58
+## 前置准备
59
+
60
+### 1. 域名配置
61
+
62
+- 准备主域名(如 `water.example.com`)用于 Web 访问
63
+- 准备 MQTT 域名(如 `mqtt.example.com`)用于设备接入
64
+- 配置 DNS A 记录指向服务器 IP
65
+
66
+### 2. 环境变量
67
+
68
+复制并编辑 `.env` 文件:
69
+
70
+```bash
71
+cp .env.example .env
72
+```
73
+
74
+**必须修改的配置:**
75
+
76
+```bash
77
+# 数据库
78
+POSTGRES_DB=water_management
79
+POSTGRES_USER=water
80
+POSTGRES_PASSWORD=<强密码>
81
+
82
+# Redis
83
+REDIS_PASSWORD=<强密码>
84
+
85
+# EMQX MQTT
86
+EMQX_ADMIN_USER=admin
87
+EMQX_ADMIN_PASSWORD=<强密码>
88
+
89
+# MinIO
90
+MINIO_USER=minioadmin
91
+MINIO_PASSWORD=<强密码>
92
+
93
+# 镜像标签
94
+IMAGE_TAG=latest
95
+
96
+# 域名
97
+DOMAIN=water.example.com
98
+```
99
+
100
+### 3. SSL 证书
101
+
102
+准备 Let's Encrypt 证书(服务器初始化后可自动申请)。
103
+
104
+---
105
+
106
+## 服务器初始化
107
+
108
+### 步骤 1: 执行初始化脚本
109
+
110
+```bash
111
+# 以 root 或 sudo 执行
112
+sudo chmod +x deploy/production/server-setup.sh
113
+sudo deploy/production/server-setup.sh
114
+```
115
+
116
+该脚本将自动完成:
117
+- ✅ 系统包更新
118
+- ✅ Docker & Docker Compose 安装
119
+- ✅ 防火墙配置(仅开放 22/80/443/1883)
120
+- ✅ SSH 安全加固
121
+- ✅ Fail2Ban 防暴力破解
122
+- ✅ 创建部署用户 (`deploy`)
123
+- ✅ 创建目录结构
124
+- ✅ 内核参数优化
125
+- ✅ 定时任务配置
126
+
127
+### 步骤 2: 配置 SSH 公钥
128
+
129
+```bash
130
+# 在本地机器执行
131
+ssh-copy-id deploy@your-server-ip
132
+```
133
+
134
+### 步骤 3: 上传代码
135
+
136
+```bash
137
+# 在服务器执行
138
+sudo -u deploy git clone http://git.xayunmei.com/bot_ym/water-management-system.git /opt/water-management
139
+cd /opt/water-management
140
+cp .env.example .env
141
+vim .env  # 编辑环境变量
142
+```
143
+
144
+---
145
+
146
+## 应用部署
147
+
148
+### 首次部署
149
+
150
+```bash
151
+cd /opt/water-management
152
+
153
+# 构建镜像(如使用远程镜像仓库可跳过)
154
+docker compose build
155
+
156
+# 启动所有服务
157
+docker compose \
158
+  -f docker-compose.yml \
159
+  -f deploy/production/docker-compose.override.yml \
160
+  up -d
161
+
162
+# 查看服务状态
163
+docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml ps
164
+
165
+# 查看日志
166
+docker compose logs -f
167
+```
168
+
169
+### 更新部署
170
+
171
+```bash
172
+cd /opt/water-management
173
+
174
+# 拉取最新代码
175
+git pull origin master
176
+
177
+# 拉取最新镜像
178
+docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml pull
179
+
180
+# 滚动更新
181
+docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml up -d --remove-orphans
182
+
183
+# 清理旧镜像
184
+docker image prune -f
185
+```
186
+
187
+### 使用 CI/CD 自动部署
188
+
189
+```bash
190
+# 通过 scripts/deploy.sh 脚本
191
+./scripts/deploy.sh \
192
+  --env production \
193
+  --host your-server-ip \
194
+  --user deploy \
195
+  --tag latest
196
+```
197
+
198
+---
199
+
200
+## HTTPS 配置
201
+
202
+### 首次申请证书
203
+
204
+```bash
205
+# 安装 certbot
206
+sudo apt install certbot
207
+
208
+# 确保 Nginx 已启动且 80 端口可访问
209
+# 申请证书(替换域名和邮箱)
210
+sudo certbot certonly --webroot \
211
+  -w /var/www/certbot \
212
+  -d water.example.com \
213
+  --email admin@example.com \
214
+  --agree-tos \
215
+  --no-eff-email
216
+```
217
+
218
+### 自动续期
219
+
220
+证书续期脚本已配置为定时任务(每天凌晨 3 点检查),也可手动执行:
221
+
222
+```bash
223
+sudo deploy/production/nginx/certbot-renew.sh
224
+```
225
+
226
+手动续期:
227
+
228
+```bash
229
+sudo certbot renew --dry-run
230
+sudo certbot renew
231
+```
232
+
233
+### 验证 HTTPS
234
+
235
+```bash
236
+curl -I https://water.example.com
237
+# 应返回 HTTP/2 200 及 HSTS header
238
+```
239
+
240
+---
241
+
242
+## 监控告警
243
+
244
+### 部署监控栈
245
+
246
+```bash
247
+cd /opt/water-management
248
+
249
+# 启动监控组件
250
+docker compose \
251
+  -f deploy/production/monitoring/docker-compose.monitoring.yml \
252
+  up -d
253
+
254
+# 查看状态
255
+docker compose -f deploy/production/monitoring/docker-compose.monitoring.yml ps
256
+```
257
+
258
+### 组件说明
259
+
260
+| 组件 | 端口 | 用途 |
261
+|------|------|------|
262
+| Prometheus | 9090 | 指标收集与存储 |
263
+| Grafana | 3000 | 可视化仪表盘 |
264
+| Node Exporter | 9100 | 主机指标 |
265
+| cAdvisor | 8880 | 容器指标 |
266
+| AlertManager | 9093 | 告警管理 |
267
+| Postgres Exporter | 9187 | PostgreSQL 指标 |
268
+| Redis Exporter | 9121 | Redis 指标 |
269
+
270
+### 访问 Grafana
271
+
272
+1. 浏览器打开 `http://your-server-ip:3000`(生产环境建议配置 Nginx 反向代理 + HTTPS)
273
+2. 默认账号:`admin` / `admin`(首次登录后修改)
274
+3. 数据源已自动配置(Prometheus + PostgreSQL)
275
+
276
+### 告警规则
277
+
278
+已配置以下告警(详见 `monitoring/alert_rules.yml`):
279
+
280
+| 告警名称 | 触发条件 | 严重级别 |
281
+|----------|----------|----------|
282
+| HighCPUUsage | CPU > 80% 持续 5 分钟 | Warning |
283
+| CriticalCPUUsage | CPU > 95% 持续 2 分钟 | Critical |
284
+| HighMemoryUsage | 内存 > 85% 持续 5 分钟 | Warning |
285
+| HighDiskUsage | 磁盘 > 90% 持续 5 分钟 | Critical |
286
+| ServiceDown | 服务宕机 1 分钟 | Critical |
287
+| ServiceSlowResponse | 响应时间 > 5s 持续 5 分钟 | Warning |
288
+| HighErrorRate | 5xx 错误率 > 5% | Critical |
289
+| PostgresHighConnections | 连接数 > 80% | Warning |
290
+
291
+### 企业微信告警配置
292
+
293
+编辑 `monitoring/alertmanager.yml`,配置 Webhook:
294
+
295
+```yaml
296
+receivers:
297
+  - name: 'wecom'
298
+    webhook_configs:
299
+      - url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY'
300
+```
301
+
302
+---
303
+
304
+## 日志收集
305
+
306
+### 部署日志栈(Loki 轻量方案)
307
+
308
+```bash
309
+cd /opt/water-management
310
+
311
+docker compose \
312
+  -f deploy/production/logging/docker-compose.logging.yml \
313
+  up -d
314
+```
315
+
316
+### 组件说明
317
+
318
+| 组件 | 端口 | 用途 |
319
+|------|------|------|
320
+| Loki | 3100 | 日志聚合存储 |
321
+| Promtail | 9080 | 日志收集代理 |
322
+
323
+### 在 Grafana 中查看日志
324
+
325
+1. 打开 Grafana → Explore
326
+2. 选择 Loki 数据源
327
+3. 使用 LogQL 查询:
328
+   ```
329
+   {container="wm-gateway"} |= "error"
330
+   {service="revenue"} | json | level="ERROR"
331
+   ```
332
+
333
+### 日志保留策略
334
+
335
+- 日志保留 30 天
336
+- 自动清理过期日志
337
+
338
+---
339
+
340
+## 数据库备份策略
341
+
342
+### 备份方式
343
+
344
+| 类型 | 频率 | 保留时间 | 说明 |
345
+|------|------|----------|------|
346
+| 每日备份 | 每天凌晨 2 点 | 7 天 | 全量 pg_dump |
347
+| 每周备份 | 每周日凌晨 | 4 周 | 全量 pg_dump |
348
+| 每月备份 | 每月 1 号 | 12 个月 | 全量 pg_dump |
349
+
350
+### 手动备份
351
+
352
+```bash
353
+# 执行备份
354
+deploy/production/backup/backup-db.sh
355
+
356
+# 列出备份
357
+deploy/production/backup/restore-db.sh --list
358
+```
359
+
360
+### 数据库恢复
361
+
362
+```bash
363
+# 恢复最近备份
364
+deploy/production/backup/restore-db.sh --latest
365
+
366
+# 恢复指定日期
367
+deploy/production/backup/restore-db.sh --date 2026-06-15
368
+
369
+# 恢复指定文件
370
+deploy/production/backup/restore-db.sh --file /opt/water-management/backups/daily/wm_water_management_20260615_020000.sql.gz
371
+```
372
+
373
+### 远程备份(MinIO/S3)
374
+
375
+在 `.env` 中配置:
376
+
377
+```bash
378
+S3_ENABLED=true
379
+S3_BUCKET=water-backups
380
+S3_ENDPOINT=http://minio:9000
381
+S3_ACCESS_KEY=minioadmin
382
+S3_SECRET_KEY=<密码>
383
+```
384
+
385
+---
386
+
387
+## 运维手册
388
+
389
+### 常用命令
390
+
391
+```bash
392
+# ===== 服务管理 =====
393
+# 查看所有服务状态
394
+docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml ps
395
+
396
+# 重启单个服务
397
+docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml restart gateway
398
+
399
+# 查看服务日志
400
+docker compose logs -f --tail=100 gateway
401
+
402
+# 进入容器
403
+docker exec -it wm-gateway bash
404
+
405
+# ===== 数据库 =====
406
+# 连接数据库
407
+docker exec -it wm-postgres psql -U water -d water_management
408
+
409
+# 查看活跃连接
410
+docker exec -it wm-postgres psql -U water -c "SELECT count(*) FROM pg_stat_activity WHERE state='active';"
411
+
412
+# ===== Redis =====
413
+# 查看 Redis 信息
414
+docker exec -it wm-redis redis-cli -a <密码> info
415
+
416
+# ===== 资源监控 =====
417
+# 查看容器资源使用
418
+docker stats --no-stream
419
+
420
+# 查看磁盘使用
421
+df -h
422
+docker system df
423
+```
424
+
425
+### 扩缩容
426
+
427
+```bash
428
+# 水平扩展应用实例(需要负载均衡器)
429
+docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml up -d --scale gateway=2
430
+
431
+# 修改资源限制
432
+# 编辑 deploy/production/docker-compose.override.yml 中的 deploy.resources
433
+```
434
+
435
+### 证书管理
436
+
437
+```bash
438
+# 查看证书过期时间
439
+sudo openssl x509 -enddate -noout -in /etc/letsencrypt/live/water.example.com/fullchain.pem
440
+
441
+# 手动续期
442
+sudo certbot renew
443
+
444
+# 重新申请
445
+sudo certbot certonly --webroot -w /var/www/certbot -d water.example.com --force-renewal
446
+```
447
+
448
+---
449
+
450
+## 故障排查
451
+
452
+### 服务无法启动
453
+
454
+```bash
455
+# 查看详细日志
456
+docker compose logs --tail=200 <service-name>
457
+
458
+# 检查端口冲突
459
+ss -tlnp | grep <port>
460
+
461
+# 检查资源
462
+docker stats --no-stream
463
+df -h
464
+free -h
465
+```
466
+
467
+### 数据库连接问题
468
+
469
+```bash
470
+# 检查 PostgreSQL 是否运行
471
+docker exec -it wm-postgres pg_isready
472
+
473
+# 检查连接数
474
+docker exec -it wm-postgres psql -U water -c "SELECT count(*) FROM pg_stat_activity;"
475
+
476
+# 检查慢查询
477
+docker exec -it wm-postgres psql -U water -c "SELECT pid, now()-query_start AS duration, query FROM pg_stat_activity WHERE state='active' ORDER BY duration DESC LIMIT 5;"
478
+```
479
+
480
+### 内存不足
481
+
482
+```bash
483
+# 查看内存使用
484
+docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
485
+
486
+# 清理无用镜像和容器
487
+docker system prune -a --volumes
488
+
489
+# 调整 JVM 内存参数
490
+# 编辑 docker-compose.override.yml 中的 JAVA_OPTS
491
+```
492
+
493
+### 磁盘空间不足
494
+
495
+```bash
496
+# 查看磁盘使用
497
+df -h
498
+du -sh /opt/water-management/*
499
+docker system df -v
500
+
501
+# 清理旧备份
502
+find /opt/water-management/backups -name "*.sql.gz" -mtime +30 -delete
503
+
504
+# 清理 Docker
505
+docker system prune -f
506
+docker volume prune -f
507
+```
508
+
509
+### MQTT 连接问题
510
+
511
+```bash
512
+# 检查 EMQX 状态
513
+docker exec -it wm-emqx emqx status
514
+
515
+# 查看 EMQX 日志
516
+docker logs wm-emqx
517
+
518
+# 测试 MQTT 连接
519
+mosquitto_pub -h localhost -p 1883 -t "test" -m "hello"
520
+```
521
+
522
+---
523
+
524
+## 附录
525
+
526
+### 目录结构
527
+
528
+```
529
+/opt/water-management/
530
+├── .env                           # 环境变量
531
+├── docker-compose.yml             # 基础编排
532
+├── deploy/
533
+│   └── production/
534
+│       ├── docker-compose.override.yml  # 生产覆盖配置
535
+│       ├── nginx/
536
+│       │   ├── nginx.conf               # Nginx 配置
537
+│       │   └── certbot-renew.sh         # 证书续期
538
+│       ├── backup/
539
+│       │   ├── backup-db.sh             # 数据库备份
540
+│       │   └── restore-db.sh            # 数据库恢复
541
+│       ├── monitoring/
542
+│       │   ├── docker-compose.monitoring.yml
543
+│       │   ├── prometheus.yml
544
+│       │   ├── alert_rules.yml
545
+│       │   └── grafana/
546
+│       ├── logging/
547
+│       │   ├── docker-compose.logging.yml
548
+│       │   ├── loki-config.yml
549
+│       │   └── promtail-config.yml
550
+│       ├── server-setup.sh              # 服务器初始化
551
+│       └── README.md                    # 本文档
552
+├── backups/                       # 数据库备份
553
+│   ├── daily/
554
+│   ├── weekly/
555
+│   └── monthly/
556
+└── logs/                          # 应用日志
557
+```
558
+
559
+### 端口清单
560
+
561
+| 服务 | 端口 | 用途 |
562
+|------|------|------|
563
+| Nginx HTTP | 80 | Web 访问(跳转 HTTPS) |
564
+| Nginx HTTPS | 443 | Web 安全访问 |
565
+| MQTT | 1883 | 物联网设备接入 |
566
+| Gateway | 8080 | API 网关(内部) |
567
+| PostgreSQL | 5432 | 数据库(仅本地) |
568
+| Redis | 6379 | 缓存(仅本地) |
569
+| Grafana | 3000 | 监控面板(仅本地) |
570
+| Prometheus | 9090 | 指标查询(仅本地) |
571
+
572
+### 安全建议
573
+
574
+1. **定期更新**:每月执行系统更新和 Docker 镜像更新
575
+2. **密码管理**:使用强密码,定期轮换
576
+3. **访问控制**:生产环境仅通过 HTTPS 访问,内部端口不对外开放
577
+4. **备份验证**:每月至少恢复一次备份进行验证
578
+5. **日志审计**:定期检查访问日志和错误日志
579
+6. **证书监控**:确保证书自动续期正常
580
+
581
+---
582
+
583
+*文档版本: 1.0 | 最后更新: 2026-06-16*

+ 248
- 0
deploy/production/backup/backup-db.sh Vedi File

@@ -0,0 +1,248 @@
1
+#!/bin/bash
2
+# ============================================================
3
+# 数据库备份脚本
4
+# 
5
+# 功能:
6
+#   - PostgreSQL 全量备份 (pg_dump)
7
+#   - 备份保留策略: 每日(7天) + 每周(4周) + 每月(12月)
8
+#   - 压缩 + 可选加密
9
+#   - 上传到 MinIO/S3 或本地归档
10
+#   - 企业微信通知备份结果
11
+#
12
+# Cron 配置 (每天凌晨 2 点执行):
13
+#   0 2 * * * /opt/water-management/deploy/production/backup/backup-db.sh >> /var/log/wm-backup.log 2>&1
14
+# ============================================================
15
+set -euo pipefail
16
+
17
+# ==================== 配置 ====================
18
+# 数据库配置
19
+DB_HOST="${DB_HOST:-localhost}"
20
+DB_PORT="${DB_PORT:-5432}"
21
+DB_NAME="${POSTGRES_DB:-water_management}"
22
+DB_USER="${POSTGRES_USER:-water}"
23
+DB_PASSWORD="${POSTGRES_PASSWORD:-}"
24
+PGPASSWORD="${DB_PASSWORD}"
25
+export PGPASSWORD
26
+
27
+# 备份目录
28
+BACKUP_BASE="${BACKUP_DIR:-/opt/water-management/backups}"
29
+BACKUP_DAILY="${BACKUP_BASE}/daily"
30
+BACKUP_WEEKLY="${BACKUP_BASE}/weekly"
31
+BACKUP_MONTHLY="${BACKUP_BASE}/monthly"
32
+
33
+# 保留策略
34
+DAILY_RETENTION=7
35
+WEEKLY_RETENTION=4
36
+MONTHLY_RETENTION=12
37
+
38
+# MinIO/S3 配置(可选)
39
+S3_ENABLED="${S3_ENABLED:-false}"
40
+S3_BUCKET="${S3_BUCKET:-water-backups}"
41
+S3_ENDPOINT="${S3_ENDPOINT:-http://localhost:9000}"
42
+S3_ACCESS_KEY="${S3_ACCESS_KEY:-}"
43
+S3_SECRET_KEY="${S3_SECRET_KEY:-}"
44
+
45
+# 加密配置(可选)
46
+ENCRYPT_ENABLED="${ENCRYPT_ENABLED:-false}"
47
+ENCRYPT_PASSPHRASE="${ENCRYPT_PASSPHRASE:-}"
48
+
49
+# 企业微信 Webhook
50
+WECOM_WEBHOOK="${WECOM_WEBHOOK:-}"
51
+
52
+# ==================== 初始化 ====================
53
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
54
+DATE=$(date +%Y-%m-%d)
55
+DAY_OF_WEEK=$(date +%u)  # 1=Monday, 7=Sunday
56
+DAY_OF_MONTH=$(date +%d)
57
+
58
+log() {
59
+    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
60
+}
61
+
62
+# 创建备份目录
63
+mkdir -p "$BACKUP_DAILY" "$BACKUP_WEEKLY" "$BACKUP_MONTHLY"
64
+
65
+# ==================== 备份函数 ====================
66
+
67
+# 全量备份
68
+do_full_backup() {
69
+    local target_dir="$1"
70
+    local backup_name="wm_${DB_NAME}_${TIMESTAMP}.sql.gz"
71
+    local backup_path="${target_dir}/${backup_name}"
72
+    
73
+    log "📦 开始全量备份: ${DB_NAME}"
74
+    
75
+    # pg_dump 全量备份 + gzip 压缩
76
+    pg_dump \
77
+        -h "$DB_HOST" \
78
+        -p "$DB_PORT" \
79
+        -U "$DB_USER" \
80
+        -d "$DB_NAME" \
81
+        --format=plain \
82
+        --verbose \
83
+        --no-owner \
84
+        --no-privileges \
85
+        --clean \
86
+        --if-exists \
87
+        2>/dev/null | gzip -9 > "$backup_path"
88
+    
89
+    local size
90
+    size=$(du -sh "$backup_path" | cut -f1)
91
+    log "✅ 备份完成: ${backup_name} (${size})"
92
+    
93
+    # 可选加密
94
+    if [ "$ENCRYPT_ENABLED" = "true" ] && [ -n "$ENCRYPT_PASSPHRASE" ]; then
95
+        log "🔐 加密备份文件..."
96
+        openssl enc -aes-256-cbc -salt -pbkdf2 \
97
+            -in "$backup_path" \
98
+            -out "${backup_path}.enc" \
99
+            -pass "pass:${ENCRYPT_PASSPHRASE}"
100
+        rm "$backup_path"
101
+        backup_path="${backup_path}.enc"
102
+        log "✅ 加密完成"
103
+    fi
104
+    
105
+    # 上传到 S3/MinIO
106
+    if [ "$S3_ENABLED" = "true" ]; then
107
+        upload_to_s3 "$backup_path"
108
+    fi
109
+    
110
+    echo "$backup_path"
111
+}
112
+
113
+# 上传到 S3/MinIO
114
+upload_to_s3() {
115
+    local file="$1"
116
+    local filename
117
+    filename=$(basename "$file")
118
+    local s3_path="s3://${S3_BUCKET}/db/${DATE}/${filename}"
119
+    
120
+    log "☁️ 上传到 S3: ${s3_path}"
121
+    
122
+    # 使用 mc (MinIO Client) 或 aws cli
123
+    if command -v mc &>/dev/null; then
124
+        mc alias set wm-s3 "$S3_ENDPOINT" "$S3_ACCESS_KEY" "$S3_SECRET_KEY" 2>/dev/null
125
+        mc cp "$file" "wm-s3/${S3_BUCKET}/db/${DATE}/${filename}" 2>/dev/null
126
+    elif command -v aws &>/dev/null; then
127
+        aws --endpoint-url "$S3_ENDPOINT" s3 cp "$file" "$s3_path" 2>/dev/null
128
+    else
129
+        log "⚠️ 未找到 mc 或 aws 命令,跳过 S3 上传"
130
+        return 0
131
+    fi
132
+    
133
+    log "✅ S3 上传完成"
134
+}
135
+
136
+# ==================== 保留策略 ====================
137
+
138
+# 清理过期备份
139
+cleanup_old_backups() {
140
+    log "🧹 执行备份保留策略..."
141
+    
142
+    # 每日备份:保留最近 N 天
143
+    local daily_count
144
+    daily_count=$(ls -1 "$BACKUP_DAILY"/*.sql.gz 2>/dev/null | wc -l || echo 0)
145
+    if [ "$daily_count" -gt "$DAILY_RETENTION" ]; then
146
+        local to_delete=$((daily_count - DAILY_RETENTION))
147
+        log "  清理每日备份: 删除最旧的 ${to_delete} 个文件"
148
+        ls -1t "$BACKUP_DAILY"/*.sql.gz 2>/dev/null | tail -n "$to_delete" | xargs rm -f
149
+    fi
150
+    
151
+    # 每周备份:保留最近 N 周
152
+    local weekly_count
153
+    weekly_count=$(ls -1 "$BACKUP_WEEKLY"/*.sql.gz 2>/dev/null | wc -l || echo 0)
154
+    if [ "$weekly_count" -gt "$WEEKLY_RETENTION" ]; then
155
+        local to_delete=$((weekly_count - WEEKLY_RETENTION))
156
+        log "  清理每周备份: 删除最旧的 ${to_delete} 个文件"
157
+        ls -1t "$BACKUP_WEEKLY"/*.sql.gz 2>/dev/null | tail -n "$to_delete" | xargs rm -f
158
+    fi
159
+    
160
+    # 每月备份:保留最近 N 个月
161
+    local monthly_count
162
+    monthly_count=$(ls -1 "$BACKUP_MONTHLY"/*.sql.gz 2>/dev/null | wc -l || echo 0)
163
+    if [ "$monthly_count" -gt "$MONTHLY_RETENTION" ]; then
164
+        local to_delete=$((monthly_count - MONTHLY_RETENTION))
165
+        log "  清理每月备份: 删除最旧的 ${to_delete} 个文件"
166
+        ls -1t "$BACKUP_MONTHLY"/*.sql.gz 2>/dev/null | tail -n "$to_delete" | xargs rm -f
167
+    fi
168
+    
169
+    log "✅ 保留策略执行完毕"
170
+}
171
+
172
+# ==================== 通知 ====================
173
+
174
+send_notification() {
175
+    local status="$1"
176
+    local message="$2"
177
+    local backup_path="$3"
178
+    
179
+    if [ -z "$WECOM_WEBHOOK" ]; then
180
+        log "⚠️ 未配置企业微信 Webhook,跳过通知"
181
+        return 0
182
+    fi
183
+    
184
+    local emoji="✅"
185
+    local color="info"
186
+    if [ "$status" = "failure" ]; then
187
+        emoji="❌"
188
+        color="warning"
189
+    fi
190
+    
191
+    local backup_size="N/A"
192
+    if [ -f "$backup_path" ]; then
193
+        backup_size=$(du -sh "$backup_path" | cut -f1)
194
+    fi
195
+    
196
+    local payload
197
+    payload=$(cat <<EOF
198
+{
199
+    "msgtype": "markdown",
200
+    "markdown": {
201
+        "content": "## ${emoji} 数据库备份通知\n\n**数据库:** ${DB_NAME}\n**状态:** <font color=\"${color}\">${status}</font>\n**大小:** ${backup_size}\n**时间:** $(date '+%Y-%m-%d %H:%M:%S')\n\n${message}"
202
+    }
203
+}
204
+EOF
205
+)
206
+    
207
+    curl -s -X POST "$WECOM_WEBHOOK" \
208
+        -H "Content-Type: application/json" \
209
+        -d "$payload" > /dev/null 2>&1 || true
210
+    
211
+    log "📤 通知已发送"
212
+}
213
+
214
+# ==================== 主逻辑 ====================
215
+
216
+log "========================================="
217
+log "数据库备份开始 - ${DB_NAME}"
218
+log "========================================="
219
+
220
+BACKUP_PATH=""
221
+STATUS="success"
222
+MESSAGE=""
223
+
224
+# 执行每日备份
225
+log "📅 执行每日备份..."
226
+BACKUP_PATH=$(do_full_backup "$BACKUP_DAILY")
227
+
228
+# 周日执行每周备份
229
+if [ "$DAY_OF_WEEK" = "7" ]; then
230
+    log "📅 今天是周日,执行每周备份..."
231
+    do_full_backup "$BACKUP_WEEKLY" > /dev/null
232
+fi
233
+
234
+# 每月1号执行每月备份
235
+if [ "$DAY_OF_MONTH" = "01" ]; then
236
+    log "📅 今天是月初,执行每月备份..."
237
+    do_full_backup "$BACKUP_MONTHLY" > /dev/null
238
+fi
239
+
240
+# 清理过期备份
241
+cleanup_old_backups
242
+
243
+# 发送通知
244
+send_notification "$STATUS" "每日全量备份完成" "$BACKUP_PATH"
245
+
246
+log "========================================="
247
+log "备份全部完成"
248
+log "========================================="

+ 244
- 0
deploy/production/backup/restore-db.sh Vedi File

@@ -0,0 +1,244 @@
1
+#!/bin/bash
2
+# ============================================================
3
+# 数据库恢复脚本
4
+# 
5
+# 用法:
6
+#   ./restore-db.sh --file /path/to/backup.sql.gz
7
+#   ./restore-db.sh --latest                          # 恢复最近一次备份
8
+#   ./restore-db.sh --date 2026-06-15                 # 恢复指定日期的备份
9
+#   ./restore-db.sh --list                            # 列出所有可用备份
10
+# ============================================================
11
+set -euo pipefail
12
+
13
+# ==================== 配置 ====================
14
+DB_HOST="${DB_HOST:-localhost}"
15
+DB_PORT="${DB_PORT:-5432}"
16
+DB_NAME="${POSTGRES_DB:-water_management}"
17
+DB_USER="${POSTGRES_USER:-water}"
18
+DB_PASSWORD="${POSTGRES_PASSWORD:-}"
19
+PGPASSWORD="${DB_PASSWORD}"
20
+export PGPASSWORD
21
+
22
+BACKUP_BASE="${BACKUP_DIR:-/opt/water-management/backups}"
23
+BACKUP_DAILY="${BACKUP_BASE}/daily"
24
+BACKUP_WEEKLY="${BACKUP_BASE}/weekly"
25
+BACKUP_MONTHLY="${BACKUP_BASE}/monthly"
26
+
27
+# ==================== 参数 ====================
28
+BACKUP_FILE=""
29
+MODE="file"  # file | latest | date | list
30
+TARGET_DATE=""
31
+
32
+while [[ $# -gt 0 ]]; do
33
+    case $1 in
34
+        --file)    BACKUP_FILE="$2"; MODE="file"; shift 2 ;;
35
+        --latest)  MODE="latest"; shift ;;
36
+        --date)    TARGET_DATE="$2"; MODE="date"; shift 2 ;;
37
+        --list)    MODE="list"; shift ;;
38
+        -h|--help)
39
+            echo "用法: $0 [选项]"
40
+            echo ""
41
+            echo "选项:"
42
+            echo "  --file PATH      指定备份文件路径"
43
+            echo "  --latest         恢复最近一次备份"
44
+            echo "  --date YYYY-MM-DD 恢复指定日期的备份"
45
+            echo "  --list           列出所有可用备份"
46
+            echo ""
47
+            echo "示例:"
48
+            echo "  $0 --file /opt/water-management/backups/daily/wm_water_management_20260615_020000.sql.gz"
49
+            echo "  $0 --latest"
50
+            echo "  $0 --date 2026-06-15"
51
+            exit 0
52
+            ;;
53
+        *)
54
+            echo "❌ 未知参数: $1"
55
+            exit 1
56
+            ;;
57
+    esac
58
+done
59
+
60
+# ==================== 函数 ====================
61
+
62
+log() {
63
+    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
64
+}
65
+
66
+# 列出所有可用备份
67
+list_backups() {
68
+    echo "========================================="
69
+    echo " 可用备份列表"
70
+    echo "========================================="
71
+    echo ""
72
+    
73
+    echo "📅 每日备份:"
74
+    if [ -d "$BACKUP_DAILY" ]; then
75
+        ls -lht "$BACKUP_DAILY"/*.sql.gz 2>/dev/null | head -10 || echo "  (无)"
76
+    fi
77
+    echo ""
78
+    
79
+    echo "📅 每周备份:"
80
+    if [ -d "$BACKUP_WEEKLY" ]; then
81
+        ls -lht "$BACKUP_WEEKLY"/*.sql.gz 2>/dev/null | head -10 || echo "  (无)"
82
+    fi
83
+    echo ""
84
+    
85
+    echo "📅 每月备份:"
86
+    if [ -d "$BACKUP_MONTHLY" ]; then
87
+        ls -lht "$BACKUP_MONTHLY"/*.sql.gz 2>/dev/null | head -10 || echo "  (无)"
88
+    fi
89
+    echo ""
90
+}
91
+
92
+# 查找最近的备份
93
+find_latest_backup() {
94
+    local latest=""
95
+    latest=$(ls -1t "$BACKUP_DAILY"/*.sql.gz 2>/dev/null | head -1 || echo "")
96
+    
97
+    if [ -z "$latest" ]; then
98
+        latest=$(ls -1t "$BACKUP_WEEKLY"/*.sql.gz 2>/dev/null | head -1 || echo "")
99
+    fi
100
+    
101
+    if [ -z "$latest" ]; then
102
+        latest=$(ls -1t "$BACKUP_MONTHLY"/*.sql.gz 2>/dev/null | head -1 || echo "")
103
+    fi
104
+    
105
+    echo "$latest"
106
+}
107
+
108
+# 查找指定日期的备份
109
+find_backup_by_date() {
110
+    local date_str="$1"
111
+    local date_compact
112
+    date_compact=$(echo "$date_str" | tr -d '-')
113
+    
114
+    local found=""
115
+    found=$(ls -1 "$BACKUP_DAILY"/*${date_compact}*.sql.gz 2>/dev/null | head -1 || echo "")
116
+    
117
+    echo "$found"
118
+}
119
+
120
+# 解密备份文件(如果需要)
121
+decrypt_if_needed() {
122
+    local file="$1"
123
+    
124
+    if [[ "$file" == *.enc ]]; then
125
+        log "🔐 检测到加密文件,正在解密..."
126
+        local decrypted="${file%.enc}"
127
+        openssl enc -d -aes-256-cbc -pbkdf2 \
128
+            -in "$file" \
129
+            -out "$decrypted" \
130
+            -pass "pass:${ENCRYPT_PASSPHRASE:-}"
131
+        echo "$decrypted"
132
+    else
133
+        echo "$file"
134
+    fi
135
+}
136
+
137
+# 恢复数据库
138
+restore_database() {
139
+    local backup_file="$1"
140
+    
141
+    if [ ! -f "$backup_file" ]; then
142
+        log "❌ 备份文件不存在: $backup_file"
143
+        exit 1
144
+    fi
145
+    
146
+    local file_size
147
+    file_size=$(du -sh "$backup_file" | cut -f1)
148
+    
149
+    log "========================================="
150
+    log " 数据库恢复"
151
+    log "========================================="
152
+    log "备份文件: $backup_file"
153
+    log "文件大小: $file_size"
154
+    log "目标数据库: $DB_NAME"
155
+    log "目标主机: $DB_HOST:$DB_PORT"
156
+    log ""
157
+    
158
+    # 确认操作
159
+    read -p "⚠️  此操作将覆盖当前数据库,是否继续?(yes/no): " confirm
160
+    if [ "$confirm" != "yes" ]; then
161
+        log "❌ 操作已取消"
162
+        exit 0
163
+    fi
164
+    
165
+    log "🔄 开始恢复..."
166
+    
167
+    # 解密(如果需要)
168
+    local actual_file
169
+    actual_file=$(decrypt_if_needed "$backup_file")
170
+    
171
+    # 恢复数据库
172
+    if [[ "$actual_file" == *.gz ]]; then
173
+        gunzip -c "$actual_file" | psql \
174
+            -h "$DB_HOST" \
175
+            -p "$DB_PORT" \
176
+            -U "$DB_USER" \
177
+            -d "$DB_NAME" \
178
+            --single-transaction \
179
+            -v ON_ERROR_STOP=1 \
180
+            2>&1 | tail -20
181
+    else
182
+        psql \
183
+            -h "$DB_HOST" \
184
+            -p "$DB_PORT" \
185
+            -U "$DB_USER" \
186
+            -d "$DB_NAME" \
187
+            --single-transaction \
188
+            -v ON_ERROR_STOP=1 \
189
+            -f "$actual_file" \
190
+            2>&1 | tail -20
191
+    fi
192
+    
193
+    if [ ${PIPESTATUS[0]} -eq 0 ]; then
194
+        log "✅ 数据库恢复成功"
195
+    else
196
+        log "❌ 数据库恢复可能存在问题,请检查日志"
197
+        exit 1
198
+    fi
199
+    
200
+    # 清理解密的临时文件
201
+    if [ "$actual_file" != "$backup_file" ] && [ -f "$actual_file" ]; then
202
+        rm -f "$actual_file"
203
+    fi
204
+}
205
+
206
+# ==================== 主逻辑 ====================
207
+
208
+case "$MODE" in
209
+    list)
210
+        list_backups
211
+        ;;
212
+    latest)
213
+        BACKUP_FILE=$(find_latest_backup)
214
+        if [ -z "$BACKUP_FILE" ]; then
215
+            log "❌ 未找到任何备份文件"
216
+            exit 1
217
+        fi
218
+        restore_database "$BACKUP_FILE"
219
+        ;;
220
+    date)
221
+        if [ -z "$TARGET_DATE" ]; then
222
+            log "❌ 必须指定日期: --date YYYY-MM-DD"
223
+            exit 1
224
+        fi
225
+        BACKUP_FILE=$(find_backup_by_date "$TARGET_DATE")
226
+        if [ -z "$BACKUP_FILE" ]; then
227
+            log "❌ 未找到 ${TARGET_DATE} 的备份文件"
228
+            list_backups
229
+            exit 1
230
+        fi
231
+        restore_database "$BACKUP_FILE"
232
+        ;;
233
+    file)
234
+        if [ -z "$BACKUP_FILE" ]; then
235
+            log "❌ 必须指定备份文件: --file /path/to/backup.sql.gz"
236
+            exit 1
237
+        fi
238
+        restore_database "$BACKUP_FILE"
239
+        ;;
240
+    *)
241
+        echo "❌ 未知模式: $MODE"
242
+        exit 1
243
+        ;;
244
+esac

+ 526
- 0
deploy/production/docker-compose.override.yml Vedi File

@@ -0,0 +1,526 @@
1
+# ============================================================
2
+# 生产环境 Docker Compose 覆盖配置
3
+# 使用方式: docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml up -d
4
+# ============================================================
5
+
6
+services:
7
+  # ==================== 基础设施 ====================
8
+
9
+  postgres:
10
+    restart: always
11
+    environment:
12
+      POSTGRES_DB: ${POSTGRES_DB}
13
+      POSTGRES_USER: ${POSTGRES_USER}
14
+      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
15
+    ports:
16
+      - "127.0.0.1:5432:5432"
17
+    volumes:
18
+      - pgdata:/var/lib/postgresql/data
19
+      - /opt/water-management/config/postgres:/etc/postgresql:ro
20
+    deploy:
21
+      resources:
22
+        limits:
23
+          cpus: '2.0'
24
+          memory: 2G
25
+        reservations:
26
+          cpus: '0.5'
27
+          memory: 512M
28
+    logging:
29
+      driver: json-file
30
+      options:
31
+        max-size: "50m"
32
+        max-file: "5"
33
+    security_opt:
34
+      - no-new-privileges:true
35
+    healthcheck:
36
+      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
37
+      interval: 30s
38
+      timeout: 10s
39
+      retries: 5
40
+      start_period: 60s
41
+
42
+  redis:
43
+    restart: always
44
+    command: >
45
+      redis-server
46
+      --requirepass ${REDIS_PASSWORD}
47
+      --maxmemory 512mb
48
+      --maxmemory-policy allkeys-lru
49
+      --appendonly yes
50
+      --appendfsync everysec
51
+    ports:
52
+      - "127.0.0.1:6379:6379"
53
+    deploy:
54
+      resources:
55
+        limits:
56
+          cpus: '0.5'
57
+          memory: 768M
58
+        reservations:
59
+          cpus: '0.25'
60
+          memory: 256M
61
+    logging:
62
+      driver: json-file
63
+      options:
64
+        max-size: "20m"
65
+        max-file: "3"
66
+    security_opt:
67
+      - no-new-privileges:true
68
+    read_only: true
69
+    tmpfs:
70
+      - /tmp
71
+      - /run
72
+
73
+  tdengine:
74
+    restart: always
75
+    ports:
76
+      - "127.0.0.1:6030:6030"
77
+      - "127.0.0.1:6041:6041"
78
+    deploy:
79
+      resources:
80
+        limits:
81
+          cpus: '1.0'
82
+          memory: 2G
83
+        reservations:
84
+          cpus: '0.5'
85
+          memory: 512M
86
+    logging:
87
+      driver: json-file
88
+      options:
89
+        max-size: "50m"
90
+        max-file: "5"
91
+
92
+  kafka:
93
+    restart: always
94
+    ports:
95
+      - "127.0.0.1:9092:9092"
96
+    deploy:
97
+      resources:
98
+        limits:
99
+          cpus: '1.0'
100
+          memory: 2G
101
+        reservations:
102
+          cpus: '0.5'
103
+          memory: 512M
104
+    logging:
105
+      driver: json-file
106
+      options:
107
+        max-size: "50m"
108
+        max-file: "5"
109
+
110
+  emqx:
111
+    restart: always
112
+    ports:
113
+      - "1883:1883"
114
+      - "127.0.0.1:8083:8083"
115
+      - "127.0.0.1:18083:18083"
116
+    environment:
117
+      EMQX_DASHBOARD__DEFAULT_USERNAME: ${EMQX_ADMIN_USER}
118
+      EMQX_DASHBOARD__DEFAULT_PASSWORD: ${EMQX_ADMIN_PASSWORD}
119
+    deploy:
120
+      resources:
121
+        limits:
122
+          cpus: '1.0'
123
+          memory: 1G
124
+        reservations:
125
+          cpus: '0.25'
126
+          memory: 256M
127
+    logging:
128
+      driver: json-file
129
+      options:
130
+        max-size: "30m"
131
+        max-file: "3"
132
+
133
+  nacos:
134
+    restart: always
135
+    ports:
136
+      - "127.0.0.1:8848:8848"
137
+      - "127.0.0.1:9848:9848"
138
+    deploy:
139
+      resources:
140
+        limits:
141
+          cpus: '1.0'
142
+          memory: 1G
143
+        reservations:
144
+          cpus: '0.25'
145
+          memory: 256M
146
+    logging:
147
+      driver: json-file
148
+      options:
149
+        max-size: "30m"
150
+        max-file: "3"
151
+
152
+  minio:
153
+    restart: always
154
+    ports:
155
+      - "127.0.0.1:9000:9000"
156
+      - "127.0.0.1:9001:9001"
157
+    environment:
158
+      MINIO_ROOT_USER: ${MINIO_USER}
159
+      MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
160
+    deploy:
161
+      resources:
162
+        limits:
163
+          cpus: '1.0'
164
+          memory: 1G
165
+        reservations:
166
+          cpus: '0.25'
167
+          memory: 256M
168
+    logging:
169
+      driver: json-file
170
+      options:
171
+        max-size: "30m"
172
+        max-file: "3"
173
+
174
+  # ==================== 应用服务 ====================
175
+
176
+  gateway:
177
+    restart: always
178
+    image: ${REGISTRY:-}water/wm-gateway:${IMAGE_TAG:-latest}
179
+    build: !reset null
180
+    ports:
181
+      - "127.0.0.1:8080:8080"
182
+    environment:
183
+      SPRING_PROFILES_ACTIVE: prod
184
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
185
+      SPRING_CLOUD_NACOS_CONFIG_SERVER_ADDR: nacos:8848
186
+      SPRING_DATA_REDIS_HOST: redis
187
+      SPRING_DATA_REDIS_PASSWORD: ${REDIS_PASSWORD}
188
+      JAVA_OPTS: "-Xms256m -Xmx512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
189
+    deploy:
190
+      resources:
191
+        limits:
192
+          cpus: '1.0'
193
+          memory: 768M
194
+        reservations:
195
+          cpus: '0.25'
196
+          memory: 256M
197
+    logging:
198
+      driver: json-file
199
+      options:
200
+        max-size: "50m"
201
+        max-file: "5"
202
+    security_opt:
203
+      - no-new-privileges:true
204
+    read_only: true
205
+    tmpfs:
206
+      - /tmp
207
+
208
+  base:
209
+    restart: always
210
+    image: ${REGISTRY:-}water/wm-base:${IMAGE_TAG:-latest}
211
+    build: !reset null
212
+    ports:
213
+      - "127.0.0.1:8091:8081"
214
+    environment:
215
+      SPRING_PROFILES_ACTIVE: prod
216
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
217
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
218
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
219
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
220
+      SPRING_DATA_REDIS_HOST: redis
221
+      SPRING_DATA_REDIS_PASSWORD: ${REDIS_PASSWORD}
222
+      JAVA_OPTS: "-Xms256m -Xmx512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
223
+    deploy:
224
+      resources:
225
+        limits:
226
+          cpus: '1.0'
227
+          memory: 768M
228
+        reservations:
229
+          cpus: '0.25'
230
+          memory: 256M
231
+    logging:
232
+      driver: json-file
233
+      options:
234
+        max-size: "50m"
235
+        max-file: "5"
236
+    security_opt:
237
+      - no-new-privileges:true
238
+    read_only: true
239
+    tmpfs:
240
+      - /tmp
241
+
242
+  iot:
243
+    restart: always
244
+    image: ${REGISTRY:-}water/wm-iot:${IMAGE_TAG:-latest}
245
+    build: !reset null
246
+    ports:
247
+      - "127.0.0.1:8092:8082"
248
+    environment:
249
+      SPRING_PROFILES_ACTIVE: prod
250
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
251
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
252
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
253
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
254
+      MQTT_BROKER_URL: tcp://emqx:1883
255
+      JAVA_OPTS: "-Xms256m -Xmx512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
256
+    deploy:
257
+      resources:
258
+        limits:
259
+          cpus: '1.0'
260
+          memory: 768M
261
+        reservations:
262
+          cpus: '0.25'
263
+          memory: 256M
264
+    logging:
265
+      driver: json-file
266
+      options:
267
+        max-size: "50m"
268
+        max-file: "5"
269
+    security_opt:
270
+      - no-new-privileges:true
271
+    read_only: true
272
+    tmpfs:
273
+      - /tmp
274
+
275
+  data-engine:
276
+    restart: always
277
+    image: ${REGISTRY:-}water/wm-data-engine:${IMAGE_TAG:-latest}
278
+    build: !reset null
279
+    ports:
280
+      - "127.0.0.1:8093:8083"
281
+    environment:
282
+      SPRING_PROFILES_ACTIVE: prod
283
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
284
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
285
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
286
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
287
+      JAVA_OPTS: "-Xms512m -Xmx1g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
288
+    deploy:
289
+      resources:
290
+        limits:
291
+          cpus: '1.5'
292
+          memory: 1536M
293
+        reservations:
294
+          cpus: '0.5'
295
+          memory: 512M
296
+    logging:
297
+      driver: json-file
298
+      options:
299
+        max-size: "50m"
300
+        max-file: "5"
301
+    security_opt:
302
+      - no-new-privileges:true
303
+    read_only: true
304
+    tmpfs:
305
+      - /tmp
306
+
307
+  bpm:
308
+    restart: always
309
+    image: ${REGISTRY:-}water/wm-bpm:${IMAGE_TAG:-latest}
310
+    build: !reset null
311
+    ports:
312
+      - "127.0.0.1:8094:8084"
313
+    environment:
314
+      SPRING_PROFILES_ACTIVE: prod
315
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
316
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
317
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
318
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
319
+      JAVA_OPTS: "-Xms256m -Xmx512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
320
+    deploy:
321
+      resources:
322
+        limits:
323
+          cpus: '1.0'
324
+          memory: 768M
325
+        reservations:
326
+          cpus: '0.25'
327
+          memory: 256M
328
+    logging:
329
+      driver: json-file
330
+      options:
331
+        max-size: "50m"
332
+        max-file: "5"
333
+    security_opt:
334
+      - no-new-privileges:true
335
+    read_only: true
336
+    tmpfs:
337
+      - /tmp
338
+
339
+  production:
340
+    restart: always
341
+    image: ${REGISTRY:-}water/wm-production:${IMAGE_TAG:-latest}
342
+    build: !reset null
343
+    ports:
344
+      - "127.0.0.1:8095:8085"
345
+    environment:
346
+      SPRING_PROFILES_ACTIVE: prod
347
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
348
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
349
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
350
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
351
+      JAVA_OPTS: "-Xms256m -Xmx512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
352
+    deploy:
353
+      resources:
354
+        limits:
355
+          cpus: '1.0'
356
+          memory: 768M
357
+        reservations:
358
+          cpus: '0.25'
359
+          memory: 256M
360
+    logging:
361
+      driver: json-file
362
+      options:
363
+        max-size: "50m"
364
+        max-file: "5"
365
+    security_opt:
366
+      - no-new-privileges:true
367
+    read_only: true
368
+    tmpfs:
369
+      - /tmp
370
+
371
+  revenue:
372
+    restart: always
373
+    image: ${REGISTRY:-}water/wm-revenue:${IMAGE_TAG:-latest}
374
+    build: !reset null
375
+    ports:
376
+      - "127.0.0.1:8096:8086"
377
+    environment:
378
+      SPRING_PROFILES_ACTIVE: prod
379
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
380
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
381
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
382
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
383
+      JAVA_OPTS: "-Xms256m -Xmx512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
384
+    deploy:
385
+      resources:
386
+        limits:
387
+          cpus: '1.0'
388
+          memory: 768M
389
+        reservations:
390
+          cpus: '0.25'
391
+          memory: 256M
392
+    logging:
393
+      driver: json-file
394
+      options:
395
+        max-size: "50m"
396
+        max-file: "5"
397
+    security_opt:
398
+      - no-new-privileges:true
399
+    read_only: true
400
+    tmpfs:
401
+      - /tmp
402
+
403
+  patrol:
404
+    restart: always
405
+    image: ${REGISTRY:-}water/wm-patrol:${IMAGE_TAG:-latest}
406
+    build: !reset null
407
+    ports:
408
+      - "127.0.0.1:8097:8087"
409
+    environment:
410
+      SPRING_PROFILES_ACTIVE: prod
411
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
412
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
413
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
414
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
415
+      JAVA_OPTS: "-Xms256m -Xmx512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
416
+    deploy:
417
+      resources:
418
+        limits:
419
+          cpus: '1.0'
420
+          memory: 768M
421
+        reservations:
422
+          cpus: '0.25'
423
+          memory: 256M
424
+    logging:
425
+      driver: json-file
426
+      options:
427
+        max-size: "50m"
428
+        max-file: "5"
429
+    security_opt:
430
+      - no-new-privileges:true
431
+    read_only: true
432
+    tmpfs:
433
+      - /tmp
434
+
435
+  notify:
436
+    restart: always
437
+    image: ${REGISTRY:-}water/wm-notify:${IMAGE_TAG:-latest}
438
+    build: !reset null
439
+    ports:
440
+      - "127.0.0.1:8099:8089"
441
+    environment:
442
+      SPRING_PROFILES_ACTIVE: prod
443
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
444
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
445
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
446
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
447
+      JAVA_OPTS: "-Xms128m -Xmx256m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
448
+    deploy:
449
+      resources:
450
+        limits:
451
+          cpus: '0.5'
452
+          memory: 384M
453
+        reservations:
454
+          cpus: '0.1'
455
+          memory: 128M
456
+    logging:
457
+      driver: json-file
458
+      options:
459
+        max-size: "30m"
460
+        max-file: "3"
461
+    security_opt:
462
+      - no-new-privileges:true
463
+    read_only: true
464
+    tmpfs:
465
+      - /tmp
466
+
467
+  job:
468
+    restart: always
469
+    image: ${REGISTRY:-}water/wm-job:${IMAGE_TAG:-latest}
470
+    build: !reset null
471
+    ports:
472
+      - "127.0.0.1:8100:8090"
473
+    environment:
474
+      SPRING_PROFILES_ACTIVE: prod
475
+      SPRING_CLOUD_NACOS_DISCOVERY_SERVER_ADDR: nacos:8848
476
+      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
477
+      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
478
+      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
479
+      JAVA_OPTS: "-Xms128m -Xmx256m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
480
+    deploy:
481
+      resources:
482
+        limits:
483
+          cpus: '0.5'
484
+          memory: 384M
485
+        reservations:
486
+          cpus: '0.1'
487
+          memory: 128M
488
+    logging:
489
+      driver: json-file
490
+      options:
491
+        max-size: "30m"
492
+        max-file: "3"
493
+    security_opt:
494
+      - no-new-privileges:true
495
+    read_only: true
496
+    tmpfs:
497
+      - /tmp
498
+
499
+  # ==================== 前端 ====================
500
+
501
+  frontend:
502
+    restart: always
503
+    image: ${REGISTRY:-}water/frontend:${IMAGE_TAG:-latest}
504
+    build: !reset null
505
+    ports:
506
+      - "80:80"
507
+      - "443:443"
508
+    volumes:
509
+      - ./deploy/production/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
510
+      - /etc/letsencrypt:/etc/letsencrypt:ro
511
+      - /var/www/certbot:/var/www/certbot:ro
512
+    deploy:
513
+      resources:
514
+        limits:
515
+          cpus: '0.5'
516
+          memory: 256M
517
+        reservations:
518
+          cpus: '0.1'
519
+          memory: 64M
520
+    logging:
521
+      driver: json-file
522
+      options:
523
+        max-size: "20m"
524
+        max-file: "3"
525
+    security_opt:
526
+      - no-new-privileges:true

+ 88
- 0
deploy/production/logging/docker-compose.logging.yml Vedi File

@@ -0,0 +1,88 @@
1
+# ============================================================
2
+# 日志收集栈 Docker Compose 编排 (轻量级 Loki 方案)
3
+# 
4
+# 使用方式:
5
+#   docker compose -f deploy/production/logging/docker-compose.logging.yml up -d
6
+#
7
+# 包含组件:
8
+#   - Loki (日志聚合存储)
9
+#   - Promtail (日志收集代理)
10
+#   - 可选: Filebeat (替代 Promtail)
11
+# ============================================================
12
+
13
+services:
14
+  # ==================== Loki (日志聚合) ====================
15
+  loki:
16
+    image: grafana/loki:2.9.6
17
+    container_name: wm-loki
18
+    restart: always
19
+    command: -config.file=/etc/loki/loki-config.yml
20
+    ports:
21
+      - "127.0.0.1:3100:3100"
22
+    volumes:
23
+      - ./loki-config.yml:/etc/loki/loki-config.yml:ro
24
+      - loki_data:/loki
25
+    networks:
26
+      - wm-network
27
+      - logging
28
+    deploy:
29
+      resources:
30
+        limits:
31
+          cpus: '1.0'
32
+          memory: 1G
33
+        reservations:
34
+          cpus: '0.25'
35
+          memory: 256M
36
+    logging:
37
+      driver: json-file
38
+      options:
39
+        max-size: "30m"
40
+        max-file: "3"
41
+    healthcheck:
42
+      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3100/ready"]
43
+      interval: 30s
44
+      timeout: 10s
45
+      retries: 3
46
+
47
+  # ==================== Promtail (日志收集) ====================
48
+  promtail:
49
+    image: grafana/promtail:2.9.6
50
+    container_name: wm-promtail
51
+    restart: always
52
+    command: -config.file=/etc/promtail/promtail-config.yml
53
+    volumes:
54
+      - ./promtail-config.yml:/etc/promtail/promtail-config.yml:ro
55
+      - /var/log:/var/log:ro
56
+      - /var/lib/docker/containers:/var/lib/docker/containers:ro
57
+      - /var/run/docker.sock:/var/run/docker.sock:ro
58
+    networks:
59
+      - logging
60
+      - wm-network
61
+    depends_on:
62
+      - loki
63
+    deploy:
64
+      resources:
65
+        limits:
66
+          cpus: '0.3'
67
+          memory: 256M
68
+        reservations:
69
+          cpus: '0.1'
70
+          memory: 64M
71
+    logging:
72
+      driver: json-file
73
+      options:
74
+        max-size: "10m"
75
+        max-file: "3"
76
+
77
+# ==================== 数据卷 ====================
78
+volumes:
79
+  loki_data:
80
+
81
+# ==================== 网络 ====================
82
+networks:
83
+  logging:
84
+    driver: bridge
85
+    name: wm-logging
86
+  wm-network:
87
+    external: true
88
+    name: wm-network

+ 64
- 0
deploy/production/logging/loki-config.yml Vedi File

@@ -0,0 +1,64 @@
1
+# ============================================================
2
+# Loki 配置 - 日志聚合存储
3
+# ============================================================
4
+
5
+auth_enabled: false
6
+
7
+server:
8
+  http_listen_port: 3100
9
+  grpc_listen_port: 9096
10
+  log_level: info
11
+
12
+common:
13
+  path_prefix: /loki
14
+  storage:
15
+    filesystem:
16
+      chunks_directory: /loki/chunks
17
+      rules_directory: /loki/rules
18
+  replication_factor: 1
19
+  ring:
20
+    kvstore:
21
+      store: inmemory
22
+
23
+schema_config:
24
+  configs:
25
+    - from: "2024-01-01"
26
+      store: tsdb
27
+      object_store: filesystem
28
+      schema: v13
29
+      index:
30
+        prefix: index_
31
+        period: 24h
32
+
33
+storage_config:
34
+  tsdb_shipper:
35
+    active_index_directory: /loki/tsdb-index
36
+    cache_location: /loki/tsdb-cache
37
+    cache_ttl: 24h
38
+
39
+limits_config:
40
+  retention_period: 30d
41
+  max_query_series: 500
42
+  max_query_parallelism: 4
43
+  max_entries_limit_per_query: 5000
44
+  ingestion_rate_mb: 10
45
+  ingestion_burst_size_mb: 20
46
+  per_stream_rate_limit: 5MB
47
+  per_stream_rate_limit_burst: 15MB
48
+
49
+chunk_store_config:
50
+  max_look_back_period: 0s
51
+
52
+table_manager:
53
+  retention_deletes_enabled: true
54
+  retention_period: 720h  # 30 days
55
+
56
+compactor:
57
+  working_directory: /loki/compactor
58
+  compaction_interval: 10m
59
+  retention_enabled: true
60
+  retention_delete_delay: 2h
61
+  delete_request_store: filesystem
62
+
63
+analytics:
64
+  reporting_enabled: false

+ 108
- 0
deploy/production/logging/promtail-config.yml Vedi File

@@ -0,0 +1,108 @@
1
+# ============================================================
2
+# Promtail 配置 - 日志收集代理
3
+# ============================================================
4
+
5
+server:
6
+  http_listen_port: 9080
7
+  grpc_listen_port: 0
8
+  log_level: info
9
+
10
+positions:
11
+  filename: /tmp/positions.yaml
12
+
13
+clients:
14
+  - url: http://loki:3100/loki/api/v1/push
15
+    batchwait: 1s
16
+    batchsize: 1048576  # 1MB
17
+    timeout: 10s
18
+
19
+scrape_configs:
20
+  # ==================== Docker 容器日志 ====================
21
+  - job_name: docker
22
+    docker_sd_configs:
23
+      - host: unix:///var/run/docker.sock
24
+        refresh_interval: 5s
25
+        filters:
26
+          - name: name
27
+            values: ["wm-*"]
28
+    relabel_configs:
29
+      - source_labels: ['__meta_docker_container_name']
30
+        regex: '/(.*)'
31
+        target_label: 'container'
32
+      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
33
+        target_label: 'service'
34
+    pipeline_stages:
35
+      - json:
36
+          expressions:
37
+            level: level
38
+            msg: msg
39
+            time: time
40
+      - labels:
41
+          level:
42
+          service:
43
+      - timestamp:
44
+          source: time
45
+          format: RFC3339Nano
46
+          fallback_formats:
47
+            - "2006-01-02T15:04:05.000Z"
48
+      - output:
49
+          source: msg
50
+
51
+  # ==================== Nginx 访问日志 ====================
52
+  - job_name: nginx-access
53
+    static_configs:
54
+      - targets:
55
+          - localhost
56
+        labels:
57
+          job: nginx-access
58
+          __path__: /var/lib/docker/containers/**/wm-frontend*.log
59
+    pipeline_stages:
60
+      - json:
61
+          expressions:
62
+            log: log
63
+      - regex:
64
+          expression: '^(?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^\]]+)\] "(?P<request>[^"]*)" (?P<status>\d+) (?P<body_bytes_sent>\d+) "(?P<referer>[^"]*)" "(?P<user_agent>[^"]*)"'
65
+          source: log
66
+      - labels:
67
+          status:
68
+      - timestamp:
69
+          source: time_local
70
+          format: "02/Jan/2006:15:04:05 -0700"
71
+
72
+  # ==================== 系统日志 ====================
73
+  - job_name: syslog
74
+    static_configs:
75
+      - targets:
76
+          - localhost
77
+        labels:
78
+          job: syslog
79
+          __path__: /var/log/syslog
80
+    pipeline_stages:
81
+      - regex:
82
+          expression: '^(?P<timestamp>\w+\s+\d+\s+\d+:\d+:\d+)\s+(?P<host>\S+)\s+(?P<process>\S+):\s+(?P<message>.+)$'
83
+      - labels:
84
+          host:
85
+          process:
86
+      - timestamp:
87
+          source: timestamp
88
+          format: "Jan  2 15:04:05"
89
+
90
+  # ==================== Docker 系统日志 ====================
91
+  - job_name: docker-system
92
+    static_configs:
93
+      - targets:
94
+          - localhost
95
+        labels:
96
+          job: docker
97
+          __path__: /var/lib/docker/containers/*/*.log
98
+    pipeline_stages:
99
+      - json:
100
+          expressions:
101
+            stream: stream
102
+            log: log
103
+            time: time
104
+      - labels:
105
+          stream:
106
+      - timestamp:
107
+          source: time
108
+          format: RFC3339Nano

+ 184
- 0
deploy/production/monitoring/alert_rules.yml Vedi File

@@ -0,0 +1,184 @@
1
+# ============================================================
2
+# Prometheus 告警规则
3
+# ============================================================
4
+
5
+groups:
6
+  # ==================== 主机告警 ====================
7
+  - name: host_alerts
8
+    rules:
9
+      # CPU 使用率 > 80% 持续 5 分钟
10
+      - alert: HighCPUUsage
11
+        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
12
+        for: 5m
13
+        labels:
14
+          severity: warning
15
+        annotations:
16
+          summary: "主机 CPU 使用率过高 ({{ $labels.instance }})"
17
+          description: "CPU 使用率超过 80%,当前值: {{ $value | printf \"%.2f\" }}%"
18
+
19
+      # CPU 使用率 > 95% 持续 2 分钟
20
+      - alert: CriticalCPUUsage
21
+        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
22
+        for: 2m
23
+        labels:
24
+          severity: critical
25
+        annotations:
26
+          summary: "主机 CPU 使用率严重过高 ({{ $labels.instance }})"
27
+          description: "CPU 使用率超过 95%,当前值: {{ $value | printf \"%.2f\" }}%"
28
+
29
+      # 内存使用率 > 85%
30
+      - alert: HighMemoryUsage
31
+        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
32
+        for: 5m
33
+        labels:
34
+          severity: warning
35
+        annotations:
36
+          summary: "主机内存使用率过高 ({{ $labels.instance }})"
37
+          description: "内存使用率超过 85%,当前值: {{ $value | printf \"%.2f\" }}%"
38
+
39
+      # 磁盘使用率 > 90%
40
+      - alert: HighDiskUsage
41
+        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
42
+        for: 5m
43
+        labels:
44
+          severity: critical
45
+        annotations:
46
+          summary: "主机磁盘空间不足 ({{ $labels.instance }})"
47
+          description: "磁盘使用率超过 90%,当前值: {{ $value | printf \"%.2f\" }}%"
48
+
49
+      # 磁盘 IO 等待 > 10%
50
+      - alert: HighDiskIOWait
51
+        expr: rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100 > 10
52
+        for: 5m
53
+        labels:
54
+          severity: warning
55
+        annotations:
56
+          summary: "磁盘 IO 等待过高 ({{ $labels.instance }})"
57
+          description: "IO 等待超过 10%,当前值: {{ $value | printf \"%.2f\" }}%"
58
+
59
+  # ==================== 容器告警 ====================
60
+  - name: container_alerts
61
+    rules:
62
+      # 容器 CPU 使用率 > 80%
63
+      - alert: ContainerHighCPU
64
+        expr: (sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[5m])) * 100) > 80
65
+        for: 5m
66
+        labels:
67
+          severity: warning
68
+        annotations:
69
+          summary: "容器 CPU 使用率过高 ({{ $labels.name }})"
70
+          description: "容器 {{ $labels.name }} CPU 使用率超过 80%"
71
+
72
+      # 容器内存使用率 > 85%
73
+      - alert: ContainerHighMemory
74
+        expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100) > 85
75
+        for: 5m
76
+        labels:
77
+          severity: warning
78
+        annotations:
79
+          summary: "容器内存使用率过高 ({{ $labels.name }})"
80
+          description: "容器 {{ $labels.name }} 内存使用率超过 85%"
81
+
82
+      # 容器重启次数
83
+      - alert: ContainerFrequentRestart
84
+        expr: increase(container_last_seen{name!=""}[1h]) > 3
85
+        for: 0m
86
+        labels:
87
+          severity: warning
88
+        annotations:
89
+          summary: "容器频繁重启 ({{ $labels.name }})"
90
+          description: "容器 {{ $labels.name }} 在过去1小时内重启超过3次"
91
+
92
+  # ==================== 服务告警 ====================
93
+  - name: service_alerts
94
+    rules:
95
+      # 服务宕机
96
+      - alert: ServiceDown
97
+        expr: up{job=~"wm-.*"} == 0
98
+        for: 1m
99
+        labels:
100
+          severity: critical
101
+        annotations:
102
+          summary: "服务宕机 ({{ $labels.job }})"
103
+          description: "服务 {{ $labels.job }} 已停止响应超过1分钟"
104
+
105
+      # 服务响应时间 > 5s
106
+      - alert: ServiceSlowResponse
107
+        expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket{job=~"wm-.*"}[5m])) > 5
108
+        for: 5m
109
+        labels:
110
+          severity: warning
111
+        annotations:
112
+          summary: "服务响应缓慢 ({{ $labels.job }})"
113
+          description: "服务 {{ $labels.job }} 95% 请求响应时间超过5秒"
114
+
115
+      # HTTP 5xx 错误率 > 5%
116
+      - alert: HighErrorRate
117
+        expr: |
118
+          (
119
+            sum by(job) (rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
120
+            /
121
+            sum by(job) (rate(http_server_requests_seconds_count[5m]))
122
+          ) * 100 > 5
123
+        for: 5m
124
+        labels:
125
+          severity: critical
126
+        annotations:
127
+          summary: "服务错误率过高 ({{ $labels.job }})"
128
+          description: "服务 {{ $labels.job }} 5xx 错误率超过 5%"
129
+
130
+  # ==================== 数据库告警 ====================
131
+  - name: database_alerts
132
+    rules:
133
+      # PostgreSQL 连接数 > 80%
134
+      - alert: PostgresHighConnections
135
+        expr: pg_stat_activity_count / pg_settings_max_connections * 100 > 80
136
+        for: 5m
137
+        labels:
138
+          severity: warning
139
+        annotations:
140
+          summary: "PostgreSQL 连接数过高"
141
+          description: "数据库连接数使用率超过 80%"
142
+
143
+      # PostgreSQL 死锁
144
+      - alert: PostgresDeadlock
145
+        expr: increase(pg_stat_database_deadlocks[5m]) > 0
146
+        for: 0m
147
+        labels:
148
+          severity: warning
149
+        annotations:
150
+          summary: "PostgreSQL 检测到死锁"
151
+          description: "数据库发生死锁"
152
+
153
+      # Redis 内存使用率 > 85%
154
+      - alert: RedisHighMemory
155
+        expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 85
156
+        for: 5m
157
+        labels:
158
+          severity: warning
159
+        annotations:
160
+          summary: "Redis 内存使用率过高"
161
+          description: "Redis 内存使用率超过 85%"
162
+
163
+  # ==================== 消息队列告警 ====================
164
+  - name: mq_alerts
165
+    rules:
166
+      # Kafka 消费延迟
167
+      - alert: KafkaConsumerLag
168
+        expr: kafka_consumergroup_lag_sum > 10000
169
+        for: 10m
170
+        labels:
171
+          severity: warning
172
+        annotations:
173
+          summary: "Kafka 消费延迟过高"
174
+          description: "消费者组 {{ $labels.group }} 延迟超过 10000 条消息"
175
+
176
+      # EMQX 连接数过多
177
+      - alert: EmqxHighConnections
178
+        expr: emqx_connections_count > 10000
179
+        for: 5m
180
+        labels:
181
+          severity: warning
182
+        annotations:
183
+          summary: "EMQX 连接数过高"
184
+          description: "MQTT 连接数超过 10000"

+ 66
- 0
deploy/production/monitoring/alertmanager.yml Vedi File

@@ -0,0 +1,66 @@
1
+# ============================================================
2
+# AlertManager 配置 - 告警路由与通知
3
+# ============================================================
4
+
5
+global:
6
+  resolve_timeout: 5m
7
+
8
+# ==================== 告警模板 ====================
9
+templates:
10
+  - '/etc/alertmanager/templates/*.tmpl'
11
+
12
+# ==================== 路由规则 ====================
13
+route:
14
+  group_by: ['alertname', 'job']
15
+  group_wait: 30s
16
+  group_interval: 5m
17
+  repeat_interval: 4h
18
+  receiver: 'wecom'
19
+
20
+  routes:
21
+    # 严重告警 - 立即通知
22
+    - match:
23
+        severity: critical
24
+      receiver: 'wecom-critical'
25
+      group_wait: 10s
26
+      repeat_interval: 1h
27
+
28
+    # 普通告警
29
+    - match:
30
+        severity: warning
31
+      receiver: 'wecom'
32
+      repeat_interval: 4h
33
+
34
+# ==================== 接收器 ====================
35
+receivers:
36
+  # 企业微信通知 - 普通告警
37
+  - name: 'wecom'
38
+    webhook_configs:
39
+      - url: 'http://wecom-alert-proxy:8080/alert'
40
+        send_resolved: true
41
+        http_config:
42
+          follow_redirects: true
43
+
44
+  # 企业微信通知 - 严重告警
45
+  - name: 'wecom-critical'
46
+    webhook_configs:
47
+      - url: 'http://wecom-alert-proxy:8080/alert-critical'
48
+        send_resolved: true
49
+        http_config:
50
+          follow_redirects: true
51
+
52
+# ==================== 告警抑制 ====================
53
+inhibit_rules:
54
+  # 同一告警,critical 触发时抑制 warning
55
+  - source_match:
56
+      severity: 'critical'
57
+    target_match:
58
+      severity: 'warning'
59
+    equal: ['alertname', 'instance']
60
+
61
+  # 服务宕机时抑制该服务的其他告警
62
+  - source_match:
63
+      alertname: 'ServiceDown'
64
+    target_match_re:
65
+      alertname: 'ServiceSlowResponse|HighErrorRate'
66
+    equal: ['job']

+ 259
- 0
deploy/production/monitoring/docker-compose.monitoring.yml Vedi File

@@ -0,0 +1,259 @@
1
+# ============================================================
2
+# 监控栈 Docker Compose 编排
3
+# 
4
+# 使用方式:
5
+#   docker compose -f deploy/production/monitoring/docker-compose.monitoring.yml up -d
6
+#
7
+# 包含组件:
8
+#   - Prometheus (指标收集与告警)
9
+#   - Grafana (可视化仪表盘)
10
+#   - Node Exporter (主机指标)
11
+#   - cAdvisor (容器指标)
12
+#   - AlertManager (告警管理)
13
+# ============================================================
14
+
15
+services:
16
+  # ==================== Prometheus ====================
17
+  prometheus:
18
+    image: prom/prometheus:v2.51.0
19
+    container_name: wm-prometheus
20
+    restart: always
21
+    command:
22
+      - '--config.file=/etc/prometheus/prometheus.yml'
23
+      - '--storage.tsdb.path=/prometheus'
24
+      - '--storage.tsdb.retention.time=30d'
25
+      - '--storage.tsdb.retention.size=10GB'
26
+      - '--web.console.libraries=/etc/prometheus/console_libraries'
27
+      - '--web.console.templates=/etc/prometheus/consoles'
28
+      - '--web.enable-lifecycle'
29
+      - '--web.enable-admin-api'
30
+    ports:
31
+      - "127.0.0.1:9090:9090"
32
+    volumes:
33
+      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
34
+      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml:ro
35
+      - prometheus_data:/prometheus
36
+    networks:
37
+      - wm-network
38
+      - monitoring
39
+    deploy:
40
+      resources:
41
+        limits:
42
+          cpus: '1.0'
43
+          memory: 2G
44
+        reservations:
45
+          cpus: '0.25'
46
+          memory: 512M
47
+    logging:
48
+      driver: json-file
49
+      options:
50
+        max-size: "50m"
51
+        max-file: "3"
52
+    healthcheck:
53
+      test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
54
+      interval: 30s
55
+      timeout: 10s
56
+      retries: 3
57
+
58
+  # ==================== Grafana ====================
59
+  grafana:
60
+    image: grafana/grafana:10.4.0
61
+    container_name: wm-grafana
62
+    restart: always
63
+    ports:
64
+      - "127.0.0.1:3000:3000"
65
+    environment:
66
+      GF_SECURITY_ADMIN_USER: ${GRAFANA_ADMIN_USER:-admin}
67
+      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin}
68
+      GF_SERVER_ROOT_URL: ${GRAFANA_ROOT_URL:-http://localhost:3000}
69
+      GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-simple-json-datasource
70
+      GF_USERS_ALLOW_SIGN_UP: "false"
71
+      GF_AUTH_ANONYMOUS_ENABLED: "false"
72
+      GF_SECURITY_COOKIE_SECURE: "true"
73
+      GF_SECURITY_STRICT_TRANSPORT_SECURITY: "true"
74
+    volumes:
75
+      - ./grafana/provisioning/datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml:ro
76
+      - grafana_data:/var/lib/grafana
77
+    networks:
78
+      - monitoring
79
+      - wm-network
80
+    deploy:
81
+      resources:
82
+        limits:
83
+          cpus: '0.5'
84
+          memory: 512M
85
+        reservations:
86
+          cpus: '0.1'
87
+          memory: 128M
88
+    logging:
89
+      driver: json-file
90
+      options:
91
+        max-size: "30m"
92
+        max-file: "3"
93
+    healthcheck:
94
+      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/api/health"]
95
+      interval: 30s
96
+      timeout: 10s
97
+      retries: 3
98
+
99
+  # ==================== Node Exporter ====================
100
+  node-exporter:
101
+    image: prom/node-exporter:v1.7.0
102
+    container_name: wm-node-exporter
103
+    restart: always
104
+    command:
105
+      - '--path.rootfs=/host'
106
+      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
107
+    ports:
108
+      - "127.0.0.1:9100:9100"
109
+    volumes:
110
+      - /:/host:ro,rslave
111
+    pid: host
112
+    networks:
113
+      - monitoring
114
+    deploy:
115
+      resources:
116
+        limits:
117
+          cpus: '0.2'
118
+          memory: 128M
119
+        reservations:
120
+          cpus: '0.05'
121
+          memory: 32M
122
+    logging:
123
+      driver: json-file
124
+      options:
125
+        max-size: "10m"
126
+        max-file: "3"
127
+
128
+  # ==================== cAdvisor ====================
129
+  cadvisor:
130
+    image: gcr.io/cadvisor/cadvisor:v0.49.1
131
+    container_name: wm-cadvisor
132
+    restart: always
133
+    ports:
134
+      - "127.0.0.1:8880:8080"
135
+    volumes:
136
+      - /:/rootfs:ro
137
+      - /var/run:/var/run:ro
138
+      - /sys:/sys:ro
139
+      - /var/lib/docker/:/var/lib/docker:ro
140
+      - /dev/disk/:/dev/disk:ro
141
+    privileged: true
142
+    devices:
143
+      - /dev/kmsg:/dev/kmsg
144
+    networks:
145
+      - monitoring
146
+    deploy:
147
+      resources:
148
+        limits:
149
+          cpus: '0.3'
150
+          memory: 256M
151
+        reservations:
152
+          cpus: '0.1'
153
+          memory: 64M
154
+    logging:
155
+      driver: json-file
156
+      options:
157
+        max-size: "20m"
158
+        max-file: "3"
159
+
160
+  # ==================== AlertManager ====================
161
+  alertmanager:
162
+    image: prom/alertmanager:v0.27.0
163
+    container_name: wm-alertmanager
164
+    restart: always
165
+    command:
166
+      - '--config.file=/etc/alertmanager/alertmanager.yml'
167
+      - '--storage.path=/alertmanager'
168
+    ports:
169
+      - "127.0.0.1:9093:9093"
170
+    volumes:
171
+      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
172
+      - alertmanager_data:/alertmanager
173
+    networks:
174
+      - monitoring
175
+    deploy:
176
+      resources:
177
+        limits:
178
+          cpus: '0.2'
179
+          memory: 256M
180
+        reservations:
181
+          cpus: '0.05'
182
+          memory: 64M
183
+    logging:
184
+      driver: json-file
185
+      options:
186
+        max-size: "10m"
187
+        max-file: "3"
188
+
189
+  # ==================== PostgreSQL Exporter ====================
190
+  postgres-exporter:
191
+    image: prometheuscommunity/postgres-exporter:v0.15.0
192
+    container_name: wm-postgres-exporter
193
+    restart: always
194
+    environment:
195
+      DATA_SOURCE_NAME: "postgresql://${POSTGRES_USER:-water}:${POSTGRES_PASSWORD:-water123}@postgres:5432/${POSTGRES_DB:-water_management}?sslmode=disable"
196
+    ports:
197
+      - "127.0.0.1:9187:9187"
198
+    depends_on:
199
+      - postgres
200
+    networks:
201
+      - monitoring
202
+      - wm-network
203
+    deploy:
204
+      resources:
205
+        limits:
206
+          cpus: '0.2'
207
+          memory: 128M
208
+        reservations:
209
+          cpus: '0.05'
210
+          memory: 32M
211
+    logging:
212
+      driver: json-file
213
+      options:
214
+        max-size: "10m"
215
+        max-file: "3"
216
+
217
+  # ==================== Redis Exporter ====================
218
+  redis-exporter:
219
+    image: oliver006/redis_exporter:v1.58.0
220
+    container_name: wm-redis-exporter
221
+    restart: always
222
+    environment:
223
+      REDIS_ADDR: redis://redis:6379
224
+      REDIS_PASSWORD: ${REDIS_PASSWORD:-water123}
225
+    ports:
226
+      - "127.0.0.1:9121:9121"
227
+    depends_on:
228
+      - redis
229
+    networks:
230
+      - monitoring
231
+      - wm-network
232
+    deploy:
233
+      resources:
234
+        limits:
235
+          cpus: '0.1'
236
+          memory: 64M
237
+        reservations:
238
+          cpus: '0.02'
239
+          memory: 16M
240
+    logging:
241
+      driver: json-file
242
+      options:
243
+        max-size: "10m"
244
+        max-file: "3"
245
+
246
+# ==================== 数据卷 ====================
247
+volumes:
248
+  prometheus_data:
249
+  grafana_data:
250
+  alertmanager_data:
251
+
252
+# ==================== 网络 ====================
253
+networks:
254
+  monitoring:
255
+    driver: bridge
256
+    name: wm-monitoring
257
+  wm-network:
258
+    external: true
259
+    name: wm-network

+ 59
- 0
deploy/production/monitoring/grafana/provisioning/datasources.yml Vedi File

@@ -0,0 +1,59 @@
1
+# ============================================================
2
+# Grafana 数据源自动配置 (Provisioning)
3
+# 放置于 /etc/grafana/provisioning/datasources/datasources.yml
4
+# ============================================================
5
+
6
+apiVersion: 1
7
+
8
+deleteDatasources: []
9
+
10
+datasources:
11
+  # ==================== Prometheus ====================
12
+  - name: Prometheus
13
+    type: prometheus
14
+    access: proxy
15
+    url: http://prometheus:9090
16
+    isDefault: true
17
+    editable: false
18
+    jsonData:
19
+      timeInterval: "15s"
20
+      httpMethod: POST
21
+      queryTimeout: "60s"
22
+    version: 1
23
+
24
+  # ==================== Loki (日志) ====================
25
+  - name: Loki
26
+    type: loki
27
+    access: proxy
28
+    url: http://loki:3100
29
+    editable: false
30
+    jsonData:
31
+      maxLines: 1000
32
+    version: 1
33
+
34
+  # ==================== PostgreSQL ====================
35
+  - name: PostgreSQL
36
+    type: postgres
37
+    access: proxy
38
+    url: postgres:5432
39
+    database: water_management
40
+    user: water
41
+    editable: false
42
+    secureJsonData:
43
+      password: "${POSTGRES_PASSWORD}"
44
+    jsonData:
45
+      sslmode: "disable"
46
+      maxOpenConns: 5
47
+      maxIdleConns: 2
48
+      connMaxLifetime: 14400
49
+    version: 1
50
+
51
+  # ==================== AlertManager ====================
52
+  - name: AlertManager
53
+    type: alertmanager
54
+    access: proxy
55
+    url: http://alertmanager:9093
56
+    editable: false
57
+    jsonData:
58
+      implementation: prometheus
59
+    version: 1

+ 158
- 0
deploy/production/monitoring/prometheus.yml Vedi File

@@ -0,0 +1,158 @@
1
+# ============================================================
2
+# Prometheus 配置 - 生产环境监控
3
+# ============================================================
4
+
5
+global:
6
+  scrape_interval: 15s
7
+  evaluation_interval: 15s
8
+  scrape_timeout: 10s
9
+  external_labels:
10
+    cluster: water-management-prod
11
+    environment: production
12
+
13
+# ==================== 告警规则 ====================
14
+rule_files:
15
+  - "alert_rules.yml"
16
+
17
+# ==================== AlertManager ====================
18
+alerting:
19
+  alertmanagers:
20
+    - static_configs:
21
+        - targets:
22
+            - alertmanager:9093
23
+
24
+# ==================== 抓取目标 ====================
25
+scrape_configs:
26
+  # ==================== Prometheus 自身 ====================
27
+  - job_name: 'prometheus'
28
+    static_configs:
29
+      - targets: ['localhost:9090']
30
+    metrics_path: /metrics
31
+
32
+  # ==================== Node Exporter (主机指标) ====================
33
+  - job_name: 'node-exporter'
34
+    static_configs:
35
+      - targets: ['node-exporter:9100']
36
+    scrape_interval: 10s
37
+    relabel_configs:
38
+      - source_labels: [__address__]
39
+        regex: '([^:]+):\d+'
40
+        target_label: instance
41
+        replacement: '${1}'
42
+
43
+  # ==================== cAdvisor (容器指标) ====================
44
+  - job_name: 'cadvisor'
45
+    static_configs:
46
+      - targets: ['cadvisor:8080']
47
+    scrape_interval: 15s
48
+    metrics_path: /metrics
49
+
50
+  # ==================== 应用服务 (Spring Boot Actuator) ====================
51
+  - job_name: 'wm-gateway'
52
+    metrics_path: /actuator/prometheus
53
+    static_configs:
54
+      - targets: ['wm-gateway:8080']
55
+        labels:
56
+          service: gateway
57
+          module: base
58
+
59
+  - job_name: 'wm-base'
60
+    metrics_path: /actuator/prometheus
61
+    static_configs:
62
+      - targets: ['wm-base:8081']
63
+        labels:
64
+          service: base
65
+          module: system
66
+
67
+  - job_name: 'wm-iot'
68
+    metrics_path: /actuator/prometheus
69
+    static_configs:
70
+      - targets: ['wm-iot:8082']
71
+        labels:
72
+          service: iot
73
+          module: data-engine
74
+
75
+  - job_name: 'wm-data-engine'
76
+    metrics_path: /actuator/prometheus
77
+    static_configs:
78
+      - targets: ['wm-data-engine:8083']
79
+        labels:
80
+          service: data-engine
81
+          module: data-engine
82
+
83
+  - job_name: 'wm-bpm'
84
+    metrics_path: /actuator/prometheus
85
+    static_configs:
86
+      - targets: ['wm-bpm:8084']
87
+        labels:
88
+          service: bpm
89
+          module: bpm
90
+
91
+  - job_name: 'wm-production'
92
+    metrics_path: /actuator/prometheus
93
+    static_configs:
94
+      - targets: ['wm-production:8085']
95
+        labels:
96
+          service: production
97
+          module: production
98
+
99
+  - job_name: 'wm-revenue'
100
+    metrics_path: /actuator/prometheus
101
+    static_configs:
102
+      - targets: ['wm-revenue:8086']
103
+        labels:
104
+          service: revenue
105
+          module: revenue
106
+
107
+  - job_name: 'wm-patrol'
108
+    metrics_path: /actuator/prometheus
109
+    static_configs:
110
+      - targets: ['wm-patrol:8087']
111
+        labels:
112
+          service: patrol
113
+          module: patrol
114
+
115
+  - job_name: 'wm-notify'
116
+    metrics_path: /actuator/prometheus
117
+    static_configs:
118
+      - targets: ['wm-notify:8089']
119
+        labels:
120
+          service: notify
121
+          module: notify
122
+
123
+  - job_name: 'wm-job'
124
+    metrics_path: /actuator/prometheus
125
+    static_configs:
126
+      - targets: ['wm-job:8090']
127
+        labels:
128
+          service: job
129
+          module: job
130
+
131
+  # ==================== PostgreSQL ====================
132
+  - job_name: 'postgres'
133
+    static_configs:
134
+      - targets: ['postgres-exporter:9187']
135
+    scrape_interval: 30s
136
+
137
+  # ==================== Redis ====================
138
+  - job_name: 'redis'
139
+    static_configs:
140
+      - targets: ['redis-exporter:9121']
141
+    scrape_interval: 30s
142
+
143
+  # ==================== Nginx ====================
144
+  - job_name: 'nginx'
145
+    static_configs:
146
+      - targets: ['wm-frontend:9113']
147
+    scrape_interval: 15s
148
+    metrics_path: /metrics
149
+
150
+  # ==================== EMQX MQTT ====================
151
+  - job_name: 'emqx'
152
+    static_configs:
153
+      - targets: ['wm-emqx:18083']
154
+    scrape_interval: 30s
155
+    metrics_path: /api/v5/metrics
156
+    basic_auth:
157
+      username: admin
158
+      password: public

+ 160
- 0
deploy/production/nginx/certbot-renew.sh Vedi File

@@ -0,0 +1,160 @@
1
+#!/bin/bash
2
+# ============================================================
3
+# Let's Encrypt 证书自动续期脚本
4
+# 
5
+# 首次申请证书:
6
+#   certbot certonly --webroot -w /var/www/certbot \
7
+#     -d your-domain.com -d www.your-domain.com \
8
+#     --email admin@your-domain.com --agree-tos --no-eff-email
9
+#
10
+# 配置 cron 自动续期 (每天凌晨 3 点检查):
11
+#   0 3 * * * /opt/water-management/deploy/production/nginx/certbot-renew.sh >> /var/log/certbot-renew.log 2>&1
12
+# ============================================================
13
+set -euo pipefail
14
+
15
+# ==================== 配置 ====================
16
+DOMAIN="${DOMAIN:-water.example.com}"
17
+WEBROOT="/var/www/certbot"
18
+EMAIL="${CERTBOT_EMAIL:-admin@example.com}"
19
+NGINX_CONTAINER="${NGINX_CONTAINER:-wm-frontend}"
20
+RENEW_DAYS_BEFORE_EXPIRY=30
21
+
22
+# ==================== 函数 ====================
23
+
24
+log() {
25
+    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
26
+}
27
+
28
+# 检查证书是否需要续期
29
+check_cert_expiry() {
30
+    local cert_path="/etc/letsencrypt/live/${DOMAIN}/fullchain.pem"
31
+    
32
+    if [ ! -f "$cert_path" ]; then
33
+        log "⚠️ 证书文件不存在: $cert_path"
34
+        return 1
35
+    fi
36
+    
37
+    local expiry_date
38
+    expiry_date=$(openssl x509 -enddate -noout -in "$cert_path" | cut -d= -f2)
39
+    local expiry_epoch
40
+    expiry_epoch=$(date -d "$expiry_date" +%s)
41
+    local now_epoch
42
+    now_epoch=$(date +%s)
43
+    local days_left=$(( (expiry_epoch - now_epoch) / 86400 ))
44
+    
45
+    log "📅 证书过期时间: $expiry_date (剩余 ${days_left} 天)"
46
+    
47
+    if [ $days_left -le $RENEW_DAYS_BEFORE_EXPIRY ]; then
48
+        log "⚠️ 证书将在 ${days_left} 天内过期,需要续期"
49
+        return 0
50
+    else
51
+        log "✅ 证书有效,无需续期"
52
+        return 1
53
+    fi
54
+}
55
+
56
+# 首次申请证书
57
+init_cert() {
58
+    log "🔐 首次申请证书..."
59
+    
60
+    # 确保 webroot 目录存在
61
+    mkdir -p "$WEBROOT"
62
+    
63
+    # 使用 webroot 方式申请(不需要停止 Nginx)
64
+    certbot certonly --webroot \
65
+        -w "$WEBROOT" \
66
+        -d "$DOMAIN" \
67
+        --email "$EMAIL" \
68
+        --agree-tos \
69
+        --no-eff-email \
70
+        --non-interactive
71
+    
72
+    if [ $? -eq 0 ]; then
73
+        log "✅ 证书申请成功"
74
+        reload_nginx
75
+    else
76
+        log "❌ 证书申请失败"
77
+        exit 1
78
+    fi
79
+}
80
+
81
+# 续期证书
82
+renew_cert() {
83
+    log "🔄 开始续期证书..."
84
+    
85
+    certbot renew \
86
+        --webroot \
87
+        -w "$WEBROOT" \
88
+        --non-interactive \
89
+        --quiet
90
+    
91
+    if [ $? -eq 0 ]; then
92
+        log "✅ 证书续期成功"
93
+        reload_nginx
94
+    else
95
+        log "❌ 证书续期失败"
96
+        send_alert "证书续期失败,请手动检查!"
97
+        exit 1
98
+    fi
99
+}
100
+
101
+# 重载 Nginx
102
+reload_nginx() {
103
+    log "🔄 重载 Nginx 配置..."
104
+    
105
+    if docker exec "$NGINX_CONTAINER" nginx -s reload 2>/dev/null; then
106
+        log "✅ Nginx 重载成功"
107
+    else
108
+        log "⚠️ Docker 方式重载失败,尝试系统方式..."
109
+        systemctl reload nginx 2>/dev/null || nginx -s reload 2>/dev/null || true
110
+    fi
111
+}
112
+
113
+# 发送告警(企业微信 Webhook)
114
+send_alert() {
115
+    local message="$1"
116
+    local webhook="${WECOM_WEBHOOK:-}"
117
+    
118
+    if [ -z "$webhook" ]; then
119
+        log "⚠️ 未配置企业微信 Webhook,跳过告警"
120
+        return 0
121
+    fi
122
+    
123
+    local payload
124
+    payload=$(cat <<EOF
125
+{
126
+    "msgtype": "text",
127
+    "text": {
128
+        "content": "🚨 证书告警 [${DOMAIN}]\n${message}\n时间: $(date '+%Y-%m-%d %H:%M:%S')"
129
+    }
130
+}
131
+EOF
132
+)
133
+    
134
+    curl -s -X POST "$webhook" \
135
+        -H "Content-Type: application/json" \
136
+        -d "$payload" > /dev/null 2>&1 || true
137
+    
138
+    log "📤 告警已发送"
139
+}
140
+
141
+# ==================== 主逻辑 ====================
142
+
143
+log "========================================="
144
+log "Let's Encrypt 证书管理 - ${DOMAIN}"
145
+log "========================================="
146
+
147
+# 检查证书是否存在
148
+if [ ! -d "/etc/letsencrypt/live/${DOMAIN}" ]; then
149
+    log "📋 证书不存在,执行首次申请"
150
+    init_cert
151
+else
152
+    # 检查是否需要续期
153
+    if check_cert_expiry; then
154
+        renew_cert
155
+    fi
156
+fi
157
+
158
+log "========================================="
159
+log "执行完毕"
160
+log "========================================="

+ 283
- 0
deploy/production/nginx/nginx.conf Vedi File

@@ -0,0 +1,283 @@
1
+# ============================================================
2
+# 生产环境 Nginx 配置
3
+# 支持 HTTPS, HTTP/2, 安全 headers, 限流, WebSocket
4
+# ============================================================
5
+
6
+user nginx;
7
+worker_processes auto;
8
+worker_rlimit_nofile 65535;
9
+pid /run/nginx.pid;
10
+error_log /var/log/nginx/error.log warn;
11
+
12
+events {
13
+    worker_connections 4096;
14
+    multi_accept on;
15
+    use epoll;
16
+}
17
+
18
+http {
19
+    include       /etc/nginx/mime.types;
20
+    default_type  application/octet-stream;
21
+
22
+    # ==================== 日志格式 ====================
23
+    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
24
+                    '$status $body_bytes_sent "$http_referer" '
25
+                    '"$http_user_agent" "$http_x_forwarded_for" '
26
+                    'rt=$request_time';
27
+
28
+    log_format json escape=json '{'
29
+        '"time":"$time_iso8601",'
30
+        '"remote_addr":"$remote_addr",'
31
+        '"request":"$request",'
32
+        '"status":$status,'
33
+        '"body_bytes_sent":$body_bytes_sent,'
34
+        '"referer":"$http_referer",'
35
+        '"user_agent":"$http_user_agent",'
36
+        '"request_time":$request_time,'
37
+        '"upstream_response_time":"$upstream_response_time"'
38
+    '}';
39
+
40
+    access_log /var/log/nginx/access.log json;
41
+
42
+    # ==================== 基础优化 ====================
43
+    sendfile           on;
44
+    tcp_nopush         on;
45
+    tcp_nodelay        on;
46
+    keepalive_timeout  65;
47
+    keepalive_requests 1000;
48
+    types_hash_max_size 2048;
49
+    server_names_hash_bucket_size 128;
50
+    client_max_body_size 50m;
51
+    client_body_buffer_size 128k;
52
+
53
+    # ==================== Gzip 压缩 ====================
54
+    gzip on;
55
+    gzip_vary on;
56
+    gzip_proxied any;
57
+    gzip_comp_level 6;
58
+    gzip_buffers 16 8k;
59
+    gzip_http_version 1.1;
60
+    gzip_min_length 1024;
61
+    gzip_types
62
+        text/plain
63
+        text/css
64
+        text/xml
65
+        text/javascript
66
+        application/json
67
+        application/javascript
68
+        application/x-javascript
69
+        application/xml
70
+        application/xml+rss
71
+        application/vnd.ms-fontobject
72
+        font/opentype
73
+        image/svg+xml
74
+        image/x-icon;
75
+
76
+    # ==================== 限流配置 ====================
77
+    # 全局 API 限流: 每秒 30 个请求
78
+    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;
79
+    # 登录接口限流: 每分钟 10 个请求
80
+    limit_req_zone $binary_remote_addr zone=login_limit:10m rate=10r/m;
81
+    # WebSocket 连接数限制
82
+    limit_conn_zone $binary_remote_addr zone=ws_conn_limit:10m;
83
+
84
+    # ==================== 上游服务 ====================
85
+    upstream gateway {
86
+        server wm-gateway:8080;
87
+        keepalive 32;
88
+    }
89
+
90
+    # ==================== HTTP → HTTPS 跳转 ====================
91
+    server {
92
+        listen 80;
93
+        listen [::]:80;
94
+        server_name _;
95
+
96
+        # Let's Encrypt 验证路径
97
+        location /.well-known/acme-challenge/ {
98
+            root /var/www/certbot;
99
+            allow all;
100
+        }
101
+
102
+        # 其余请求全部 301 跳转到 HTTPS
103
+        location / {
104
+            return 301 https://$host$request_uri;
105
+        }
106
+    }
107
+
108
+    # ==================== HTTPS 主服务 ====================
109
+    server {
110
+        listen 443 ssl http2;
111
+        listen [::]:443 ssl http2;
112
+        server_name _;
113
+
114
+        # ==================== SSL 证书 ====================
115
+        # Let's Encrypt 证书路径(使用 certbot 自动生成)
116
+        ssl_certificate     /etc/letsencrypt/live/${DOMAIN}/fullchain.pem;
117
+        ssl_certificate_key /etc/letsencrypt/live/${DOMAIN}/privkey.pem;
118
+        ssl_trusted_certificate /etc/letsencrypt/live/${DOMAIN}/chain.pem;
119
+
120
+        # SSL 优化
121
+        ssl_session_timeout 1d;
122
+        ssl_session_cache shared:SSL:50m;
123
+        ssl_session_tickets off;
124
+
125
+        # SSL 协议和加密套件
126
+        ssl_protocols TLSv1.2 TLSv1.3;
127
+        ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384;
128
+        ssl_prefer_server_ciphers off;
129
+
130
+        # OCSP Stapling
131
+        ssl_stapling on;
132
+        ssl_stapling_verify on;
133
+        resolver 8.8.8.8 8.8.4.4 valid=300s;
134
+        resolver_timeout 5s;
135
+
136
+        # ==================== 安全 Headers ====================
137
+        add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
138
+        add_header X-Frame-Options "SAMEORIGIN" always;
139
+        add_header X-Content-Type-Options "nosniff" always;
140
+        add_header X-XSS-Protection "1; mode=block" always;
141
+        add_header Referrer-Policy "strict-origin-when-cross-origin" always;
142
+        add_header Permissions-Policy "camera=(), microphone=(), geolocation=()" always;
143
+        add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https:; font-src 'self' data:; connect-src 'self' wss: ws: https:; frame-ancestors 'self';" always;
144
+
145
+        # 隐藏 Nginx 版本
146
+        server_tokens off;
147
+
148
+        # ==================== 前端静态资源 ====================
149
+        root /usr/share/nginx/html;
150
+        index index.html;
151
+
152
+        location / {
153
+            try_files $uri $uri/ /index.html;
154
+
155
+            # HTML 文件不缓存(确保更新及时生效)
156
+            location ~* \.html$ {
157
+                expires -1;
158
+                add_header Cache-Control "no-cache, no-store, must-revalidate";
159
+            }
160
+        }
161
+
162
+        # 静态资源长期缓存(带 hash 的文件名)
163
+        location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot|map)$ {
164
+            expires 1y;
165
+            add_header Cache-Control "public, immutable";
166
+            access_log off;
167
+        }
168
+
169
+        # ==================== API 反向代理 ====================
170
+        location /api/ {
171
+            limit_req zone=api_limit burst=50 nodelay;
172
+            limit_req_status 429;
173
+
174
+            proxy_pass http://gateway/;
175
+            proxy_http_version 1.1;
176
+            proxy_set_header Host $host;
177
+            proxy_set_header X-Real-IP $remote_addr;
178
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
179
+            proxy_set_header X-Forwarded-Proto $scheme;
180
+            proxy_set_header X-Request-ID $request_id;
181
+            proxy_set_header Connection "";
182
+
183
+            proxy_connect_timeout 30s;
184
+            proxy_send_timeout 60s;
185
+            proxy_read_timeout 120s;
186
+
187
+            proxy_buffering on;
188
+            proxy_buffer_size 4k;
189
+            proxy_buffers 8 16k;
190
+            proxy_busy_buffers_size 32k;
191
+
192
+            # 文件上传大小限制
193
+            client_max_body_size 50m;
194
+        }
195
+
196
+        # 登录接口严格限流
197
+        location ~* ^/api/(auth|login|oauth) {
198
+            limit_req zone=login_limit burst=5 nodelay;
199
+            limit_req_status 429;
200
+
201
+            proxy_pass http://gateway;
202
+            proxy_http_version 1.1;
203
+            proxy_set_header Host $host;
204
+            proxy_set_header X-Real-IP $remote_addr;
205
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
206
+            proxy_set_header X-Forwarded-Proto $scheme;
207
+        }
208
+
209
+        # ==================== WebSocket 代理 ====================
210
+        location /ws/ {
211
+            limit_conn ws_conn_limit 10;
212
+
213
+            proxy_pass http://gateway/ws/;
214
+            proxy_http_version 1.1;
215
+            proxy_set_header Upgrade $http_upgrade;
216
+            proxy_set_header Connection "upgrade";
217
+            proxy_set_header Host $host;
218
+            proxy_set_header X-Real-IP $remote_addr;
219
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
220
+            proxy_set_header X-Forwarded-Proto $scheme;
221
+
222
+            proxy_connect_timeout 7d;
223
+            proxy_send_timeout 7d;
224
+            proxy_read_timeout 7d;
225
+        }
226
+
227
+        # ==================== MQTT over WebSocket ====================
228
+        location /mqtt {
229
+            proxy_pass http://wm-emqx:8083/mqtt;
230
+            proxy_http_version 1.1;
231
+            proxy_set_header Upgrade $http_upgrade;
232
+            proxy_set_header Connection "upgrade";
233
+            proxy_set_header Host $host;
234
+            proxy_read_timeout 3600s;
235
+        }
236
+
237
+        # ==================== GeoServer GIS 代理 ====================
238
+        location /geoserver/ {
239
+            proxy_pass http://wm-geoserver:8080/geoserver/;
240
+            proxy_set_header Host $host;
241
+            proxy_set_header X-Real-IP $remote_addr;
242
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
243
+            proxy_set_header X-Forwarded-Proto $scheme;
244
+
245
+            # GIS 数据可能较大
246
+            client_max_body_size 100m;
247
+            proxy_read_timeout 300s;
248
+        }
249
+
250
+        # ==================== MinIO 对象存储代理 ====================
251
+        location /minio/ {
252
+            proxy_pass http://wm-minio:9000/;
253
+            proxy_set_header Host $host;
254
+            proxy_set_header X-Real-IP $remote_addr;
255
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
256
+            proxy_set_header X-Forwarded-Proto $scheme;
257
+
258
+            client_max_body_size 500m;
259
+        }
260
+
261
+        # ==================== 健康检查 ====================
262
+        location /health {
263
+            access_log off;
264
+            return 200 '{"status":"UP"}';
265
+            add_header Content-Type application/json;
266
+        }
267
+
268
+        # 禁止访问隐藏文件
269
+        location ~ /\. {
270
+            deny all;
271
+            access_log off;
272
+            log_not_found off;
273
+        }
274
+
275
+        # 错误页面
276
+        error_page 404 /index.html;
277
+        error_page 500 502 503 504 /50x.html;
278
+        location = /50x.html {
279
+            root /usr/share/nginx/html;
280
+            internal;
281
+        }
282
+    }
283
+}

+ 379
- 0
deploy/production/server-setup.sh Vedi File

@@ -0,0 +1,379 @@
1
+#!/bin/bash
2
+# ============================================================
3
+# 生产服务器初始化脚本
4
+# 
5
+# 功能:
6
+#   - 系统更新
7
+#   - Docker 安装
8
+#   - 防火墙配置 (ufw)
9
+#   - SSH 加固
10
+#   - 用户创建
11
+#   - 目录结构创建
12
+#
13
+# 用法: sudo ./server-setup.sh
14
+# ============================================================
15
+set -euo pipefail
16
+
17
+# ==================== 配置 ====================
18
+DEPLOY_USER="${DEPLOY_USER:-deploy}"
19
+APP_DIR="${APP_DIR:-/opt/water-management}"
20
+BACKUP_DIR="${BACKUP_DIR:-/opt/water-management/backups}"
21
+LOG_DIR="${LOG_DIR:-/opt/water-management/logs}"
22
+
23
+# ==================== 函数 ====================
24
+
25
+log() {
26
+    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
27
+}
28
+
29
+check_root() {
30
+    if [ "$(id -u)" -ne 0 ]; then
31
+        log "❌ 请以 root 用户或 sudo 执行此脚本"
32
+        exit 1
33
+    fi
34
+}
35
+
36
+# ==================== 系统更新 ====================
37
+update_system() {
38
+    log "📦 更新系统包..."
39
+    
40
+    if command -v apt-get &>/dev/null; then
41
+        apt-get update -y
42
+        apt-get upgrade -y
43
+        apt-get install -y \
44
+            curl \
45
+            wget \
46
+            git \
47
+            unzip \
48
+            software-properties-common \
49
+            apt-transport-https \
50
+            ca-certificates \
51
+            gnupg \
52
+            lsb-release \
53
+            htop \
54
+            iotop \
55
+            jq \
56
+            openssl \
57
+            fail2ban \
58
+            ufw
59
+    elif command -v yum &>/dev/null; then
60
+        yum update -y
61
+        yum install -y \
62
+            curl \
63
+            wget \
64
+            git \
65
+            unzip \
66
+            yum-utils \
67
+            htop \
68
+            iotop \
69
+            jq \
70
+            openssl \
71
+            fail2ban \
72
+            firewalld
73
+    else
74
+        log "⚠️ 未知的包管理器,请手动安装依赖"
75
+    fi
76
+    
77
+    log "✅ 系统更新完成"
78
+}
79
+
80
+# ==================== Docker 安装 ====================
81
+install_docker() {
82
+    if command -v docker &>/dev/null; then
83
+        log "✅ Docker 已安装: $(docker --version)"
84
+        return 0
85
+    fi
86
+    
87
+    log "🐳 安装 Docker..."
88
+    
89
+    # 使用官方安装脚本
90
+    curl -fsSL https://get.docker.com | sh
91
+    
92
+    # 启动 Docker
93
+    systemctl enable docker
94
+    systemctl start docker
95
+    
96
+    # 配置 Docker 镜像加速(中国大陆)
97
+    mkdir -p /etc/docker
98
+    cat > /etc/docker/daemon.json <<EOF
99
+{
100
+    "registry-mirrors": [
101
+        "https://docker.mirrors.ustc.edu.cn",
102
+        "https://hub-mirror.c.163.com"
103
+    ],
104
+    "log-driver": "json-file",
105
+    "log-opts": {
106
+        "max-size": "50m",
107
+        "max-file": "3"
108
+    },
109
+    "storage-driver": "overlay2",
110
+    "live-restore": true,
111
+    "default-ulimits": {
112
+        "nofile": {
113
+            "Name": "nofile",
114
+            "Hard": 65535,
115
+            "Soft": 65535
116
+        }
117
+    }
118
+}
119
+EOF
120
+    
121
+    systemctl restart docker
122
+    
123
+    # 安装 Docker Compose
124
+    COMPOSE_VERSION=$(curl -s https://api.github.com/repos/docker/compose/releases/latest | grep -oP '"tag_name": "\K(.*)(?=")')
125
+    curl -L "https://github.com/docker/compose/releases/download/${COMPOSE_VERSION}/docker-compose-$(uname -s)-$(uname -m)" \
126
+        -o /usr/local/bin/docker-compose
127
+    chmod +x /usr/local/bin/docker-compose
128
+    ln -sf /usr/local/bin/docker-compose /usr/bin/docker-compose
129
+    
130
+    log "✅ Docker 安装完成: $(docker --version)"
131
+    log "✅ Docker Compose 安装完成: $(docker-compose --version)"
132
+}
133
+
134
+# ==================== 防火墙配置 ====================
135
+setup_firewall() {
136
+    log "🔥 配置防火墙..."
137
+    
138
+    if command -v ufw &>/dev/null; then
139
+        # Ubuntu/Debian - UFW
140
+        ufw default deny incoming
141
+        ufw default allow outgoing
142
+        
143
+        # SSH
144
+        ufw allow 22/tcp comment "SSH"
145
+        ufw limit 22/tcp
146
+        
147
+        # HTTP/HTTPS
148
+        ufw allow 80/tcp comment "HTTP"
149
+        ufw allow 443/tcp comment "HTTPS"
150
+        
151
+        # MQTT (物联网设备接入)
152
+        ufw allow 1883/tcp comment "MQTT"
153
+        
154
+        # 内部服务端口(仅允许内网访问,按实际 IP 段调整)
155
+        # ufw allow from 10.0.0.0/8 to any port 8080
156
+        # ufw allow from 10.0.0.0/8 to any port 8848
157
+        
158
+        ufw --force enable
159
+        log "✅ UFW 防火墙已启用"
160
+        
161
+    elif command -v firewall-cmd &>/dev/null; then
162
+        # CentOS/RHEL - firewalld
163
+        systemctl enable firewalld
164
+        systemctl start firewalld
165
+        
166
+        firewall-cmd --permanent --add-service=ssh
167
+        firewall-cmd --permanent --add-service=http
168
+        firewall-cmd --permanent --add-service=https
169
+        firewall-cmd --permanent --add-port=1883/tcp
170
+        firewall-cmd --reload
171
+        
172
+        log "✅ firewalld 防火墙已配置"
173
+    fi
174
+}
175
+
176
+# ==================== SSH 加固 ====================
177
+harden_ssh() {
178
+    log "🔒 SSH 加固..."
179
+    
180
+    local sshd_config="/etc/ssh/sshd_config"
181
+    
182
+    # 备份原配置
183
+    cp "$sshd_config" "${sshd_config}.bak.$(date +%Y%m%d)"
184
+    
185
+    # 禁用 root 密码登录
186
+    sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' "$sshd_config"
187
+    
188
+    # 禁用空密码
189
+    sed -i 's/^#*PermitEmptyPasswords.*/PermitEmptyPasswords no/' "$sshd_config"
190
+    
191
+    # 启用公钥认证
192
+    sed -i 's/^#*PubkeyAuthentication.*/PubkeyAuthentication yes/' "$sshd_config"
193
+    
194
+    # 禁用密码登录(确保已配置公钥后启用)
195
+    # sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' "$sshd_config"
196
+    
197
+    # 限制最大认证尝试
198
+    sed -i 's/^#*MaxAuthTries.*/MaxAuthTries 3/' "$sshd_config"
199
+    
200
+    # 客户端存活检测
201
+    sed -i 's/^#*ClientAliveInterval.*/ClientAliveInterval 300/' "$sshd_config"
202
+    sed -i 's/^#*ClientAliveCountMax.*/ClientAliveCountMax 2/' "$sshd_config"
203
+    
204
+    systemctl restart sshd
205
+    
206
+    log "✅ SSH 加固完成"
207
+}
208
+
209
+# ==================== Fail2Ban 配置 ====================
210
+setup_fail2ban() {
211
+    log "🛡️ 配置 Fail2Ban..."
212
+    
213
+    cat > /etc/fail2ban/jail.local <<EOF
214
+[DEFAULT]
215
+bantime = 3600
216
+findtime = 600
217
+maxretry = 5
218
+backend = systemd
219
+
220
+[sshd]
221
+enabled = true
222
+port = ssh
223
+filter = sshd
224
+logpath = /var/log/auth.log
225
+maxretry = 3
226
+bantime = 86400
227
+EOF
228
+    
229
+    systemctl enable fail2ban
230
+    systemctl restart fail2ban
231
+    
232
+    log "✅ Fail2Ban 已配置"
233
+}
234
+
235
+# ==================== 创建部署用户 ====================
236
+create_deploy_user() {
237
+    log "👤 创建部署用户: ${DEPLOY_USER}"
238
+    
239
+    if id "$DEPLOY_USER" &>/dev/null; then
240
+        log "✅ 用户已存在: ${DEPLOY_USER}"
241
+    else
242
+        useradd -m -s /bin/bash "$DEPLOY_USER"
243
+        usermod -aG docker "$DEPLOY_USER"
244
+        log "✅ 用户已创建: ${DEPLOY_USER}"
245
+    fi
246
+    
247
+    # 配置 sudo 权限(免密码执行 docker 命令)
248
+    cat > "/etc/sudoers.d/${DEPLOY_USER}" <<EOF
249
+${DEPLOY_USER} ALL=(ALL) NOPASSWD: /usr/bin/docker, /usr/bin/docker-compose, /usr/local/bin/docker-compose
250
+EOF
251
+    chmod 440 "/etc/sudoers.d/${DEPLOY_USER}"
252
+    
253
+    # 创建 SSH 目录
254
+    local user_home
255
+    user_home=$(eval echo "~${DEPLOY_USER}")
256
+    mkdir -p "${user_home}/.ssh"
257
+    chmod 700 "${user_home}/.ssh"
258
+    chown -R "${DEPLOY_USER}:${DEPLOY_USER}" "${user_home}/.ssh"
259
+    
260
+    log "✅ 部署用户配置完成"
261
+    log "⚠️  请将公钥添加到 ${user_home}/.ssh/authorized_keys"
262
+}
263
+
264
+# ==================== 创建目录结构 ====================
265
+create_directories() {
266
+    log "📁 创建应用目录结构..."
267
+    
268
+    mkdir -p "${APP_DIR}"
269
+    mkdir -p "${APP_DIR}/config"
270
+    mkdir -p "${APP_DIR}/config/postgres"
271
+    mkdir -p "${APP_DIR}/config/nginx"
272
+    mkdir -p "${APP_DIR}/config/certs"
273
+    mkdir -p "${BACKUP_DIR}/daily"
274
+    mkdir -p "${BACKUP_DIR}/weekly"
275
+    mkdir -p "${BACKUP_DIR}/monthly"
276
+    mkdir -p "${LOG_DIR}"
277
+    
278
+    chown -R "${DEPLOY_USER}:${DEPLOY_USER}" "${APP_DIR}"
279
+    
280
+    log "✅ 目录结构创建完成"
281
+}
282
+
283
+# ==================== 内核优化 ====================
284
+optimize_kernel() {
285
+    log "⚙️ 内核参数优化..."
286
+    
287
+    cat > /etc/sysctl.d/99-water-management.conf <<EOF
288
+# TCP 优化
289
+net.core.somaxconn = 65535
290
+net.ipv4.tcp_max_syn_backlog = 65535
291
+net.ipv4.tcp_syncookies = 1
292
+net.ipv4.tcp_tw_reuse = 1
293
+net.ipv4.ip_local_port_range = 1024 65535
294
+
295
+# 文件描述符
296
+fs.file-max = 655350
297
+fs.inotify.max_user_watches = 524288
298
+
299
+# 内存优化
300
+vm.swappiness = 10
301
+vm.dirty_ratio = 15
302
+vm.dirty_background_ratio = 5
303
+
304
+# 网络优化
305
+net.ipv4.tcp_keepalive_time = 600
306
+net.ipv4.tcp_keepalive_intvl = 30
307
+net.ipv4.tcp_keepalive_probes = 3
308
+EOF
309
+    
310
+    sysctl -p /etc/sysctl.d/99-water-management.conf
311
+    
312
+    # 文件描述符限制
313
+    cat > /etc/security/limits.d/99-water-management.conf <<EOF
314
+* soft nofile 65535
315
+* hard nofile 65535
316
+* soft nproc 65535
317
+* hard nproc 65535
318
+${DEPLOY_USER} soft nofile 65535
319
+${DEPLOY_USER} hard nofile 65535
320
+EOF
321
+    
322
+    log "✅ 内核优化完成"
323
+}
324
+
325
+# ==================== 配置定时任务 ====================
326
+setup_cron() {
327
+    log "⏰ 配置定时任务..."
328
+    
329
+    local cron_file="/etc/cron.d/water-management"
330
+    
331
+    cat > "$cron_file" <<EOF
332
+# 数据库备份 - 每天凌晨 2 点
333
+0 2 * * * ${DEPLOY_USER} ${APP_DIR}/deploy/production/backup/backup-db.sh >> ${LOG_DIR}/backup.log 2>&1
334
+
335
+# 证书续期 - 每天凌晨 3 点
336
+0 3 * * * root ${APP_DIR}/deploy/production/nginx/certbot-renew.sh >> ${LOG_DIR}/certbot-renew.log 2>&1
337
+
338
+# 日志清理 - 每天凌晨 4 点
339
+0 4 * * * ${DEPLOY_USER} find ${LOG_DIR} -name "*.log" -mtime +30 -delete 2>/dev/null || true
340
+
341
+# 磁盘监控 - 每小时
342
+0 * * * * ${DEPLOY_USER} df -h | awk 'NR>1 && +\$5>90 {print "⚠️ 磁盘告警: "\$0}' | mail -s "磁盘告警" root 2>/dev/null || true
343
+EOF
344
+    
345
+    chmod 644 "$cron_file"
346
+    
347
+    log "✅ 定时任务配置完成"
348
+}
349
+
350
+# ==================== 主流程 ====================
351
+
352
+log "========================================="
353
+log " 生产服务器初始化"
354
+log "========================================="
355
+
356
+check_root
357
+update_system
358
+install_docker
359
+setup_firewall
360
+harden_ssh
361
+setup_fail2ban
362
+create_deploy_user
363
+create_directories
364
+optimize_kernel
365
+setup_cron
366
+
367
+log "========================================="
368
+log " ✅ 服务器初始化完成!"
369
+log "========================================="
370
+log ""
371
+log "后续步骤:"
372
+log "1. 将 SSH 公钥添加到 /home/${DEPLOY_USER}/.ssh/authorized_keys"
373
+log "2. 配置 .env 环境变量文件: ${APP_DIR}/.env"
374
+log "3. 配置域名 DNS 解析"
375
+log "4. 申请 Let's Encrypt 证书"
376
+log "5. 部署应用: docker compose -f docker-compose.yml -f deploy/production/docker-compose.override.yml up -d"
377
+log "6. 启动监控栈: docker compose -f deploy/production/monitoring/docker-compose.monitoring.yml up -d"
378
+log "7. 启动日志栈: docker compose -f deploy/production/logging/docker-compose.logging.yml up -d"
379
+log ""