feat: 套餐系统升级 - Worker 重构、流量重置、文档与规范更新
All checks were successful
构建并部署到测试环境(无 SSH) / build-and-deploy (push) Successful in 6m54s
All checks were successful
构建并部署到测试环境(无 SSH) / build-and-deploy (push) Successful in 6m54s
- 重构 Worker 启动流程,引入 bootstrap 模块统一管理依赖注入 - 实现套餐流量重置服务(日/月/年周期重置) - 新增套餐激活排队、加油包绑定、囤货待实名激活逻辑 - 新增订单创建幂等性防重(Redis 业务键 + 分布式锁) - 更新 AGENTS.md/CLAUDE.md:新增注释规范、幂等性规范,移除测试要求 - 添加套餐系统升级完整文档(API文档、使用指南、功能总结、运维指南) - 归档 OpenSpec package-system-upgrade 变更,同步 specs 到主目录 - 新增 queue types 抽象和 Redis 常量定义
This commit is contained in:
279
docs/package-system-upgrade/运维指南.md
Normal file
279
docs/package-system-upgrade/运维指南.md
Normal file
@@ -0,0 +1,279 @@
|
||||
# 套餐系统升级 - 运维指南
|
||||
|
||||
## 监控指标
|
||||
|
||||
### Asynq 队列监控
|
||||
|
||||
| 指标 | 说明 | 正常范围 | 告警阈值 |
|
||||
|------|------|---------|---------|
|
||||
| `asynq_queue_size{queue="default"}` | 默认队列长度 | < 100 | > 1000 |
|
||||
| `asynq_queue_latency_seconds` | 任务处理延迟 | < 5s | > 30s |
|
||||
| `asynq_processed_total` | 已处理任务数 | 持续增长 | - |
|
||||
| `asynq_failed_total` | 失败任务数 | 接近 0 | > 10/min |
|
||||
|
||||
### 套餐激活监控
|
||||
|
||||
| 指标 | 说明 | 正常范围 | 告警阈值 |
|
||||
|------|------|---------|---------|
|
||||
| 排队套餐激活延迟 | 主套餐过期到下一个激活的时间 | < 30s | > 1min |
|
||||
| 实名激活延迟 | 实名完成到套餐激活的时间 | < 30s | > 1min |
|
||||
| 待激活套餐堆积 | `status=0` 的套餐数量 | 正常波动 | 持续增长 |
|
||||
|
||||
### API 性能监控
|
||||
|
||||
| 指标 | 端点 | 正常范围 | 告警阈值 |
|
||||
|------|------|---------|---------|
|
||||
| 响应时间 P95 | `/api/h5/packages/my-usage` | < 100ms | > 200ms |
|
||||
| 响应时间 P99 | `/api/h5/packages/my-usage` | < 200ms | > 500ms |
|
||||
| 响应时间 P95 | `/api/admin/package-usage/:id/daily-records` | < 150ms | > 300ms |
|
||||
|
||||
### 数据库监控
|
||||
|
||||
| 指标 | 说明 | 正常范围 | 告警阈值 |
|
||||
|------|------|---------|---------|
|
||||
| 流量重置执行时间 | 单批次重置耗时 | < 5s | > 10s |
|
||||
| 套餐表行数增长 | `tb_package_usage` 每日新增 | 正常波动 | 异常增长 |
|
||||
| 日记录表行数 | `tb_package_usage_daily_record` | 正常增长 | - |
|
||||
|
||||
---
|
||||
|
||||
## 告警规则
|
||||
|
||||
### Prometheus 告警规则示例
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: package_system_alerts
|
||||
rules:
|
||||
# 套餐激活延迟告警
|
||||
- alert: PackageActivationDelayHigh
|
||||
expr: histogram_quantile(0.95, rate(package_activation_duration_seconds_bucket[5m])) > 60
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "套餐激活延迟过高"
|
||||
description: "套餐激活 P95 延迟超过 1 分钟,当前值: {{ $value }}s"
|
||||
|
||||
# Asynq 队列堆积告警
|
||||
- alert: AsynqQueueBacklog
|
||||
expr: asynq_queue_size{queue="default"} > 1000
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Asynq 任务队列堆积"
|
||||
description: "默认队列任务数超过 1000,当前值: {{ $value }}"
|
||||
|
||||
# 任务失败率告警
|
||||
- alert: AsynqTaskFailureRateHigh
|
||||
expr: rate(asynq_failed_total[5m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Asynq 任务失败率过高"
|
||||
description: "任务失败率超过 10%,当前值: {{ $value }}/s"
|
||||
|
||||
# API 响应时间告警
|
||||
- alert: PackageAPILatencyHigh
|
||||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{path=~"/api/h5/packages.*"}[5m])) > 0.2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "套餐 API 响应时间过高"
|
||||
description: "套餐相关 API P95 响应时间超过 200ms"
|
||||
|
||||
# 流量重置执行时间告警
|
||||
- alert: DataResetDurationHigh
|
||||
expr: package_data_reset_duration_seconds > 10
|
||||
for: 1m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "流量重置执行时间过长"
|
||||
description: "流量重置批次执行时间超过 10 秒"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 回滚预案
|
||||
|
||||
### 场景一:代码回滚
|
||||
|
||||
**触发条件**:
|
||||
- API 接口异常
|
||||
- 业务逻辑错误
|
||||
- 性能严重下降
|
||||
|
||||
**回滚步骤**:
|
||||
|
||||
```bash
|
||||
# 1. 切换到上一个稳定版本
|
||||
git checkout <上一个稳定版本 tag>
|
||||
|
||||
# 2. 重新构建镜像
|
||||
make build-docker
|
||||
|
||||
# 3. 重新部署
|
||||
kubectl rollout restart deployment/cmp-api
|
||||
kubectl rollout restart deployment/cmp-worker
|
||||
|
||||
# 4. 验证服务正常
|
||||
curl -s http://api-host/health | jq
|
||||
```
|
||||
|
||||
**注意事项**:
|
||||
- 代码回滚不会回滚数据库迁移
|
||||
- 需要确保旧代码兼容新数据库结构
|
||||
- 新增字段使用默认值,不影响旧代码运行
|
||||
|
||||
### 场景二:数据库回滚
|
||||
|
||||
**触发条件**:
|
||||
- 迁移脚本有问题
|
||||
- 数据损坏
|
||||
- 需要完全撤销功能
|
||||
|
||||
**前置条件**:
|
||||
- 确认已备份数据库
|
||||
- 确认代码已回滚到兼容版本
|
||||
|
||||
**回滚步骤**:
|
||||
|
||||
```bash
|
||||
# 1. 停止 API 和 Worker 服务
|
||||
kubectl scale deployment/cmp-api --replicas=0
|
||||
kubectl scale deployment/cmp-worker --replicas=0
|
||||
|
||||
# 2. 执行数据库回滚
|
||||
make migrate-down STEPS=1
|
||||
|
||||
# 3. 验证数据库结构
|
||||
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d tb_package"
|
||||
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d tb_package_usage"
|
||||
|
||||
# 4. 重新启动服务
|
||||
kubectl scale deployment/cmp-api --replicas=3
|
||||
kubectl scale deployment/cmp-worker --replicas=2
|
||||
```
|
||||
|
||||
**回滚脚本位置**:
|
||||
`migrations/000055_package_system_upgrade.down.sql`
|
||||
|
||||
### 场景三:数据修复
|
||||
|
||||
**情况 1:套餐状态异常**
|
||||
|
||||
```sql
|
||||
-- 查找状态异常的套餐
|
||||
SELECT id, status, activated_at, expires_at
|
||||
FROM tb_package_usage
|
||||
WHERE status = 1 AND expires_at < NOW();
|
||||
|
||||
-- 修复:将过期套餐标记为已过期
|
||||
UPDATE tb_package_usage
|
||||
SET status = 3, updated_at = NOW()
|
||||
WHERE status = 1 AND expires_at < NOW();
|
||||
```
|
||||
|
||||
**情况 2:加油包未正确失效**
|
||||
|
||||
```sql
|
||||
-- 查找主套餐已过期但加油包仍生效的记录
|
||||
SELECT pu.id, pu.status, pu.master_usage_id, master.status as master_status
|
||||
FROM tb_package_usage pu
|
||||
JOIN tb_package_usage master ON pu.master_usage_id = master.id
|
||||
WHERE pu.status = 1 AND master.status = 3;
|
||||
|
||||
-- 修复:将这些加油包标记为失效
|
||||
UPDATE tb_package_usage
|
||||
SET status = 4, updated_at = NOW()
|
||||
WHERE id IN (
|
||||
SELECT pu.id
|
||||
FROM tb_package_usage pu
|
||||
JOIN tb_package_usage master ON pu.master_usage_id = master.id
|
||||
WHERE pu.status = 1 AND master.status = 3
|
||||
);
|
||||
```
|
||||
|
||||
**情况 3:流量重置时间错误**
|
||||
|
||||
```sql
|
||||
-- 查找下次重置时间异常的套餐
|
||||
SELECT id, data_reset_cycle, next_reset_at
|
||||
FROM tb_package_usage
|
||||
WHERE data_reset_cycle = 'daily' AND next_reset_at < NOW() - INTERVAL '1 day';
|
||||
|
||||
-- 修复:重新计算下次重置时间
|
||||
UPDATE tb_package_usage
|
||||
SET next_reset_at = DATE_TRUNC('day', NOW()) + INTERVAL '1 day',
|
||||
updated_at = NOW()
|
||||
WHERE data_reset_cycle = 'daily' AND next_reset_at < NOW() - INTERVAL '1 day';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 日常运维
|
||||
|
||||
### 手动触发流量重置
|
||||
|
||||
```bash
|
||||
# 通过 API 触发
|
||||
curl -X POST http://api-host/api/admin/internal/trigger-data-reset \
|
||||
-H "Authorization: Bearer $ADMIN_TOKEN"
|
||||
```
|
||||
|
||||
### 查看 Asynq 队列状态
|
||||
|
||||
```bash
|
||||
# 查看队列概览
|
||||
asynq stats
|
||||
|
||||
# 查看待处理任务
|
||||
asynq list pending
|
||||
|
||||
# 查看失败任务
|
||||
asynq list archived
|
||||
```
|
||||
|
||||
### 重试失败任务
|
||||
|
||||
```bash
|
||||
# 重试所有失败任务
|
||||
asynq task run archived --all
|
||||
|
||||
# 重试特定任务
|
||||
asynq task run archived --id=<task_id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 容量规划
|
||||
|
||||
### 数据增长预估
|
||||
|
||||
| 表 | 每日增量 | 月增量 | 年增量 |
|
||||
|----|---------|--------|--------|
|
||||
| `tb_package_usage` | ~1000 行 | ~30000 行 | ~360000 行 |
|
||||
| `tb_package_usage_daily_record` | ~10000 行 | ~300000 行 | ~3600000 行 |
|
||||
| `tb_card_daily_usage` | ~10000 行 | ~300000 行 | ~3600000 行 |
|
||||
|
||||
### 存储预估
|
||||
|
||||
| 表 | 单行大小 | 年存储量 |
|
||||
|----|---------|---------|
|
||||
| `tb_package_usage_daily_record` | ~100 bytes | ~360 MB |
|
||||
| `tb_card_daily_usage` | ~80 bytes | ~288 MB |
|
||||
|
||||
### 清理策略
|
||||
|
||||
```sql
|
||||
-- 清理 180 天前的日记录(可选)
|
||||
DELETE FROM tb_package_usage_daily_record
|
||||
WHERE date < NOW() - INTERVAL '180 days';
|
||||
|
||||
DELETE FROM tb_card_daily_usage
|
||||
WHERE usage_date < NOW() - INTERVAL '180 days';
|
||||
```
|
||||
Reference in New Issue
Block a user