All checks were successful
构建并部署到测试环境(无 SSH) / build-and-deploy (push) Successful in 6m54s
- 重构 Worker 启动流程,引入 bootstrap 模块统一管理依赖注入 - 实现套餐流量重置服务(日/月/年周期重置) - 新增套餐激活排队、加油包绑定、囤货待实名激活逻辑 - 新增订单创建幂等性防重(Redis 业务键 + 分布式锁) - 更新 AGENTS.md/CLAUDE.md:新增注释规范、幂等性规范,移除测试要求 - 添加套餐系统升级完整文档(API文档、使用指南、功能总结、运维指南) - 归档 OpenSpec package-system-upgrade 变更,同步 specs 到主目录 - 新增 queue types 抽象和 Redis 常量定义
280 lines
7.3 KiB
Markdown
280 lines
7.3 KiB
Markdown
# 套餐系统升级 - 运维指南
|
||
|
||
## 监控指标
|
||
|
||
### Asynq 队列监控
|
||
|
||
| 指标 | 说明 | 正常范围 | 告警阈值 |
|
||
|------|------|---------|---------|
|
||
| `asynq_queue_size{queue="default"}` | 默认队列长度 | < 100 | > 1000 |
|
||
| `asynq_queue_latency_seconds` | 任务处理延迟 | < 5s | > 30s |
|
||
| `asynq_processed_total` | 已处理任务数 | 持续增长 | - |
|
||
| `asynq_failed_total` | 失败任务数 | 接近 0 | > 10/min |
|
||
|
||
### 套餐激活监控
|
||
|
||
| 指标 | 说明 | 正常范围 | 告警阈值 |
|
||
|------|------|---------|---------|
|
||
| 排队套餐激活延迟 | 主套餐过期到下一个激活的时间 | < 30s | > 1min |
|
||
| 实名激活延迟 | 实名完成到套餐激活的时间 | < 30s | > 1min |
|
||
| 待激活套餐堆积 | `status=0` 的套餐数量 | 正常波动 | 持续增长 |
|
||
|
||
### API 性能监控
|
||
|
||
| 指标 | 端点 | 正常范围 | 告警阈值 |
|
||
|------|------|---------|---------|
|
||
| 响应时间 P95 | `/api/h5/packages/my-usage` | < 100ms | > 200ms |
|
||
| 响应时间 P99 | `/api/h5/packages/my-usage` | < 200ms | > 500ms |
|
||
| 响应时间 P95 | `/api/admin/package-usage/:id/daily-records` | < 150ms | > 300ms |
|
||
|
||
### 数据库监控
|
||
|
||
| 指标 | 说明 | 正常范围 | 告警阈值 |
|
||
|------|------|---------|---------|
|
||
| 流量重置执行时间 | 单批次重置耗时 | < 5s | > 10s |
|
||
| 套餐表行数增长 | `tb_package_usage` 每日新增 | 正常波动 | 异常增长 |
|
||
| 日记录表行数 | `tb_package_usage_daily_record` | 正常增长 | - |
|
||
|
||
---
|
||
|
||
## 告警规则
|
||
|
||
### Prometheus 告警规则示例
|
||
|
||
```yaml
|
||
groups:
|
||
- name: package_system_alerts
|
||
rules:
|
||
# 套餐激活延迟告警
|
||
- alert: PackageActivationDelayHigh
|
||
expr: histogram_quantile(0.95, rate(package_activation_duration_seconds_bucket[5m])) > 60
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "套餐激活延迟过高"
|
||
description: "套餐激活 P95 延迟超过 1 分钟,当前值: {{ $value }}s"
|
||
|
||
# Asynq 队列堆积告警
|
||
- alert: AsynqQueueBacklog
|
||
expr: asynq_queue_size{queue="default"} > 1000
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Asynq 任务队列堆积"
|
||
description: "默认队列任务数超过 1000,当前值: {{ $value }}"
|
||
|
||
# 任务失败率告警
|
||
- alert: AsynqTaskFailureRateHigh
|
||
expr: rate(asynq_failed_total[5m]) > 0.1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Asynq 任务失败率过高"
|
||
description: "任务失败率超过 10%,当前值: {{ $value }}/s"
|
||
|
||
# API 响应时间告警
|
||
- alert: PackageAPILatencyHigh
|
||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{path=~"/api/h5/packages.*"}[5m])) > 0.2
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "套餐 API 响应时间过高"
|
||
description: "套餐相关 API P95 响应时间超过 200ms"
|
||
|
||
# 流量重置执行时间告警
|
||
- alert: DataResetDurationHigh
|
||
expr: package_data_reset_duration_seconds > 10
|
||
for: 1m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "流量重置执行时间过长"
|
||
description: "流量重置批次执行时间超过 10 秒"
|
||
```
|
||
|
||
---
|
||
|
||
## 回滚预案
|
||
|
||
### 场景一:代码回滚
|
||
|
||
**触发条件**:
|
||
- API 接口异常
|
||
- 业务逻辑错误
|
||
- 性能严重下降
|
||
|
||
**回滚步骤**:
|
||
|
||
```bash
|
||
# 1. 切换到上一个稳定版本
|
||
git checkout <上一个稳定版本 tag>
|
||
|
||
# 2. 重新构建镜像
|
||
make build-docker
|
||
|
||
# 3. 重新部署
|
||
kubectl rollout restart deployment/cmp-api
|
||
kubectl rollout restart deployment/cmp-worker
|
||
|
||
# 4. 验证服务正常
|
||
curl -s http://api-host/health | jq
|
||
```
|
||
|
||
**注意事项**:
|
||
- 代码回滚不会回滚数据库迁移
|
||
- 需要确保旧代码兼容新数据库结构
|
||
- 新增字段使用默认值,不影响旧代码运行
|
||
|
||
### 场景二:数据库回滚
|
||
|
||
**触发条件**:
|
||
- 迁移脚本有问题
|
||
- 数据损坏
|
||
- 需要完全撤销功能
|
||
|
||
**前置条件**:
|
||
- 确认已备份数据库
|
||
- 确认代码已回滚到兼容版本
|
||
|
||
**回滚步骤**:
|
||
|
||
```bash
|
||
# 1. 停止 API 和 Worker 服务
|
||
kubectl scale deployment/cmp-api --replicas=0
|
||
kubectl scale deployment/cmp-worker --replicas=0
|
||
|
||
# 2. 执行数据库回滚
|
||
make migrate-down STEPS=1
|
||
|
||
# 3. 验证数据库结构
|
||
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d tb_package"
|
||
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d tb_package_usage"
|
||
|
||
# 4. 重新启动服务
|
||
kubectl scale deployment/cmp-api --replicas=3
|
||
kubectl scale deployment/cmp-worker --replicas=2
|
||
```
|
||
|
||
**回滚脚本位置**:
|
||
`migrations/000055_package_system_upgrade.down.sql`
|
||
|
||
### 场景三:数据修复
|
||
|
||
**情况 1:套餐状态异常**
|
||
|
||
```sql
|
||
-- 查找状态异常的套餐
|
||
SELECT id, status, activated_at, expires_at
|
||
FROM tb_package_usage
|
||
WHERE status = 1 AND expires_at < NOW();
|
||
|
||
-- 修复:将过期套餐标记为已过期
|
||
UPDATE tb_package_usage
|
||
SET status = 3, updated_at = NOW()
|
||
WHERE status = 1 AND expires_at < NOW();
|
||
```
|
||
|
||
**情况 2:加油包未正确失效**
|
||
|
||
```sql
|
||
-- 查找主套餐已过期但加油包仍生效的记录
|
||
SELECT pu.id, pu.status, pu.master_usage_id, master.status as master_status
|
||
FROM tb_package_usage pu
|
||
JOIN tb_package_usage master ON pu.master_usage_id = master.id
|
||
WHERE pu.status = 1 AND master.status = 3;
|
||
|
||
-- 修复:将这些加油包标记为失效
|
||
UPDATE tb_package_usage
|
||
SET status = 4, updated_at = NOW()
|
||
WHERE id IN (
|
||
SELECT pu.id
|
||
FROM tb_package_usage pu
|
||
JOIN tb_package_usage master ON pu.master_usage_id = master.id
|
||
WHERE pu.status = 1 AND master.status = 3
|
||
);
|
||
```
|
||
|
||
**情况 3:流量重置时间错误**
|
||
|
||
```sql
|
||
-- 查找下次重置时间异常的套餐
|
||
SELECT id, data_reset_cycle, next_reset_at
|
||
FROM tb_package_usage
|
||
WHERE data_reset_cycle = 'daily' AND next_reset_at < NOW() - INTERVAL '1 day';
|
||
|
||
-- 修复:重新计算下次重置时间
|
||
UPDATE tb_package_usage
|
||
SET next_reset_at = DATE_TRUNC('day', NOW()) + INTERVAL '1 day',
|
||
updated_at = NOW()
|
||
WHERE data_reset_cycle = 'daily' AND next_reset_at < NOW() - INTERVAL '1 day';
|
||
```
|
||
|
||
---
|
||
|
||
## 日常运维
|
||
|
||
### 手动触发流量重置
|
||
|
||
```bash
|
||
# 通过 API 触发
|
||
curl -X POST http://api-host/api/admin/internal/trigger-data-reset \
|
||
-H "Authorization: Bearer $ADMIN_TOKEN"
|
||
```
|
||
|
||
### 查看 Asynq 队列状态
|
||
|
||
```bash
|
||
# 查看队列概览
|
||
asynq stats
|
||
|
||
# 查看待处理任务
|
||
asynq list pending
|
||
|
||
# 查看失败任务
|
||
asynq list archived
|
||
```
|
||
|
||
### 重试失败任务
|
||
|
||
```bash
|
||
# 重试所有失败任务
|
||
asynq task run archived --all
|
||
|
||
# 重试特定任务
|
||
asynq task run archived --id=<task_id>
|
||
```
|
||
|
||
---
|
||
|
||
## 容量规划
|
||
|
||
### 数据增长预估
|
||
|
||
| 表 | 每日增量 | 月增量 | 年增量 |
|
||
|----|---------|--------|--------|
|
||
| `tb_package_usage` | ~1000 行 | ~30000 行 | ~360000 行 |
|
||
| `tb_package_usage_daily_record` | ~10000 行 | ~300000 行 | ~3600000 行 |
|
||
| `tb_card_daily_usage` | ~10000 行 | ~300000 行 | ~3600000 行 |
|
||
|
||
### 存储预估
|
||
|
||
| 表 | 单行大小 | 年存储量 |
|
||
|----|---------|---------|
|
||
| `tb_package_usage_daily_record` | ~100 bytes | ~360 MB |
|
||
| `tb_card_daily_usage` | ~80 bytes | ~288 MB |
|
||
|
||
### 清理策略
|
||
|
||
```sql
|
||
-- 清理 180 天前的日记录(可选)
|
||
DELETE FROM tb_package_usage_daily_record
|
||
WHERE date < NOW() - INTERVAL '180 days';
|
||
|
||
DELETE FROM tb_card_daily_usage
|
||
WHERE usage_date < NOW() - INTERVAL '180 days';
|
||
```
|