Files
junhong_cmp_fiber/docs/package-system-upgrade/运维指南.md
huang c665f32976
All checks were successful
构建并部署到测试环境(无 SSH) / build-and-deploy (push) Successful in 6m54s
feat: 套餐系统升级 - Worker 重构、流量重置、文档与规范更新
- 重构 Worker 启动流程,引入 bootstrap 模块统一管理依赖注入
- 实现套餐流量重置服务(日/月/年周期重置)
- 新增套餐激活排队、加油包绑定、囤货待实名激活逻辑
- 新增订单创建幂等性防重(Redis 业务键 + 分布式锁)
- 更新 AGENTS.md/CLAUDE.md:新增注释规范、幂等性规范,移除测试要求
- 添加套餐系统升级完整文档(API文档、使用指南、功能总结、运维指南)
- 归档 OpenSpec package-system-upgrade 变更,同步 specs 到主目录
- 新增 queue types 抽象和 Redis 常量定义
2026-02-12 14:24:15 +08:00

280 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 套餐系统升级 - 运维指南
## 监控指标
### Asynq 队列监控
| 指标 | 说明 | 正常范围 | 告警阈值 |
|------|------|---------|---------|
| `asynq_queue_size{queue="default"}` | 默认队列长度 | < 100 | > 1000 |
| `asynq_queue_latency_seconds` | 任务处理延迟 | < 5s | > 30s |
| `asynq_processed_total` | 已处理任务数 | 持续增长 | - |
| `asynq_failed_total` | 失败任务数 | 接近 0 | > 10/min |
### 套餐激活监控
| 指标 | 说明 | 正常范围 | 告警阈值 |
|------|------|---------|---------|
| 排队套餐激活延迟 | 主套餐过期到下一个激活的时间 | < 30s | > 1min |
| 实名激活延迟 | 实名完成到套餐激活的时间 | < 30s | > 1min |
| 待激活套餐堆积 | `status=0` 的套餐数量 | 正常波动 | 持续增长 |
### API 性能监控
| 指标 | 端点 | 正常范围 | 告警阈值 |
|------|------|---------|---------|
| 响应时间 P95 | `/api/h5/packages/my-usage` | < 100ms | > 200ms |
| 响应时间 P99 | `/api/h5/packages/my-usage` | < 200ms | > 500ms |
| 响应时间 P95 | `/api/admin/package-usage/:id/daily-records` | < 150ms | > 300ms |
### 数据库监控
| 指标 | 说明 | 正常范围 | 告警阈值 |
|------|------|---------|---------|
| 流量重置执行时间 | 单批次重置耗时 | < 5s | > 10s |
| 套餐表行数增长 | `tb_package_usage` 每日新增 | 正常波动 | 异常增长 |
| 日记录表行数 | `tb_package_usage_daily_record` | 正常增长 | - |
---
## 告警规则
### Prometheus 告警规则示例
```yaml
groups:
- name: package_system_alerts
rules:
# 套餐激活延迟告警
- alert: PackageActivationDelayHigh
expr: histogram_quantile(0.95, rate(package_activation_duration_seconds_bucket[5m])) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "套餐激活延迟过高"
description: "套餐激活 P95 延迟超过 1 分钟,当前值: {{ $value }}s"
# Asynq 队列堆积告警
- alert: AsynqQueueBacklog
expr: asynq_queue_size{queue="default"} > 1000
for: 5m
labels:
severity: critical
annotations:
summary: "Asynq 任务队列堆积"
description: "默认队列任务数超过 1000当前值: {{ $value }}"
# 任务失败率告警
- alert: AsynqTaskFailureRateHigh
expr: rate(asynq_failed_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Asynq 任务失败率过高"
description: "任务失败率超过 10%,当前值: {{ $value }}/s"
# API 响应时间告警
- alert: PackageAPILatencyHigh
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{path=~"/api/h5/packages.*"}[5m])) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "套餐 API 响应时间过高"
description: "套餐相关 API P95 响应时间超过 200ms"
# 流量重置执行时间告警
- alert: DataResetDurationHigh
expr: package_data_reset_duration_seconds > 10
for: 1m
labels:
severity: warning
annotations:
summary: "流量重置执行时间过长"
description: "流量重置批次执行时间超过 10 秒"
```
---
## 回滚预案
### 场景一:代码回滚
**触发条件**
- API 接口异常
- 业务逻辑错误
- 性能严重下降
**回滚步骤**
```bash
# 1. 切换到上一个稳定版本
git checkout <上一个稳定版本 tag>
# 2. 重新构建镜像
make build-docker
# 3. 重新部署
kubectl rollout restart deployment/cmp-api
kubectl rollout restart deployment/cmp-worker
# 4. 验证服务正常
curl -s http://api-host/health | jq
```
**注意事项**
- 代码回滚不会回滚数据库迁移
- 需要确保旧代码兼容新数据库结构
- 新增字段使用默认值,不影响旧代码运行
### 场景二:数据库回滚
**触发条件**
- 迁移脚本有问题
- 数据损坏
- 需要完全撤销功能
**前置条件**
- 确认已备份数据库
- 确认代码已回滚到兼容版本
**回滚步骤**
```bash
# 1. 停止 API 和 Worker 服务
kubectl scale deployment/cmp-api --replicas=0
kubectl scale deployment/cmp-worker --replicas=0
# 2. 执行数据库回滚
make migrate-down STEPS=1
# 3. 验证数据库结构
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d tb_package"
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d tb_package_usage"
# 4. 重新启动服务
kubectl scale deployment/cmp-api --replicas=3
kubectl scale deployment/cmp-worker --replicas=2
```
**回滚脚本位置**
`migrations/000055_package_system_upgrade.down.sql`
### 场景三:数据修复
**情况 1套餐状态异常**
```sql
-- 查找状态异常的套餐
SELECT id, status, activated_at, expires_at
FROM tb_package_usage
WHERE status = 1 AND expires_at < NOW();
-- 修复:将过期套餐标记为已过期
UPDATE tb_package_usage
SET status = 3, updated_at = NOW()
WHERE status = 1 AND expires_at < NOW();
```
**情况 2加油包未正确失效**
```sql
-- 查找主套餐已过期但加油包仍生效的记录
SELECT pu.id, pu.status, pu.master_usage_id, master.status as master_status
FROM tb_package_usage pu
JOIN tb_package_usage master ON pu.master_usage_id = master.id
WHERE pu.status = 1 AND master.status = 3;
-- 修复:将这些加油包标记为失效
UPDATE tb_package_usage
SET status = 4, updated_at = NOW()
WHERE id IN (
SELECT pu.id
FROM tb_package_usage pu
JOIN tb_package_usage master ON pu.master_usage_id = master.id
WHERE pu.status = 1 AND master.status = 3
);
```
**情况 3流量重置时间错误**
```sql
-- 查找下次重置时间异常的套餐
SELECT id, data_reset_cycle, next_reset_at
FROM tb_package_usage
WHERE data_reset_cycle = 'daily' AND next_reset_at < NOW() - INTERVAL '1 day';
-- 修复:重新计算下次重置时间
UPDATE tb_package_usage
SET next_reset_at = DATE_TRUNC('day', NOW()) + INTERVAL '1 day',
updated_at = NOW()
WHERE data_reset_cycle = 'daily' AND next_reset_at < NOW() - INTERVAL '1 day';
```
---
## 日常运维
### 手动触发流量重置
```bash
# 通过 API 触发
curl -X POST http://api-host/api/admin/internal/trigger-data-reset \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
### 查看 Asynq 队列状态
```bash
# 查看队列概览
asynq stats
# 查看待处理任务
asynq list pending
# 查看失败任务
asynq list archived
```
### 重试失败任务
```bash
# 重试所有失败任务
asynq task run archived --all
# 重试特定任务
asynq task run archived --id=<task_id>
```
---
## 容量规划
### 数据增长预估
| 表 | 每日增量 | 月增量 | 年增量 |
|----|---------|--------|--------|
| `tb_package_usage` | ~1000 行 | ~30000 行 | ~360000 行 |
| `tb_package_usage_daily_record` | ~10000 行 | ~300000 行 | ~3600000 行 |
| `tb_card_daily_usage` | ~10000 行 | ~300000 行 | ~3600000 行 |
### 存储预估
| 表 | 单行大小 | 年存储量 |
|----|---------|---------|
| `tb_package_usage_daily_record` | ~100 bytes | ~360 MB |
| `tb_card_daily_usage` | ~80 bytes | ~288 MB |
### 清理策略
```sql
-- 清理 180 天前的日记录(可选)
DELETE FROM tb_package_usage_daily_record
WHERE date < NOW() - INTERVAL '180 days';
DELETE FROM tb_card_daily_usage
WHERE usage_date < NOW() - INTERVAL '180 days';
```