feat: 实现 IoT 卡轮询系统(支持千万级卡规模)
All checks were successful
构建并部署到测试环境(无 SSH) / build-and-deploy (push) Successful in 6m35s
All checks were successful
构建并部署到测试环境(无 SSH) / build-and-deploy (push) Successful in 6m35s
实现功能: - 实名状态检查轮询(可配置间隔) - 卡流量检查轮询(支持跨月流量追踪) - 套餐检查与超额自动停机 - 分布式并发控制(Redis 信号量) - 手动触发轮询(单卡/批量/条件筛选) - 数据清理配置与执行 - 告警规则与历史记录 - 实时监控统计(队列/性能/并发) 性能优化: - Redis 缓存卡信息,减少 DB 查询 - Pipeline 批量写入 Redis - 异步流量记录写入 - 渐进式初始化(10万卡/批) 压测工具(scripts/benchmark/): - Mock Gateway 模拟上游服务 - 测试卡生成器 - 配置初始化脚本 - 实时监控脚本 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
361
docs/polling-system/operations.md
Normal file
361
docs/polling-system/operations.md
Normal file
@@ -0,0 +1,361 @@
|
||||
# 轮询系统运维文档
|
||||
|
||||
## 日常监控
|
||||
|
||||
### 1. 监控面板
|
||||
|
||||
访问监控接口获取系统状态:
|
||||
|
||||
```bash
|
||||
# 总览统计
|
||||
curl http://localhost:3000/api/admin/polling-stats
|
||||
|
||||
# 队列状态
|
||||
curl http://localhost:3000/api/admin/polling-stats/queues
|
||||
|
||||
# 任务统计
|
||||
curl http://localhost:3000/api/admin/polling-stats/tasks
|
||||
|
||||
# 初始化进度
|
||||
curl http://localhost:3000/api/admin/polling-stats/init-progress
|
||||
```
|
||||
|
||||
### 2. 关键指标
|
||||
|
||||
| 指标 | 正常范围 | 告警阈值 | 说明 |
|
||||
|------|----------|----------|------|
|
||||
| 队列长度 | < 10000 | > 50000 | 队列积压严重需关注 |
|
||||
| 成功率 | > 95% | < 90% | 任务执行成功率 |
|
||||
| 平均耗时 | < 500ms | > 2000ms | 单任务处理时间 |
|
||||
| 并发使用率 | 50-80% | > 95% | 接近上限需扩容 |
|
||||
|
||||
### 3. Redis 监控命令
|
||||
|
||||
```bash
|
||||
# 查看队列长度
|
||||
redis-cli ZCARD polling:queue:realname
|
||||
redis-cli ZCARD polling:queue:carddata
|
||||
redis-cli ZCARD polling:queue:package
|
||||
|
||||
# 查看手动触发队列
|
||||
redis-cli LLEN polling:manual:realname
|
||||
redis-cli LLEN polling:manual:carddata
|
||||
redis-cli LLEN polling:manual:package
|
||||
|
||||
# 查看当前并发数
|
||||
redis-cli GET polling:concurrency:current:realname
|
||||
redis-cli GET polling:concurrency:current:carddata
|
||||
redis-cli GET polling:concurrency:current:package
|
||||
|
||||
# 查看统计数据
|
||||
redis-cli HGETALL polling:stats:realname
|
||||
redis-cli HGETALL polling:stats:carddata
|
||||
redis-cli HGETALL polling:stats:package
|
||||
|
||||
# 查看初始化进度
|
||||
redis-cli HGETALL polling:init:progress
|
||||
```
|
||||
|
||||
## 告警配置
|
||||
|
||||
### 1. 默认告警规则
|
||||
|
||||
建议配置以下告警规则:
|
||||
|
||||
```bash
|
||||
# 队列积压告警
|
||||
curl -X POST http://localhost:3000/api/admin/polling-alert-rules \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"name": "队列积压告警",
|
||||
"rule_type": "queue_backlog",
|
||||
"task_type": "realname",
|
||||
"threshold": 50000,
|
||||
"comparison": ">",
|
||||
"is_enabled": true,
|
||||
"notify_channels": ["webhook"],
|
||||
"webhook_url": "https://your-webhook-url"
|
||||
}'
|
||||
|
||||
# 成功率告警
|
||||
curl -X POST http://localhost:3000/api/admin/polling-alert-rules \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"name": "成功率告警",
|
||||
"rule_type": "success_rate",
|
||||
"task_type": "realname",
|
||||
"threshold": 90,
|
||||
"comparison": "<",
|
||||
"is_enabled": true,
|
||||
"notify_channels": ["webhook"],
|
||||
"webhook_url": "https://your-webhook-url"
|
||||
}'
|
||||
|
||||
# 平均耗时告警
|
||||
curl -X POST http://localhost:3000/api/admin/polling-alert-rules \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"name": "耗时告警",
|
||||
"rule_type": "avg_duration",
|
||||
"task_type": "realname",
|
||||
"threshold": 2000,
|
||||
"comparison": ">",
|
||||
"is_enabled": true,
|
||||
"notify_channels": ["webhook"],
|
||||
"webhook_url": "https://your-webhook-url"
|
||||
}'
|
||||
```
|
||||
|
||||
### 2. 告警历史查询
|
||||
|
||||
```bash
|
||||
# 查看告警历史
|
||||
curl "http://localhost:3000/api/admin/polling-alert-history?page=1&page_size=20"
|
||||
|
||||
# 按规则筛选
|
||||
curl "http://localhost:3000/api/admin/polling-alert-history?rule_id=1"
|
||||
```
|
||||
|
||||
## 故障排查
|
||||
|
||||
### 问题 1: 队列积压
|
||||
|
||||
**现象**: 队列长度持续增长,任务处理速度跟不上
|
||||
|
||||
**排查步骤**:
|
||||
|
||||
1. 检查并发使用情况
|
||||
```bash
|
||||
redis-cli GET polling:concurrency:current:realname
|
||||
redis-cli GET polling:concurrency:config:realname
|
||||
```
|
||||
|
||||
2. 检查 Gateway 接口响应时间
|
||||
```bash
|
||||
# 查看统计中的平均耗时
|
||||
redis-cli HGET polling:stats:realname avg_duration_ms
|
||||
```
|
||||
|
||||
3. 检查是否有大量失败重试
|
||||
```bash
|
||||
redis-cli HGET polling:stats:realname failed
|
||||
```
|
||||
|
||||
**解决方案**:
|
||||
|
||||
1. 增加并发数
|
||||
```bash
|
||||
curl -X PUT http://localhost:3000/api/admin/polling-concurrency/realname \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"max_concurrency": 100}'
|
||||
```
|
||||
|
||||
2. 临时禁用非关键配置
|
||||
```bash
|
||||
curl -X PUT http://localhost:3000/api/admin/polling-configs/1 \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"status": 0}'
|
||||
```
|
||||
|
||||
### 问题 2: 任务执行失败率高
|
||||
|
||||
**现象**: 成功率低于 90%
|
||||
|
||||
**排查步骤**:
|
||||
|
||||
1. 查看 Worker 日志
|
||||
```bash
|
||||
grep -i "error" logs/worker.log | tail -100
|
||||
```
|
||||
|
||||
2. 检查 Gateway 服务状态
|
||||
3. 检查网络连接
|
||||
|
||||
**解决方案**:
|
||||
|
||||
1. 如果是 Gateway 问题,联系运营商解决
|
||||
2. 如果是网络问题,检查防火墙和 DNS 配置
|
||||
3. 临时降低并发数,减少压力
|
||||
|
||||
### 问题 3: 初始化卡住
|
||||
|
||||
**现象**: 初始化进度长时间不变
|
||||
|
||||
**排查步骤**:
|
||||
|
||||
1. 检查初始化进度
|
||||
```bash
|
||||
redis-cli HGETALL polling:init:progress
|
||||
```
|
||||
|
||||
2. 查看 Worker 日志是否有错误
|
||||
```bash
|
||||
grep -i "初始化" logs/worker.log | tail -50
|
||||
```
|
||||
|
||||
**解决方案**:
|
||||
|
||||
1. 重启 Worker 服务
|
||||
2. 如果持续失败,检查数据库连接
|
||||
|
||||
### 问题 4: 并发信号量泄漏
|
||||
|
||||
**现象**: 当前并发数异常高,但实际没有那么多任务在运行
|
||||
|
||||
**排查步骤**:
|
||||
|
||||
```bash
|
||||
# 检查当前并发数
|
||||
redis-cli GET polling:concurrency:current:realname
|
||||
```
|
||||
|
||||
**解决方案**:
|
||||
|
||||
重置信号量:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/admin/polling-concurrency/reset \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"task_type": "realname"}'
|
||||
```
|
||||
|
||||
## 数据清理
|
||||
|
||||
### 1. 查看清理配置
|
||||
|
||||
```bash
|
||||
curl http://localhost:3000/api/admin/data-cleanup-configs
|
||||
```
|
||||
|
||||
### 2. 手动触发清理
|
||||
|
||||
```bash
|
||||
# 预览清理范围
|
||||
curl http://localhost:3000/api/admin/data-cleanup/preview
|
||||
|
||||
# 手动触发清理
|
||||
curl -X POST http://localhost:3000/api/admin/data-cleanup/trigger
|
||||
|
||||
# 查看清理进度
|
||||
curl http://localhost:3000/api/admin/data-cleanup/progress
|
||||
```
|
||||
|
||||
### 3. 调整保留天数
|
||||
|
||||
```bash
|
||||
curl -X PUT http://localhost:3000/api/admin/data-cleanup-configs/1 \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"retention_days": 60}'
|
||||
```
|
||||
|
||||
## 手动触发操作
|
||||
|
||||
### 1. 单卡触发
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/admin/polling-manual-trigger/single \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"card_id": 12345,
|
||||
"task_type": "realname"
|
||||
}'
|
||||
```
|
||||
|
||||
### 2. 批量触发
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/admin/polling-manual-trigger/batch \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"card_ids": [12345, 12346, 12347],
|
||||
"task_type": "carddata"
|
||||
}'
|
||||
```
|
||||
|
||||
### 3. 条件触发
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/admin/polling-manual-trigger/by-condition \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"task_type": "realname",
|
||||
"carrier_id": 1,
|
||||
"status": 1
|
||||
}'
|
||||
```
|
||||
|
||||
### 4. 取消触发
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/admin/polling-manual-trigger/cancel \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"trigger_id": "xxx"
|
||||
}'
|
||||
```
|
||||
|
||||
## 性能优化
|
||||
|
||||
### 1. 并发数调优
|
||||
|
||||
根据 Gateway 接口响应时间和服务器资源调整并发数:
|
||||
|
||||
| 场景 | 建议并发数 |
|
||||
|------|-----------|
|
||||
| Gateway 响应 < 100ms | 100-200 |
|
||||
| Gateway 响应 100-500ms | 50-100 |
|
||||
| Gateway 响应 > 500ms | 20-50 |
|
||||
|
||||
### 2. 轮询间隔调优
|
||||
|
||||
根据业务需求调整间隔:
|
||||
|
||||
| 任务类型 | 建议间隔 | 说明 |
|
||||
|----------|----------|------|
|
||||
| 实名检查(未实名) | 60s | 需要快速获知实名状态 |
|
||||
| 实名检查(已实名) | 3600s | 状态稳定,低频检查 |
|
||||
| 流量检查 | 1800s | 30分钟一次 |
|
||||
| 套餐检查 | 1800s | 与流量检查同步 |
|
||||
|
||||
### 3. 批量处理优化
|
||||
|
||||
- 渐进式初始化:每批 10 万张卡,间隔 1 秒
|
||||
- 数据清理:每批 10000 条,避免长事务
|
||||
|
||||
## 备份与恢复
|
||||
|
||||
### 1. 配置备份
|
||||
|
||||
```bash
|
||||
# 备份轮询配置
|
||||
pg_dump -h $HOST -U $USER -d $DB -t tb_polling_config > polling_config_backup.sql
|
||||
pg_dump -h $HOST -U $USER -d $DB -t tb_polling_concurrency_config > concurrency_config_backup.sql
|
||||
pg_dump -h $HOST -U $USER -d $DB -t tb_polling_alert_rule > alert_rules_backup.sql
|
||||
pg_dump -h $HOST -U $USER -d $DB -t tb_data_cleanup_config > cleanup_config_backup.sql
|
||||
```
|
||||
|
||||
### 2. 恢复配置
|
||||
|
||||
```bash
|
||||
psql -h $HOST -U $USER -d $DB < polling_config_backup.sql
|
||||
```
|
||||
|
||||
## 日志说明
|
||||
|
||||
### 日志位置
|
||||
|
||||
- Worker 日志:`logs/worker.log`
|
||||
- API 日志:`logs/api.log`
|
||||
- 访问日志:`logs/access.log`
|
||||
|
||||
### 关键日志关键词
|
||||
|
||||
| 关键词 | 含义 |
|
||||
|--------|------|
|
||||
| `轮询调度器启动` | Worker 启动成功 |
|
||||
| `渐进式初始化` | 初始化进行中 |
|
||||
| `实名检查完成` | 实名检查任务完成 |
|
||||
| `流量检查完成` | 流量检查任务完成 |
|
||||
| `套餐检查完成` | 套餐检查任务完成 |
|
||||
| `告警触发` | 告警规则触发 |
|
||||
| `数据清理完成` | 清理任务完成 |
|
||||
Reference in New Issue
Block a user