You've already forked DataMate
## 功能概述
将数据标注模块从固定 YOLO 算子改造为支持通用算子编排,实现与数据清洗模块类似的灵活算子组合能力。
## 改动内容
### 第 1 步:数据库改造(DDL)
- 新增 SQL migration 脚本:scripts/db/data-annotation-operator-pipeline-migration.sql
- 修改 t_dm_auto_annotation_tasks 表:
- 新增字段:task_mode, executor_type, pipeline, output_dataset_id, created_by, stop_requested, started_at, heartbeat_at, run_token
- 新增索引:idx_status_created, idx_created_by
- 创建 t_dm_annotation_task_operator_instance 表:用于存储算子实例详情
### 第 2 步:API 层改造
- 扩展请求模型(schema/auto.py):
- 新增 OperatorPipelineStep 模型
- 支持 pipeline 字段,保留旧 YOLO 字段向后兼容
- 实现多写法归一(operatorId/operator_id/id, overrides/settingsOverride/settings_override)
- 修改任务创建服务(service/auto.py):
- 新增 validate_file_ids() 校验方法
- 新增 _to_pipeline() 兼容映射方法
- 写入新字段并集成算子实例表
- 修复 fileIds 去重准确性问题
- 新增 API 路由(interface/auto.py):
- 新增 /operator-tasks 系列接口
- 新增 stop API 接口(/auto/{id}/stop 和 /operator-tasks/{id}/stop)
- 保留旧 /auto 接口向后兼容
- ORM 模型对齐(annotation_management.py):
- AutoAnnotationTask 新增所有 DDL 字段
- 新增 AnnotationTaskOperatorInstance 模型
- 状态定义补充 stopped
### 第 3 步:Runtime 层改造
- 修改 worker 执行逻辑(auto_annotation_worker.py):
- 实现原子任务抢占机制(run_token)
- 从硬编码 YOLO 改为通用 pipeline 执行
- 新增算子解析和实例化能力
- 支持 stop_requested 检查
- 保留 legacy_yolo 模式向后兼容
- 支持多种算子调用方式(execute 和 __call__)
### 第 4 步:灰度发布
- 完善 YOLO 算子元数据(metadata.yml):
- 补齐 raw_id, language, modal, inputs, outputs, settings 字段
- 注册标注算子(__init__.py):
- 将 YOLO 算子注册到 OPERATORS 注册表
- 确保 annotation 包被正确加载
- 新增白名单控制:
- 支持环境变量 AUTO_ANNOTATION_OPERATOR_WHITELIST
- 灰度发布时可限制可用算子
## 关键特性
### 向后兼容
- 旧 /auto 接口完全保留
- 旧请求参数自动映射到 pipeline
- legacy_yolo 模式确保旧逻辑正常运行
### 新功能
- 支持通用 pipeline 编排
- 支持多算子组合
- 支持任务停止控制
- 支持白名单灰度发布
### 可靠性
- 原子任务抢占(防止重复执行)
- 完整的错误处理和状态管理
- 详细的审计追踪(算子实例表)
## 部署说明
1. 执行 DDL:mysql < scripts/db/data-annotation-operator-pipeline-migration.sql
2. 配置环境变量:AUTO_ANNOTATION_OPERATOR_WHITELIST=ImageObjectDetectionBoundingBox
3. 重启服务:datamate-runtime 和 datamate-backend-python
## 验证步骤
1. 兼容模式验证:使用旧 /auto 接口创建任务
2. 通用编排验证:使用新 /operator-tasks 接口创建 pipeline 任务
3. 原子 claim 验证:检查 run_token 机制
4. 停止验证:测试 stop API
5. 白名单验证:测试算子白名单拦截
## 相关文件
- DDL: scripts/db/data-annotation-operator-pipeline-migration.sql
- API: runtime/datamate-python/app/module/annotation/
- Worker: runtime/python-executor/datamate/auto_annotation_worker.py
- 算子: runtime/ops/annotation/image_object_detection_bounding_box/
166 lines
6.6 KiB
Python
166 lines
6.6 KiB
Python
"""Schemas for Auto Annotation tasks"""
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
|
|
from typing import List, Optional, Dict, Any
|
|
from datetime import datetime
|
|
|
|
from pydantic import BaseModel, Field, ConfigDict, model_validator
|
|
|
|
|
|
class AutoAnnotationConfig(BaseModel):
|
|
"""自动标注任务配置(与前端 payload 对齐)"""
|
|
|
|
model_size: str = Field(alias="modelSize", description="模型规模: n/s/m/l/x")
|
|
conf_threshold: float = Field(alias="confThreshold", description="置信度阈值 0-1")
|
|
target_classes: List[int] = Field(
|
|
default_factory=list,
|
|
alias="targetClasses",
|
|
description="目标类别ID列表,空表示全部类别",
|
|
)
|
|
output_dataset_name: Optional[str] = Field(
|
|
default=None,
|
|
alias="outputDatasetName",
|
|
description="自动标注结果要写入的新数据集名称(可选)",
|
|
)
|
|
|
|
model_config = ConfigDict(populate_by_name=True)
|
|
|
|
|
|
class OperatorPipelineStep(BaseModel):
|
|
"""通用算子编排中的单个算子节点定义"""
|
|
|
|
operator_id: str = Field(alias="operatorId", description="算子ID(raw_id)")
|
|
overrides: Dict[str, Any] = Field(
|
|
default_factory=dict,
|
|
alias="overrides",
|
|
description="算子参数覆盖(对应 settings override)",
|
|
)
|
|
|
|
@model_validator(mode="before")
|
|
@classmethod
|
|
def normalize_compatible_fields(cls, value: Any):
|
|
if not isinstance(value, dict):
|
|
return value
|
|
|
|
normalized = dict(value)
|
|
|
|
if "operatorId" not in normalized:
|
|
for key in ("operator_id", "id"):
|
|
candidate = normalized.get(key)
|
|
if candidate:
|
|
normalized["operatorId"] = candidate
|
|
break
|
|
|
|
if "overrides" not in normalized:
|
|
for key in ("settingsOverride", "settings_override"):
|
|
candidate = normalized.get(key)
|
|
if isinstance(candidate, str):
|
|
try:
|
|
candidate = json.loads(candidate)
|
|
except Exception:
|
|
candidate = None
|
|
if isinstance(candidate, dict):
|
|
normalized["overrides"] = candidate
|
|
break
|
|
|
|
return normalized
|
|
|
|
model_config = ConfigDict(populate_by_name=True)
|
|
|
|
|
|
class CreateAutoAnnotationTaskRequest(BaseModel):
|
|
"""创建自动标注任务的请求体,对齐前端 CreateAutoAnnotationDialog 发送的结构"""
|
|
|
|
name: str = Field(..., min_length=1, max_length=255, description="任务名称")
|
|
dataset_id: str = Field(..., alias="datasetId", description="数据集ID")
|
|
config: Optional[AutoAnnotationConfig] = Field(
|
|
default=None,
|
|
description="兼容旧版 YOLO 任务配置",
|
|
)
|
|
pipeline: Optional[List[OperatorPipelineStep]] = Field(
|
|
default=None,
|
|
description="通用算子编排定义",
|
|
)
|
|
task_mode: str = Field(
|
|
default="legacy_yolo",
|
|
alias="taskMode",
|
|
description="任务模式: legacy_yolo/pipeline",
|
|
)
|
|
executor_type: str = Field(
|
|
default="annotation_local",
|
|
alias="executorType",
|
|
description="执行器类型",
|
|
)
|
|
output_dataset_name: Optional[str] = Field(
|
|
default=None,
|
|
alias="outputDatasetName",
|
|
description="输出数据集名称(优先级高于 config.outputDatasetName)",
|
|
)
|
|
file_ids: Optional[List[str]] = Field(
|
|
None,
|
|
alias="fileIds",
|
|
description="要处理的文件ID列表,为空则处理数据集中所有图像",
|
|
)
|
|
|
|
@model_validator(mode="after")
|
|
def validate_config_or_pipeline(self):
|
|
if self.config is None and not self.pipeline:
|
|
raise ValueError("Either config or pipeline must be provided")
|
|
return self
|
|
|
|
model_config = ConfigDict(populate_by_name=True)
|
|
|
|
|
|
class AutoAnnotationTaskResponse(BaseModel):
|
|
"""自动标注任务响应模型(列表/详情均可复用)"""
|
|
|
|
id: str = Field(..., description="任务ID")
|
|
name: str = Field(..., description="任务名称")
|
|
dataset_id: str = Field(..., alias="datasetId", description="数据集ID")
|
|
dataset_name: Optional[str] = Field(None, alias="datasetName", description="数据集名称")
|
|
task_mode: Optional[str] = Field(None, alias="taskMode", description="任务模式")
|
|
executor_type: Optional[str] = Field(None, alias="executorType", description="执行器类型")
|
|
pipeline: Optional[List[Dict[str, Any]]] = Field(None, description="算子编排定义")
|
|
source_datasets: Optional[List[str]] = Field(
|
|
default=None,
|
|
alias="sourceDatasets",
|
|
description="本任务实际处理涉及到的所有数据集名称列表",
|
|
)
|
|
config: Dict[str, Any] = Field(..., description="任务配置")
|
|
status: str = Field(..., description="任务状态")
|
|
progress: int = Field(..., description="任务进度 0-100")
|
|
total_images: int = Field(..., alias="totalImages", description="总图片数")
|
|
processed_images: int = Field(..., alias="processedImages", description="已处理图片数")
|
|
detected_objects: int = Field(..., alias="detectedObjects", description="检测到的对象总数")
|
|
output_path: Optional[str] = Field(None, alias="outputPath", description="输出路径")
|
|
output_dataset_id: Optional[str] = Field(
|
|
None,
|
|
alias="outputDatasetId",
|
|
description="输出数据集ID",
|
|
)
|
|
stop_requested: Optional[bool] = Field(
|
|
None,
|
|
alias="stopRequested",
|
|
description="是否请求停止",
|
|
)
|
|
error_message: Optional[str] = Field(None, alias="errorMessage", description="错误信息")
|
|
created_by: Optional[str] = Field(None, alias="createdBy", description="创建人")
|
|
started_at: Optional[datetime] = Field(None, alias="startedAt", description="启动时间")
|
|
heartbeat_at: Optional[datetime] = Field(None, alias="heartbeatAt", description="心跳时间")
|
|
created_at: datetime = Field(..., alias="createdAt", description="创建时间")
|
|
updated_at: Optional[datetime] = Field(None, alias="updatedAt", description="更新时间")
|
|
completed_at: Optional[datetime] = Field(None, alias="completedAt", description="完成时间")
|
|
|
|
model_config = ConfigDict(populate_by_name=True, from_attributes=True)
|
|
|
|
|
|
class AutoAnnotationTaskListResponse(BaseModel):
|
|
"""自动标注任务列表响应,目前前端直接使用数组,这里预留分页结构"""
|
|
|
|
content: List[AutoAnnotationTaskResponse] = Field(..., description="任务列表")
|
|
total: int = Field(..., description="总数")
|
|
|
|
model_config = ConfigDict(populate_by_name=True)
|