Go to file

Jerry Yan 2f49fc4199 feat(annotation): 支持通用算子编排的数据标注功能

## 功能概述
将数据标注模块从固定 YOLO 算子改造为支持通用算子编排，实现与数据清洗模块类似的灵活算子组合能力。

## 改动内容

### 第 1 步：数据库改造（DDL）
- 新增 SQL migration 脚本：scripts/db/data-annotation-operator-pipeline-migration.sql
- 修改 t_dm_auto_annotation_tasks 表：
  - 新增字段：task_mode, executor_type, pipeline, output_dataset_id, created_by, stop_requested, started_at, heartbeat_at, run_token
  - 新增索引：idx_status_created, idx_created_by
- 创建 t_dm_annotation_task_operator_instance 表：用于存储算子实例详情

### 第 2 步：API 层改造
- 扩展请求模型（schema/auto.py）：
  - 新增 OperatorPipelineStep 模型
  - 支持 pipeline 字段，保留旧 YOLO 字段向后兼容
  - 实现多写法归一（operatorId/operator_id/id, overrides/settingsOverride/settings_override）
- 修改任务创建服务（service/auto.py）：
  - 新增 validate_file_ids() 校验方法
  - 新增 _to_pipeline() 兼容映射方法
  - 写入新字段并集成算子实例表
  - 修复 fileIds 去重准确性问题
- 新增 API 路由（interface/auto.py）：
  - 新增 /operator-tasks 系列接口
  - 新增 stop API 接口（/auto/{id}/stop 和 /operator-tasks/{id}/stop）
  - 保留旧 /auto 接口向后兼容
- ORM 模型对齐（annotation_management.py）：
  - AutoAnnotationTask 新增所有 DDL 字段
  - 新增 AnnotationTaskOperatorInstance 模型
  - 状态定义补充 stopped

### 第 3 步：Runtime 层改造
- 修改 worker 执行逻辑（auto_annotation_worker.py）：
  - 实现原子任务抢占机制（run_token）
  - 从硬编码 YOLO 改为通用 pipeline 执行
  - 新增算子解析和实例化能力
  - 支持 stop_requested 检查
  - 保留 legacy_yolo 模式向后兼容
  - 支持多种算子调用方式（execute 和 __call__）

### 第 4 步：灰度发布
- 完善 YOLO 算子元数据（metadata.yml）：
  - 补齐 raw_id, language, modal, inputs, outputs, settings 字段
- 注册标注算子（__init__.py）：
  - 将 YOLO 算子注册到 OPERATORS 注册表
  - 确保 annotation 包被正确加载
- 新增白名单控制：
  - 支持环境变量 AUTO_ANNOTATION_OPERATOR_WHITELIST
  - 灰度发布时可限制可用算子

## 关键特性

### 向后兼容
- 旧 /auto 接口完全保留
- 旧请求参数自动映射到 pipeline
- legacy_yolo 模式确保旧逻辑正常运行

### 新功能
- 支持通用 pipeline 编排
- 支持多算子组合
- 支持任务停止控制
- 支持白名单灰度发布

### 可靠性
- 原子任务抢占（防止重复执行）
- 完整的错误处理和状态管理
- 详细的审计追踪（算子实例表）

## 部署说明

1. 执行 DDL：mysql < scripts/db/data-annotation-operator-pipeline-migration.sql
2. 配置环境变量：AUTO_ANNOTATION_OPERATOR_WHITELIST=ImageObjectDetectionBoundingBox
3. 重启服务：datamate-runtime 和 datamate-backend-python

## 验证步骤

1. 兼容模式验证：使用旧 /auto 接口创建任务
2. 通用编排验证：使用新 /operator-tasks 接口创建 pipeline 任务
3. 原子 claim 验证：检查 run_token 机制
4. 停止验证：测试 stop API
5. 白名单验证：测试算子白名单拦截

## 相关文件

- DDL: scripts/db/data-annotation-operator-pipeline-migration.sql
- API: runtime/datamate-python/app/module/annotation/
- Worker: runtime/python-executor/datamate/auto_annotation_worker.py
- 算子: runtime/ops/annotation/image_object_detection_bounding_box/

2026-02-07 22:35:33 +08:00

.github/workflows

feature: add mysql collection and starrocks collection (#222 )

2026-01-04 19:05:08 +08:00

backend

fix(pdf): 优化PDF文本提取服务异常处理

2026-02-06 18:52:51 +08:00

deployment

2026-02-02 16:09:25 +08:00

editions

feature: 对接deer-flow (#54 )

2025-11-04 20:30:40 +08:00

frontend

feat: 完善数据标注导出格式兼容性验证

2026-02-07 16:05:57 +08:00

runtime

feat(annotation): 支持通用算子编排的数据标注功能

2026-02-07 22:35:33 +08:00

scripts

feat(annotation): 支持通用算子编排的数据标注功能

2026-02-07 22:35:33 +08:00

.editorconfig

feature: Implement the basic knowledge generation function (#40 )

2025-10-30 16:50:54 +08:00

.gitignore

feat: Improve makefile readability, Add user control on volume keep at uninstallation, Add Label Studio install and uninstall via Make. (#106 )

2025-11-25 17:37:28 +08:00

LICENSE

Change license to MIT with additional conditions

2025-11-06 11:32:49 +08:00

Makefile

chore(gateway): 移除Dockerfile中的离线模式参数

2026-01-30 14:13:16 +08:00

Makefile.offline.mk

feat(scripts): 添加 APT 缓存预装功能解决离线构建问题

2026-02-03 13:16:17 +08:00

README-zh.md

bugfix (#164 )

2025-12-11 23:17:01 +08:00

README.md

bugfix (#164 )

2025-12-11 23:17:01 +08:00

README.md

DataMate All-in-One Data Work Platform

DataMate is an enterprise-level data processing platform for model fine-tuning and RAG retrieval, supporting core functions such as data collection, data management, operator marketplace, data cleaning, data synthesis, data annotation, data evaluation, and knowledge generation.

简体中文 | English

If you like this project, please give it a Star⭐️!

🌟 Core Features

Core Modules: Data Collection, Data Management, Operator Marketplace, Data Cleaning, Data Synthesis, Data Annotation, Data Evaluation, Knowledge Generation.
Visual Orchestration: Drag-and-drop data processing workflow design.
Operator Ecosystem: Rich built-in operators and support for custom operators.

🚀 Quick Start

Prerequisites

Git (for pulling source code)
Make (for building and installing)
Docker (for building images and deploying services)
Docker-Compose (for service deployment - Docker method)
Kubernetes (for service deployment - k8s method)
Helm (for service deployment - k8s method)

This project supports deployment via two methods: docker-compose and helm. After executing the command, please enter the corresponding number for the deployment method. The command echo is as follows:

Choose a deployment method:
1. Docker/Docker-Compose
2. Kubernetes/Helm
Enter choice:

Clone the Code

git clone git@github.com:ModelEngine-Group/DataMate.git
cd DataMate

Deploy the basic services

make install

If the machine you are using does not have make installed, please run the following command to deploy it:

# Windows
set REGISTRY=ghcr.io/modelengine-group/
docker compose -f ./deployment/docker/datamate/docker-compose.yml up -d
docker compose -f ./deployment/docker/milvus/docker-compose.yml up -d

# Linux/Mac
export REGISTRY=ghcr.io/modelengine-group/
docker compose -f ./deployment/docker/datamate/docker-compose.yml up -d
docker compose -f ./deployment/docker/milvus/docker-compose.yml up -d

Once the container is running, access http://localhost:30000 in a browser to view the front-end interface.

To list all available Make targets, flags and help text, run:

make help

Build and deploy Mineru Enhanced PDF Processing

make build-mineru
make install-mineru

Deploy the DeerFlow service

make install-deer-flow

Local Development and Deployment

After modifying the local code, please execute the following commands to build the image and deploy using the local image.

make build
make install dev=true

Uninstall

make uninstall

When running make uninstall, the installer will prompt once whether to delete volumes; that single choice is applied to all components. The uninstall order is: milvus -> label-studio -> datamate, which ensures the datamate network is removed cleanly after services that use it have stopped.

🤝 Contribution Guidelines

Thank you for your interest in this project! We warmly welcome contributions from the community. Whether it's submitting bug reports, suggesting new features, or directly participating in code development, all forms of help make the project better.

• 📮 GitHub Issues: Submit bugs or feature suggestions.

• 🔧 GitHub Pull Requests: Contribute code improvements.

📄 License

DataMate is open source under the MIT license. You are free to use, modify, and distribute the code of this project in compliance with the license terms.

Languages

JavaScript 41.9%

TypeScript 19.9%

Java 16.7%

Python 15.6%

Smarty 4.4%

Other 1.5%