2f49fc4199
feat(annotation): 支持通用算子编排的数据标注功能
...
## 功能概述
将数据标注模块从固定 YOLO 算子改造为支持通用算子编排,实现与数据清洗模块类似的灵活算子组合能力。
## 改动内容
### 第 1 步:数据库改造(DDL)
- 新增 SQL migration 脚本:scripts/db/data-annotation-operator-pipeline-migration.sql
- 修改 t_dm_auto_annotation_tasks 表:
- 新增字段:task_mode, executor_type, pipeline, output_dataset_id, created_by, stop_requested, started_at, heartbeat_at, run_token
- 新增索引:idx_status_created, idx_created_by
- 创建 t_dm_annotation_task_operator_instance 表:用于存储算子实例详情
### 第 2 步:API 层改造
- 扩展请求模型(schema/auto.py):
- 新增 OperatorPipelineStep 模型
- 支持 pipeline 字段,保留旧 YOLO 字段向后兼容
- 实现多写法归一(operatorId/operator_id/id, overrides/settingsOverride/settings_override)
- 修改任务创建服务(service/auto.py):
- 新增 validate_file_ids() 校验方法
- 新增 _to_pipeline() 兼容映射方法
- 写入新字段并集成算子实例表
- 修复 fileIds 去重准确性问题
- 新增 API 路由(interface/auto.py):
- 新增 /operator-tasks 系列接口
- 新增 stop API 接口(/auto/{id}/stop 和 /operator-tasks/{id}/stop)
- 保留旧 /auto 接口向后兼容
- ORM 模型对齐(annotation_management.py):
- AutoAnnotationTask 新增所有 DDL 字段
- 新增 AnnotationTaskOperatorInstance 模型
- 状态定义补充 stopped
### 第 3 步:Runtime 层改造
- 修改 worker 执行逻辑(auto_annotation_worker.py):
- 实现原子任务抢占机制(run_token)
- 从硬编码 YOLO 改为通用 pipeline 执行
- 新增算子解析和实例化能力
- 支持 stop_requested 检查
- 保留 legacy_yolo 模式向后兼容
- 支持多种算子调用方式(execute 和 __call__)
### 第 4 步:灰度发布
- 完善 YOLO 算子元数据(metadata.yml):
- 补齐 raw_id, language, modal, inputs, outputs, settings 字段
- 注册标注算子(__init__.py):
- 将 YOLO 算子注册到 OPERATORS 注册表
- 确保 annotation 包被正确加载
- 新增白名单控制:
- 支持环境变量 AUTO_ANNOTATION_OPERATOR_WHITELIST
- 灰度发布时可限制可用算子
## 关键特性
### 向后兼容
- 旧 /auto 接口完全保留
- 旧请求参数自动映射到 pipeline
- legacy_yolo 模式确保旧逻辑正常运行
### 新功能
- 支持通用 pipeline 编排
- 支持多算子组合
- 支持任务停止控制
- 支持白名单灰度发布
### 可靠性
- 原子任务抢占(防止重复执行)
- 完整的错误处理和状态管理
- 详细的审计追踪(算子实例表)
## 部署说明
1. 执行 DDL:mysql < scripts/db/data-annotation-operator-pipeline-migration.sql
2. 配置环境变量:AUTO_ANNOTATION_OPERATOR_WHITELIST=ImageObjectDetectionBoundingBox
3. 重启服务:datamate-runtime 和 datamate-backend-python
## 验证步骤
1. 兼容模式验证:使用旧 /auto 接口创建任务
2. 通用编排验证:使用新 /operator-tasks 接口创建 pipeline 任务
3. 原子 claim 验证:检查 run_token 机制
4. 停止验证:测试 stop API
5. 白名单验证:测试算子白名单拦截
## 相关文件
- DDL: scripts/db/data-annotation-operator-pipeline-migration.sql
- API: runtime/datamate-python/app/module/annotation/
- Worker: runtime/python-executor/datamate/auto_annotation_worker.py
- 算子: runtime/ops/annotation/image_object_detection_bounding_box/
2026-02-07 22:35:33 +08:00
109069c0da
1
2026-01-19 13:09:48 +08:00
3f36be0f9f
feat(runtime): 实现运行时操作模块的自动导入功能
...
- 添加 importlib 和 os 模块用于动态导入
- 集成 loguru 日志记录器进行错误追踪
- 实现自动遍历并导入所有子模块的逻辑
- 添加异常处理机制捕获模块加载失败的情况
- 确保所有子模块注册的算子能够正确加载
- 修复模块导入顺序以支持注解操作正常工作
2026-01-19 12:37:40 +08:00
b5aaf52bb6
chore(deps): 更新 paddlenlp 依赖版本
...
- 将 paddlenlp 从 3.0b4 版本降级到 2.8.1 版本
- 保持其他依赖包版本不变
- 确保依赖版本兼容性
2026-01-09 17:20:05 +08:00
103cb94a6d
feat(runtime): 添加 PaddleNLP 依赖包
...
- 在 pyproject.toml 中新增 paddlenlp==3.0.0b4 依赖
- 为 OCR 功能扩展提供自然语言处理支持
2026-01-09 15:51:42 +08:00
Kecheng Sha
3f1ad6a872
feat(auto-annotation): integrate YOLO auto-labeling and enhance data management ( #223 )
...
* feat(auto-annotation): initial setup
* chore: remove package-lock.json
* chore: 清理本地测试脚本与 Maven 设置
* chore: change package-lock.json
2026-01-05 14:22:44 +08:00
hhhhsc701
6a1eb85e8e
feat: 支持运行data-juicer算子 ( #215 )
...
* feature: 增加data-juicer算子
* feat: 支持运行data-juicer算子
* feat: 支持data-juicer任务下发
* feat: 支持data-juicer结果数据集归档
* feat: 支持data-juicer结果数据集归档
2025-12-31 09:20:41 +08:00
hhhhsc701
d82bff441a
fix: prevent deletion of predefined operators and improve error handling ( #192 )
...
* fix: prevent deletion of predefined operators and improve error handling
* fix: prevent deletion of predefined operators and improve error handling
2025-12-22 19:30:41 +08:00
hhhhsc701
46f4a8c219
feat: add download functionality for example operator and update Dock… ( #188 )
...
* feat: add download functionality for example operator and update Dockerfile
* feat: enhance download response by exposing content disposition header
* feat: update download function to accept filename parameter for example operator
2025-12-22 15:39:32 +08:00
hhhhsc701
ab4523b556
add export type settings and enhance metadata structure ( #181 )
...
* fix(session): enhance database connection settings with pool pre-ping and recycle options
* feat(metadata): add export type settings and enhance metadata structure
* fix(base_op): improve sample handling by introducing target_type key and consolidating text/data retrieval logic
* feat(metadata): add export type settings and enhance metadata structure
* feat(metadata): add export type settings and enhance metadata structure
2025-12-19 11:54:08 +08:00
hhhhsc701
be875086db
feat: add operator-packages-volume to docker-compose and update Docke… ( #179 )
...
* feat: add operator-packages-volume to docker-compose and update Dockerfile for site-packages path
* feat: add retry
2025-12-18 20:32:42 +08:00
hhhhsc701
761f7f6a51
fix: optimize PDF parsing by implementing concurrent processing with … ( #177 )
...
* fix: optimize PDF parsing by implementing concurrent processing with ThreadPoolExecutor
* Refactor to async processing for file extraction
Refactor the file processing to use asyncio for improved performance and concurrency.
2025-12-18 15:28:30 +08:00
hhhhsc701
924d977d6f
支持mineru npu处理 ( #174 )
...
* feature: unstructured支持简单pdf处理
* feature: update values.yaml to enhance ray-cluster configuration with security context, environment variables, and resource limits
* feature: update deploy.yaml and process.py for mineru server configuration and PDF processing enhancements
* feature: update deploy.yaml and process.py for mineru server configuration and PDF processing enhancements
* feature: improve PDF processing logic and update dependencies in process.py and pyproject.toml
* feature: improve PDF processing logic and update dependencies in process.py and pyproject.toml
* feature: update Dockerfile for improved package source mirrors and add mineru-npu to build targets
2025-12-17 16:31:06 +08:00
hhhhsc701
d8c0b0ed73
补充modal范围 ( #165 )
2025-12-12 13:34:03 +08:00
hhhhsc701
19a04df276
feature: 增加水印去除/高级匿名化算子 ( #151 )
...
* feature: 增加水印去除算子
* feature: clean code
* feature: clean code
* feature: 增加高级匿名化算子
2025-12-10 18:12:47 +08:00
hhhhsc701
d59c167da4
算子将抽取与落盘固定到流程中 ( #134 )
...
* feature: 将抽取动作移到每一个算子中
* feature: 落盘算子改为默认执行
* feature: 优化前端展示
* feature: 使用pyproject管理依赖
2025-12-05 17:26:29 +08:00
hhhhsc701
265e284fb8
feature: 修改算子开发指南 ( #127 )
2025-12-03 17:45:08 +08:00
hhhhsc701
07029d07ff
优化清洗重试机制,优化清洗进度展示,修复模板无法展示参数 ( #113 )
...
* bugfix: 模板无法展示参数
* bugfix: 优化清洗进度展示
* bugfix: 优化清洗重试机制
2025-11-28 15:28:10 +08:00
hhhhsc701
f1bffdcd61
bugfix: 创建清洗任务时修改数据集状态;无法删除已在模板/运行任务的算子
...
* bugfix: 创建清洗任务时修改数据集状态;无法删除已在模板/运行任务的算子
2025-11-27 17:34:53 +08:00
hhhhsc701
5cef9cb273
feature: deer-flow支持从datamate获取外部接入模型 ( #83 )
...
* feature: deer-flow支持从datamate获取外部接入模型
2025-11-13 20:13:16 +08:00
hhhhsc701
f78475e29f
Develop hsc ( #58 )
...
feature: 优化镜像构建/部署
2025-11-06 17:14:54 +08:00
hhhhsc701
05b26a2981
feature: 更新算子名称;增加创建任务、模板校验 ( #57 )
...
* feature: 更新算子名称;增加创建任务、模板校验
* feature: 镜像构建增加缓存
2025-11-05 17:38:03 +08:00
Startalker
a600c1d793
feature: modify UnstructuredFormatter and ExternalPDFFormatter description ( #44 )
...
* feature: add UnstructuredFormatter
* feature: add UnstructuredFormatter in db
* feature: add unstructured[docx]==0.18.15
* feature: support doc
* feature: add mineru
* feature: add external pdf extract operator by using mineru
* feature: mineru docker install bugfix
* feature: add unstructured xlsx/xls/csv/pptx/ppt
* feature: modify UnstructuredFormatter and ExternalPDFFormatter description
---------
Co-authored-by: Startalker <438747480@qq.com >
2025-10-31 10:32:14 +08:00
Startalker
06b05a65a9
feature: add unstructured xlsx/xls/csv/pptx/ppt ( #41 )
...
* feature: add UnstructuredFormatter
* feature: add UnstructuredFormatter in db
* feature: add unstructured[docx]==0.18.15
* feature: support doc
* feature: add mineru
* feature: add external pdf extract operator by using mineru
* feature: mineru docker install bugfix
* feature: add unstructured xlsx/xls/csv/pptx/ppt
---------
Co-authored-by: Startalker <438747480@qq.com >
2025-10-30 20:21:12 +08:00
Startalker
155603b1ca
feature: add external pdf extract operator by using mineru ( #36 )
...
* feature: add UnstructuredFormatter
* feature: add UnstructuredFormatter in db
* feature: add unstructured[docx]==0.18.15
* feature: support doc
* feature: add mineru
* feature: add external pdf extract operator by using mineru
* feature: mineru docker install bugfix
---------
Co-authored-by: Startalker <438747480@qq.com >
2025-10-30 15:55:10 +08:00
Startalker
f86d4fae25
feature: add unstructured formatter operator for doc/docx ( #17 )
...
* feature: add UnstructuredFormatter
* feature: add UnstructuredFormatter in db
* feature: add unstructured[docx]==0.18.15
* feature: support doc
---------
Co-authored-by: Startalker <438747480@qq.com >
2025-10-23 16:49:03 +08:00
hhhhsc701
31ef8bc265
[Feature] Refactor project to use 'datamate' naming convention for services and configurations ( #14 )
...
* Enhance CleaningTaskService to track cleaning process progress and update ExecutorType to DATAMATE
* Refactor project to use 'datamate' naming convention for services and configurations
2025-10-22 17:53:16 +08:00
Dallas98
1c97afed7d
init datamate
2025-10-21 23:00:48 +08:00