DataMate

Author	SHA1	Message	Date
Jerry Yan	f707ce9dae	feat(auto-annotation): add batch progress updates to reduce DB write pressure Some checks failed CodeQL Advanced / Analyze (actions) (push) Has been cancelled Details CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled Details CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled Details CodeQL Advanced / Analyze (python) (push) Has been cancelled Details Throttle progress updates to reduce database write operations during large dataset processing. Key features: - Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env) - Conditional progress updates: Only write to DB when (now - last_update) >= interval - Use time.monotonic() for timing (immune to system clock adjustments) - Final status updates (completed/stopped/failed) always execute (not throttled) Implementation: - Initialize last_progress_update timestamp before as_completed() loop - Replace unconditional _update_task_status() with conditional call based on time interval - Update docstring to reflect throttling capability Performance impact (T=2s): - 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction) - 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction) - Small datasets (10 files): Minimal difference Backward compatibility: - PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior) - Heartbeat mechanism unaffected (2s interval << 300s timeout) - Stop check mechanism independent of progress updates - Final status updates always execute Testing: - 14 unit tests all passed (11 existing + 3 new): * Fast processing with throttling * PROGRESS_UPDATE_INTERVAL=0 updates every file * Slow processing (per-file > T) updates every file - py_compile syntax check passed Edge cases handled: - Single file task: Works normally - Very slow processing: Degrades to per-file updates - Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay	2026-02-10 16:49:37 +08:00
Jerry Yan	9988ff00f5	feat(auto-annotation): add concurrent processing support Enable parallel processing for auto-annotation tasks with configurable worker count and file-level parallelism. Key features: - Multi-worker support: WORKER_COUNT env var (default 1) controls number of worker threads - Intra-task file parallelism: FILE_WORKERS env var (default 1) controls concurrent file processing within a single task - Operator chain pooling: Pre-create N independent chain instances to avoid thread-safety issues - Thread-safe progress tracking: Use threading.Lock to protect shared counters - Stop signal handling: threading.Event for graceful cancellation during concurrent processing Implementation details: - Refactor _process_single_task() to use ThreadPoolExecutor + as_completed() - Chain pool (queue.Queue): Each worker thread acquires/releases a chain instance - Protected counters: processed_images, detected_total, file_results with Lock - Stop check: Periodic check of _is_stop_requested() during concurrent processing - Refactor start_auto_annotation_worker(): Move recovery logic here, start WORKER_COUNT threads - Simplify _worker_loop(): Remove recovery call, keep only polling + processing Backward compatibility: - Default config (WORKER_COUNT=1, FILE_WORKERS=1) behaves identically to previous version - No breaking changes to existing deployments Testing: - 11 unit tests all passed: * Multi-worker startup * Chain pool acquire/release * Concurrent file processing * Stop signal handling * Thread-safe counter updates * Backward compatibility (FILE_WORKERS=1) - py_compile syntax check passed Performance benefits: - WORKER_COUNT=3: Process 3 tasks simultaneously - FILE_WORKERS=4: Process 4 files in parallel within each task - Combined: Up to 12x throughput improvement (3 workers × 4 files)	2026-02-10 16:36:34 +08:00
Jerry Yan	2fbfefdb91	feat(auto-annotation): add worker recovery mechanism for stale tasks Automatically recover running tasks with stale heartbeats on worker startup, preventing tasks from being permanently stuck after container restarts. Key changes: - Add HEARTBEAT_TIMEOUT_SECONDS constant (default 300s, configurable via env) - Add _recover_stale_running_tasks() function: * Scans for status='running' tasks with heartbeat timeout * No progress (processed=0) → reset to pending (auto-retry) * Has progress (processed>0) → mark as failed with Chinese error message * Each task recovery is independent (single failure doesn't affect others) * Skip recovery if timeout is 0 or negative (disable feature) - Call recovery function in _worker_loop() before polling loop - Update file header comments to reflect recovery mechanism Recovery logic: - Query: status='running' AND (heartbeat_at IS NULL OR heartbeat_at < NOW() - timeout) - Decision based on processed_images count - Clear run_token to allow other workers to claim - Single transaction per task for atomicity Edge cases handled: - Database unavailable: recovery failure doesn't block worker startup - Concurrent recovery: UPDATE WHERE status='running' prevents duplicates - NULL heartbeat: extreme case (crash right after claim) also recovered - stop_requested tasks: automatically excluded by _fetch_pending_task() Testing: - 8 unit tests all passed: * No timeout tasks * Timeout disabled * No progress → pending * Has progress → failed * NULL heartbeat recovery * Multiple tasks mixed processing * DB error doesn't crash * Negative timeout disables feature	2026-02-10 16:19:22 +08:00
Jerry Yan	dc490f03be	feat(auto-annotation): unify annotation results with Label Studio format Automatically convert auto-annotation outputs to Label Studio format and write to t_dm_annotation_results table, enabling seamless editing in the annotation editor. New file: - runtime/python-executor/datamate/annotation_result_converter.py * 4 converters for different annotation types: - convert_text_classification → choices type - convert_ner → labels (span) type - convert_relation_extraction → labels + relation type - convert_object_detection → rectanglelabels type * convert_annotation() dispatcher (auto-detects task_type) * generate_label_config_xml() for dynamic XML generation * Pipeline introspection utilities * Label Studio ID generation logic Modified file: - runtime/python-executor/datamate/auto_annotation_worker.py * Preserve file_id through processing loop (line 918) * Collect file_results as (file_id, annotations) pairs * New _create_labeling_project_with_annotations() function: - Creates labeling project linked to source dataset - Snapshots all files - Converts results to Label Studio format - Writes to t_dm_annotation_results in single transaction * label_config XML stored in t_dm_labeling_projects.configuration Key features: - Supports 4 annotation types: text classification, NER, relation extraction, object detection - Deterministic region IDs for entity references in relation extraction - Pixel to percentage conversion for object detection - XML escaping handled by xml.etree.ElementTree - Partial results preserved on task stop Users can now view and edit auto-annotation results seamlessly in the annotation editor.	2026-02-10 16:06:40 +08:00
Jerry Yan	49f99527cc	feat(auto-annotation): add LLM-based annotation operators Add three new LLM-powered auto-annotation operators: - LLMTextClassification: Text classification using LLM - LLMNamedEntityRecognition: Named entity recognition with type validation - LLMRelationExtraction: Relation extraction with entity and relation type validation Key features: - Load LLM config from t_model_config table via modelId parameter - Lazy loading of LLM configuration on first execute() - Result validation with whitelist checking for entity/relation types - Fault-tolerant: returns empty results on LLM failure instead of throwing - Fully compatible with existing Worker pipeline Files added: - runtime/ops/annotation/_llm_utils.py: Shared LLM utilities - runtime/ops/annotation/llm_text_classification/: Text classification operator - runtime/ops/annotation/llm_named_entity_recognition/: NER operator - runtime/ops/annotation/llm_relation_extraction/: Relation extraction operator Files modified: - runtime/ops/annotation/__init__.py: Register 3 new operators - runtime/python-executor/datamate/auto_annotation_worker.py: Add to Worker whitelist - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts: Add to frontend whitelist	2026-02-10 15:22:23 +08:00
Jerry Yan	8ffa131fad	feat(annotation): 自动标注任务支持非图像类型数据集（TEXT/AUDIO/VIDEO）移除自动标注任务创建流程中的 IMAGE-only 限制，使 TEXT、AUDIO、VIDEO 类型数据集均可用于自动标注任务。 - 新增数据库迁移：t_dm_auto_annotation_tasks 表添加 dataset_type 列 - 后端 schema/API/service 全链路传递 dataset_type - Worker 动态构建 sample key（image/text/audio/video）和输出目录 - 前端移除数据集类型校验，下拉框显示数据集类型标识 - 输出数据集继承源数据集类型，不再硬编码为 IMAGE - 保持向后兼容：默认值为 IMAGE，worker 有元数据回退和目录 fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 23:23:05 +08:00
Jerry Yan	78624915b7	feat(annotation): 添加标注任务算子编排前端页面和测试算子 ## 功能概述为标注任务通用算子编排功能添加完整的前端界面，包括任务创建、列表管理、详情查看等功能，并提供测试算子用于功能验证。 ## 改动内容 ### 前端功能 #### 1. 算子编排页面 - 新增两步创建流程： - 第一步：基本信息（数据集选择、任务名称等） - 第二步：算子编排（选择算子、配置参数、预览 pipeline） - 核心文件： - frontend/src/pages/DataAnnotation/OperatorCreate/CreateTask.tsx - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useDragOperators.ts - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useCreateStepTwo.tsx #### 2. UI 组件 - 算子库（OperatorLibrary）：显示可用算子，支持分类筛选 - 编排区（OperatorOrchestration）：拖拽排序算子 - 参数面板（OperatorConfig）：配置算子参数 - Pipeline 预览（PipelinePreview）：预览算子链 - 核心文件：frontend/src/pages/DataAnnotation/OperatorCreate/components/ #### 3. 任务列表管理 - 在数据标注首页同一 Tab 中添加任务列表 - 支持状态筛选（pending/running/completed/failed/stopped） - 支持关键词搜索 - 支持轮询刷新 - 支持停止任务 - 支持下载结果 - 核心文件：frontend/src/pages/DataAnnotation/Home/components/AutoAnnotationTaskList.tsx #### 4. 任务详情抽屉 - 点击任务名打开详情抽屉 - 显示任务基本信息（名称、状态、进度、时间等） - 显示 pipeline 配置（算子链和参数） - 显示错误信息（如果失败） - 显示产物路径和下载按钮 - 核心文件：frontend/src/pages/DataAnnotation/Home/components/AutoAnnotationTaskDetailDrawer.tsx #### 5. API 集成 - 封装自动标注任务相关接口： - list：获取任务列表 - create：创建任务 - detail：获取任务详情 - delete：删除任务 - stop：停止任务 - download：下载结果 - 核心文件：frontend/src/pages/DataAnnotation/annotation.api.ts #### 6. 路由配置 - 新增路由：/data/annotation/create-auto-task - 集成到数据标注首页 - 核心文件： - frontend/src/routes/routes.ts - frontend/src/pages/DataAnnotation/Home/DataAnnotation.tsx #### 7. 算子模型增强 - 新增 runtime 字段用于标注算子筛选 - 核心文件：frontend/src/pages/OperatorMarket/operator.model.ts ### 后端功能 #### 1. 测试算子（test_annotation_marker） - 功能：在图片上绘制测试标记并输出 JSON 标注 - 用途：测试标注功能是否正常工作 - 实现文件： - runtime/ops/annotation/test_annotation_marker/process.py - runtime/ops/annotation/test_annotation_marker/metadata.yml - runtime/ops/annotation/test_annotation_marker/__init__.py #### 2. 算子注册 - 将测试算子注册到 annotation ops 包 - 添加到运行时白名单 - 核心文件： - runtime/ops/annotation/__init__.py - runtime/python-executor/datamate/auto_annotation_worker.py #### 3. 数据库初始化 - 添加测试算子到数据库 - 添加算子分类关联 - 核心文件：scripts/db/data-operator-init.sql ### 问题修复 #### 1. outputDir 默认值覆盖问题 - 问题：前端设置空字符串默认值导致 worker 无法注入真实输出目录 - 解决：过滤掉空/null 的 outputDir，确保 worker 能注入真实输出目录 - 修改位置：frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts #### 2. targetClasses 默认值类型问题 - 问题：YOLO 算子 metadata 中 targetClasses 默认值是字符串 '[]' 而不是列表 - 解决：改为列表 [] - 修改位置：runtime/ops/annotation/image_object_detection_bounding_box/metadata.yml ## 关键特性 ### 用户体验 - 统一的算子编排界面（与数据清洗保持一致） - 直观的拖拽操作 - 实时的 pipeline 预览 - 完整的任务管理功能 ### 功能完整性 - 任务创建：两步流程，清晰明了 - 任务管理：列表展示、状态筛选、搜索 - 任务操作：停止、下载 - 任务详情：完整的信息展示 ### 可测试性 - 提供测试算子用于功能验证 - 支持快速测试标注流程 ## 验证结果 - ESLint 检查：✅ 通过 - 前端构建：✅ 通过（10.91s） - 功能测试：✅ 所有功能正常 ## 部署说明 1. 执行数据库初始化脚本（如果是新环境） 2. 重启前端服务 3. 重启后端服务（如果修改了 worker 白名单） ## 使用说明 1. 进入数据标注页面 2. 点击创建自动标注任务 3. 选择数据集和文件 4. 从算子库拖拽算子到编排区 5. 配置算子参数 6. 预览 pipeline 7. 提交任务 8. 在任务列表中查看进度 9. 点击任务名查看详情 10. 下载标注结果 ## 相关文件 - 前端页面：frontend/src/pages/DataAnnotation/OperatorCreate/ - 任务管理：frontend/src/pages/DataAnnotation/Home/components/ - API 集成：frontend/src/pages/DataAnnotation/annotation.api.ts - 测试算子：runtime/ops/annotation/test_annotation_marker/ - 数据库脚本：scripts/db/data-operator-init.sql	2026-02-08 08:17:35 +08:00
Jerry Yan	2f49fc4199	feat(annotation): 支持通用算子编排的数据标注功能 ## 功能概述将数据标注模块从固定 YOLO 算子改造为支持通用算子编排，实现与数据清洗模块类似的灵活算子组合能力。 ## 改动内容 ### 第 1 步：数据库改造（DDL） - 新增 SQL migration 脚本：scripts/db/data-annotation-operator-pipeline-migration.sql - 修改 t_dm_auto_annotation_tasks 表： - 新增字段：task_mode, executor_type, pipeline, output_dataset_id, created_by, stop_requested, started_at, heartbeat_at, run_token - 新增索引：idx_status_created, idx_created_by - 创建 t_dm_annotation_task_operator_instance 表：用于存储算子实例详情 ### 第 2 步：API 层改造 - 扩展请求模型（schema/auto.py）： - 新增 OperatorPipelineStep 模型 - 支持 pipeline 字段，保留旧 YOLO 字段向后兼容 - 实现多写法归一（operatorId/operator_id/id, overrides/settingsOverride/settings_override） - 修改任务创建服务（service/auto.py）： - 新增 validate_file_ids() 校验方法 - 新增 _to_pipeline() 兼容映射方法 - 写入新字段并集成算子实例表 - 修复 fileIds 去重准确性问题 - 新增 API 路由（interface/auto.py）： - 新增 /operator-tasks 系列接口 - 新增 stop API 接口（/auto/{id}/stop 和 /operator-tasks/{id}/stop） - 保留旧 /auto 接口向后兼容 - ORM 模型对齐（annotation_management.py）： - AutoAnnotationTask 新增所有 DDL 字段 - 新增 AnnotationTaskOperatorInstance 模型 - 状态定义补充 stopped ### 第 3 步：Runtime 层改造 - 修改 worker 执行逻辑（auto_annotation_worker.py）： - 实现原子任务抢占机制（run_token） - 从硬编码 YOLO 改为通用 pipeline 执行 - 新增算子解析和实例化能力 - 支持 stop_requested 检查 - 保留 legacy_yolo 模式向后兼容 - 支持多种算子调用方式（execute 和 __call__） ### 第 4 步：灰度发布 - 完善 YOLO 算子元数据（metadata.yml）： - 补齐 raw_id, language, modal, inputs, outputs, settings 字段 - 注册标注算子（__init__.py）： - 将 YOLO 算子注册到 OPERATORS 注册表 - 确保 annotation 包被正确加载 - 新增白名单控制： - 支持环境变量 AUTO_ANNOTATION_OPERATOR_WHITELIST - 灰度发布时可限制可用算子 ## 关键特性 ### 向后兼容 - 旧 /auto 接口完全保留 - 旧请求参数自动映射到 pipeline - legacy_yolo 模式确保旧逻辑正常运行 ### 新功能 - 支持通用 pipeline 编排 - 支持多算子组合 - 支持任务停止控制 - 支持白名单灰度发布 ### 可靠性 - 原子任务抢占（防止重复执行） - 完整的错误处理和状态管理 - 详细的审计追踪（算子实例表） ## 部署说明 1. 执行 DDL：mysql < scripts/db/data-annotation-operator-pipeline-migration.sql 2. 配置环境变量：AUTO_ANNOTATION_OPERATOR_WHITELIST=ImageObjectDetectionBoundingBox 3. 重启服务：datamate-runtime 和 datamate-backend-python ## 验证步骤 1. 兼容模式验证：使用旧 /auto 接口创建任务 2. 通用编排验证：使用新 /operator-tasks 接口创建 pipeline 任务 3. 原子 claim 验证：检查 run_token 机制 4. 停止验证：测试 stop API 5. 白名单验证：测试算子白名单拦截 ## 相关文件 - DDL: scripts/db/data-annotation-operator-pipeline-migration.sql - API: runtime/datamate-python/app/module/annotation/ - Worker: runtime/python-executor/datamate/auto_annotation_worker.py - 算子: runtime/ops/annotation/image_object_detection_bounding_box/	2026-02-07 22:35:33 +08:00
Jerry Yan	d0972cbc9d	feat(data-management): 实现数据集文件版本管理和内部路径保护 - 将数据集文件查询方法替换为只查询可见文件的版本 - 引入文件状态管理（ACTIVE/ARCHIVED）和内部目录结构 - 实现文件重复处理策略，支持版本控制模式而非覆盖 - 添加内部数据目录保护，防止访问.datamate等系统目录 - 重构文件上传流程，引入暂存目录和事务后清理机制 - 实现文件版本归档功能，保留历史版本到专用存储位置 - 优化文件路径规范化和安全验证逻辑 - 修复文件删除逻辑，确保归档文件不会被错误移除 - 更新数据集压缩下载功能以排除内部系统文件	2026-02-04 23:53:35 +08:00
Jerry Yan	79371ba078	feat(data-management): 添加数据集父子层级结构功能 - 在OpenAPI规范中新增parentDatasetId字段用于层级过滤 - 实现数据集父子关系的创建、更新和删除逻辑 - 添加数据集移动时的路径重命名和文件路径前缀更新 - 增加子数据集数量验证防止误删父数据集 - 更新前端界面支持选择父数据集和导航显示 - 优化Python后端自动标注任务的路径处理逻辑 - 修改数据库表结构添加外键约束确保数据一致性	2026-01-20 13:34:50 +08:00
Kecheng Sha	3f1ad6a872	feat(auto-annotation): integrate YOLO auto-labeling and enhance data management (#223 ) * feat(auto-annotation): initial setup * chore: remove package-lock.json * chore: 清理本地测试脚本与 Maven 设置 * chore: change package-lock.json	2026-01-05 14:22:44 +08:00

11 Commits