DataMate

Author	SHA1	Message	Date
q792602257	f707ce9dae	feat(auto-annotation): add batch progress updates to reduce DB write pressure CodeQL Advanced / Analyze (actions) (push) Has been cancelled Details CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled Details CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled Details CodeQL Advanced / Analyze (python) (push) Has been cancelled Details Throttle progress updates to reduce database write operations during large dataset processing. Key features: - Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env) - Conditional progress updates: Only write to DB when (now - last_update) >= interval - Use time.monotonic() for timing (immune to system clock adjustments) - Final status updates (completed/stopped/failed) always execute (not throttled) Implementation: - Initialize last_progress_update timestamp before as_completed() loop - Replace unconditional _update_task_status() with conditional call based on time interval - Update docstring to reflect throttling capability Performance impact (T=2s): - 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction) - 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction) - Small datasets (10 files): Minimal difference Backward compatibility: - PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior) - Heartbeat mechanism unaffected (2s interval << 300s timeout) - Stop check mechanism independent of progress updates - Final status updates always execute Testing: - 14 unit tests all passed (11 existing + 3 new): * Fast processing with throttling * PROGRESS_UPDATE_INTERVAL=0 updates every file * Slow processing (per-file > T) updates every file - py_compile syntax check passed Edge cases handled: - Single file task: Works normally - Very slow processing: Degrades to per-file updates - Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay	2026-02-10 16:49:37 +08:00
q792602257	9988ff00f5	feat(auto-annotation): add concurrent processing support Enable parallel processing for auto-annotation tasks with configurable worker count and file-level parallelism. Key features: - Multi-worker support: WORKER_COUNT env var (default 1) controls number of worker threads - Intra-task file parallelism: FILE_WORKERS env var (default 1) controls concurrent file processing within a single task - Operator chain pooling: Pre-create N independent chain instances to avoid thread-safety issues - Thread-safe progress tracking: Use threading.Lock to protect shared counters - Stop signal handling: threading.Event for graceful cancellation during concurrent processing Implementation details: - Refactor _process_single_task() to use ThreadPoolExecutor + as_completed() - Chain pool (queue.Queue): Each worker thread acquires/releases a chain instance - Protected counters: processed_images, detected_total, file_results with Lock - Stop check: Periodic check of _is_stop_requested() during concurrent processing - Refactor start_auto_annotation_worker(): Move recovery logic here, start WORKER_COUNT threads - Simplify _worker_loop(): Remove recovery call, keep only polling + processing Backward compatibility: - Default config (WORKER_COUNT=1, FILE_WORKERS=1) behaves identically to previous version - No breaking changes to existing deployments Testing: - 11 unit tests all passed: * Multi-worker startup * Chain pool acquire/release * Concurrent file processing * Stop signal handling * Thread-safe counter updates * Backward compatibility (FILE_WORKERS=1) - py_compile syntax check passed Performance benefits: - WORKER_COUNT=3: Process 3 tasks simultaneously - FILE_WORKERS=4: Process 4 files in parallel within each task - Combined: Up to 12x throughput improvement (3 workers × 4 files)	2026-02-10 16:36:34 +08:00
q792602257	2fbfefdb91	feat(auto-annotation): add worker recovery mechanism for stale tasks Automatically recover running tasks with stale heartbeats on worker startup, preventing tasks from being permanently stuck after container restarts. Key changes: - Add HEARTBEAT_TIMEOUT_SECONDS constant (default 300s, configurable via env) - Add _recover_stale_running_tasks() function: * Scans for status='running' tasks with heartbeat timeout * No progress (processed=0) → reset to pending (auto-retry) * Has progress (processed>0) → mark as failed with Chinese error message * Each task recovery is independent (single failure doesn't affect others) * Skip recovery if timeout is 0 or negative (disable feature) - Call recovery function in _worker_loop() before polling loop - Update file header comments to reflect recovery mechanism Recovery logic: - Query: status='running' AND (heartbeat_at IS NULL OR heartbeat_at < NOW() - timeout) - Decision based on processed_images count - Clear run_token to allow other workers to claim - Single transaction per task for atomicity Edge cases handled: - Database unavailable: recovery failure doesn't block worker startup - Concurrent recovery: UPDATE WHERE status='running' prevents duplicates - NULL heartbeat: extreme case (crash right after claim) also recovered - stop_requested tasks: automatically excluded by _fetch_pending_task() Testing: - 8 unit tests all passed: * No timeout tasks * Timeout disabled * No progress → pending * Has progress → failed * NULL heartbeat recovery * Multiple tasks mixed processing * DB error doesn't crash * Negative timeout disables feature	2026-02-10 16:19:22 +08:00
q792602257	dc490f03be	feat(auto-annotation): unify annotation results with Label Studio format Automatically convert auto-annotation outputs to Label Studio format and write to t_dm_annotation_results table, enabling seamless editing in the annotation editor. New file: - runtime/python-executor/datamate/annotation_result_converter.py * 4 converters for different annotation types: - convert_text_classification → choices type - convert_ner → labels (span) type - convert_relation_extraction → labels + relation type - convert_object_detection → rectanglelabels type * convert_annotation() dispatcher (auto-detects task_type) * generate_label_config_xml() for dynamic XML generation * Pipeline introspection utilities * Label Studio ID generation logic Modified file: - runtime/python-executor/datamate/auto_annotation_worker.py * Preserve file_id through processing loop (line 918) * Collect file_results as (file_id, annotations) pairs * New _create_labeling_project_with_annotations() function: - Creates labeling project linked to source dataset - Snapshots all files - Converts results to Label Studio format - Writes to t_dm_annotation_results in single transaction * label_config XML stored in t_dm_labeling_projects.configuration Key features: - Supports 4 annotation types: text classification, NER, relation extraction, object detection - Deterministic region IDs for entity references in relation extraction - Pixel to percentage conversion for object detection - XML escaping handled by xml.etree.ElementTree - Partial results preserved on task stop Users can now view and edit auto-annotation results seamlessly in the annotation editor.	2026-02-10 16:06:40 +08:00
q792602257	49f99527cc	feat(auto-annotation): add LLM-based annotation operators Add three new LLM-powered auto-annotation operators: - LLMTextClassification: Text classification using LLM - LLMNamedEntityRecognition: Named entity recognition with type validation - LLMRelationExtraction: Relation extraction with entity and relation type validation Key features: - Load LLM config from t_model_config table via modelId parameter - Lazy loading of LLM configuration on first execute() - Result validation with whitelist checking for entity/relation types - Fault-tolerant: returns empty results on LLM failure instead of throwing - Fully compatible with existing Worker pipeline Files added: - runtime/ops/annotation/_llm_utils.py: Shared LLM utilities - runtime/ops/annotation/llm_text_classification/: Text classification operator - runtime/ops/annotation/llm_named_entity_recognition/: NER operator - runtime/ops/annotation/llm_relation_extraction/: Relation extraction operator Files modified: - runtime/ops/annotation/__init__.py: Register 3 new operators - runtime/python-executor/datamate/auto_annotation_worker.py: Add to Worker whitelist - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts: Add to frontend whitelist	2026-02-10 15:22:23 +08:00
q792602257	06a7cd9abd	feat(auth): 角色管理CRUD与角色权限绑定功能 CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled Details CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled Details CodeQL Advanced / Analyze (python) (push) Has been cancelled Details CodeQL Advanced / Analyze (actions) (push) Has been cancelled Details 新增角色创建/编辑/删除接口和角色-权限绑定接口，支持管理员自定义角色并灵活配置权限。前端新增角色CRUD弹窗、按模块分组的权限配置面板，内置角色禁止删除但允许编辑和配置权限。 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 00:09:48 +08:00
q792602257	ea7ca5474e	fix(annotation): 标注配置可视化编辑器根据父节点类型限制子标签选项根据选中节点类型动态过滤"添加子节点"和"添加同级节点"下拉选项： - 标注控件（如 Choices/RectangleLabels）仅允许添加对应的子标签（Choice/Label） - 无子节点的控件（如 TextArea/Rating）和数据对象标签禁用添加子节点 - Choice 节点允许嵌套 Choice（支持 Taxonomy 层级结构） - View 容器允许添加所有标签类型但排除裸子标签 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 23:34:41 +08:00
q792602257	8ffa131fad	feat(annotation): 自动标注任务支持非图像类型数据集（TEXT/AUDIO/VIDEO）移除自动标注任务创建流程中的 IMAGE-only 限制，使 TEXT、AUDIO、VIDEO 类型数据集均可用于自动标注任务。 - 新增数据库迁移：t_dm_auto_annotation_tasks 表添加 dataset_type 列 - 后端 schema/API/service 全链路传递 dataset_type - Worker 动态构建 sample key（image/text/audio/video）和输出目录 - 前端移除数据集类型校验，下拉框显示数据集类型标识 - 输出数据集继承源数据集类型，不再硬编码为 IMAGE - 保持向后兼容：默认值为 IMAGE，worker 有元数据回退和目录 fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 23:23:05 +08:00
q792602257	807c2289e2	feat(annotation): 文件版本更新时支持保留标注记录（位置偏移+文字匹配迁移）新增 AnnotationMigrator 迁移算法，在 TEXT 类型数据集的文件版本更新时，可选通过 difflib 位置偏移映射和文字二次匹配将旧版本标注迁移到新版本上。前端版本切换对话框增加"保留标注"复选框（仅 TEXT 类型显示），后端 API 增加 preserveAnnotations 参数，完全向后兼容。 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 19:42:59 +08:00
q792602257	7d5a809772	fix: 重命名 task-coordination-init.sql 以修复数据库初始化顺序问题将 task-coordination-init.sql 重命名为 zz-task-coordination-init.sql，确保其在 zz-auth-init.sql 之后执行，解决 t_auth_permissions 表不存在的问题。 Fixes: ERROR 1146 (42S02) Table 'datamate.t_auth_permissions' doesn't exist	2026-02-09 19:04:55 +08:00
q792602257	2f8645a011	fix(auth): harden confidential knowledge access checks and sensitivity filtering	2026-02-09 17:09:34 +08:00
q792602257	71f8f7d1c3	feat: 实现任务拆分和分配功能 ## 功能概述实现完整的任务拆分、分配和进度跟踪功能，支持将任务拆分为子任务并分配给不同用户。 ## Phase 1: 数据库层 - 新增 t_task_meta 表（任务元数据协调表） - 新增 t_task_assignment_log 表（分配日志表） - 新增 3 个权限条目（read/write/assign） - 新增 SQLAlchemy ORM 模型 ## Phase 2: 后端 API (Java) - 新增 task-coordination-service 模块（32 个文件） - 实现 11 个 API 端点： - 任务查询（列表、子任务、我的任务） - 任务拆分（支持 4 种策略） - 任务分配（单个、批量、重新分配、撤回） - 进度管理（查询、更新、聚合） - 分配日志 - 集成权限控制和路由规则 ## Phase 3: 前端 UI (React + TypeScript) - 新增 10 个文件（模型、API、组件、页面） - 实现 5 个核心组件： - SplitTaskDialog - 任务拆分对话框 - AssignTaskDialog - 任务分配对话框 - BatchAssignDialog - 批量分配对话框 - TaskProgressPanel - 进度面板 - AssignmentLogDrawer - 分配记录 - 实现 2 个页面： - TaskCoordination - 任务管理主页 - MyTasks - 我的任务页面 - 集成侧边栏菜单和路由 ## 问题修复 - 修复 getMyTasks 分页参数缺失 - 修复子任务 assignee 信息缺失（批量查询优化） - 修复 proportion 精度计算（余量分配） ## 技术亮点 - 零侵入设计：通过独立协调表实现，不修改现有模块 - 批量查询优化：避免 N+1 查询问题 - 4 种拆分策略：按比例/数量/文件/手动 - 进度自动聚合：子任务更新自动聚合到父任务 - 权限细粒度控制：read/write/assign 三级权限 ## 验证 - Maven 编译：✅ 零错误 - TypeScript 编译：✅ 零错误 - Vite 生产构建：✅ 成功	2026-02-09 00:42:34 +08:00
q792602257	78624915b7	feat(annotation): 添加标注任务算子编排前端页面和测试算子 ## 功能概述为标注任务通用算子编排功能添加完整的前端界面，包括任务创建、列表管理、详情查看等功能，并提供测试算子用于功能验证。 ## 改动内容 ### 前端功能 #### 1. 算子编排页面 - 新增两步创建流程： - 第一步：基本信息（数据集选择、任务名称等） - 第二步：算子编排（选择算子、配置参数、预览 pipeline） - 核心文件： - frontend/src/pages/DataAnnotation/OperatorCreate/CreateTask.tsx - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useDragOperators.ts - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useCreateStepTwo.tsx #### 2. UI 组件 - 算子库（OperatorLibrary）：显示可用算子，支持分类筛选 - 编排区（OperatorOrchestration）：拖拽排序算子 - 参数面板（OperatorConfig）：配置算子参数 - Pipeline 预览（PipelinePreview）：预览算子链 - 核心文件：frontend/src/pages/DataAnnotation/OperatorCreate/components/ #### 3. 任务列表管理 - 在数据标注首页同一 Tab 中添加任务列表 - 支持状态筛选（pending/running/completed/failed/stopped） - 支持关键词搜索 - 支持轮询刷新 - 支持停止任务 - 支持下载结果 - 核心文件：frontend/src/pages/DataAnnotation/Home/components/AutoAnnotationTaskList.tsx #### 4. 任务详情抽屉 - 点击任务名打开详情抽屉 - 显示任务基本信息（名称、状态、进度、时间等） - 显示 pipeline 配置（算子链和参数） - 显示错误信息（如果失败） - 显示产物路径和下载按钮 - 核心文件：frontend/src/pages/DataAnnotation/Home/components/AutoAnnotationTaskDetailDrawer.tsx #### 5. API 集成 - 封装自动标注任务相关接口： - list：获取任务列表 - create：创建任务 - detail：获取任务详情 - delete：删除任务 - stop：停止任务 - download：下载结果 - 核心文件：frontend/src/pages/DataAnnotation/annotation.api.ts #### 6. 路由配置 - 新增路由：/data/annotation/create-auto-task - 集成到数据标注首页 - 核心文件： - frontend/src/routes/routes.ts - frontend/src/pages/DataAnnotation/Home/DataAnnotation.tsx #### 7. 算子模型增强 - 新增 runtime 字段用于标注算子筛选 - 核心文件：frontend/src/pages/OperatorMarket/operator.model.ts ### 后端功能 #### 1. 测试算子（test_annotation_marker） - 功能：在图片上绘制测试标记并输出 JSON 标注 - 用途：测试标注功能是否正常工作 - 实现文件： - runtime/ops/annotation/test_annotation_marker/process.py - runtime/ops/annotation/test_annotation_marker/metadata.yml - runtime/ops/annotation/test_annotation_marker/__init__.py #### 2. 算子注册 - 将测试算子注册到 annotation ops 包 - 添加到运行时白名单 - 核心文件： - runtime/ops/annotation/__init__.py - runtime/python-executor/datamate/auto_annotation_worker.py #### 3. 数据库初始化 - 添加测试算子到数据库 - 添加算子分类关联 - 核心文件：scripts/db/data-operator-init.sql ### 问题修复 #### 1. outputDir 默认值覆盖问题 - 问题：前端设置空字符串默认值导致 worker 无法注入真实输出目录 - 解决：过滤掉空/null 的 outputDir，确保 worker 能注入真实输出目录 - 修改位置：frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts #### 2. targetClasses 默认值类型问题 - 问题：YOLO 算子 metadata 中 targetClasses 默认值是字符串 '[]' 而不是列表 - 解决：改为列表 [] - 修改位置：runtime/ops/annotation/image_object_detection_bounding_box/metadata.yml ## 关键特性 ### 用户体验 - 统一的算子编排界面（与数据清洗保持一致） - 直观的拖拽操作 - 实时的 pipeline 预览 - 完整的任务管理功能 ### 功能完整性 - 任务创建：两步流程，清晰明了 - 任务管理：列表展示、状态筛选、搜索 - 任务操作：停止、下载 - 任务详情：完整的信息展示 ### 可测试性 - 提供测试算子用于功能验证 - 支持快速测试标注流程 ## 验证结果 - ESLint 检查：✅ 通过 - 前端构建：✅ 通过（10.91s） - 功能测试：✅ 所有功能正常 ## 部署说明 1. 执行数据库初始化脚本（如果是新环境） 2. 重启前端服务 3. 重启后端服务（如果修改了 worker 白名单） ## 使用说明 1. 进入数据标注页面 2. 点击创建自动标注任务 3. 选择数据集和文件 4. 从算子库拖拽算子到编排区 5. 配置算子参数 6. 预览 pipeline 7. 提交任务 8. 在任务列表中查看进度 9. 点击任务名查看详情 10. 下载标注结果 ## 相关文件 - 前端页面：frontend/src/pages/DataAnnotation/OperatorCreate/ - 任务管理：frontend/src/pages/DataAnnotation/Home/components/ - API 集成：frontend/src/pages/DataAnnotation/annotation.api.ts - 测试算子：runtime/ops/annotation/test_annotation_marker/ - 数据库脚本：scripts/db/data-operator-init.sql	2026-02-08 08:17:35 +08:00
q792602257	2f49fc4199	feat(annotation): 支持通用算子编排的数据标注功能 ## 功能概述将数据标注模块从固定 YOLO 算子改造为支持通用算子编排，实现与数据清洗模块类似的灵活算子组合能力。 ## 改动内容 ### 第 1 步：数据库改造（DDL） - 新增 SQL migration 脚本：scripts/db/data-annotation-operator-pipeline-migration.sql - 修改 t_dm_auto_annotation_tasks 表： - 新增字段：task_mode, executor_type, pipeline, output_dataset_id, created_by, stop_requested, started_at, heartbeat_at, run_token - 新增索引：idx_status_created, idx_created_by - 创建 t_dm_annotation_task_operator_instance 表：用于存储算子实例详情 ### 第 2 步：API 层改造 - 扩展请求模型（schema/auto.py）： - 新增 OperatorPipelineStep 模型 - 支持 pipeline 字段，保留旧 YOLO 字段向后兼容 - 实现多写法归一（operatorId/operator_id/id, overrides/settingsOverride/settings_override） - 修改任务创建服务（service/auto.py）： - 新增 validate_file_ids() 校验方法 - 新增 _to_pipeline() 兼容映射方法 - 写入新字段并集成算子实例表 - 修复 fileIds 去重准确性问题 - 新增 API 路由（interface/auto.py）： - 新增 /operator-tasks 系列接口 - 新增 stop API 接口（/auto/{id}/stop 和 /operator-tasks/{id}/stop） - 保留旧 /auto 接口向后兼容 - ORM 模型对齐（annotation_management.py）： - AutoAnnotationTask 新增所有 DDL 字段 - 新增 AnnotationTaskOperatorInstance 模型 - 状态定义补充 stopped ### 第 3 步：Runtime 层改造 - 修改 worker 执行逻辑（auto_annotation_worker.py）： - 实现原子任务抢占机制（run_token） - 从硬编码 YOLO 改为通用 pipeline 执行 - 新增算子解析和实例化能力 - 支持 stop_requested 检查 - 保留 legacy_yolo 模式向后兼容 - 支持多种算子调用方式（execute 和 __call__） ### 第 4 步：灰度发布 - 完善 YOLO 算子元数据（metadata.yml）： - 补齐 raw_id, language, modal, inputs, outputs, settings 字段 - 注册标注算子（__init__.py）： - 将 YOLO 算子注册到 OPERATORS 注册表 - 确保 annotation 包被正确加载 - 新增白名单控制： - 支持环境变量 AUTO_ANNOTATION_OPERATOR_WHITELIST - 灰度发布时可限制可用算子 ## 关键特性 ### 向后兼容 - 旧 /auto 接口完全保留 - 旧请求参数自动映射到 pipeline - legacy_yolo 模式确保旧逻辑正常运行 ### 新功能 - 支持通用 pipeline 编排 - 支持多算子组合 - 支持任务停止控制 - 支持白名单灰度发布 ### 可靠性 - 原子任务抢占（防止重复执行） - 完整的错误处理和状态管理 - 详细的审计追踪（算子实例表） ## 部署说明 1. 执行 DDL：mysql < scripts/db/data-annotation-operator-pipeline-migration.sql 2. 配置环境变量：AUTO_ANNOTATION_OPERATOR_WHITELIST=ImageObjectDetectionBoundingBox 3. 重启服务：datamate-runtime 和 datamate-backend-python ## 验证步骤 1. 兼容模式验证：使用旧 /auto 接口创建任务 2. 通用编排验证：使用新 /operator-tasks 接口创建 pipeline 任务 3. 原子 claim 验证：检查 run_token 机制 4. 停止验证：测试 stop API 5. 白名单验证：测试算子白名单拦截 ## 相关文件 - DDL: scripts/db/data-annotation-operator-pipeline-migration.sql - API: runtime/datamate-python/app/module/annotation/ - Worker: runtime/python-executor/datamate/auto_annotation_worker.py - 算子: runtime/ops/annotation/image_object_detection_bounding_box/	2026-02-07 22:35:33 +08:00
q792602257	9efc07935f	fix(db): 更新数据库初始化脚本中的默认用户密码 - 在初始化脚本中添加默认密码注释说明 - 更新 admin 用户的密码哈希值 - 更新 knowledge_user 用户的密码哈希值 - 确保本地开发环境密码一致性	2026-02-07 17:00:19 +08:00
q792602257	7264e111ae	chore(db): 移除数据标注初始化脚本中的Alembic版本查询 - 删除了数据库初始化脚本末尾的Alembic版本查询语句 - 保留了内置标注模板插入成功提示信息 - 简化了数据标注初始化脚本的输出结果	2026-02-07 16:24:21 +08:00
q792602257	3dd4035005	feat: 完善数据标注导出格式兼容性验证 - 后端：添加 YOLO 格式对 TEXT 数据集的限制验证 - 后端：统一 COCO/YOLO 兼容性校验规则（仅允许图像类或目标检测类数据集） - 后端：修复 datasetType 字段传递，在任务列表响应中补充 dataset_type - 前端：在导出对话框中禁用 TEXT 数据集的 COCO/YOLO 选项 - 前端：添加 datasetType 和 labelingType 字段传递 - 前端：对齐前后端 COCO/YOLO 兼容性规则 - 前端：优化提示文案，明确说明格式适用范围修改文件： - runtime/datamate-python/app/module/annotation/service/export.py - runtime/datamate-python/app/module/annotation/service/mapping.py - runtime/datamate-python/app/module/annotation/schema/mapping.py - frontend/src/pages/DataAnnotation/Home/ExportAnnotationDialog.tsx - frontend/src/pages/DataAnnotation/Home/DataAnnotation.tsx - frontend/src/pages/DataAnnotation/annotation.const.tsx	2026-02-07 16:05:57 +08:00
q792602257	36b410ba7b	feat(annotation): 添加导出格式与数据集类型的兼容性检查 - 实现 COCO 格式导出前的数据集类型验证 - COCO 格式仅适用于图像类和目标检测类数据集 - 文本类数据集尝试导出 COCO 格式时返回 HTTP 400 错误 - 添加清晰的错误提示信息，建议使用其他格式新增功能： - 数据集类型常量定义（TEXT、IMAGE、OBJECT_DETECTION） - COCO 兼容类型集合 - 类型值标准化方法 - 数据集类型查询方法 - 模板标注类型解析方法 - 导出格式兼容性验证方法相关文件： - runtime/datamate-python/app/module/annotation/service/export.py (+94, -7) Reviewed-by: Codex AI	2026-02-07 16:05:57 +08:00
q792602257	329382db47	fix(pdf): 优化PDF文本提取服务异常处理 - 添加FeignException专门处理逻辑 - 实现详细的Feign异常日志记录功能 - 新增响应体解析和根因链构建方法 - 添加异常消息规范化处理 - 改进错误日志的可读性和调试信息完整度	2026-02-06 18:52:51 +08:00
q792602257	e862925a06	feat(export): 添加逻辑路径构建功能支持文件管理 - 在导出服务中实现_build_logical_path方法用于构建相对路径 - 更新数据集文件记录以包含logical_path字段 - 在比率任务服务中实现build_logical_path静态方法 - 将逻辑路径信息添加到数据集文件记录中 - 规范化路径处理并替换反斜杠为正斜杠 - 添加无效路径验证防止目录遍历安全问题	2026-02-06 18:46:44 +08:00
q792602257	05752678cc	feat(dataset): 添加PDF提取服务中的逻辑路径构建功能 - 移除重复的csv导入语句 - 添加_build_logical_path方法用于构建文件逻辑路径 - 在_create_text_file_record方法中增加logical_path参数 - 更新记录创建调用以传递逻辑路径参数 - 验证逻辑路径不为空并抛出相应异常 - 将逻辑路径存储到数据集文件记录中	2026-02-06 18:30:44 +08:00
q792602257	0f1dd9ec8d	Merge remote-tracking branch 'gitea/lsf' into lsf	2026-02-06 18:29:58 +08:00
q792602257	38add27d84	fix: 修复 Codex 审查发现的两个数据一致性问题 - [P1] 调整删除顺序：先删除数据库记录，成功后再删除派生文件避免源文件删除失败时派生文件已被删除导致的数据不一致 - [P2] 完善 logicalPath 空值判断：使用 StringUtils.isBlank() 处理 null、空字符串和纯空白字符，防止误删其他文件 Fixes review comments from commits `f9f4ea3`	2026-02-06 18:00:32 +08:00
q792602257	f9f4ea352e	fix: 修复 codex 审查发现的两个问题 - [P1] 当 logicalPath 为 null 时，直接删除当前文件（兼容旧数据） - [P2] 数据库删除失败时，return 跳过后续清理以避免数据不一致	2026-02-06 17:42:59 +08:00
q792602257	24d8ee49a1	feat: 优化文件删除逻辑，支持级联删除版本和派生文件 - 删除文件时，如果存在多个版本，一并删除所有版本 - 删除PDF/doc/docx/xls/xlsx时，一并删除其派生的txt文件 - 文件删除失败时记录日志但不影响删除成功	2026-02-06 17:23:37 +08:00
q792602257	38e58ba864	Merge branch 'rbac' into lsf	2026-02-06 15:44:43 +08:00
q792602257	cd5f5ef6da	fix(annotation): fix use_new_version to support files without annotation Problem: use_new_version returned 404 annotation not found for files without annotation, preventing users from switching to new versions. Solution: 1. Query latest file by logical_path 2. Update LabelingProjectFile to point to latest version 3. If annotation exists: clear it and update file_id 4. If no annotation: just update project file snapshot 5. Return new file_id in response	2026-02-06 15:22:57 +08:00
q792602257	1f6c821cbc	fix(annotation): show new version warning even without annotation Change has_new_version logic to compare current file version with latest version, regardless of whether annotation exists. Before: Only show warning if annotation exists and version is outdated After: Show warning if current file is not the latest version This ensures users are informed when viewing an old file version, even if they haven't started annotating yet.	2026-02-06 15:17:51 +08:00
q792602257	44a1f2193f	fix(annotation): fix file version check to compare with latest version by logical path Problem: check_file_version was comparing annotation version with the passed file_id's version, but when files are updated, new file records are created with higher versions and old ones are marked ARCHIVED. Solution: 1. Query the latest ACTIVE file by logical_path 2. Compare annotation version with latest file version 3. Return latestFileId so frontend can switch to new version Changes: - check_file_version now queries latest version by logical_path - Added latest_file_id to FileVersionCheckResponse schema - Updated descriptions to clarify currentFileVersion is latest version Database scenario: - old file: id=6dae9f2f, version=1, status=ARCHIVED - new file: id=3365b4e7, version=3, status=ACTIVE - Both have same logical_path='rufus.ini' - Now correctly detects version 3 > annotation version	2026-02-06 15:11:54 +08:00
q792602257	6a4c4ae3d7	feat(auth): 为数据管理和RAG服务增加资源访问控制 - 在DatasetApplicationService中注入ResourceAccessService并添加所有权验证 - 在KnowledgeSetApplicationService中注入ResourceAccessService并添加所有权验证 - 修改DatasetRepository接口和实现类，增加按创建者过滤的方法 - 修改KnowledgeSetRepository接口和实现类，增加按创建者过滤的方法 - 在RAG索引器服务中添加知识库访问权限检查和作用域过滤 - 更新实体元对象处理器以使用请求用户上下文获取当前用户 - 在前端设置页面添加用户权限管理功能和角色权限控制 - 为Python标注服务增加用户上下文和数据集访问权限验证	2026-02-06 14:58:46 +08:00
q792602257	c6dccf5e29	fix(python): remove datetime.UTC usage for Python 3.10 compatibility Replace datetime.datetime.now(datetime.UTC) with datetime.datetime.now() to fix compatibility issues with Python 3.10 and earlier versions. datetime.UTC is only available in Python 3.11+, causing 500 errors in production environment. Files fixed: - app/module/dataset/service/pdf_extract.py - app/module/generation/service/export_service.py	2026-02-06 13:34:27 +08:00
q792602257	fbc83b5610	revert(db): remove Alembic migration system Remove Alembic database migration system in favor of delta scripts: Deleted: - runtime/datamate-python/alembic.ini (config file) - runtime/datamate-python/alembic/env.py (environment config) - runtime/datamate-python/alembic/script.py.mako (migration template) - runtime/datamate-python/alembic/versions/20250205_0001_add_file_version.py (migration) Modified: - scripts/db/data-annotation-init.sql - Removed alembic_version table creation and version insertion - Kept file_version column in t_dm_annotation_results Rationale: - Alembic migration testing failed in production - Delta scripts are simpler and more reliable for this project - SQL init scripts contain complete schema including latest changes	2026-02-06 13:29:44 +08:00
q792602257	056cee11cc	feat(auth): 完善API网关JWT认证和权限控制功能 - 实现网关侧JWT工具类和权限规则匹配器 - 集成JWT认证流程，支持Bearer Token验证 - 添加基于路径和HTTP方法的权限控制机制 - 配置白名单路由规则，优化认证性能 - 更新前端受保护路由组件，实现权限验证 - 添加403禁止访问页面和权限检查逻辑 - 重构登录页面，集成实际认证API调用 - 实现用户信息获取和权限加载功能 - 优化全局异常处理器中的认证错误状态码 - 集成FastJSON2和JJWT依赖库支持	2026-02-06 13:21:20 +08:00
q792602257	f8f9faaa06	feat(runtime): 添加 Pillow 图像处理库依赖 - 在 poetry.lock 中新增 pillow 包及其所有版本的依赖信息 - 支持多种操作系统平台包括 macOS、Linux、Windows 和 iOS - 支持多种架构包括 x86_64、arm64、aarch64 和 win32 - 支持 Python 版本从 3.9 到 3.14 的完整兼容性 - 包含多种文件格式如 wheel 和 tar.gz 校验信息 - 添加了文档、测试和类型检查相关的可选依赖配置	2026-02-06 13:21:01 +08:00
q792602257	719f54bf2e	feat(annotation): 完善文件版本管理和标注同步功能 - 将 useNewVersionUsingPost 重命名为 applyNewVersionUsingPost - 添加 fileVersionCheckSeqRef 避免版本检查竞态条件 - 移除 checkingFileVersion 状态变量的渲染依赖 - 在文件版本信息中添加 annotationVersionUnknown 字段 - 修复前端文件版本比较显示的 JSX 语法 - 添加历史标注缺少版本信息的提示显示 - 配置 Alembic 异步数据库迁移环境支持 aiomysql - 添加文件版本未知状态的后端判断逻辑 - 实现标注清除时的段落注释清理功能 - 添加知识库同步钩子到版本更新流程	2026-02-05 23:22:49 +08:00
q792602257	5507adeb45	fix(knowledge): 优化知识项文件删除逻辑 - 添加内容类型检查，仅处理文件类型的知识项 - 修改源类型判断条件，提前返回非文件上传和手动创建的类型 - 保持原有的文件路径解析和删除操作逻辑 - 维持异常处理和日志记录功能	2026-02-05 21:24:12 +08:00
q792602257	48cf49d064	feat(db): update SQL init script and Alembic migration for compatibility Update data-annotation-init.sql and Alembic migration to support both new and old deployments: SQL Initialization Script (data-annotation-init.sql): - Add file_version column to t_dm_annotation_results table - Add Alembic version table creation and version insertion - New deployments using this script will have latest schema and Alembic version marked Alembic Migration (20250205_0001_add_file_version.py): - Add column_exists() helper function to detect if column already exists - Add compatibility check in upgrade(): skip if column exists (new SQL init) - Add informative print messages for deployment clarity - Enhanced docstrings explaining compatibility strategy Deployment Scenarios: 1. New deployment with latest SQL script: Schema created with file_version, Alembic marked as applied 2. Old deployment upgrade: Alembic detects missing column and adds it This ensures backward compatibility while supporting fresh installs with complete schema.	2026-02-05 21:17:17 +08:00
q792602257	f5cb265667	feat(annotation): implement file version management for annotation feature Add support for detecting new file versions and switching to them: Backend Changes: - Add file_version column to AnnotationResult model - Create Alembic migration for database schema update - Implement check_file_version() method to compare annotation and file versions - Implement use_new_version() method to clear annotations and update version - Update upsert_annotation() to record file version when saving - Add new API endpoints: GET /version and POST /use-new-version - Add FileVersionCheckResponse and UseNewVersionResponse schemas Frontend Changes: - Add checkFileVersionUsingGet and useNewVersionUsingPost API calls - Add version warning banner showing current vs latest file version - Add 'Use New Version' button with confirmation dialog - Clear version info state when switching files to avoid stale warnings Bug Fixes: - Fix previousFileVersion returning updated value (save before update) - Handle null file_version for historical data compatibility - Fix segmented annotation clearing (preserve structure, clear results) - Fix files without annotations incorrectly showing new version warnings - Preserve total_segments when clearing segmented annotations Files Modified: - frontend/src/pages/DataAnnotation/Annotate/LabelStudioTextEditor.tsx - frontend/src/pages/DataAnnotation/annotation.api.ts - runtime/datamate-python/app/db/models/annotation_management.py - runtime/datamate-python/app/module/annotation/interface/editor.py - runtime/datamate-python/app/module/annotation/schema/editor.py - runtime/datamate-python/app/module/annotation/service/editor.py New Files: - runtime/datamate-python/alembic.ini - runtime/datamate-python/alembic/env.py - runtime/datamate-python/alembic/script.py.mako - runtime/datamate-python/alembic/versions/20250205_0001_add_file_version.py	2026-02-05 20:12:07 +08:00
q792602257	4143bc75f9	fix: 修复codex review发现的问题问题1 - 行锁持有时间过长： - 采用双重检查锁定模式，将HTTP调用移到锁范围外 - 新增 _update_knowledge_set_config 方法专门处理加锁更新问题2 - 清理不完整： - _list_knowledge_sets 方法添加分页参数 - 新增 _list_all_knowledge_sets 方法遍历所有知识集 - 清理方法使用新的全量查询方法问题3 - 文件删除逻辑可能误删： - deleteKnowledgeItemFile 方法增加严格的 sourceType 检查 - 只有当 sourceType 为 FILE_UPLOAD 或 MANUAL 时才删除文件 - 避免误删 DATASET_FILE 类型的数据集文件涉及文件： - knowledge_sync.py - KnowledgeItemApplicationService.java	2026-02-05 04:07:40 +08:00
q792602257	99bd83d312	fix: 修复知识库同步的并发控制、数据清理、文件事务和COCO导出问题问题1 - 并发控制缺失： - 在 _ensure_knowledge_set 方法中添加数据库行锁（with_for_update） - 修改 _update_project_config 方法，使用行锁保护配置更新问题3 - 数据清理机制缺失： - 添加 _cleanup_knowledge_set_for_project 方法，项目删除时清理知识集 - 添加 _cleanup_knowledge_item_for_file 方法，文件删除时清理知识条目 - 在 delete_mapping 接口中调用清理方法问题4 - 文件操作事务问题： - 修改 uploadKnowledgeItems，添加事务失败后的文件清理逻辑 - 修改 deleteKnowledgeItem，删除记录前先删除关联文件 - 新增 deleteKnowledgeItemFile 辅助方法问题5 - COCO导出格式问题： - 添加 _get_image_dimensions 方法读取图片实际宽高 - 将百分比坐标转换为像素坐标 - 在 AnnotationExportItem 中添加 file_path 字段涉及文件： - knowledge_sync.py - project.py - KnowledgeItemApplicationService.java - export.py - export schema.py	2026-02-05 03:55:01 +08:00
q792602257	c03bdf1a24	refactor(data-management): 移除未使用的数据库操作方法并优化查询条件 - 从 DatasetFileMapper 中移除未使用的 update 和 deleteById 方法 - 从 DatasetMapper 中移除未使用的 deleteById 方法 - 在 Python 项目中添加 or_ 操作符导入用于复杂查询 - 为数据集文件查询添加状态过滤条件，排除已归档的文件记录	2026-02-05 03:21:06 +08:00
q792602257	9057807ec1	fix(database): 解决数据管理表联合索引长度超限问题 - 为 logical_path 字段创建前缀索引以避免索引长度超限 - 添加注释说明 utf8mb4 下索引长度按字节计算的限制 - 提供后续优化建议：使用 hash 生成列方案替代 VARCHAR 索引	2026-02-05 02:17:58 +08:00
q792602257	f15fd044ce	fix(data-management): 修复SQL映射中的比较运算符转义问题 - 将XML文件中的 <> 运算符替换为 <> 实体编码 - 确保SQL查询在XML解析器中正确处理比较操作 - 修复了数据集文件状态过滤的查询逻辑 - 保持了原有的业务逻辑不变，仅修正语法问题	2026-02-05 02:07:31 +08:00
q792602257	d0972cbc9d	feat(data-management): 实现数据集文件版本管理和内部路径保护 - 将数据集文件查询方法替换为只查询可见文件的版本 - 引入文件状态管理（ACTIVE/ARCHIVED）和内部目录结构 - 实现文件重复处理策略，支持版本控制模式而非覆盖 - 添加内部数据目录保护，防止访问.datamate等系统目录 - 重构文件上传流程，引入暂存目录和事务后清理机制 - 实现文件版本归档功能，保留历史版本到专用存储位置 - 优化文件路径规范化和安全验证逻辑 - 修复文件删除逻辑，确保归档文件不会被错误移除 - 更新数据集压缩下载功能以排除内部系统文件	2026-02-04 23:53:35 +08:00
q792602257	473f4e717f	feat(annotation): 添加文本分段索引显示功能 - 实现了分段索引数组的生成逻辑 - 添加了分段索引网格显示界面 - 支持当前分段高亮显示 - 优化了分段导航的用户体验 - 替换了原有的分段提示文字为可视化索引组件	2026-02-04 19:16:48 +08:00
q792602257	6b0042cb66	refactor(annotation): 简化任务选择逻辑并移除未使用的状态管理 - 移除了 resolveSegmentSummary 函数调用以简化完成状态判断 - 删除了未使用的 segmentStats 相关引用和缓存清理代码 - 简化了重置模式下的状态更新逻辑	2026-02-04 18:23:49 +08:00
q792602257	fa9e9d9f68	refactor(annotation): 简化文本标注编辑器的段落管理功能 - 移除段落统计相关的数据结构和缓存逻辑 - 删除段落切换确认对话框和自动保存选项 - 简化段落加载和状态管理流程 - 将段落列表视图替换为简单的进度显示 - 更新API接口以支持单段内容获取 - 重构后端服务实现单段内容查询功能	2026-02-04 18:08:14 +08:00
q792602257	707e65b017	refactor(annotation): 优化编辑器服务中的分段处理逻辑 - 在处理分段注释时初始化 segments 列表变量 - 确保分段信息列表在函数开始时被正确初始化 - 提高代码可读性和变量声明的一致性	2026-02-04 17:35:14 +08:00
q792602257	cda22a720c	feat(annotation): 优化文本标注分段功能实现 - 新增 getEditorTaskSegmentsUsingGet 接口用于获取任务分段信息 - 移除 SegmentInfo 中的 text、start、end 字段，精简数据结构 - 添加 EditorTaskSegmentsResponse 类型定义用于分段摘要响应 - 实现服务端 get_task_segments 方法，支持分段信息查询 - 重构前端组件缓存机制，使用 segmentSummaryFileRef 管理分段状态 - 优化分段构建逻辑，提取 _build_segment_contexts 公共方法 - 调整后端 _build_text_task 方法中的分段处理流程 - 更新 API 类型定义，统一 RequestParams 和 RequestPayload 类型	2026-02-04 16:59:04 +08:00
q792602257	394e2bda18	feat(data-management): 添加数据集文件取消上传功能 - 在OpenAPI规范中定义了取消上传的REST端点接口 - 实现了DatasetFileApplicationService中的取消上传业务逻辑 - 在FileService中添加了完整的取消上传服务方法 - 创建了DatasetUploadController控制器处理取消上传请求 - 实现了临时分片文件清理和数据库记录删除功能	2026-02-04 16:25:03 +08:00

1 2 3 4 5 ...

559 Commits