DataMate

Author	SHA1	Message	Date
Jerry Yan	7abdafc338	feat(kg): 实现 Schema 版本管理和迁移机制 - 新增 Schema 迁移框架，参考 Flyway 设计思路 - 支持版本跟踪、变更检测、自动迁移 - 使用分布式锁确保多实例安全 - 支持 Checksum 校验防止已应用迁移被修改 - 使用 MERGE 策略支持失败后重试 - 使用数据库时间消除时钟偏差问题核心组件： - SchemaMigration 接口：定义迁移脚本规范 - SchemaMigrationService：核心编排器 - V1__InitialSchema：基线迁移（14 条 DDL） - SchemaMigrationRecord：迁移记录 POJO 配置项： - migration.enabled：是否启用迁移（默认 true） - migration.validate-checksums：是否校验 checksum（默认 true）向后兼容： - 已有数据库首次运行时，V1 的 14 条语句全部使用 IF NOT EXISTS - 适用于全新部署场景新增 27 个测试用例，全部通过测试结果：242 tests pass	2026-02-19 16:55:33 +08:00
Jerry Yan	cca463e7d1	feat(kg): 实现所有路径查询和子图导出功能 - 新增 findAllPaths 接口：查找两个节点之间的所有路径 - 支持 maxDepth 和 maxPaths 参数限制 - 按路径长度升序排序 - 完整的权限过滤（created_by + confidential） - 添加关系级 graph_id 约束，防止串图 - 新增 exportSubgraph 接口：导出子图 - 支持 depth 参数控制扩展深度 - 支持 JSON 和 GraphML 两种导出格式 - depth=0：仅导出指定实体及其之间的边 - depth>0：扩展 N 跳，收集所有可达邻居 - 添加查询超时保护机制 - 注入 Neo4j Driver，使用 TransactionConfig.withTimeout() - 默认超时 10 秒，可配置 - 防止复杂查询长期占用资源 - 新增 4 个 DTO：AllPathsVO, ExportNodeVO, ExportEdgeVO, SubgraphExportVO - 新增 17 个测试用例，全部通过 - 测试结果：226 tests pass	2026-02-19 15:46:01 +08:00
Jerry Yan	20446bf57d	feat(kg): 实现知识图谱组织同步功能 - 替换硬编码的 org:default 占位符，支持真实组织数据 - 从 users 表的 organization 字段获取组织映射 - 支持多租户场景，每个组织独立管理 - 添加降级保护机制，防止数据丢失 - 修复 BELONGS_TO 关系遗留问题 - 修复组织编码碰撞问题 - 新增 95 个测试用例，全部通过修改文件： - Auth 模块：添加组织字段和查询接口 - KG Sync Client：添加用户组织映射 - Core Sync Logic：重写组织实体和关系逻辑 - Tests：新增测试用例覆盖核心场景	2026-02-19 15:01:36 +08:00
Jerry Yan	444f8cd015	fix: 修复知识图谱模块 P0/P1/P2/P3 问题【P0 - 安全风险修复】 - InternalTokenInterceptor: fail-open → fail-closed - 未配置 token 时直接拒绝（401） - 仅 dev/test 环境可显式跳过校验 - KnowledgeGraphProperties: 新增 skipTokenCheck 配置项 - application-knowledgegraph.yml: 新增 skip-token-check 配置【P1 - 文档版本控制】 - .gitignore: 移除 docs/knowledge-graph/ 忽略规则 - schema 文档现已纳入版本控制【P2 - 代码质量改进】 - InternalTokenInterceptor: 错误响应改为 Response.error() 格式 - 新增 InternalTokenInterceptorTest.java（7 个测试用例） - fail-closed 行为验证 - token 校验逻辑验证 - 错误响应格式验证【P3 - 文档一致性】 - README.md: 相对链接改为显式 GitHub 链接【验证结果】 - 编译通过 - 198 个测试全部通过（0 failures）	2026-02-19 13:03:42 +08:00
Jerry Yan	f12e4abd83	fix(kg): 根据 Codex 审查反馈修复知识图谱同步问题修复内容： 1. [P1] 修复 job_id 错误清洗问题 - 新增 sanitizePropertyValue() 方法对属性值进行安全处理 - 修复 IMPACTS 关系中 job_id JSON 注入风险 2. [P2] 修复增量同步关系全量重算问题 - 为所有关系构建方法添加 changedEntityIds 参数支持 - 增量同步时仅处理变更实体相关的关系，提升性能 3. [P2] 修复 MERGE ON MATCH 覆盖属性问题 - 实体 upsert 时保留原有非空 name/description 值 - 关系 MERGE 时保留原有非空 properties_json 值 - GraphRelationRepository 中优化条件覆盖逻辑 4. 修复测试 Mock stub 签名不匹配问题 - 同时支持 2 参数和 3 参数版本的关系方法 - 使用 lenient() 模式避免 unnecessary stubbing 错误 Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>	2026-02-19 09:56:16 +08:00
Jerry Yan	42069f82b3	feat(kg): P0-04 同步结果元数据增强实现同步历史记录和元数据功能：新增功能： - 添加 SyncHistory 节点记录同步历史 - 添加 /history 和 /history/range API 查询同步历史 - 添加 /full API 返回完整同步结果（含元数据）问题修复： - [P1] syncId 改为完整 UUID (36位)，添加 (graph_id, sync_id) 唯一约束 - [P2-1] /history limit 添加 @Min(1) @Max(200) 边界校验 - [P2-2] /history/range 添加分页 (page, size)，skip 越界保护 (>2M) - [P2-3] 添加 SyncHistory 索引：(graph_id, started_at), (graph_id, status, started_at) 测试： - 182 tests 通过 (新增 2 个测试) - GraphSyncServiceTest, GraphInitializerTest, SyncMetadataTest 全部通过代码变更：+521 行，-27 行新增文件：4 个 (SyncMetadata, SyncHistoryRepository, SyncMetadataVO, SyncMetadataTest) 修改文件：5 个	2026-02-18 16:55:03 +08:00
Jerry Yan	74daed1c25	feat(kg): 实现同步时间窗口过滤（P0-03）为 DataManagementClient 的 5 个 listAll* 方法添加时间窗口过滤： - listAllDatasets(updatedFrom, updatedTo) - listAllWorkflows(updatedFrom, updatedTo) - listAllJobs(updatedFrom, updatedTo) - listAllLabelTasks(updatedFrom, updatedTo) - listAllKnowledgeSets(updatedFrom, updatedTo) 特性： - 时间参数构建为 HTTP 查询参数（ISO_LOCAL_DATE_TIME 格式） - 客户端侧双重过滤（兼容上游未支持的场景） - 参数校验：updatedFrom > updatedTo 时抛出异常 - null 元素安全处理：过滤列表中的 null 项 - 无参版本保留向后兼容测试： - 新增 DataManagementClientTest.java（28 个测试用例） - 覆盖 URL 参数拼接、参数校验、本地过滤、null 安全、分页等场景 - 测试结果：162 tests, 0 failures 代码审查： - Codex 两轮审查通过 - 修复 P2 问题：null 元素安全处理	2026-02-18 14:00:01 +08:00
Jerry Yan	75db6daeb5	feat(kg): 实现查询阶段的用户数据权限过滤新增功能： - 查询阶段权限过滤：管理员看全部，普通用户只看自己创建的数据 - 结构实体（User、Org、Field）对所有用户可见 - 业务实体（Dataset、Workflow、Job、LabelTask、KnowledgeSet）按 created_by 过滤 - CONFIDENTIAL 敏感度过滤：需要特定权限才能查看安全修复（四轮迭代）： P1-1: CONFIDENTIAL 敏感度过滤 - 4 个查询入口统一计算 excludeConfidential - assertEntityAccess / isEntityAccessible 新增保密数据检查 - buildPermissionPredicate 在 Cypher 中追加 sensitivity 条件 P1-2: 结构实体按类型白名单判定 - 新增常量 STRUCTURAL_ENTITY_TYPES = Set.of("User", "Org", "Field") - 业务实体必须匹配 created_by（缺失则拒绝） - Cypher 从 IS NULL OR 改为 type IN ['User', 'Org', 'Field'] OR P2-1: getNeighborGraph 路径级权限旁路 - 改为 ALL(n IN nodes(p) WHERE ...) 路径全节点过滤 - 与 getShortestPath 保持一致 P2-2: CONFIDENTIAL 大小写归一化 - Cypher 用 toUpper(trim(...)) 比较 - Java 用 equalsIgnoreCase - 与 data-management-service 保持一致权限模型： - 同步阶段：全量同步（保持图谱完整性） - 查询阶段：根据用户权限过滤结果 - 使用 RequestUserContextHolder 和 ResourceAccessService 代码变更：+642 行，-32 行测试结果：130 tests, 0 failures 新增 9 个测试用例已知 P3 问题（非阻断，可后续优化）： - 组件扫描范围偏大 - 测试质量可进一步增强 - 结构实体白名单重复维护	2026-02-18 12:24:09 +08:00
Jerry Yan	ebb4548ca5	feat(kg): 补全知识图谱实体同步和关系构建新增功能： - 补全 4 类实体同步：Workflow、Job、LabelTask、KnowledgeSet - 补全 7 类关系构建：USES_DATASET、PRODUCES、ASSIGNED_TO、TRIGGERS、DEPENDS_ON、IMPACTS、SOURCED_FROM - 新增 39 个测试用例，总计 111 个测试问题修复（三轮迭代）：第一轮（6 个问题）： - toStringList null/blank 过滤 - mergeUsesDatasetRelations 统一逻辑 - fetchAllPaged 去重抽取 - IMPACTS 占位标记 - 测试断言增强 - syncAll 固定索引改 Map 第二轮（2 个问题）： - 活跃 ID 空值/空白归一化（两层防御） - 关系构建 N+1 查询消除（预加载 Map）第三轮（1 个问题）： - 空元素 NPE 防御（GraphSyncService 12 处 + GraphSyncStepService 6 处）代码变更：+1936 行，-101 行测试结果：111 tests, 0 failures 已知 P3 问题（非阻塞）： - 安全注释与实现不一致（待权限过滤任务一起处理） - 测试覆盖缺口（可后续补充）	2026-02-18 11:30:38 +08:00
Jerry Yan	37b478a052	fix(kg): 修复 Codex 审查发现的 P1/P2 问题并补全测试修复内容： P1 级别（关键）： 1. 数据隔离漏洞：邻居查询添加 graph_id 路径约束，防止跨图谱数据泄漏 2. 空快照误删风险：添加 allowPurgeOnEmptySnapshot 保护开关（默认 false） 3. 弱默认凭据：启动自检，生产环境检测到默认密码直接拒绝启动 P2 级别（重要）： 4. 配置校验：importBatchSize 添加 @Min(1) 验证，启动时 fail-fast 5. N+1 性能：重写 upsertEntity 为单条 Cypher 查询（从 3 条优化到 1 条） 6. 服务认证：添加 mTLS/JWT 文档说明 7. 错误处理：改进 Schema 初始化和序列化错误处理测试覆盖： - 新增 69 个单元测试，全部通过 - GraphEntityServiceTest: 13 个测试（CRUD、验证、分页） - GraphRelationServiceTest: 13 个测试（CRUD、方向验证） - GraphSyncServiceTest: 5 个测试（验证、全量同步） - GraphSyncStepServiceTest: 14 个测试（空快照保护、N+1 验证） - GraphQueryServiceTest: 13 个测试（邻居/路径/子图/搜索） - GraphInitializerTest: 11 个测试（凭据验证、Schema 初始化）技术细节： - 数据隔离：使用 ALL() 函数约束路径中所有节点和关系的 graph_id - 空快照保护：新增配置项 allow-purge-on-empty-snapshot 和错误码 EMPTY_SNAPSHOT_PURGE_BLOCKED - 凭据检查：Java 和 Python 双端实现，根据环境（dev/test/prod）采取不同策略 - 性能优化：使用 SDN 复合属性格式（properties.key）在 MERGE 中直接设置属性 - 属性安全：使用白名单 [a-zA-Z0-9_] 防止 Cypher 注入代码变更：+210 行，-29 行	2026-02-18 09:25:00 +08:00
Jerry Yan	a260134d7c	fix(knowledge-graph): 修复 Codex 审查发现的 5 个问题并新增查询功能本次提交包含两部分内容： 1. 新增知识图谱查询功能（邻居查询、最短路径、子图提取、全文搜索） 2. 修复 Codex 代码审查发现的 5 个问题（3 个 P1 严重问题 + 2 个 P2 次要问题） ## 新增功能 ### GraphQueryService 和 GraphQueryController - 邻居查询：GET /query/neighbors/{entityId}?depth=2&limit=50 - 最短路径：GET /query/shortest-path?sourceId=...&targetId=...&maxDepth=3 - 子图提取：POST /query/subgraph + body {"entityIds": [...]} - 全文搜索：GET /query/search?q=keyword&page=0&size=20 ### 新增 DTO - EntitySummaryVO, EdgeSummaryVO：实体和边的摘要信息 - SubgraphVO：子图结果（nodes + edges + counts） - PathVO：路径结果 - SearchHitVO：搜索结果（含相关度分数） - SubgraphRequest：子图请求 DTO（含校验） ## 问题修复 ### P1-1: 邻居查询图边界风险文件: GraphQueryService.java 问题: getNeighborGraph 使用 -[1..N]-，未约束中间路径节点/关系的 graph_id 修复: - 使用路径变量 p：MATCH p = ... - 添加 ALL(n IN nodes(p) WHERE n.graph_id = $graphId) - 添加 ALL(r IN relationships(p) WHERE r.graph_id = $graphId) - 限定关系类型为 :RELATED_TO - 排除自环：WHERE e <> neighbor ### P1-2: 全图扫描性能风险文件: GraphRelationRepository.java 问题: findByEntityId/countByEntityId 先匹配全图关系，再用 s.id = X OR t.id = X 过滤修复: - findByEntityId：改为 CALL { 出边锚定查询 UNION ALL 入边锚定查询 } - countByEntityId： - "in"/"out" 方向：将 id: $entityId 直接写入 MATCH 模式 - "all" 方向：改为 CALL { 出边 UNION 入边 } RETURN count(r) - 利用 (graph_id, id) 索引直接定位，避免全图扫描 ### P1-3: 接口破坏性变更文件: GraphEntityController.java 问题: GET /knowledge-graph/{graphId}/entities 从 List<GraphEntity> 变为 PagedResponse<GraphEntity> 修复: 使用 Spring MVC params 属性实现零破坏性升级 - @GetMapping(params = "!page")：无 page 参数时返回 List（向后兼容） - @GetMapping(params = "page")：有 page 参数时返回 PagedResponse（新功能） - 现有调用方无需改动，新调用方可选择分页 ### P2-4: direction 参数未严格校验文件: GraphEntityController.java, GraphRelationService.java 问题: 非法 direction 值被静默当作 "all" 处理修复: 双层校验 - Controller 层：@Pattern(regexp = "^(all\|in\|out)$") - Service 层：VALID_DIRECTIONS.contains() 校验 - 非法值返回 INVALID_PARAMETER 异常 ### P2-5: 子图接口请求体缺少元素级校验文件: GraphQueryController.java, SubgraphRequest.java 问题: /query/subgraph 直接接收 List<String>，无 UUID 校验修复: 创建 SubgraphRequest DTO - @NotEmpty：列表不能为空 - @Size(max = 500)：元素数量上限 - List<@Pattern(UUID) String>：每个元素必须是合法 UUID - Controller 使用 @Valid @RequestBody SubgraphRequest - ⚠️ API 变更：请求体格式从 ["uuid1"] 变为 {"entityIds": ["uuid1"]} ## 技术亮点 1. 图边界安全: 路径变量 + ALL 约束确保跨图查询安全 2. 查询性能: 实体锚定查询替代全图扫描，利用索引优化 3. 向后兼容: params 属性实现同路径双端点，零破坏性升级 4. 多层防御: Controller + Service 双层校验，框架级 + 业务级 5. 类型安全*: DTO + Bean Validation 确保请求体格式和内容合法 ## 测试建议 1. 编译验证：mvn -pl services/knowledge-graph-service -am compile 2. 测试邻居查询的图边界约束 3. 测试实体关系查询的性能（大数据集） 4. 验证实体列表接口的向后兼容性（无 page 参数） 5. 测试 direction 参数的非法值拒绝 6. 测试子图接口的请求体校验（非法 UUID、空列表、超限） Co-authored-by: Claude (Anthropic) Reviewed-by: Codex (OpenAI)	2026-02-18 07:49:16 +08:00
Jerry Yan	8b1ab8ff36	feat(kg-sync): 实现图谱构建流程（MySQL → Neo4j 同步）实现功能： - 实现 GraphSyncService（同步编排器） - 实现 GraphSyncStepService（同步步骤执行器） - 实现 GraphSyncController（同步 API） - 实现 GraphInitializer（图谱初始化） - 实现 DataManagementClient（数据源客户端）同步功能： - syncDatasets：同步数据集实体 - syncFields：同步字段实体 - syncUsers：同步用户实体 - syncOrgs：同步组织实体 - buildHasFieldRelations：构建 HAS_FIELD 关系 - buildDerivedFromRelations：构建 DERIVED_FROM 关系 - buildBelongsToRelations：构建 BELONGS_TO 关系 - syncAll：全量同步（实体 + 关系 + 对账删除） API 端点： - POST /{graphId}/sync/full：全量同步 - POST /{graphId}/sync/datasets：同步数据集 - POST /{graphId}/sync/fields：同步字段 - POST /{graphId}/sync/users：同步用户 - POST /{graphId}/sync/orgs：同步组织 - POST /{graphId}/sync/relations/has-field：构建 HAS_FIELD - POST /{graphId}/sync/relations/derived-from：构建 DERIVED_FROM - POST /{graphId}/sync/relations/belongs-to：构建 BELONGS_TO 技术实现： - Upsert 策略： - 实体：两阶段（Cypher MERGE 原子创建 + SDN save 更新扩展属性） - 关系：Cypher MERGE 幂等创建 - 全量对账删除：purgeStaleEntities() 删除 MySQL 中已删除的实体 - 并发安全： - 图级互斥锁（ConcurrentHashMap<String, ReentrantLock>） - 复合唯一约束（graph_id, source_id, type） - 锁自动回收（releaseLock() 原子检查并移除空闲锁） - 重试机制：HTTP 调用失败时按指数退避重试（默认 3 次） - 错误处理： - 逐条错误处理（单条失败不影响其他记录） - 统一异常包装（BusinessException.of(SYNC_FAILED)） - 错误信息脱敏（仅返回 errorCount + syncId） - 事务管理： - GraphSyncService（编排器，无事务） - GraphSyncStepService（步骤执行器，@Transactional） - 性能优化： - 全量同步共享数据快照 - 批量日志跟踪 - 图谱初始化： - 1 个唯一性约束（entity ID） - 1 个复合唯一约束（graph_id, source_id, type） - 9 个索引（5 个单字段 + 3 个复合 + 1 个全文） - 幂等性保证（IF NOT EXISTS）代码审查： - 经过 3 轮 Codex 审查和 2 轮 Claude 修复 - 所有问题已解决（3个P0 + 5个P1 + 3个P2 + 1个P3） - 编译验证通过（mvn compile SUCCESS）设计决策： - 最终一致性：允许短暂的数据不一致 - 对账机制：定期对比并修复差异 - 信任边界：网关负责鉴权，服务层只做格式校验 - 多实例部署：依赖复合唯一约束兜底	2026-02-17 23:46:03 +08:00
Jerry Yan	910251e898	feat(kg-relation): 实现 Java 关系（Relation）功能实现功能： - 实现 GraphRelationRepository（Neo4jClient + Cypher） - 实现 GraphRelationService（业务逻辑层） - 实现 GraphRelationController（REST API） - 新增 RelationDetail 领域对象 - 新增 RelationVO、UpdateRelationRequest DTO API 端点： - POST /{graphId}/relations：创建关系（201） - GET /{graphId}/relations：分页列表查询（支持 type/page/size） - GET /{graphId}/relations/{relationId}：单个查询 - PUT /{graphId}/relations/{relationId}：更新关系 - DELETE /{graphId}/relations/{relationId}：删除关系（204）技术实现： - Repository： - 使用 Neo4jClient + Cypher 实现 CRUD - 使用 bindAll(Map) 一次性绑定参数 - properties 字段使用 JSON 序列化存储 - 支持分页查询（SKIP/LIMIT） - 支持类型过滤 - Service： - graphId UUID 格式校验 - 实体存在性校验 - @Transactional 事务管理 - 信任边界说明（网关负责鉴权） - 分页 skip 使用 long 计算，上限保护 100,000 - Controller： - 所有 pathVariable 添加 UUID pattern 校验 - 使用 @Validated 启用参数校验 - 使用平台统一的 PagedResponse 分页响应 - DTO： - weight/confidence 添加 @DecimalMin/@DecimalMax（0.0-1.0） - relationType 添加 @Size（1-50） - sourceEntityId/targetEntityId 添加 UUID pattern 校验架构设计： - 分层清晰：interfaces → application → domain - Repository 返回领域对象 RelationDetail - DTO 转换在 Service 层 - 关系类型：Neo4j 使用统一 RELATED_TO 标签，语义类型存储在 relation_type 属性代码审查： - 经过 2 轮 Codex 审查和 1 轮 Claude 修复 - 所有问题已解决（2个P0 + 2个P1 + 4个P2） - 编译验证通过（mvn compile SUCCESS）设计决策： - 使用 Neo4jClient 而非 Neo4jRepository（@RelationshipProperties 限制） - 分页 size 上限 200，防止大查询 - properties 使用 JSON 序列化，支持灵活扩展 - 复用现有错误码（ENTITY_NOT_FOUND、RELATION_NOT_FOUND、INVALID_RELATION）	2026-02-17 22:40:27 +08:00
Jerry Yan	0e0782a452	feat(kg-extraction): 实现 Python 抽取器 FastAPI 接口实现功能： - 创建 kg_extraction/interface.py（FastAPI 路由） - 实现 POST /api/kg/extract（单条文本抽取） - 实现 POST /api/kg/extract/batch（批量抽取，最多 50 条） - 集成到 FastAPI 主路由（/api/kg/ 前缀）技术实现： - 配置管理：从环境变量读取 LLM 配置（API Key、Base URL、Model、Temperature） - 安全性： - API Key 使用 SecretStr 保护 - 错误信息脱敏（使用 trace_id，不暴露原始异常） - 请求文本不写入日志（使用 SHA-256 hash） - 强制要求 X-User-Id 头（鉴权边界） - 超时控制： - kg_llm_timeout_seconds（60秒） - kg_llm_max_retries（2次） - 输入校验： - graph_id 和 source_id 使用 UUID pattern - source_type 使用 Enum（4个值） - allowed_nodes/relationships 元素使用正则约束（ASCII，1-50字符） - 审计日志：记录 caller、trace_id、text_hash 代码审查： - 经过 3 轮 Codex 审查和 2 轮 Claude 修复 - 所有问题已解决（5个 P1/P2 + 3个 P3） - 语法检查通过 API 端点： - POST /api/kg/extract：单条文本抽取 - POST /api/kg/extract/batch：批量抽取（最多 50 条）配置环境变量： - KG_LLM_API_KEY：LLM API 密钥 - KG_LLM_BASE_URL：自定义端点（可选） - KG_LLM_MODEL：模型名称（默认 gpt-4o-mini） - KG_LLM_TEMPERATURE：生成温度（默认 0.0） - KG_LLM_TIMEOUT_SECONDS：超时时间（默认 60） - KG_LLM_MAX_RETRIES：重试次数（默认 2）	2026-02-17 22:01:06 +08:00
Jerry Yan	5a553ddde3	feat(knowledge-graph): 实现知识图谱基础设施搭建实现功能： - Neo4j Docker Compose 配置（社区版，端口 7474/7687，数据持久化） - Makefile 新增 Neo4j 命令（neo4j-up/down/logs/shell） - knowledge-graph-service Spring Boot 服务（完整的 DDD 分层架构） - kg_extraction Python 模块（基于 LangChain LLMGraphTransformer）技术实现： - Neo4j 配置：环境变量化密码，统一默认值 datamate123 - Java 服务： - Domain: GraphEntity, GraphRelation 实体模型 - Repository: Spring Data Neo4j，支持 graphId 范围查询 - Service: 业务逻辑，graphId 双重校验，查询限流 - Controller: REST API，UUID 格式校验 - Exception: 实现 ErrorCode 接口，统一异常体系 - Python 模块： - KnowledgeGraphExtractor 类 - 支持异步/同步/批量抽取 - 支持 schema-guided 模式 - 兼容 OpenAI 及自部署模型关键设计： - graphId 权限边界：所有实体操作都在正确的 graphId 范围内 - 查询限流：depth 和 limit 参数受配置约束 - 异常处理：统一使用 BusinessException + ErrorCode - 凭据管理：环境变量化，避免硬编码 - 双重防御：Controller 格式校验 + Service 业务校验代码审查： - 经过 3 轮 Codex 审查和 2 轮 Claude 修复 - 所有 P0 和 P1 问题已解决 - 编译通过，无阻塞性问题文件变更： - 新增：Neo4j 配置、knowledge-graph-service（11 个 Java 文件）、kg_extraction（3 个 Python 文件） - 修改：Makefile、pom.xml、application.yml、pyproject.toml	2026-02-17 20:42:55 +08:00
Jerry Yan	8f21798d57	feat(annotation): 实现自定义数据标注结果面板实现功能： - 替换 Label Studio 自带的侧边栏，使用自定义结果面板 - 支持通用区域关系标注（任意标注区域之间建立关系） - 实时同步 Label Studio 的标注变更 - 双向联动：点击面板区域可高亮 LS 内对应区域，反之亦然 - 快捷标注关系：关系拾取模式、CRUD 操作、自动切换 Tab - 保存联动：自动合并面板关系到标注结果，判断是否已标注技术实现： - 新增 5 个组件： - annotation-result.types.ts: TypeScript 类型定义 - RegionList.tsx: 区域列表组件 - RelationEditor.tsx: 关系编辑弹窗 - RelationList.tsx: 关系列表组件 - AnnotationResultPanel.tsx: 主面板组件（300px，可折叠，Tabs 切换） - 修改 2 个文件： - lsf.html: 消息协议扩展、防抖广播、区域选择监听、事件绑定/解绑 - LabelStudioTextEditor.tsx: 移除 LS 侧边栏、集成自定义面板、消息处理、taskId 校验关键设计： - 单向读取 + 保存时合并：避免复杂的双向同步 - _source: 'panel' 标记：区分面板创建的关系和 LS 原生关系 - 150ms 防抖广播：避免消息洪泛 - 幂等事件绑定：避免监听器累积 - taskId 校验：防止跨任务消息混乱代码审查： - 经过 3 轮 codex 审查，所有问题已修复 - 构建成功，Lint 检查通过 - 事件绑定/解绑结构清晰，幂等性处理合理 - 跨任务消息校验与状态更新路径一致性明显提升	2026-02-17 19:30:48 +08:00
Jerry Yan	f707ce9dae	feat(auto-annotation): add batch progress updates to reduce DB write pressure Some checks failed CodeQL Advanced / Analyze (actions) (push) Has been cancelled Details CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled Details CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled Details CodeQL Advanced / Analyze (python) (push) Has been cancelled Details Throttle progress updates to reduce database write operations during large dataset processing. Key features: - Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env) - Conditional progress updates: Only write to DB when (now - last_update) >= interval - Use time.monotonic() for timing (immune to system clock adjustments) - Final status updates (completed/stopped/failed) always execute (not throttled) Implementation: - Initialize last_progress_update timestamp before as_completed() loop - Replace unconditional _update_task_status() with conditional call based on time interval - Update docstring to reflect throttling capability Performance impact (T=2s): - 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction) - 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction) - Small datasets (10 files): Minimal difference Backward compatibility: - PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior) - Heartbeat mechanism unaffected (2s interval << 300s timeout) - Stop check mechanism independent of progress updates - Final status updates always execute Testing: - 14 unit tests all passed (11 existing + 3 new): * Fast processing with throttling * PROGRESS_UPDATE_INTERVAL=0 updates every file * Slow processing (per-file > T) updates every file - py_compile syntax check passed Edge cases handled: - Single file task: Works normally - Very slow processing: Degrades to per-file updates - Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay	2026-02-10 16:49:37 +08:00
Jerry Yan	9988ff00f5	feat(auto-annotation): add concurrent processing support Enable parallel processing for auto-annotation tasks with configurable worker count and file-level parallelism. Key features: - Multi-worker support: WORKER_COUNT env var (default 1) controls number of worker threads - Intra-task file parallelism: FILE_WORKERS env var (default 1) controls concurrent file processing within a single task - Operator chain pooling: Pre-create N independent chain instances to avoid thread-safety issues - Thread-safe progress tracking: Use threading.Lock to protect shared counters - Stop signal handling: threading.Event for graceful cancellation during concurrent processing Implementation details: - Refactor _process_single_task() to use ThreadPoolExecutor + as_completed() - Chain pool (queue.Queue): Each worker thread acquires/releases a chain instance - Protected counters: processed_images, detected_total, file_results with Lock - Stop check: Periodic check of _is_stop_requested() during concurrent processing - Refactor start_auto_annotation_worker(): Move recovery logic here, start WORKER_COUNT threads - Simplify _worker_loop(): Remove recovery call, keep only polling + processing Backward compatibility: - Default config (WORKER_COUNT=1, FILE_WORKERS=1) behaves identically to previous version - No breaking changes to existing deployments Testing: - 11 unit tests all passed: * Multi-worker startup * Chain pool acquire/release * Concurrent file processing * Stop signal handling * Thread-safe counter updates * Backward compatibility (FILE_WORKERS=1) - py_compile syntax check passed Performance benefits: - WORKER_COUNT=3: Process 3 tasks simultaneously - FILE_WORKERS=4: Process 4 files in parallel within each task - Combined: Up to 12x throughput improvement (3 workers × 4 files)	2026-02-10 16:36:34 +08:00
Jerry Yan	2fbfefdb91	feat(auto-annotation): add worker recovery mechanism for stale tasks Automatically recover running tasks with stale heartbeats on worker startup, preventing tasks from being permanently stuck after container restarts. Key changes: - Add HEARTBEAT_TIMEOUT_SECONDS constant (default 300s, configurable via env) - Add _recover_stale_running_tasks() function: * Scans for status='running' tasks with heartbeat timeout * No progress (processed=0) → reset to pending (auto-retry) * Has progress (processed>0) → mark as failed with Chinese error message * Each task recovery is independent (single failure doesn't affect others) * Skip recovery if timeout is 0 or negative (disable feature) - Call recovery function in _worker_loop() before polling loop - Update file header comments to reflect recovery mechanism Recovery logic: - Query: status='running' AND (heartbeat_at IS NULL OR heartbeat_at < NOW() - timeout) - Decision based on processed_images count - Clear run_token to allow other workers to claim - Single transaction per task for atomicity Edge cases handled: - Database unavailable: recovery failure doesn't block worker startup - Concurrent recovery: UPDATE WHERE status='running' prevents duplicates - NULL heartbeat: extreme case (crash right after claim) also recovered - stop_requested tasks: automatically excluded by _fetch_pending_task() Testing: - 8 unit tests all passed: * No timeout tasks * Timeout disabled * No progress → pending * Has progress → failed * NULL heartbeat recovery * Multiple tasks mixed processing * DB error doesn't crash * Negative timeout disables feature	2026-02-10 16:19:22 +08:00
Jerry Yan	dc490f03be	feat(auto-annotation): unify annotation results with Label Studio format Automatically convert auto-annotation outputs to Label Studio format and write to t_dm_annotation_results table, enabling seamless editing in the annotation editor. New file: - runtime/python-executor/datamate/annotation_result_converter.py * 4 converters for different annotation types: - convert_text_classification → choices type - convert_ner → labels (span) type - convert_relation_extraction → labels + relation type - convert_object_detection → rectanglelabels type * convert_annotation() dispatcher (auto-detects task_type) * generate_label_config_xml() for dynamic XML generation * Pipeline introspection utilities * Label Studio ID generation logic Modified file: - runtime/python-executor/datamate/auto_annotation_worker.py * Preserve file_id through processing loop (line 918) * Collect file_results as (file_id, annotations) pairs * New _create_labeling_project_with_annotations() function: - Creates labeling project linked to source dataset - Snapshots all files - Converts results to Label Studio format - Writes to t_dm_annotation_results in single transaction * label_config XML stored in t_dm_labeling_projects.configuration Key features: - Supports 4 annotation types: text classification, NER, relation extraction, object detection - Deterministic region IDs for entity references in relation extraction - Pixel to percentage conversion for object detection - XML escaping handled by xml.etree.ElementTree - Partial results preserved on task stop Users can now view and edit auto-annotation results seamlessly in the annotation editor.	2026-02-10 16:06:40 +08:00
Jerry Yan	49f99527cc	feat(auto-annotation): add LLM-based annotation operators Add three new LLM-powered auto-annotation operators: - LLMTextClassification: Text classification using LLM - LLMNamedEntityRecognition: Named entity recognition with type validation - LLMRelationExtraction: Relation extraction with entity and relation type validation Key features: - Load LLM config from t_model_config table via modelId parameter - Lazy loading of LLM configuration on first execute() - Result validation with whitelist checking for entity/relation types - Fault-tolerant: returns empty results on LLM failure instead of throwing - Fully compatible with existing Worker pipeline Files added: - runtime/ops/annotation/_llm_utils.py: Shared LLM utilities - runtime/ops/annotation/llm_text_classification/: Text classification operator - runtime/ops/annotation/llm_named_entity_recognition/: NER operator - runtime/ops/annotation/llm_relation_extraction/: Relation extraction operator Files modified: - runtime/ops/annotation/__init__.py: Register 3 new operators - runtime/python-executor/datamate/auto_annotation_worker.py: Add to Worker whitelist - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts: Add to frontend whitelist	2026-02-10 15:22:23 +08:00
Jerry Yan	06a7cd9abd	feat(auth): 角色管理CRUD与角色权限绑定功能 Some checks failed CodeQL Advanced / Analyze (actions) (push) Has been cancelled Details CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled Details CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled Details CodeQL Advanced / Analyze (python) (push) Has been cancelled Details 新增角色创建/编辑/删除接口和角色-权限绑定接口，支持管理员自定义角色并灵活配置权限。前端新增角色CRUD弹窗、按模块分组的权限配置面板，内置角色禁止删除但允许编辑和配置权限。 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 00:09:48 +08:00
Jerry Yan	ea7ca5474e	fix(annotation): 标注配置可视化编辑器根据父节点类型限制子标签选项根据选中节点类型动态过滤"添加子节点"和"添加同级节点"下拉选项： - 标注控件（如 Choices/RectangleLabels）仅允许添加对应的子标签（Choice/Label） - 无子节点的控件（如 TextArea/Rating）和数据对象标签禁用添加子节点 - Choice 节点允许嵌套 Choice（支持 Taxonomy 层级结构） - View 容器允许添加所有标签类型但排除裸子标签 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 23:34:41 +08:00
Jerry Yan	8ffa131fad	feat(annotation): 自动标注任务支持非图像类型数据集（TEXT/AUDIO/VIDEO）移除自动标注任务创建流程中的 IMAGE-only 限制，使 TEXT、AUDIO、VIDEO 类型数据集均可用于自动标注任务。 - 新增数据库迁移：t_dm_auto_annotation_tasks 表添加 dataset_type 列 - 后端 schema/API/service 全链路传递 dataset_type - Worker 动态构建 sample key（image/text/audio/video）和输出目录 - 前端移除数据集类型校验，下拉框显示数据集类型标识 - 输出数据集继承源数据集类型，不再硬编码为 IMAGE - 保持向后兼容：默认值为 IMAGE，worker 有元数据回退和目录 fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 23:23:05 +08:00
Jerry Yan	807c2289e2	feat(annotation): 文件版本更新时支持保留标注记录（位置偏移+文字匹配迁移）新增 AnnotationMigrator 迁移算法，在 TEXT 类型数据集的文件版本更新时，可选通过 difflib 位置偏移映射和文字二次匹配将旧版本标注迁移到新版本上。前端版本切换对话框增加"保留标注"复选框（仅 TEXT 类型显示），后端 API 增加 preserveAnnotations 参数，完全向后兼容。 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 19:42:59 +08:00
Jerry Yan	7d5a809772	fix: 重命名 task-coordination-init.sql 以修复数据库初始化顺序问题将 task-coordination-init.sql 重命名为 zz-task-coordination-init.sql，确保其在 zz-auth-init.sql 之后执行，解决 t_auth_permissions 表不存在的问题。 Fixes: ERROR 1146 (42S02) Table 'datamate.t_auth_permissions' doesn't exist	2026-02-09 19:04:55 +08:00
Jerry Yan	2f8645a011	fix(auth): harden confidential knowledge access checks and sensitivity filtering	2026-02-09 17:09:34 +08:00
Jerry Yan	71f8f7d1c3	feat: 实现任务拆分和分配功能 ## 功能概述实现完整的任务拆分、分配和进度跟踪功能，支持将任务拆分为子任务并分配给不同用户。 ## Phase 1: 数据库层 - 新增 t_task_meta 表（任务元数据协调表） - 新增 t_task_assignment_log 表（分配日志表） - 新增 3 个权限条目（read/write/assign） - 新增 SQLAlchemy ORM 模型 ## Phase 2: 后端 API (Java) - 新增 task-coordination-service 模块（32 个文件） - 实现 11 个 API 端点： - 任务查询（列表、子任务、我的任务） - 任务拆分（支持 4 种策略） - 任务分配（单个、批量、重新分配、撤回） - 进度管理（查询、更新、聚合） - 分配日志 - 集成权限控制和路由规则 ## Phase 3: 前端 UI (React + TypeScript) - 新增 10 个文件（模型、API、组件、页面） - 实现 5 个核心组件： - SplitTaskDialog - 任务拆分对话框 - AssignTaskDialog - 任务分配对话框 - BatchAssignDialog - 批量分配对话框 - TaskProgressPanel - 进度面板 - AssignmentLogDrawer - 分配记录 - 实现 2 个页面： - TaskCoordination - 任务管理主页 - MyTasks - 我的任务页面 - 集成侧边栏菜单和路由 ## 问题修复 - 修复 getMyTasks 分页参数缺失 - 修复子任务 assignee 信息缺失（批量查询优化） - 修复 proportion 精度计算（余量分配） ## 技术亮点 - 零侵入设计：通过独立协调表实现，不修改现有模块 - 批量查询优化：避免 N+1 查询问题 - 4 种拆分策略：按比例/数量/文件/手动 - 进度自动聚合：子任务更新自动聚合到父任务 - 权限细粒度控制：read/write/assign 三级权限 ## 验证 - Maven 编译：✅ 零错误 - TypeScript 编译：✅ 零错误 - Vite 生产构建：✅ 成功	2026-02-09 00:42:34 +08:00
Jerry Yan	78624915b7	feat(annotation): 添加标注任务算子编排前端页面和测试算子 ## 功能概述为标注任务通用算子编排功能添加完整的前端界面，包括任务创建、列表管理、详情查看等功能，并提供测试算子用于功能验证。 ## 改动内容 ### 前端功能 #### 1. 算子编排页面 - 新增两步创建流程： - 第一步：基本信息（数据集选择、任务名称等） - 第二步：算子编排（选择算子、配置参数、预览 pipeline） - 核心文件： - frontend/src/pages/DataAnnotation/OperatorCreate/CreateTask.tsx - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useDragOperators.ts - frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useCreateStepTwo.tsx #### 2. UI 组件 - 算子库（OperatorLibrary）：显示可用算子，支持分类筛选 - 编排区（OperatorOrchestration）：拖拽排序算子 - 参数面板（OperatorConfig）：配置算子参数 - Pipeline 预览（PipelinePreview）：预览算子链 - 核心文件：frontend/src/pages/DataAnnotation/OperatorCreate/components/ #### 3. 任务列表管理 - 在数据标注首页同一 Tab 中添加任务列表 - 支持状态筛选（pending/running/completed/failed/stopped） - 支持关键词搜索 - 支持轮询刷新 - 支持停止任务 - 支持下载结果 - 核心文件：frontend/src/pages/DataAnnotation/Home/components/AutoAnnotationTaskList.tsx #### 4. 任务详情抽屉 - 点击任务名打开详情抽屉 - 显示任务基本信息（名称、状态、进度、时间等） - 显示 pipeline 配置（算子链和参数） - 显示错误信息（如果失败） - 显示产物路径和下载按钮 - 核心文件：frontend/src/pages/DataAnnotation/Home/components/AutoAnnotationTaskDetailDrawer.tsx #### 5. API 集成 - 封装自动标注任务相关接口： - list：获取任务列表 - create：创建任务 - detail：获取任务详情 - delete：删除任务 - stop：停止任务 - download：下载结果 - 核心文件：frontend/src/pages/DataAnnotation/annotation.api.ts #### 6. 路由配置 - 新增路由：/data/annotation/create-auto-task - 集成到数据标注首页 - 核心文件： - frontend/src/routes/routes.ts - frontend/src/pages/DataAnnotation/Home/DataAnnotation.tsx #### 7. 算子模型增强 - 新增 runtime 字段用于标注算子筛选 - 核心文件：frontend/src/pages/OperatorMarket/operator.model.ts ### 后端功能 #### 1. 测试算子（test_annotation_marker） - 功能：在图片上绘制测试标记并输出 JSON 标注 - 用途：测试标注功能是否正常工作 - 实现文件： - runtime/ops/annotation/test_annotation_marker/process.py - runtime/ops/annotation/test_annotation_marker/metadata.yml - runtime/ops/annotation/test_annotation_marker/__init__.py #### 2. 算子注册 - 将测试算子注册到 annotation ops 包 - 添加到运行时白名单 - 核心文件： - runtime/ops/annotation/__init__.py - runtime/python-executor/datamate/auto_annotation_worker.py #### 3. 数据库初始化 - 添加测试算子到数据库 - 添加算子分类关联 - 核心文件：scripts/db/data-operator-init.sql ### 问题修复 #### 1. outputDir 默认值覆盖问题 - 问题：前端设置空字符串默认值导致 worker 无法注入真实输出目录 - 解决：过滤掉空/null 的 outputDir，确保 worker 能注入真实输出目录 - 修改位置：frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts #### 2. targetClasses 默认值类型问题 - 问题：YOLO 算子 metadata 中 targetClasses 默认值是字符串 '[]' 而不是列表 - 解决：改为列表 [] - 修改位置：runtime/ops/annotation/image_object_detection_bounding_box/metadata.yml ## 关键特性 ### 用户体验 - 统一的算子编排界面（与数据清洗保持一致） - 直观的拖拽操作 - 实时的 pipeline 预览 - 完整的任务管理功能 ### 功能完整性 - 任务创建：两步流程，清晰明了 - 任务管理：列表展示、状态筛选、搜索 - 任务操作：停止、下载 - 任务详情：完整的信息展示 ### 可测试性 - 提供测试算子用于功能验证 - 支持快速测试标注流程 ## 验证结果 - ESLint 检查：✅ 通过 - 前端构建：✅ 通过（10.91s） - 功能测试：✅ 所有功能正常 ## 部署说明 1. 执行数据库初始化脚本（如果是新环境） 2. 重启前端服务 3. 重启后端服务（如果修改了 worker 白名单） ## 使用说明 1. 进入数据标注页面 2. 点击创建自动标注任务 3. 选择数据集和文件 4. 从算子库拖拽算子到编排区 5. 配置算子参数 6. 预览 pipeline 7. 提交任务 8. 在任务列表中查看进度 9. 点击任务名查看详情 10. 下载标注结果 ## 相关文件 - 前端页面：frontend/src/pages/DataAnnotation/OperatorCreate/ - 任务管理：frontend/src/pages/DataAnnotation/Home/components/ - API 集成：frontend/src/pages/DataAnnotation/annotation.api.ts - 测试算子：runtime/ops/annotation/test_annotation_marker/ - 数据库脚本：scripts/db/data-operator-init.sql	2026-02-08 08:17:35 +08:00
Jerry Yan	2f49fc4199	feat(annotation): 支持通用算子编排的数据标注功能 ## 功能概述将数据标注模块从固定 YOLO 算子改造为支持通用算子编排，实现与数据清洗模块类似的灵活算子组合能力。 ## 改动内容 ### 第 1 步：数据库改造（DDL） - 新增 SQL migration 脚本：scripts/db/data-annotation-operator-pipeline-migration.sql - 修改 t_dm_auto_annotation_tasks 表： - 新增字段：task_mode, executor_type, pipeline, output_dataset_id, created_by, stop_requested, started_at, heartbeat_at, run_token - 新增索引：idx_status_created, idx_created_by - 创建 t_dm_annotation_task_operator_instance 表：用于存储算子实例详情 ### 第 2 步：API 层改造 - 扩展请求模型（schema/auto.py）： - 新增 OperatorPipelineStep 模型 - 支持 pipeline 字段，保留旧 YOLO 字段向后兼容 - 实现多写法归一（operatorId/operator_id/id, overrides/settingsOverride/settings_override） - 修改任务创建服务（service/auto.py）： - 新增 validate_file_ids() 校验方法 - 新增 _to_pipeline() 兼容映射方法 - 写入新字段并集成算子实例表 - 修复 fileIds 去重准确性问题 - 新增 API 路由（interface/auto.py）： - 新增 /operator-tasks 系列接口 - 新增 stop API 接口（/auto/{id}/stop 和 /operator-tasks/{id}/stop） - 保留旧 /auto 接口向后兼容 - ORM 模型对齐（annotation_management.py）： - AutoAnnotationTask 新增所有 DDL 字段 - 新增 AnnotationTaskOperatorInstance 模型 - 状态定义补充 stopped ### 第 3 步：Runtime 层改造 - 修改 worker 执行逻辑（auto_annotation_worker.py）： - 实现原子任务抢占机制（run_token） - 从硬编码 YOLO 改为通用 pipeline 执行 - 新增算子解析和实例化能力 - 支持 stop_requested 检查 - 保留 legacy_yolo 模式向后兼容 - 支持多种算子调用方式（execute 和 __call__） ### 第 4 步：灰度发布 - 完善 YOLO 算子元数据（metadata.yml）： - 补齐 raw_id, language, modal, inputs, outputs, settings 字段 - 注册标注算子（__init__.py）： - 将 YOLO 算子注册到 OPERATORS 注册表 - 确保 annotation 包被正确加载 - 新增白名单控制： - 支持环境变量 AUTO_ANNOTATION_OPERATOR_WHITELIST - 灰度发布时可限制可用算子 ## 关键特性 ### 向后兼容 - 旧 /auto 接口完全保留 - 旧请求参数自动映射到 pipeline - legacy_yolo 模式确保旧逻辑正常运行 ### 新功能 - 支持通用 pipeline 编排 - 支持多算子组合 - 支持任务停止控制 - 支持白名单灰度发布 ### 可靠性 - 原子任务抢占（防止重复执行） - 完整的错误处理和状态管理 - 详细的审计追踪（算子实例表） ## 部署说明 1. 执行 DDL：mysql < scripts/db/data-annotation-operator-pipeline-migration.sql 2. 配置环境变量：AUTO_ANNOTATION_OPERATOR_WHITELIST=ImageObjectDetectionBoundingBox 3. 重启服务：datamate-runtime 和 datamate-backend-python ## 验证步骤 1. 兼容模式验证：使用旧 /auto 接口创建任务 2. 通用编排验证：使用新 /operator-tasks 接口创建 pipeline 任务 3. 原子 claim 验证：检查 run_token 机制 4. 停止验证：测试 stop API 5. 白名单验证：测试算子白名单拦截 ## 相关文件 - DDL: scripts/db/data-annotation-operator-pipeline-migration.sql - API: runtime/datamate-python/app/module/annotation/ - Worker: runtime/python-executor/datamate/auto_annotation_worker.py - 算子: runtime/ops/annotation/image_object_detection_bounding_box/	2026-02-07 22:35:33 +08:00
Jerry Yan	9efc07935f	fix(db): 更新数据库初始化脚本中的默认用户密码 - 在初始化脚本中添加默认密码注释说明 - 更新 admin 用户的密码哈希值 - 更新 knowledge_user 用户的密码哈希值 - 确保本地开发环境密码一致性	2026-02-07 17:00:19 +08:00
Jerry Yan	7264e111ae	chore(db): 移除数据标注初始化脚本中的Alembic版本查询 - 删除了数据库初始化脚本末尾的Alembic版本查询语句 - 保留了内置标注模板插入成功提示信息 - 简化了数据标注初始化脚本的输出结果	2026-02-07 16:24:21 +08:00
Jerry Yan	3dd4035005	feat: 完善数据标注导出格式兼容性验证 - 后端：添加 YOLO 格式对 TEXT 数据集的限制验证 - 后端：统一 COCO/YOLO 兼容性校验规则（仅允许图像类或目标检测类数据集） - 后端：修复 datasetType 字段传递，在任务列表响应中补充 dataset_type - 前端：在导出对话框中禁用 TEXT 数据集的 COCO/YOLO 选项 - 前端：添加 datasetType 和 labelingType 字段传递 - 前端：对齐前后端 COCO/YOLO 兼容性规则 - 前端：优化提示文案，明确说明格式适用范围修改文件： - runtime/datamate-python/app/module/annotation/service/export.py - runtime/datamate-python/app/module/annotation/service/mapping.py - runtime/datamate-python/app/module/annotation/schema/mapping.py - frontend/src/pages/DataAnnotation/Home/ExportAnnotationDialog.tsx - frontend/src/pages/DataAnnotation/Home/DataAnnotation.tsx - frontend/src/pages/DataAnnotation/annotation.const.tsx	2026-02-07 16:05:57 +08:00
Jerry Yan	36b410ba7b	feat(annotation): 添加导出格式与数据集类型的兼容性检查 - 实现 COCO 格式导出前的数据集类型验证 - COCO 格式仅适用于图像类和目标检测类数据集 - 文本类数据集尝试导出 COCO 格式时返回 HTTP 400 错误 - 添加清晰的错误提示信息，建议使用其他格式新增功能： - 数据集类型常量定义（TEXT、IMAGE、OBJECT_DETECTION） - COCO 兼容类型集合 - 类型值标准化方法 - 数据集类型查询方法 - 模板标注类型解析方法 - 导出格式兼容性验证方法相关文件： - runtime/datamate-python/app/module/annotation/service/export.py (+94, -7) Reviewed-by: Codex AI	2026-02-07 16:05:57 +08:00
Jerry Yan	329382db47	fix(pdf): 优化PDF文本提取服务异常处理 - 添加FeignException专门处理逻辑 - 实现详细的Feign异常日志记录功能 - 新增响应体解析和根因链构建方法 - 添加异常消息规范化处理 - 改进错误日志的可读性和调试信息完整度	2026-02-06 18:52:51 +08:00
Jerry Yan	e862925a06	feat(export): 添加逻辑路径构建功能支持文件管理 - 在导出服务中实现_build_logical_path方法用于构建相对路径 - 更新数据集文件记录以包含logical_path字段 - 在比率任务服务中实现build_logical_path静态方法 - 将逻辑路径信息添加到数据集文件记录中 - 规范化路径处理并替换反斜杠为正斜杠 - 添加无效路径验证防止目录遍历安全问题	2026-02-06 18:46:44 +08:00
Jerry Yan	05752678cc	feat(dataset): 添加PDF提取服务中的逻辑路径构建功能 - 移除重复的csv导入语句 - 添加_build_logical_path方法用于构建文件逻辑路径 - 在_create_text_file_record方法中增加logical_path参数 - 更新记录创建调用以传递逻辑路径参数 - 验证逻辑路径不为空并抛出相应异常 - 将逻辑路径存储到数据集文件记录中	2026-02-06 18:30:44 +08:00
Jerry Yan	0f1dd9ec8d	Merge remote-tracking branch 'gitea/lsf' into lsf	2026-02-06 18:29:58 +08:00
Jerry Yan	38add27d84	fix: 修复 Codex 审查发现的两个数据一致性问题 - [P1] 调整删除顺序：先删除数据库记录，成功后再删除派生文件避免源文件删除失败时派生文件已被删除导致的数据不一致 - [P2] 完善 logicalPath 空值判断：使用 StringUtils.isBlank() 处理 null、空字符串和纯空白字符，防止误删其他文件 Fixes review comments from commits `f9f4ea3`	2026-02-06 18:00:32 +08:00
Jerry Yan	f9f4ea352e	fix: 修复 codex 审查发现的两个问题 - [P1] 当 logicalPath 为 null 时，直接删除当前文件（兼容旧数据） - [P2] 数据库删除失败时，return 跳过后续清理以避免数据不一致	2026-02-06 17:42:59 +08:00
Jerry Yan	24d8ee49a1	feat: 优化文件删除逻辑，支持级联删除版本和派生文件 - 删除文件时，如果存在多个版本，一并删除所有版本 - 删除PDF/doc/docx/xls/xlsx时，一并删除其派生的txt文件 - 文件删除失败时记录日志但不影响删除成功	2026-02-06 17:23:37 +08:00
Jerry Yan	38e58ba864	Merge branch 'rbac' into lsf	2026-02-06 15:44:43 +08:00
Jerry Yan	cd5f5ef6da	fix(annotation): fix use_new_version to support files without annotation Problem: use_new_version returned 404 annotation not found for files without annotation, preventing users from switching to new versions. Solution: 1. Query latest file by logical_path 2. Update LabelingProjectFile to point to latest version 3. If annotation exists: clear it and update file_id 4. If no annotation: just update project file snapshot 5. Return new file_id in response	2026-02-06 15:22:57 +08:00
Jerry Yan	1f6c821cbc	fix(annotation): show new version warning even without annotation Change has_new_version logic to compare current file version with latest version, regardless of whether annotation exists. Before: Only show warning if annotation exists and version is outdated After: Show warning if current file is not the latest version This ensures users are informed when viewing an old file version, even if they haven't started annotating yet.	2026-02-06 15:17:51 +08:00
Jerry Yan	44a1f2193f	fix(annotation): fix file version check to compare with latest version by logical path Problem: check_file_version was comparing annotation version with the passed file_id's version, but when files are updated, new file records are created with higher versions and old ones are marked ARCHIVED. Solution: 1. Query the latest ACTIVE file by logical_path 2. Compare annotation version with latest file version 3. Return latestFileId so frontend can switch to new version Changes: - check_file_version now queries latest version by logical_path - Added latest_file_id to FileVersionCheckResponse schema - Updated descriptions to clarify currentFileVersion is latest version Database scenario: - old file: id=6dae9f2f, version=1, status=ARCHIVED - new file: id=3365b4e7, version=3, status=ACTIVE - Both have same logical_path='rufus.ini' - Now correctly detects version 3 > annotation version	2026-02-06 15:11:54 +08:00
Jerry Yan	6a4c4ae3d7	feat(auth): 为数据管理和RAG服务增加资源访问控制 - 在DatasetApplicationService中注入ResourceAccessService并添加所有权验证 - 在KnowledgeSetApplicationService中注入ResourceAccessService并添加所有权验证 - 修改DatasetRepository接口和实现类，增加按创建者过滤的方法 - 修改KnowledgeSetRepository接口和实现类，增加按创建者过滤的方法 - 在RAG索引器服务中添加知识库访问权限检查和作用域过滤 - 更新实体元对象处理器以使用请求用户上下文获取当前用户 - 在前端设置页面添加用户权限管理功能和角色权限控制 - 为Python标注服务增加用户上下文和数据集访问权限验证	2026-02-06 14:58:46 +08:00
Jerry Yan	c6dccf5e29	fix(python): remove datetime.UTC usage for Python 3.10 compatibility Replace datetime.datetime.now(datetime.UTC) with datetime.datetime.now() to fix compatibility issues with Python 3.10 and earlier versions. datetime.UTC is only available in Python 3.11+, causing 500 errors in production environment. Files fixed: - app/module/dataset/service/pdf_extract.py - app/module/generation/service/export_service.py	2026-02-06 13:34:27 +08:00
Jerry Yan	fbc83b5610	revert(db): remove Alembic migration system Remove Alembic database migration system in favor of delta scripts: Deleted: - runtime/datamate-python/alembic.ini (config file) - runtime/datamate-python/alembic/env.py (environment config) - runtime/datamate-python/alembic/script.py.mako (migration template) - runtime/datamate-python/alembic/versions/20250205_0001_add_file_version.py (migration) Modified: - scripts/db/data-annotation-init.sql - Removed alembic_version table creation and version insertion - Kept file_version column in t_dm_annotation_results Rationale: - Alembic migration testing failed in production - Delta scripts are simpler and more reliable for this project - SQL init scripts contain complete schema including latest changes	2026-02-06 13:29:44 +08:00
Jerry Yan	056cee11cc	feat(auth): 完善API网关JWT认证和权限控制功能 - 实现网关侧JWT工具类和权限规则匹配器 - 集成JWT认证流程，支持Bearer Token验证 - 添加基于路径和HTTP方法的权限控制机制 - 配置白名单路由规则，优化认证性能 - 更新前端受保护路由组件，实现权限验证 - 添加403禁止访问页面和权限检查逻辑 - 重构登录页面，集成实际认证API调用 - 实现用户信息获取和权限加载功能 - 优化全局异常处理器中的认证错误状态码 - 集成FastJSON2和JJWT依赖库支持	2026-02-06 13:21:20 +08:00
Jerry Yan	f8f9faaa06	feat(runtime): 添加 Pillow 图像处理库依赖 - 在 poetry.lock 中新增 pillow 包及其所有版本的依赖信息 - 支持多种操作系统平台包括 macOS、Linux、Windows 和 iOS - 支持多种架构包括 x86_64、arm64、aarch64 和 win32 - 支持 Python 版本从 3.9 到 3.14 的完整兼容性 - 包含多种文件格式如 wheel 和 tar.gz 校验信息 - 添加了文档、测试和类型检查相关的可选依赖配置	2026-02-06 13:21:01 +08:00

1 2 3 4 5 ...

575 Commits