DataMate

Author	SHA1	Message	Date
Jerry Yan	e9e4cf3b1c	fix(kg): 修复知识图谱部署流程问题修复从全新部署到运行的完整流程中的配置和路由问题。 ## P0 修复（功能失效） ### P0-1: GraphRAG KG 服务 URL 错误 - config.py - GRAPHRAG_KG_SERVICE_URL 从 http://datamate-kg:8080 改为 http://datamate-backend:8080（容器名修正） - kg_client.py - 修复 API 路径：/knowledge-graph/... → /api/knowledge-graph/... - kb_access.py - 同类问题修复：/knowledge-base/... → /api/knowledge-base/... - test_kb_access.py - 测试断言同步更新根因：容器名 datamate-kg 不存在，且 httpx 绝对路径会丢弃 base_url 中的 /api 路径 ### P0-2: Vite 开发代理剥离 /api 前缀 - vite.config.ts - 删除 /api/knowledge-graph 专用代理规则（剥离 /api 导致 404），统一走 ^/api 规则 ## P1 修复（功能受损） ### P1-1: Gateway 缺少 KG Python 端点路由 - ApiGatewayApplication.java - 添加 /api/kg/ 路由（指向 kg-extraction Python 服务） - ApiGatewayApplication.java - 添加 /api/graphrag/ 路由（指向 GraphRAG 服务） ### P1-2: DATA_MANAGEMENT_URL 默认值缺 /api - KnowledgeGraphProperties.java - dataManagementUrl 默认值 http://localhost:8080 → http://localhost:8080/api - KnowledgeGraphProperties.java - annotationServiceUrl 默认值 http://localhost:8081 → http://localhost:8080/api（同 JVM） - application-knowledgegraph.yml - YAML 默认值同步更新 ### P1-3: Neo4j k8s 安装链路失败 - Makefile - VALID_K8S_TARGETS 添加 neo4j - Makefile - %-k8s-install 添加 neo4j case（显式 skip，提示使用 Docker 或外部实例） - Makefile - %-k8s-uninstall 添加 neo4j case（显式 skip）根因：install 目标无条件调用 neo4j-$(INSTALLER)-install，但 k8s 模式下 neo4j 不在 VALID_K8S_TARGETS 中，导致 "Unknown k8s target 'neo4j'" 错误 ## P2 修复（次要） ### P2-1: Neo4j 加入 Docker install 流程 - Makefile - install target 增加 neo4j-$(INSTALLER)-install，在 datamate 之前启动 - Makefile - VALID_SERVICE_TARGETS 增加 neo4j - Makefile - %-docker-install / %-docker-uninstall 增加 neo4j case ## 验证结果 - mvn test: 311 tests, 0 failures ✅ - eslint: 0 errors ✅ - tsc --noEmit: 通过 ✅ - vite build: 成功 (17.71s) ✅ - Python tests: 46 passed ✅ - make -n install INSTALLER=k8s: 不再报 unknown target ✅ - make -n neo4j-k8s-install: 正确显示 skip 消息 ✅	2026-02-23 01:15:31 +08:00
Jerry Yan	9b6ff59a11	feat(kg): 实现 Phase 3.3 性能优化核心功能： - Neo4j 索引优化（entityType, graphId, properties.name） - Redis 缓存（Java 侧，3 个缓存区，TTL 可配置） - LRU 缓存（Python 侧，KG + Embedding，线程安全） - 细粒度缓存清除（graphId 前缀匹配） - 失败路径缓存清除（finally 块）新增文件（Java 侧，7 个）： - V2__PerformanceIndexes.java - Flyway 迁移，创建 3 个索引 - IndexHealthService.java - 索引健康监控 - RedisCacheConfig.java - Spring Cache + Redis 配置 - GraphCacheService.java - 缓存清除管理器 - CacheableIntegrationTest.java - 集成测试（10 tests） - GraphCacheServiceTest.java - 单元测试（19 tests） - V2__PerformanceIndexesTest.java, IndexHealthServiceTest.java 新增文件（Python 侧，2 个）： - cache.py - 内存 TTL+LRU 缓存（cachetools） - test_cache.py - 单元测试（20 tests）修改文件（Java 侧，9 个）： - GraphEntityService.java - 添加 @Cacheable，缓存清除 - GraphQueryService.java - 添加 @Cacheable（包含用户权限上下文） - GraphRelationService.java - 添加缓存清除 - GraphSyncService.java - 添加缓存清除（finally 块，失败路径） - KnowledgeGraphProperties.java - 添加 Cache 配置类 - application-knowledgegraph.yml - 添加 Redis 和缓存 TTL 配置 - GraphEntityServiceTest.java - 添加 verify(cacheService) 断言 - GraphRelationServiceTest.java - 添加 verify(cacheService) 断言 - GraphSyncServiceTest.java - 添加失败路径缓存清除测试修改文件（Python 侧，5 个）： - kg_client.py - 集成缓存（fulltext_search, get_subgraph） - interface.py - 添加 /cache/stats 和 /cache/clear 端点 - config.py - 添加缓存配置字段 - pyproject.toml - 添加 cachetools 依赖 - test_kg_client.py - 添加 _disable_cache fixture 安全修复（3 轮迭代）： - P0: 缓存 key 用户隔离（防止跨用户数据泄露） - P1-1: 同步子步骤后的缓存清除（18 个方法） - P1-2: 实体创建后的搜索缓存清除 - P1-3: 失败路径缓存清除（finally 块） - P2-1: 细粒度缓存清除（graphId 前缀匹配，避免跨图谱冲刷） - P2-2: 服务层测试添加 verify(cacheService) 断言测试结果： - Java: 280 tests pass ✅ (270 → 280, +10 new) - Python: 154 tests pass ✅ (140 → 154, +14 new) 缓存配置： - kg:entities - 实体缓存，TTL 1h - kg:queries - 查询结果缓存，TTL 5min - kg:search - 全文搜索缓存，TTL 3min - KG cache (Python) - 256 entries, 5min TTL - Embedding cache (Python) - 512 entries, 10min TTL	2026-02-20 18:28:33 +08:00
Jerry Yan	39338df808	feat(kg): 实现 Phase 2 GraphRAG 融合功能核心功能： - 三层检索策略：向量检索（Milvus）+ 图检索（KG 服务）+ 融合排序 - LLM 生成：支持同步和流式（SSE）响应 - 知识库访问控制：knowledge_base_id 归属校验 + collection_name 绑定验证新增模块（9个文件）： - models.py: 请求/响应模型（GraphRAGQueryRequest, RetrievalStrategy, GraphContext 等） - milvus_client.py: Milvus 向量检索客户端（OpenAI Embeddings + asyncio.to_thread） - kg_client.py: KG 服务 REST 客户端（全文检索 + 子图导出，fail-open） - context_builder.py: 三元组文本化（10 种关系模板）+ 上下文构建 - generator.py: LLM 生成（ChatOpenAI，支持同步和流式） - retriever.py: 检索编排（并行检索 + 融合排序） - kb_access.py: 知识库访问校验（归属验证 + collection 绑定，fail-close） - interface.py: FastAPI 端点（/query, /retrieve, /query/stream） - __init__.py: 模块入口修改文件（3个）： - app/core/config.py: 添加 13 个 graphrag_* 配置项 - app/module/__init__.py: 注册 kg_graphrag_router - pyproject.toml: 添加 pymilvus 依赖测试覆盖（79 tests）： - test_context_builder.py: 13 tests（三元组文本化 + 上下文构建） - test_kg_client.py: 14 tests（KG 响应解析 + PagedResponse + 边字段映射） - test_milvus_client.py: 8 tests（向量检索 + asyncio.to_thread） - test_retriever.py: 11 tests（并行检索 + 融合排序 + fail-open） - test_kb_access.py: 18 tests（归属校验 + collection 绑定 + 跨用户负例） - test_interface.py: 15 tests（端点级回归 + 403 short-circuit）关键设计： - Fail-open: Milvus/KG 服务失败不阻塞管道，返回空结果 - Fail-close: 访问控制失败拒绝请求，防止授权绕过 - 并行检索: asyncio.gather() 并发运行向量和图检索 - 融合排序: Min-max 归一化 + 加权融合（vector_weight/graph_weight） - 延迟初始化: 所有客户端在首次请求时初始化 - 配置回退: graphrag_llm_* 为空时回退到 kg_llm_* 安全修复： - P1-1: KG 响应解析（PagedResponse.content） - P1-2: 子图边字段映射（sourceEntityId/targetEntityId） - P1-3: collection_name 越权风险（归属校验 + 绑定验证） - P1-4: 同步 Milvus I/O（asyncio.to_thread） - P1-5: 测试覆盖（79 tests，包括安全负例）测试结果：79 tests pass ✅	2026-02-20 09:41:55 +08:00
Jerry Yan	0ed7dcbee7	feat(kg): 实现实体对齐功能（aligner.py） - 实现三层对齐策略：规则层 + 向量相似度层 + LLM 仲裁层 - 规则层：名称规范化（NFKC、小写、去标点/空格）+ 规则评分 - 向量层：OpenAI Embeddings + cosine 相似度计算 - LLM 层：仅对边界样本调用，严格 JSON schema 校验 - 使用 Union-Find 实现传递合并 - 支持批内对齐（库内对齐待 KG 服务 API 支持）核心组件： - EntityAligner 类：align() (async)、align_rules_only() (sync) - 配置项：kg_alignment_enabled（默认 false）、embedding_model、阈值 - 失败策略：fail-open（对齐失败不中断请求）集成： - 已集成到抽取主链路（extract → align → return） - extract() 调用 async align() - extract_sync() 调用 sync align_rules_only() 修复： - P1-1：使用 (name, type) 作为 key，避免同名跨类型误合并 - P1-2：LLM 计数在 finally 块中增加，异常也计数 - P1-3：添加库内对齐说明（待后续实现）新增 41 个测试用例，全部通过测试结果：41 tests pass	2026-02-19 18:26:54 +08:00
Jerry Yan	37b478a052	fix(kg): 修复 Codex 审查发现的 P1/P2 问题并补全测试修复内容： P1 级别（关键）： 1. 数据隔离漏洞：邻居查询添加 graph_id 路径约束，防止跨图谱数据泄漏 2. 空快照误删风险：添加 allowPurgeOnEmptySnapshot 保护开关（默认 false） 3. 弱默认凭据：启动自检，生产环境检测到默认密码直接拒绝启动 P2 级别（重要）： 4. 配置校验：importBatchSize 添加 @Min(1) 验证，启动时 fail-fast 5. N+1 性能：重写 upsertEntity 为单条 Cypher 查询（从 3 条优化到 1 条） 6. 服务认证：添加 mTLS/JWT 文档说明 7. 错误处理：改进 Schema 初始化和序列化错误处理测试覆盖： - 新增 69 个单元测试，全部通过 - GraphEntityServiceTest: 13 个测试（CRUD、验证、分页） - GraphRelationServiceTest: 13 个测试（CRUD、方向验证） - GraphSyncServiceTest: 5 个测试（验证、全量同步） - GraphSyncStepServiceTest: 14 个测试（空快照保护、N+1 验证） - GraphQueryServiceTest: 13 个测试（邻居/路径/子图/搜索） - GraphInitializerTest: 11 个测试（凭据验证、Schema 初始化）技术细节： - 数据隔离：使用 ALL() 函数约束路径中所有节点和关系的 graph_id - 空快照保护：新增配置项 allow-purge-on-empty-snapshot 和错误码 EMPTY_SNAPSHOT_PURGE_BLOCKED - 凭据检查：Java 和 Python 双端实现，根据环境（dev/test/prod）采取不同策略 - 性能优化：使用 SDN 复合属性格式（properties.key）在 MERGE 中直接设置属性 - 属性安全：使用白名单 [a-zA-Z0-9_] 防止 Cypher 注入代码变更：+210 行，-29 行	2026-02-18 09:25:00 +08:00
Jerry Yan	0e0782a452	feat(kg-extraction): 实现 Python 抽取器 FastAPI 接口实现功能： - 创建 kg_extraction/interface.py（FastAPI 路由） - 实现 POST /api/kg/extract（单条文本抽取） - 实现 POST /api/kg/extract/batch（批量抽取，最多 50 条） - 集成到 FastAPI 主路由（/api/kg/ 前缀）技术实现： - 配置管理：从环境变量读取 LLM 配置（API Key、Base URL、Model、Temperature） - 安全性： - API Key 使用 SecretStr 保护 - 错误信息脱敏（使用 trace_id，不暴露原始异常） - 请求文本不写入日志（使用 SHA-256 hash） - 强制要求 X-User-Id 头（鉴权边界） - 超时控制： - kg_llm_timeout_seconds（60秒） - kg_llm_max_retries（2次） - 输入校验： - graph_id 和 source_id 使用 UUID pattern - source_type 使用 Enum（4个值） - allowed_nodes/relationships 元素使用正则约束（ASCII，1-50字符） - 审计日志：记录 caller、trace_id、text_hash 代码审查： - 经过 3 轮 Codex 审查和 2 轮 Claude 修复 - 所有问题已解决（5个 P1/P2 + 3个 P3） - 语法检查通过 API 端点： - POST /api/kg/extract：单条文本抽取 - POST /api/kg/extract/batch：批量抽取（最多 50 条）配置环境变量： - KG_LLM_API_KEY：LLM API 密钥 - KG_LLM_BASE_URL：自定义端点（可选） - KG_LLM_MODEL：模型名称（默认 gpt-4o-mini） - KG_LLM_TEMPERATURE：生成温度（默认 0.0） - KG_LLM_TIMEOUT_SECONDS：超时时间（默认 60） - KG_LLM_MAX_RETRIES：重试次数（默认 2）	2026-02-17 22:01:06 +08:00
Jerry Yan	7092c3f955	feat(annotation): 调整文本编辑器大小限制配置 - 将editor_max_text_bytes默认值从2MB改为0，表示不限制 - 更新文本获取服务中的大小检查逻辑，只在max_bytes大于0时进行限制 - 修改错误提示信息中的字节限制显示 - 优化配置参数的条件判断流程	2026-02-02 17:53:09 +08:00
Jerry Yan	d5b75fee0d	LSF	2026-01-07 00:00:16 +08:00
hhhhsc701	d82bff441a	fix: prevent deletion of predefined operators and improve error handling (#192 ) * fix: prevent deletion of predefined operators and improve error handling * fix: prevent deletion of predefined operators and improve error handling	2025-12-22 19:30:41 +08:00
hefanli	1d19cd3a62	feature: add data-evaluation * feature: add evaluation task management function * feature: add evaluation task detail page * fix: delete duplicate definition for table t_model_config * refactor: rename package synthesis to ratio * refactor: add eval file table and refactor related code * fix: calling large models in parallel during evaluation	2025-12-04 09:23:54 +08:00
Dallas98	8b164cb012	feat: Implement data synthesis task management with database models and API endpoints (#122 )	2025-12-02 15:23:58 +08:00
Jason Wang	78f50ea520	feat: File and Annotation 2-way sync implementation (#63 ) * feat: Refactor configuration and sync logic for improved dataset handling and logging * feat: Enhance annotation synchronization and dataset file management - Added new fields `tags_updated_at` to `DatasetFiles` model for tracking the last update time of tags. - Implemented new asynchronous methods in the Label Studio client for fetching, creating, updating, and deleting task annotations. - Introduced bidirectional synchronization for annotations between DataMate and Label Studio, allowing for flexible data management. - Updated sync service to handle annotation conflicts based on timestamps, ensuring data integrity during synchronization. - Enhanced dataset file response model to include tags and their update timestamps. - Modified database initialization script to create a new column for `tags_updated_at` in the dataset files table. - Updated requirements to ensure compatibility with the latest dependencies.	2025-11-07 15:03:07 +08:00
Jason Wang	b5fe787c20	feat: Labeling Frontend adaptations + Backend build and deploy + Logging improvement (#55 ) * feat: Front-end data annotation page adaptation to the backend API. * feat: Implement labeling configuration editor and enhance annotation task creation form * feat: add python backend build and deployment; add backend configuration for Label Studio integration and improve logging setup * refactor: remove duplicate log configuration	2025-11-05 01:55:53 +08:00
Jason Wang	2f7341dc1f	refactor: Reorganize datamate-python (#34 ) refactor: Reorganize datamate-python (previously label-studio-adapter) into a DDD style structure.	2025-10-30 01:32:59 +08:00
hhhhsc	41e7e684c3	Merge branch 'main' into develop_deer	2025-10-28 11:03:01 +08:00
Jinglong Wang	7f819563db	Develop labeling module (#25 ) * refactor: remove db table management from LS adapter (mv to scripts later); change adapter to use the same MySQL DB as other modules. * refactor: Rename LS Adapter module to datamate-python	2025-10-27 16:16:14 +08:00

16 Commits