DataMate

Author	SHA1	Message	Date
Jerry Yan	9b6ff59a11	feat(kg): 实现 Phase 3.3 性能优化核心功能： - Neo4j 索引优化（entityType, graphId, properties.name） - Redis 缓存（Java 侧，3 个缓存区，TTL 可配置） - LRU 缓存（Python 侧，KG + Embedding，线程安全） - 细粒度缓存清除（graphId 前缀匹配） - 失败路径缓存清除（finally 块）新增文件（Java 侧，7 个）： - V2__PerformanceIndexes.java - Flyway 迁移，创建 3 个索引 - IndexHealthService.java - 索引健康监控 - RedisCacheConfig.java - Spring Cache + Redis 配置 - GraphCacheService.java - 缓存清除管理器 - CacheableIntegrationTest.java - 集成测试（10 tests） - GraphCacheServiceTest.java - 单元测试（19 tests） - V2__PerformanceIndexesTest.java, IndexHealthServiceTest.java 新增文件（Python 侧，2 个）： - cache.py - 内存 TTL+LRU 缓存（cachetools） - test_cache.py - 单元测试（20 tests）修改文件（Java 侧，9 个）： - GraphEntityService.java - 添加 @Cacheable，缓存清除 - GraphQueryService.java - 添加 @Cacheable（包含用户权限上下文） - GraphRelationService.java - 添加缓存清除 - GraphSyncService.java - 添加缓存清除（finally 块，失败路径） - KnowledgeGraphProperties.java - 添加 Cache 配置类 - application-knowledgegraph.yml - 添加 Redis 和缓存 TTL 配置 - GraphEntityServiceTest.java - 添加 verify(cacheService) 断言 - GraphRelationServiceTest.java - 添加 verify(cacheService) 断言 - GraphSyncServiceTest.java - 添加失败路径缓存清除测试修改文件（Python 侧，5 个）： - kg_client.py - 集成缓存（fulltext_search, get_subgraph） - interface.py - 添加 /cache/stats 和 /cache/clear 端点 - config.py - 添加缓存配置字段 - pyproject.toml - 添加 cachetools 依赖 - test_kg_client.py - 添加 _disable_cache fixture 安全修复（3 轮迭代）： - P0: 缓存 key 用户隔离（防止跨用户数据泄露） - P1-1: 同步子步骤后的缓存清除（18 个方法） - P1-2: 实体创建后的搜索缓存清除 - P1-3: 失败路径缓存清除（finally 块） - P2-1: 细粒度缓存清除（graphId 前缀匹配，避免跨图谱冲刷） - P2-2: 服务层测试添加 verify(cacheService) 断言测试结果： - Java: 280 tests pass ✅ (270 → 280, +10 new) - Python: 154 tests pass ✅ (140 → 154, +14 new) 缓存配置： - kg:entities - 实体缓存，TTL 1h - kg:queries - 查询结果缓存，TTL 5min - kg:search - 全文搜索缓存，TTL 3min - KG cache (Python) - 256 entries, 5min TTL - Embedding cache (Python) - 512 entries, 10min TTL	2026-02-20 18:28:33 +08:00
Jerry Yan	39338df808	feat(kg): 实现 Phase 2 GraphRAG 融合功能核心功能： - 三层检索策略：向量检索（Milvus）+ 图检索（KG 服务）+ 融合排序 - LLM 生成：支持同步和流式（SSE）响应 - 知识库访问控制：knowledge_base_id 归属校验 + collection_name 绑定验证新增模块（9个文件）： - models.py: 请求/响应模型（GraphRAGQueryRequest, RetrievalStrategy, GraphContext 等） - milvus_client.py: Milvus 向量检索客户端（OpenAI Embeddings + asyncio.to_thread） - kg_client.py: KG 服务 REST 客户端（全文检索 + 子图导出，fail-open） - context_builder.py: 三元组文本化（10 种关系模板）+ 上下文构建 - generator.py: LLM 生成（ChatOpenAI，支持同步和流式） - retriever.py: 检索编排（并行检索 + 融合排序） - kb_access.py: 知识库访问校验（归属验证 + collection 绑定，fail-close） - interface.py: FastAPI 端点（/query, /retrieve, /query/stream） - __init__.py: 模块入口修改文件（3个）： - app/core/config.py: 添加 13 个 graphrag_* 配置项 - app/module/__init__.py: 注册 kg_graphrag_router - pyproject.toml: 添加 pymilvus 依赖测试覆盖（79 tests）： - test_context_builder.py: 13 tests（三元组文本化 + 上下文构建） - test_kg_client.py: 14 tests（KG 响应解析 + PagedResponse + 边字段映射） - test_milvus_client.py: 8 tests（向量检索 + asyncio.to_thread） - test_retriever.py: 11 tests（并行检索 + 融合排序 + fail-open） - test_kb_access.py: 18 tests（归属校验 + collection 绑定 + 跨用户负例） - test_interface.py: 15 tests（端点级回归 + 403 short-circuit）关键设计： - Fail-open: Milvus/KG 服务失败不阻塞管道，返回空结果 - Fail-close: 访问控制失败拒绝请求，防止授权绕过 - 并行检索: asyncio.gather() 并发运行向量和图检索 - 融合排序: Min-max 归一化 + 加权融合（vector_weight/graph_weight） - 延迟初始化: 所有客户端在首次请求时初始化 - 配置回退: graphrag_llm_* 为空时回退到 kg_llm_* 安全修复： - P1-1: KG 响应解析（PagedResponse.content） - P1-2: 子图边字段映射（sourceEntityId/targetEntityId） - P1-3: collection_name 越权风险（归属校验 + 绑定验证） - P1-4: 同步 Milvus I/O（asyncio.to_thread） - P1-5: 测试覆盖（79 tests，包括安全负例）测试结果：79 tests pass ✅	2026-02-20 09:41:55 +08:00
Jerry Yan	0ed7dcbee7	feat(kg): 实现实体对齐功能（aligner.py） - 实现三层对齐策略：规则层 + 向量相似度层 + LLM 仲裁层 - 规则层：名称规范化（NFKC、小写、去标点/空格）+ 规则评分 - 向量层：OpenAI Embeddings + cosine 相似度计算 - LLM 层：仅对边界样本调用，严格 JSON schema 校验 - 使用 Union-Find 实现传递合并 - 支持批内对齐（库内对齐待 KG 服务 API 支持）核心组件： - EntityAligner 类：align() (async)、align_rules_only() (sync) - 配置项：kg_alignment_enabled（默认 false）、embedding_model、阈值 - 失败策略：fail-open（对齐失败不中断请求）集成： - 已集成到抽取主链路（extract → align → return） - extract() 调用 async align() - extract_sync() 调用 sync align_rules_only() 修复： - P1-1：使用 (name, type) 作为 key，避免同名跨类型误合并 - P1-2：LLM 计数在 finally 块中增加，异常也计数 - P1-3：添加库内对齐说明（待后续实现）新增 41 个测试用例，全部通过测试结果：41 tests pass	2026-02-19 18:26:54 +08:00
Jerry Yan	37b478a052	fix(kg): 修复 Codex 审查发现的 P1/P2 问题并补全测试修复内容： P1 级别（关键）： 1. 数据隔离漏洞：邻居查询添加 graph_id 路径约束，防止跨图谱数据泄漏 2. 空快照误删风险：添加 allowPurgeOnEmptySnapshot 保护开关（默认 false） 3. 弱默认凭据：启动自检，生产环境检测到默认密码直接拒绝启动 P2 级别（重要）： 4. 配置校验：importBatchSize 添加 @Min(1) 验证，启动时 fail-fast 5. N+1 性能：重写 upsertEntity 为单条 Cypher 查询（从 3 条优化到 1 条） 6. 服务认证：添加 mTLS/JWT 文档说明 7. 错误处理：改进 Schema 初始化和序列化错误处理测试覆盖： - 新增 69 个单元测试，全部通过 - GraphEntityServiceTest: 13 个测试（CRUD、验证、分页） - GraphRelationServiceTest: 13 个测试（CRUD、方向验证） - GraphSyncServiceTest: 5 个测试（验证、全量同步） - GraphSyncStepServiceTest: 14 个测试（空快照保护、N+1 验证） - GraphQueryServiceTest: 13 个测试（邻居/路径/子图/搜索） - GraphInitializerTest: 11 个测试（凭据验证、Schema 初始化）技术细节： - 数据隔离：使用 ALL() 函数约束路径中所有节点和关系的 graph_id - 空快照保护：新增配置项 allow-purge-on-empty-snapshot 和错误码 EMPTY_SNAPSHOT_PURGE_BLOCKED - 凭据检查：Java 和 Python 双端实现，根据环境（dev/test/prod）采取不同策略 - 性能优化：使用 SDN 复合属性格式（properties.key）在 MERGE 中直接设置属性 - 属性安全：使用白名单 [a-zA-Z0-9_] 防止 Cypher 注入代码变更：+210 行，-29 行	2026-02-18 09:25:00 +08:00
Jerry Yan	0e0782a452	feat(kg-extraction): 实现 Python 抽取器 FastAPI 接口实现功能： - 创建 kg_extraction/interface.py（FastAPI 路由） - 实现 POST /api/kg/extract（单条文本抽取） - 实现 POST /api/kg/extract/batch（批量抽取，最多 50 条） - 集成到 FastAPI 主路由（/api/kg/ 前缀）技术实现： - 配置管理：从环境变量读取 LLM 配置（API Key、Base URL、Model、Temperature） - 安全性： - API Key 使用 SecretStr 保护 - 错误信息脱敏（使用 trace_id，不暴露原始异常） - 请求文本不写入日志（使用 SHA-256 hash） - 强制要求 X-User-Id 头（鉴权边界） - 超时控制： - kg_llm_timeout_seconds（60秒） - kg_llm_max_retries（2次） - 输入校验： - graph_id 和 source_id 使用 UUID pattern - source_type 使用 Enum（4个值） - allowed_nodes/relationships 元素使用正则约束（ASCII，1-50字符） - 审计日志：记录 caller、trace_id、text_hash 代码审查： - 经过 3 轮 Codex 审查和 2 轮 Claude 修复 - 所有问题已解决（5个 P1/P2 + 3个 P3） - 语法检查通过 API 端点： - POST /api/kg/extract：单条文本抽取 - POST /api/kg/extract/batch：批量抽取（最多 50 条）配置环境变量： - KG_LLM_API_KEY：LLM API 密钥 - KG_LLM_BASE_URL：自定义端点（可选） - KG_LLM_MODEL：模型名称（默认 gpt-4o-mini） - KG_LLM_TEMPERATURE：生成温度（默认 0.0） - KG_LLM_TIMEOUT_SECONDS：超时时间（默认 60） - KG_LLM_MAX_RETRIES：重试次数（默认 2）	2026-02-17 22:01:06 +08:00
Jerry Yan	7092c3f955	feat(annotation): 调整文本编辑器大小限制配置 - 将editor_max_text_bytes默认值从2MB改为0，表示不限制 - 更新文本获取服务中的大小检查逻辑，只在max_bytes大于0时进行限制 - 修改错误提示信息中的字节限制显示 - 优化配置参数的条件判断流程	2026-02-02 17:53:09 +08:00
Jerry Yan	d5b75fee0d	LSF	2026-01-07 00:00:16 +08:00
hhhhsc701	d82bff441a	fix: prevent deletion of predefined operators and improve error handling (#192 ) * fix: prevent deletion of predefined operators and improve error handling * fix: prevent deletion of predefined operators and improve error handling	2025-12-22 19:30:41 +08:00
Dallas98	8b164cb012	feat: Implement data synthesis task management with database models and API endpoints (#122 )	2025-12-02 15:23:58 +08:00
Jason Wang	78f50ea520	feat: File and Annotation 2-way sync implementation (#63 ) * feat: Refactor configuration and sync logic for improved dataset handling and logging * feat: Enhance annotation synchronization and dataset file management - Added new fields `tags_updated_at` to `DatasetFiles` model for tracking the last update time of tags. - Implemented new asynchronous methods in the Label Studio client for fetching, creating, updating, and deleting task annotations. - Introduced bidirectional synchronization for annotations between DataMate and Label Studio, allowing for flexible data management. - Updated sync service to handle annotation conflicts based on timestamps, ensuring data integrity during synchronization. - Enhanced dataset file response model to include tags and their update timestamps. - Modified database initialization script to create a new column for `tags_updated_at` in the dataset files table. - Updated requirements to ensure compatibility with the latest dependencies.	2025-11-07 15:03:07 +08:00
Jason Wang	b5fe787c20	feat: Labeling Frontend adaptations + Backend build and deploy + Logging improvement (#55 ) * feat: Front-end data annotation page adaptation to the backend API. * feat: Implement labeling configuration editor and enhance annotation task creation form * feat: add python backend build and deployment; add backend configuration for Label Studio integration and improve logging setup * refactor: remove duplicate log configuration	2025-11-05 01:55:53 +08:00
Jason Wang	2f7341dc1f	refactor: Reorganize datamate-python (#34 ) refactor: Reorganize datamate-python (previously label-studio-adapter) into a DDD style structure.	2025-10-30 01:32:59 +08:00
hhhhsc	41e7e684c3	Merge branch 'main' into develop_deer	2025-10-28 11:03:01 +08:00
Jinglong Wang	7f819563db	Develop labeling module (#25 ) * refactor: remove db table management from LS adapter (mv to scripts later); change adapter to use the same MySQL DB as other modules. * refactor: Rename LS Adapter module to datamate-python	2025-10-27 16:16:14 +08:00

14 Commits