Compare commits

..

47 Commits

Author SHA1 Message Date
473f4e717f feat(annotation): 添加文本分段索引显示功能
- 实现了分段索引数组的生成逻辑
- 添加了分段索引网格显示界面
- 支持当前分段高亮显示
- 优化了分段导航的用户体验
- 替换了原有的分段提示文字为可视化索引组件
2026-02-04 19:16:48 +08:00
6b0042cb66 refactor(annotation): 简化任务选择逻辑并移除未使用的状态管理
- 移除了 resolveSegmentSummary 函数调用以简化完成状态判断
- 删除了未使用的 segmentStats 相关引用和缓存清理代码
- 简化了重置模式下的状态更新逻辑
2026-02-04 18:23:49 +08:00
fa9e9d9f68 refactor(annotation): 简化文本标注编辑器的段落管理功能
- 移除段落统计相关的数据结构和缓存逻辑
- 删除段落切换确认对话框和自动保存选项
- 简化段落加载和状态管理流程
- 将段落列表视图替换为简单的进度显示
- 更新API接口以支持单段内容获取
- 重构后端服务实现单段内容查询功能
2026-02-04 18:08:14 +08:00
707e65b017 refactor(annotation): 优化编辑器服务中的分段处理逻辑
- 在处理分段注释时初始化 segments 列表变量
- 确保分段信息列表在函数开始时被正确初始化
- 提高代码可读性和变量声明的一致性
2026-02-04 17:35:14 +08:00
cda22a720c feat(annotation): 优化文本标注分段功能实现
- 新增 getEditorTaskSegmentsUsingGet 接口用于获取任务分段信息
- 移除 SegmentInfo 中的 text、start、end 字段,精简数据结构
- 添加 EditorTaskSegmentsResponse 类型定义用于分段摘要响应
- 实现服务端 get_task_segments 方法,支持分段信息查询
- 重构前端组件缓存机制,使用 segmentSummaryFileRef 管理分段状态
- 优化分段构建逻辑,提取 _build_segment_contexts 公共方法
- 调整后端 _build_text_task 方法中的分段处理流程
- 更新 API 类型定义,统一 RequestParams 和 RequestPayload 类型
2026-02-04 16:59:04 +08:00
394e2bda18 feat(data-management): 添加数据集文件取消上传功能
- 在OpenAPI规范中定义了取消上传的REST端点接口
- 实现了DatasetFileApplicationService中的取消上传业务逻辑
- 在FileService中添加了完整的取消上传服务方法
- 创建了DatasetUploadController控制器处理取消上传请求
- 实现了临时分片文件清理和数据库记录删除功能
2026-02-04 16:25:03 +08:00
4220284f5a refactor(utils): 重构文件流式分割上传功能
- 将 streamSplitAndUpload 函数拆分为独立的 processFileLines 函数
- 简化文件按行处理逻辑,移除冗余的行收集和缓存机制
- 优化并发上传实现,使用 Promise 集合管理上传任务
- 修复上传过程中断信号处理和错误传播机制
- 统一进度回调参数结构,改进字节和行数跟踪逻辑
- 优化空行跳过计数和上传结果返回值处理
2026-02-04 16:11:03 +08:00
8415166949 refactor(upload): 重构切片上传逻辑支持动态请求ID解析
- 移除预先批量获取reqId的方式,改为按需解析
- 新增resolveReqId函数支持动态获取请求ID
- 添加onReqIdResolved回调处理ID解析完成事件
- 改进文件按行切片上传,每行作为独立文件处理
- 优化空行跳过逻辑,统计跳过的空行数量
- 修复fileNo和chunkNo的对应关系
- 更新streamSplitAndUpload参数结构
2026-02-04 15:58:58 +08:00
078f303f57 Revert "fix: 修复 hasArchive 和 splitByLine 同时存在的问题"
This reverts commit 50f2da5503.
2026-02-04 15:48:01 +08:00
50f2da5503 fix: 修复 hasArchive 和 splitByLine 同时存在的问题
问题:hasArchive 默认为 true,而 splitByLine 可以与其同时开启,
      导致压缩包被错误地按行分割,产生逻辑矛盾。

修复:
1. 当 hasArchive=true 时,禁用 splitByLine switch
2. 添加 useEffect,当 hasArchive 变为 true 时自动关闭 splitByLine

修改文件:frontend/src/pages/DataManagement/Detail/components/ImportConfiguration.tsx
2026-02-04 15:43:53 +08:00
3af1daf8b6 fix: 修复流式分割上传的"预上传请求不存在"错误
问题:handleStreamUpload 中为所有文件只调用一次 preUpload,设置
      totalFileNum: files.length(原始文件数),但实际上传的文件数量
      是按行分割后的总行数,导致后端提前删除预上传请求。

修复:将 preUpload 调用移到文件循环内部,为每个原始文件单独调用
      preUpload,设置 totalFileNum: 1,每个文件有自己的 reqId。
      这样可以避免按行分割导致的请求被提前删除问题。

修改文件:frontend/src/hooks/useSliceUpload.tsx
2026-02-04 15:39:05 +08:00
7c7729434b fix: 修复流式分割上传的三个问题
1. 实现真正的并发控制,避免同时产生大量请求
   - 使用任务队列模式,确保同时运行的任务不超过 maxConcurrency
   - 完成一个任务后才启动下一个,而不是一次性启动所有任务

2. 修复 API 错误(预上传请求不存在)
   - 所有分片使用相同的 fileNo=1(属于同一个预上传请求)
   - chunkNo 改为行号,表示第几行数据
   - 这是根本原因:之前每行都被当作不同文件,但只有第一个文件有有效的预上传请求

3. 保留原始文件扩展名
   - 正确提取并保留文件扩展名
   - 例如:132.txt → 132_000001.txt(而不是 132_000001)
2026-02-04 15:06:02 +08:00
17a62cd3c2 fix: 修复上传取消功能,确保 HTTP 请求正确中止
- 在 XMLHttpRequest 中添加 signal.aborted 检查
- 修复 useSliceUpload 中的 cancelFn 闭包问题
- 确保流式上传和分片上传都能正确取消
2026-02-04 14:51:23 +08:00
f381d641ab fix(upload): 修复流式上传中的文件名处理逻辑
- 修正预上传接口调用时传递正确的文件总数而非固定值-1
- 移除导入配置中文件分割时的文件扩展名保留逻辑
- 删除流式上传选项中的fileExtension参数定义
- 移除流式上传实现中的文件扩展名处理相关代码
- 简化新文件名生成逻辑,不再附加扩展名后缀
2026-02-04 07:47:41 +08:00
c8611d29ff feat(upload): 实现流式分割上传,优化大文件上传体验
实现边分割边上传的流式处理,避免大文件一次性加载导致前端卡顿。

修改内容:
1. file.util.ts - 流式分割上传核心功能
   - 新增 streamSplitAndUpload 函数,实现边分割边上传
   - 新增 shouldStreamUpload 函数,判断是否使用流式上传
   - 新增 StreamUploadOptions 和 StreamUploadResult 接口
   - 优化分片大小(默认 5MB)

2. ImportConfiguration.tsx - 智能上传策略
   - 大文件(>5MB)使用流式分割上传
   - 小文件(≤5MB)使用传统分割方式
   - 保持 UI 不变

3. useSliceUpload.tsx - 流式上传处理
   - 新增 handleStreamUpload 处理流式上传事件
   - 支持并发上传和更好的进度管理

4. TaskUpload.tsx - 进度显示优化
   - 注册流式上传事件监听器
   - 显示流式上传信息(已上传行数、当前文件等)

5. dataset.model.ts - 类型定义扩展
   - 新增 StreamUploadInfo 接口
   - TaskItem 接口添加 streamUploadInfo 和 prefix 字段

实现特点:
- 流式读取:使用 Blob.slice 逐块读取,避免一次性加载
- 逐行检测:按换行符分割,形成完整行后立即上传
- 内存优化:buffer 只保留当前块和未完成行,不累积所有分割结果
- 并发控制:支持 3 个并发上传,提升效率
- 进度可见:实时显示已上传行数和总体进度
- 错误处理:单个文件上传失败不影响其他文件
- 向后兼容:小文件仍使用原有分割方式

优势:
- 大文件上传不再卡顿,用户体验大幅提升
- 内存占用显著降低(从加载整个文件到只保留当前块)
- 上传效率提升(边分割边上传,并发上传多个小文件)

相关文件:
- frontend/src/utils/file.util.ts
- frontend/src/pages/DataManagement/Detail/components/ImportConfiguration.tsx
- frontend/src/hooks/useSliceUpload.tsx
- frontend/src/pages/Layout/TaskUpload.tsx
- frontend/src/pages/DataManagement/dataset.model.ts
2026-02-03 13:12:10 +00:00
147beb1ec7 feat(annotation): 实现文本切片预生成功能
在创建标注任务时自动预生成文本切片结构,避免每次进入标注页面时的实时计算。

修改内容:
1. 在 AnnotationEditorService 中新增 precompute_segmentation_for_project 方法
   - 为项目的所有文本文件预计算切片结构
   - 使用 AnnotationTextSplitter 执行切片
   - 将切片结构持久化到 AnnotationResult 表(状态为 IN_PROGRESS)
   - 支持失败重试机制
   - 返回统计信息

2. 修改 create_mapping 接口
   - 在创建标注任务后,如果启用分段且为文本数据集,自动触发切片预生成
   - 使用 try-except 捕获异常,确保切片失败不影响项目创建

特点:
- 使用现有的 AnnotationTextSplitter 类
- 切片数据结构与现有分段标注格式一致
- 向后兼容(未切片的任务仍然使用实时计算)
- 性能优化:避免进入标注页面时的重复计算

相关文件:
- runtime/datamate-python/app/module/annotation/service/editor.py
- runtime/datamate-python/app/module/annotation/interface/project.py
2026-02-03 12:59:29 +00:00
699031dae7 fix: 修复编辑数据集时无法清除关联数据集的编译问题
问题分析:
之前尝试使用 @TableField(updateStrategy = FieldStrategy.IGNORED/ALWAYS) 注解
来强制更新 null 值,但 FieldStrategy.ALWAYS 可能不存在于当前
MyBatis-Plus 3.5.14 版本中,导致编译错误。

修复方案:
1. 移除 Dataset.java 中 parentDatasetId 字段的 @TableField(updateStrategy) 注解
2. 移除不需要的 import com.baomidou.mybatisplus.annotation.FieldStrategy
3. 在 DatasetApplicationService.updateDataset 方法中:
   - 添加 import com.baomidou.mybatisplus.core.conditions.update.LambdaUpdateWrapper
   - 保存原始的 parentDatasetId 值用于比较
   - handleParentChange 之后,检查 parentDatasetId 是否发生变化
   - 如果发生变化,使用 LambdaUpdateWrapper 显式地更新 parentDatasetId 字段
   - 这样即使值为 null 也能被正确更新到数据库

原理:
MyBatis-Plus 的 updateById 方法默认只更新非 null 字段。
通过使用 LambdaUpdateWrapper 的 set 方法,可以显式地设置字段值,
包括 null 值,从而确保字段能够被正确更新到数据库。
2026-02-03 11:09:15 +00:00
88b1383653 fix: 恢复前端发送空字符串以支持清除关联数据集
修改说明:
移除了之前将空字符串转换为 undefined 的逻辑,
现在直接发送表单值,包括空字符串。

配合后端修改(commit cc6415c):
1. 当用户选择"无关联数据集"时,发送空字符串 ""
2. 后端 handleParentChange 方法通过 normalizeParentId 将空字符串转为 null
3. Dataset.parentDatasetId 字段添加了 @TableField(updateStrategy = FieldStrategy.IGNORED)
4. 确保即使值为 null 也会被更新到数据库
2026-02-03 10:57:14 +00:00
cc6415c4d9 fix: 修复编辑数据集时无法清除关联数据集的问题
问题描述:
在数据管理的数据集编辑中,如果之前设置了关联数据集,编辑时选择不关联数据集后保存不会生效。

根本原因:
MyBatis-Plus 的 updateById 方法默认使用 FieldStrategy.NOT_NULL 策略,
只有当字段值为非 null 时才会更新到数据库。
当 parentDatasetId 从有值变为 null 时,默认情况下不会更新到数据库。

修复方案:
在 Dataset.java 的 parentDatasetId 字段上添加 @TableField(updateStrategy = FieldStrategy.IGNORED) 注解,
表示即使值为 null 也需要更新到数据库。

配合前端修改(恢复发送空字符串),现在可以正确清除关联数据集:
1. 前端发送空字符串表示"无关联数据集"
2. 后端 handleParentChange 通过 normalizeParentId 将空字符串转为 null
3. dataset.setParentDatasetId(null) 设置为 null
4. 由于添加了 IGNORED 策略,即使为 null 也会更新到数据库
2026-02-03 10:57:08 +00:00
3d036c4cd6 fix: 修复编辑数据集时无法清除关联数据集的问题
问题描述:
在数据管理的数据集编辑中,如果之前设置了关联数据集,编辑时选择不关联数据集后保存不会生效。

问题原因:
后端 updateDataset 方法中的条件判断:
```java
if (updateDatasetRequest.getParentDatasetId() != null) {
    handleParentChange(dataset, updateDatasetRequest.getParentDatasetId());
}
```
当 parentDatasetId 为 null 或空字符串时,条件判断为 false,不会执行 handleParentChange,导致无法清除关联数据集。

修复方案:
去掉条件判断,始终调用 handleParentChange。handleParentChange 内部通过 normalizeParentId 方法将空字符串和 null 都转换为 null,这样既支持设置新的父数据集,也支持清除关联。

配合前端修改(commit 2445235),将空字符串转换为 undefined(被后端反序列化为 null),确保清除关联的操作能够正确执行。
2026-02-03 09:35:09 +00:00
2445235fd2 fix: 修复编辑数据集时清除关联数据集不生效的问题
问题描述:
在数据管理的数据集编辑中,如果之前设置了关联数据集,编辑时选择不关联数据集后保存不会生效。

问题原因:
- BasicInformation.tsx中,"无关联数据集"选项的值是空字符串""
- 当用户选择不关联数据集时,parentDatasetId的值为""
- 后端API将空字符串视为无效值而忽略,而不是识别为"清除关联"的操作

修复方案:
- 在EditDataset.tsx的handleSubmit函数中,将parentDatasetId的空字符串转换为undefined
- 使用 formValues.parentDatasetId || undefined 确保空字符串被转换为 undefined
- 这样后端API能正确识别为要清除关联数据集的操作
2026-02-03 09:23:13 +00:00
893e0a1580 fix: 上传文件时任务中心立即显示
问题描述:
在数据管理的数据集详情页上传文件时,点击确认后,弹窗消失,但是需要等待文件处理(特别是启用按行分割时)后任务中心才弹出来,用户体验不好。

修改内容:
1. useSliceUpload.tsx: 在 createTask 函数中添加立即显示任务中心的逻辑,确保任务创建后立即显示
2. ImportConfiguration.tsx: 在 handleImportData 函数中,在执行耗时的文件处理操作(如文件分割)之前,立即触发 show:task-popover 事件显示任务中心

效果:
- 修改前:点击确认 → 弹窗消失 → (等待文件处理)→ 任务中心弹出
- 修改后:点击确认 → 弹窗消失 + 任务中心立即弹出 → 文件开始处理
2026-02-03 09:14:40 +00:00
05e6842fc8 refactor(DataManagement): 移除不必要的数据集类型过滤逻辑
- 删除了对数据集类型的过滤操作
- 移除了不再使用的 textDatasetTypeOptions 变量
- 简化了 BasicInformation 组件的数据传递逻辑
- 减少了代码冗余,提高了组件性能
2026-02-03 13:33:12 +08:00
da5b18e423 feat(scripts): 添加 APT 缓存预装功能解决离线构建问题
- 新增 APT 缓存目录和相关构建脚本 export-cache.sh
- 添加 build-base-images.sh 脚本用于构建预装 APT 包的基础镜像
- 增加 build-offline-final.sh 最终版离线构建脚本
- 更新 Makefile.offline.mk 添加新的离线构建目标
- 扩展 README.md 文档详细说明 APT 缓存问题解决方案
- 为多个服务添加使用预装基础镜像的离线 Dockerfile
- 修改打包脚本包含 APT 缓存到最终压缩包中
2026-02-03 13:16:17 +08:00
31629ab50b docs(offline): 更新离线构建文档添加传统构建方式和故障排查指南
- 添加传统 docker build 方式作为推荐方案
- 新增离线环境诊断命令 make offline-diagnose
- 扩展故障排查章节,增加多个常见问题解决方案
- 添加文件清单和推荐工作流说明
- 为 BuildKit 构建器无法使用本地镜像问题提供多种解决方法
- 更新构建命令使用说明和重要提示信息
2026-02-03 13:10:28 +08:00
fb43052ddf feat(build): 添加传统 Docker 构建方式和诊断功能
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled
CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
- 在 build-offline.sh 脚本中添加 --pull=false 参数并改进错误处理
- 为 Makefile.offline.mk 中的各个服务构建任务添加 --pull=false 参数
- 新增 build-offline-classic.sh 脚本,提供不使用 BuildKit 的传统构建方式
- 新增 build-offline-v2.sh 脚本,提供增强版 BuildKit 离线构建功能
- 新增 diagnose.sh 脚本,用于诊断离线构建环境状态
- 在 Makefile 中添加 offline-build-classic 和 offline-diagnose
2026-02-02 23:53:45 +08:00
c44c75be25 fix(login): 修复登录页面样式问题
- 修正了标题下方描述文字的CSS类名,移除了错误的空格
- 更新了页脚版权信息的样式类名
- 简化了底部描述文本的内容,保持一致的品牌信息
2026-02-02 22:49:46 +08:00
05f3efc148 build(docker): 更新 Docker 镜像源为南京大学镜像地址
- 将 frontend Dockerfile 中的基础镜像从 gcr.io 切换到 gcr.nju.edu.cn
- 更新 offline Dockerfile 中的 nodejs20-debian12 镜像源
- 修改 export-cache.sh 脚本中的基础镜像列表为南京大学镜像
- 更新 Makefile.offline.mk 中的镜像拉取地址为本地镜像源
- 优化 export-cache.sh 脚本的格式和输出信息
- 添加缓存导出过程中的警告处理机制
2026-02-02 22:48:41 +08:00
16eb5cacf9 feat(data-management): 添加知识项扩展元数据支持
- 在 KnowledgeItemApplicationService 中实现元数据字段的更新逻辑
- 为 CreateKnowledgeItemRequest 添加 metadata 字段定义
- 为 UpdateKnowledgeItemRequest 添加 metadata 字段定义
- 支持知识项创建和更新时的扩展元数据存储
2026-02-02 22:20:05 +08:00
e71116d117 refactor(components): 更新标签组件类型定义和数据处理逻辑
- 修改 Tag 接口定义,将 id 和 color 字段改为可选类型
- 更新 onAddTag 回调函数参数类型,从对象改为字符串
- 在 AddTagPopover 组件中添加 useCallback 优化数据获取逻辑
- 调整标签去重逻辑,支持 id 或 name 任一字段匹配
- 更新 DetailHeader 组件的数据类型定义和泛型约束
- 添加 parseMetadata 工具函数用于解析元数据
- 实现 isAnnotationItem 函数判断注释类型数据
- 优化知识库详情页的标签处理和数据类型转换
2026-02-02 22:15:16 +08:00
cac53d7aac fix(knowledge): 更新知识管理页面标题为知识集
- 将页面标题从"知识管理"修改为"知识集"
2026-02-02 21:49:39 +08:00
43b4a619bc refactor(knowledge): 移除知识库创建中的扩展元数据字段
- 删除了表单中的扩展元数据输入区域
- 移除了对应的 Form.Item 包装器
- 简化了创建知识库表单结构
2026-02-02 21:48:21 +08:00
9da187d2c6 feat(build): 添加离线构建支持
- 新增 build-offline.sh 脚本实现无网环境构建
- 添加离线版 Dockerfiles 使用本地资源替代网络下载
- 创建 export-cache.sh 脚本在有网环境预下载依赖
- 集成 Makefile.offline.mk 提供便捷的离线构建命令
- 添加详细的离线构建文档和故障排查指南
- 实现基础镜像、BuildKit 缓存和外部资源的一键打包
2026-02-02 21:44:44 +08:00
b36fdd2438 feat(annotation): 添加数据类型过滤功能到标签配置树编辑器
- 引入 DataType 枚举类型定义
- 根据数据类型动态过滤对象标签选项
- 在模板表单中添加数据类型监听
- 改进错误处理逻辑以提高类型安全性
- 集成数据类型参数到配置树编辑器组件
2026-02-02 20:37:38 +08:00
daa63bdd13 feat(knowledge): 移除知识库管理中的敏感级别功能
- 注释掉创建知识集表单中的敏感级别选择字段
- 移除知识集详情页面中的敏感级别显示项
- 注释掉相关的敏感级别选项配置常量
- 更新表单布局以保持一致的两列网格结构
2026-02-02 19:06:03 +08:00
85433ac071 feat(template): 移除模板类型和版本字段并添加管理员权限控制
- 移除了模板详情页面中的类型和版本显示字段
- 移除了模板列表页面中的类型和版本列
- 添加了管理员权限检查功能,通过 localStorage 键控制
- 将编辑和删除操作按钮限制为仅管理员可见
- 将创建模板按钮限制为仅管理员可见
2026-02-02 18:59:32 +08:00
fc2e50b415 Revert "refactor(template): 移除模板列表中的类型、版本和操作列"
This reverts commit a5261b33b2.
2026-02-02 18:39:52 +08:00
26e1ae69d7 Revert "refactor(template): 移除模板列表页面的创建按钮"
This reverts commit b2bdf9e066.
2026-02-02 18:39:48 +08:00
7092c3f955 feat(annotation): 调整文本编辑器大小限制配置
- 将editor_max_text_bytes默认值从2MB改为0,表示不限制
- 更新文本获取服务中的大小检查逻辑,只在max_bytes大于0时进行限制
- 修改错误提示信息中的字节限制显示
- 优化配置参数的条件判断流程
2026-02-02 17:53:09 +08:00
b2bdf9e066 refactor(template): 移除模板列表页面的创建按钮
- 删除了右上角的创建模板按钮组件
- 移除了相关的点击事件处理函数调用
- 调整了页面布局结构以适应按钮移除后的变化
2026-02-02 16:35:09 +08:00
a5261b33b2 refactor(template): 移除模板列表中的类型、版本和操作列
- 移除了类型列(内置/自定义标签显示)
- 移除了版本列
- 移除了操作列(查看、编辑、删除按钮)
- 保留了创建时间列并维持其渲染逻辑
2026-02-02 16:20:50 +08:00
root
52daf30869 a 2026-02-02 16:09:25 +08:00
07a901043a refactor(annotation): 移除文本内容获取相关功能
- 删除了 fetch_text_content_via_download_api 导入
- 移除了 TEXT 类型数据集的文本内容获取逻辑
- 删除了 _append_annotation_to_content 方法实现
- 简化了知识同步服务的内容处理流程
2026-02-02 15:39:06 +08:00
32e3fc97c6 feat(annotation): 增强知识库同步服务以支持项目隔离
- 在知识库查找时添加项目ID验证,确保知识库归属正确
- 修改日志消息以显示项目ID信息便于调试
- 重构知识库查找逻辑,从按名称查找改为按名称和项目ID组合查找
- 新增_metadata_matches_project方法验证元数据中的项目归属
- 新增_parse_metadata方法安全解析元数据JSON字符串
- 更新回退命名逻辑以确保项目级别的唯一性
- 在所有知识库操作中统一使用项目名称和项目ID进行验证
2026-02-02 15:28:33 +08:00
a73571bd73 feat(annotation): 优化模板配置树编辑器中的属性填充逻辑
- 修改对象配置属性填充条件,仅在名称不存在时设置默认值
- 为控制配置添加标签类别判断逻辑
- 区分标注类和布局类控件的属性填充策略
- 标注类控件始终填充必需属性,布局类控件仅在需要时填充
- 修复属性值设置逻辑,确保正确引用名称属性
2026-02-02 15:26:25 +08:00
00fa1b86eb refactor(DataAnnotation): 移除未使用的状态变量并优化选择器逻辑
- 删除未使用的 addChildTag 和 addSiblingTag 状态变量
- 将 Select 组件的值设置为 null 以重置选择状态
- 简化 handleAddNode 调用的处理逻辑
- 移除不再需要的状态管理代码以提高性能
2026-02-02 15:23:01 +08:00
626c0fcd9a fix(data-annotation): 修复数据标注任务进度计算问题
- 添加 toSafeCount 工具函数确保数值安全处理
- 支持 totalCount 和 total_count 字段兼容性
-
2026-02-01 23:42:06 +08:00
59 changed files with 4922 additions and 1216 deletions

304
Makefile.offline.mk Normal file
View File

@@ -0,0 +1,304 @@
# ============================================================================
# Makefile 离线构建扩展
# 将此文件内容追加到主 Makefile 末尾,或单独包含使用
# ============================================================================
# 离线构建配置
CACHE_DIR ?= ./build-cache
OFFLINE_VERSION ?= latest
# 创建 buildx 构建器(如果不存在)
.PHONY: ensure-buildx
ensure-buildx:
@if ! docker buildx inspect offline-builder > /dev/null 2>&1; then \
echo "创建 buildx 构建器..."; \
docker buildx create --name offline-builder --driver docker-container --use 2>/dev/null || docker buildx use offline-builder; \
else \
docker buildx use offline-builder 2>/dev/null || true; \
fi
# ========== 离线缓存导出(有网环境) ==========
.PHONY: offline-export
offline-export: ensure-buildx
@echo "======================================"
@echo "导出离线构建缓存..."
@echo "======================================"
@mkdir -p $(CACHE_DIR)/buildkit $(CACHE_DIR)/images $(CACHE_DIR)/resources
@$(MAKE) _offline-export-base-images
@$(MAKE) _offline-export-cache
@$(MAKE) _offline-export-resources
@$(MAKE) _offline-package
.PHONY: _offline-export-base-images
_offline-export-base-images:
@echo ""
@echo "1. 导出基础镜像..."
@bash -c 'images=( \
"maven:3-eclipse-temurin-21" \
"maven:3-eclipse-temurin-8" \
"eclipse-temurin:21-jdk" \
"mysql:8" \
"node:20-alpine" \
"nginx:1.29" \
"ghcr.nju.edu.cn/astral-sh/uv:python3.11-bookworm" \
"ghcr.nju.edu.cn/astral-sh/uv:python3.12-bookworm" \
"ghcr.nju.edu.cn/astral-sh/uv:latest" \
"python:3.12-slim" \
"python:3.11-slim" \
"gcr.nju.edu.cn/distroless/nodejs20-debian12" \
); for img in "$${images[@]}"; do echo " Pulling $$img..."; docker pull "$$img" 2>/dev/null || true; done'
@echo " Saving base images..."
@docker save -o $(CACHE_DIR)/images/base-images.tar \
maven:3-eclipse-temurin-21 \
maven:3-eclipse-temurin-8 \
eclipse-temurin:21-jdk \
mysql:8 \
node:20-alpine \
nginx:1.29 \
ghcr.nju.edu.cn/astral-sh/uv:python3.11-bookworm \
ghcr.nju.edu.cn/astral-sh/uv:python3.12-bookworm \
ghcr.nju.edu.cn/astral-sh/uv:latest \
python:3.12-slim \
python:3.11-slim \
gcr.nju.edu.cn/distroless/nodejs20-debian12 2>/dev/null || echo " Warning: Some images may not exist"
.PHONY: _offline-export-cache
_offline-export-cache:
@echo ""
@echo "2. 导出 BuildKit 缓存..."
@echo " backend..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/backend-cache,mode=max -f scripts/images/backend/Dockerfile -t datamate-backend:cache . 2>/dev/null || echo " Warning: backend cache export failed"
@echo " backend-python..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/backend-python-cache,mode=max -f scripts/images/backend-python/Dockerfile -t datamate-backend-python:cache . 2>/dev/null || echo " Warning: backend-python cache export failed"
@echo " database..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/database-cache,mode=max -f scripts/images/database/Dockerfile -t datamate-database:cache . 2>/dev/null || echo " Warning: database cache export failed"
@echo " frontend..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/frontend-cache,mode=max -f scripts/images/frontend/Dockerfile -t datamate-frontend:cache . 2>/dev/null || echo " Warning: frontend cache export failed"
@echo " gateway..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/gateway-cache,mode=max -f scripts/images/gateway/Dockerfile -t datamate-gateway:cache . 2>/dev/null || echo " Warning: gateway cache export failed"
@echo " runtime..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/runtime-cache,mode=max -f scripts/images/runtime/Dockerfile -t datamate-runtime:cache . 2>/dev/null || echo " Warning: runtime cache export failed"
@echo " deer-flow-backend..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/deer-flow-backend-cache,mode=max -f scripts/images/deer-flow-backend/Dockerfile -t deer-flow-backend:cache . 2>/dev/null || echo " Warning: deer-flow-backend cache export failed"
@echo " deer-flow-frontend..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/deer-flow-frontend-cache,mode=max -f scripts/images/deer-flow-frontend/Dockerfile -t deer-flow-frontend:cache . 2>/dev/null || echo " Warning: deer-flow-frontend cache export failed"
@echo " mineru..."
@docker buildx build --cache-to type=local,dest=$(CACHE_DIR)/buildkit/mineru-cache,mode=max -f scripts/images/mineru/Dockerfile -t datamate-mineru:cache . 2>/dev/null || echo " Warning: mineru cache export failed"
.PHONY: _offline-export-resources
_offline-export-resources:
@echo ""
@echo "3. 预下载外部资源..."
@mkdir -p $(CACHE_DIR)/resources/models
@echo " PaddleOCR model..."
@wget -q -O $(CACHE_DIR)/resources/models/ch_ppocr_mobile_v2.0_cls_infer.tar \
https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar 2>/dev/null || echo " Warning: PaddleOCR model download failed"
@echo " spaCy model..."
@wget -q -O $(CACHE_DIR)/resources/models/zh_core_web_sm-3.8.0-py3-none-any.whl \
https://ghproxy.net/https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.8.0/zh_core_web_sm-3.8.0-py3-none-any.whl 2>/dev/null || echo " Warning: spaCy model download failed"
@echo " DataX source..."
@if [ ! -d "$(CACHE_DIR)/resources/DataX" ]; then \
git clone --depth 1 https://gitee.com/alibaba/DataX.git $(CACHE_DIR)/resources/DataX 2>/dev/null || echo " Warning: DataX clone failed"; \
fi
@echo " deer-flow source..."
@if [ ! -d "$(CACHE_DIR)/resources/deer-flow" ]; then \
git clone --depth 1 https://ghproxy.net/https://github.com/ModelEngine-Group/deer-flow.git $(CACHE_DIR)/resources/deer-flow 2>/dev/null || echo " Warning: deer-flow clone failed"; \
fi
.PHONY: _offline-package
_offline-package:
@echo ""
@echo "4. 打包缓存..."
@cd $(CACHE_DIR) && tar -czf "build-cache-$$(date +%Y%m%d).tar.gz" buildkit images resources 2>/dev/null && cd - > /dev/null
@echo ""
@echo "======================================"
@echo "✓ 缓存导出完成!"
@echo "======================================"
@echo "传输文件: $(CACHE_DIR)/build-cache-$$(date +%Y%m%d).tar.gz"
# ========== 离线构建(无网环境) ==========
.PHONY: offline-setup
offline-setup:
@echo "======================================"
@echo "设置离线构建环境..."
@echo "======================================"
@if [ ! -d "$(CACHE_DIR)" ]; then \
echo "查找并解压缓存包..."; \
cache_file=$$(ls -t build-cache-*.tar.gz 2>/dev/null | head -1); \
if [ -z "$$cache_file" ]; then \
echo "错误: 未找到缓存压缩包 (build-cache-*.tar.gz)"; \
exit 1; \
fi; \
echo "解压 $$cache_file..."; \
tar -xzf "$$cache_file"; \
else \
echo "缓存目录已存在: $(CACHE_DIR)"; \
fi
@echo ""
@echo "加载基础镜像..."
@if [ -f "$(CACHE_DIR)/images/base-images.tar" ]; then \
docker load -i $(CACHE_DIR)/images/base-images.tar; \
else \
echo "警告: 基础镜像文件不存在,假设已手动加载"; \
fi
@$(MAKE) ensure-buildx
@echo ""
@echo "✓ 离线环境准备完成"
.PHONY: offline-build
offline-build: offline-setup
@echo ""
@echo "======================================"
@echo "开始离线构建..."
@echo "======================================"
@$(MAKE) _offline-build-services
.PHONY: _offline-build-services
_offline-build-services: ensure-buildx
@echo ""
@echo "构建 datamate-database..."
@docker buildx build \
--cache-from type=local,src=$(CACHE_DIR)/buildkit/database-cache \
--pull=false \
-f scripts/images/database/Dockerfile \
-t datamate-database:$(OFFLINE_VERSION) \
--load . || echo " Failed"
@echo ""
@echo "构建 datamate-gateway..."
@docker buildx build \
--cache-from type=local,src=$(CACHE_DIR)/buildkit/gateway-cache \
--pull=false \
-f scripts/images/gateway/Dockerfile \
-t datamate-gateway:$(OFFLINE_VERSION) \
--load . || echo " Failed"
@echo ""
@echo "构建 datamate-backend..."
@docker buildx build \
--cache-from type=local,src=$(CACHE_DIR)/buildkit/backend-cache \
--pull=false \
-f scripts/images/backend/Dockerfile \
-t datamate-backend:$(OFFLINE_VERSION) \
--load . || echo " Failed"
@echo ""
@echo "构建 datamate-frontend..."
@docker buildx build \
--cache-from type=local,src=$(CACHE_DIR)/buildkit/frontend-cache \
--pull=false \
-f scripts/images/frontend/Dockerfile \
-t datamate-frontend:$(OFFLINE_VERSION) \
--load . || echo " Failed"
@echo ""
@echo "构建 datamate-runtime..."
@docker buildx build \
--cache-from type=local,src=$(CACHE_DIR)/buildkit/runtime-cache \
--pull=false \
--build-arg RESOURCES_DIR=$(CACHE_DIR)/resources \
-f scripts/images/runtime/Dockerfile \
-t datamate-runtime:$(OFFLINE_VERSION) \
--load . || echo " Failed"
@echo ""
@echo "构建 datamate-backend-python..."
@docker buildx build \
--cache-from type=local,src=$(CACHE_DIR)/buildkit/backend-python-cache \
--pull=false \
--build-arg RESOURCES_DIR=$(CACHE_DIR)/resources \
-f scripts/images/backend-python/Dockerfile \
-t datamate-backend-python:$(OFFLINE_VERSION) \
--load . || echo " Failed"
@echo ""
@echo "======================================"
@echo "✓ 离线构建完成"
@echo "======================================"
# 单个服务离线构建 (BuildKit)
.PHONY: %-offline-build
%-offline-build: offline-setup ensure-buildx
@echo "离线构建 $*..."
@if [ ! -d "$(CACHE_DIR)/buildkit/$*-cache" ]; then \
echo "错误: $* 的缓存不存在"; \
exit 1; \
fi
@$(eval IMAGE_NAME := $(if $(filter deer-flow%,$*),$*,datamate-$*))
@docker buildx build \
--cache-from type=local,src=$(CACHE_DIR)/buildkit/$*-cache \
--pull=false \
$(if $(filter runtime backend-python deer-flow%,$*),--build-arg RESOURCES_DIR=$(CACHE_DIR)/resources,) \
-f scripts/images/$*/Dockerfile \
-t $(IMAGE_NAME):$(OFFLINE_VERSION) \
--load .
# 传统 Docker 构建(不使用 BuildKit,更稳定)
.PHONY: offline-build-classic
offline-build-classic: offline-setup
@echo "使用传统 docker build 进行离线构建..."
@bash scripts/offline/build-offline-classic.sh $(CACHE_DIR) $(OFFLINE_VERSION)
# 诊断离线环境
.PHONY: offline-diagnose
offline-diagnose:
@bash scripts/offline/diagnose.sh $(CACHE_DIR)
# 构建 APT 预装基础镜像(有网环境)
.PHONY: offline-build-base-images
offline-build-base-images:
@echo "构建 APT 预装基础镜像..."
@bash scripts/offline/build-base-images.sh $(CACHE_DIR)
# 使用预装基础镜像进行离线构建(推荐)
.PHONY: offline-build-final
offline-build-final: offline-setup
@echo "使用预装 APT 包的基础镜像进行离线构建..."
@bash scripts/offline/build-offline-final.sh $(CACHE_DIR) $(OFFLINE_VERSION)
# 完整离线导出(包含 APT 预装基础镜像)
.PHONY: offline-export-full
offline-export-full:
@echo "======================================"
@echo "完整离线缓存导出(含 APT 预装基础镜像)"
@echo "======================================"
@$(MAKE) offline-build-base-images
@$(MAKE) offline-export
@echo ""
@echo "导出完成!传输时请包含以下文件:"
@echo " - build-cache/images/base-images-with-apt.tar"
@echo " - build-cache-YYYYMMDD.tar.gz"
# ========== 帮助 ==========
.PHONY: help-offline
help-offline:
@echo "离线构建命令:"
@echo ""
@echo "【有网环境】"
@echo " make offline-export [CACHE_DIR=./build-cache] - 导出构建缓存"
@echo " make offline-export-full - 导出完整缓存(含 APT 预装基础镜像)"
@echo " make offline-build-base-images - 构建 APT 预装基础镜像"
@echo ""
@echo "【无网环境】"
@echo " make offline-setup [CACHE_DIR=./build-cache] - 解压并准备离线缓存"
@echo " make offline-build-final - 使用预装基础镜像构建(推荐,解决 APT 问题)"
@echo " make offline-build-classic - 使用传统 docker build"
@echo " make offline-build - 使用 BuildKit 构建"
@echo " make offline-diagnose - 诊断离线构建环境"
@echo " make <service>-offline-build - 离线构建单个服务"
@echo ""
@echo "【完整工作流程(推荐)】"
@echo " # 1. 有网环境导出完整缓存"
@echo " make offline-export-full"
@echo ""
@echo " # 2. 传输到无网环境(需要传输两个文件)"
@echo " scp build-cache/images/base-images-with-apt.tar user@offline-server:/path/"
@echo " scp build-cache-*.tar.gz user@offline-server:/path/"
@echo ""
@echo " # 3. 无网环境构建"
@echo " tar -xzf build-cache-*.tar.gz"
@echo " docker load -i build-cache/images/base-images-with-apt.tar"
@echo " make offline-build-final"

View File

@@ -470,6 +470,23 @@ paths:
'200':
description: 上传成功
/data-management/datasets/upload/cancel-upload/{reqId}:
put:
tags: [ DatasetFile ]
operationId: cancelUpload
summary: 取消上传
description: 取消预上传请求并清理临时分片
parameters:
- name: reqId
in: path
required: true
schema:
type: string
description: 预上传请求ID
responses:
'200':
description: 取消成功
/data-management/dataset-types:
get:
operationId: getDatasetTypes

View File

@@ -1,5 +1,6 @@
package com.datamate.datamanagement.application;
import com.baomidou.mybatisplus.core.conditions.update.LambdaUpdateWrapper;
import com.baomidou.mybatisplus.core.metadata.IPage;
import com.baomidou.mybatisplus.extension.plugins.pagination.Page;
import com.datamate.common.domain.utils.ChunksSaver;
@@ -101,6 +102,7 @@ public class DatasetApplicationService {
public Dataset updateDataset(String datasetId, UpdateDatasetRequest updateDatasetRequest) {
Dataset dataset = datasetRepository.getById(datasetId);
BusinessAssert.notNull(dataset, DataManagementErrorCode.DATASET_NOT_FOUND);
if (StringUtils.hasText(updateDatasetRequest.getName())) {
dataset.setName(updateDatasetRequest.getName());
}
@@ -113,13 +115,31 @@ public class DatasetApplicationService {
if (Objects.nonNull(updateDatasetRequest.getStatus())) {
dataset.setStatus(updateDatasetRequest.getStatus());
}
if (updateDatasetRequest.getParentDatasetId() != null) {
if (updateDatasetRequest.isParentDatasetIdProvided()) {
// 保存原始的 parentDatasetId 值,用于比较是否发生了变化
String originalParentDatasetId = dataset.getParentDatasetId();
// 处理父数据集变更:仅当请求显式包含 parentDatasetId 时处理
// handleParentChange 内部通过 normalizeParentId 方法将空字符串和 null 都转换为 null
// 这样既支持设置新的父数据集,也支持清除关联
handleParentChange(dataset, updateDatasetRequest.getParentDatasetId());
// 检查 parentDatasetId 是否发生了变化
if (!Objects.equals(originalParentDatasetId, dataset.getParentDatasetId())) {
// 使用 LambdaUpdateWrapper 显式地更新 parentDatasetId 字段
// 这样即使值为 null 也能被正确更新到数据库
datasetRepository.update(null, new LambdaUpdateWrapper<Dataset>()
.eq(Dataset::getId, datasetId)
.set(Dataset::getParentDatasetId, dataset.getParentDatasetId()));
}
}
if (StringUtils.hasText(updateDatasetRequest.getDataSource())) {
// 数据源id不为空,使用异步线程进行文件扫盘落库
processDataSourceAsync(dataset.getId(), updateDatasetRequest.getDataSource());
}
// 更新其他字段(不包括 parentDatasetId,因为它已经在上面的代码中更新了)
datasetRepository.updateById(dataset);
return dataset;
}

View File

@@ -505,6 +505,14 @@ public class DatasetFileApplicationService {
saveFileInfoToDb(uploadResult, datasetId);
}
/**
* 取消上传
*/
@Transactional
public void cancelUpload(String reqId) {
fileService.cancelUpload(reqId);
}
private void saveFileInfoToDb(FileUploadResult fileUploadResult, String datasetId) {
if (Objects.isNull(fileUploadResult.getSavedFile())) {
// 文件切片上传没有完成

View File

@@ -178,6 +178,9 @@ public class KnowledgeItemApplicationService {
if (request.getContentType() != null) {
knowledgeItem.setContentType(request.getContentType());
}
if (request.getMetadata() != null) {
knowledgeItem.setMetadata(request.getMetadata());
}
knowledgeItemRepository.updateById(knowledgeItem);
return knowledgeItem;

View File

@@ -34,4 +34,8 @@ public class CreateKnowledgeItemRequest {
* 来源文件ID(用于标注同步等场景)
*/
private String sourceFileId;
/**
* 扩展元数据
*/
private String metadata;
}

View File

@@ -1,8 +1,10 @@
package com.datamate.datamanagement.interfaces.dto;
import com.datamate.datamanagement.common.enums.DatasetStatusType;
import com.fasterxml.jackson.annotation.JsonIgnore;
import jakarta.validation.constraints.NotBlank;
import jakarta.validation.constraints.Size;
import lombok.AccessLevel;
import lombok.Getter;
import lombok.Setter;
@@ -24,9 +26,18 @@ public class UpdateDatasetRequest {
/** 归集任务id */
private String dataSource;
/** 父数据集ID */
@Setter(AccessLevel.NONE)
private String parentDatasetId;
@JsonIgnore
@Setter(AccessLevel.NONE)
private boolean parentDatasetIdProvided;
/** 标签列表 */
private List<String> tags;
/** 数据集状态 */
private DatasetStatusType status;
public void setParentDatasetId(String parentDatasetId) {
this.parentDatasetIdProvided = true;
this.parentDatasetId = parentDatasetId;
}
}

View File

@@ -18,4 +18,8 @@ public class UpdateKnowledgeItemRequest {
* 内容类型
*/
private KnowledgeContentType contentType;
/**
* 扩展元数据
*/
private String metadata;
}

View File

@@ -0,0 +1,33 @@
package com.datamate.datamanagement.interfaces.rest;
import com.datamate.datamanagement.application.DatasetFileApplicationService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.PutMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
/**
* 数据集上传控制器
*/
@Slf4j
@RestController
@RequiredArgsConstructor
@RequestMapping("/data-management/datasets/upload")
public class DatasetUploadController {
private final DatasetFileApplicationService datasetFileApplicationService;
/**
* 取消上传
*
* @param reqId 预上传请求ID
*/
@PutMapping("/cancel-upload/{reqId}")
public ResponseEntity<Void> cancelUpload(@PathVariable("reqId") String reqId) {
datasetFileApplicationService.cancelUpload(reqId);
return ResponseEntity.ok().build();
}
}

View File

@@ -74,6 +74,26 @@ public class FileService {
.build();
}
/**
* 取消上传
*/
@Transactional
public void cancelUpload(String reqId) {
if (reqId == null || reqId.isBlank()) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
}
ChunkUploadPreRequest preRequest = chunkUploadRequestMapper.findById(reqId);
if (preRequest == null) {
return;
}
String uploadPath = preRequest.getUploadPath();
if (uploadPath != null && !uploadPath.isBlank()) {
File tempDir = new File(uploadPath, String.format(ChunksSaver.TEMP_DIR_NAME_FORMAT, preRequest.getId()));
ChunksSaver.deleteFolder(tempDir.getPath());
}
chunkUploadRequestMapper.deleteById(reqId);
}
private File uploadFile(ChunkUploadRequest fileUploadRequest, ChunkUploadPreRequest preRequest) {
File savedFile = ChunksSaver.saveFile(fileUploadRequest, preRequest);
preRequest.setTimeout(LocalDateTime.now().plusSeconds(DEFAULT_TIMEOUT));

View File

@@ -5,7 +5,7 @@ server {
access_log /var/log/datamate/frontend/access.log main;
error_log /var/log/datamate/frontend/error.log notice;
client_max_body_size 1024M;
client_max_body_size 0;
add_header Set-Cookie "NEXT_LOCALE=zh";

View File

@@ -1,17 +1,17 @@
import { Button, Input, Popover, theme, Tag, Empty } from "antd";
import { PlusOutlined } from "@ant-design/icons";
import { useEffect, useMemo, useState } from "react";
import { useCallback, useEffect, useMemo, useState } from "react";
interface Tag {
id: number;
id?: string | number;
name: string;
color: string;
color?: string;
}
interface AddTagPopoverProps {
tags: Tag[];
onFetchTags?: () => Promise<Tag[]>;
onAddTag?: (tag: Tag) => void;
onAddTag?: (tagName: string) => void;
onCreateAndTag?: (tagName: string) => void;
}
@@ -27,20 +27,23 @@ export default function AddTagPopover({
const [newTag, setNewTag] = useState("");
const [allTags, setAllTags] = useState<Tag[]>([]);
const tagsSet = useMemo(() => new Set(tags.map((tag) => tag.id)), [tags]);
const tagsSet = useMemo(
() => new Set(tags.map((tag) => (tag.id ?? tag.name))),
[tags]
);
const fetchTags = async () => {
const fetchTags = useCallback(async () => {
if (onFetchTags && showPopover) {
const data = await onFetchTags?.();
setAllTags(data || []);
}
};
}, [onFetchTags, showPopover]);
useEffect(() => {
fetchTags();
}, [showPopover]);
}, [fetchTags]);
const availableTags = useMemo(() => {
return allTags.filter((tag) => !tagsSet.has(tag.id));
return allTags.filter((tag) => !tagsSet.has(tag.id ?? tag.name));
}, [allTags, tagsSet]);
const handleCreateAndAddTag = () => {

View File

@@ -24,21 +24,28 @@ interface OperationItem {
interface TagConfig {
showAdd: boolean;
tags: { id: number; name: string; color: string }[];
onFetchTags?: () => Promise<{
data: { id: number; name: string; color: string }[];
}>;
onAddTag?: (tag: { id: number; name: string; color: string }) => void;
tags: { id?: string | number; name: string; color?: string }[];
onFetchTags?: () => Promise<{ id?: string | number; name: string; color?: string }[]>;
onAddTag?: (tagName: string) => void;
onCreateAndTag?: (tagName: string) => void;
}
interface DetailHeaderProps<T> {
interface DetailHeaderData {
name?: string;
description?: string;
status?: { color?: string; icon?: React.ReactNode; label?: string };
tags?: { id?: string | number; name?: string }[];
icon?: React.ReactNode;
iconColor?: string;
}
interface DetailHeaderProps<T extends DetailHeaderData> {
data: T;
statistics: StatisticItem[];
operations: OperationItem[];
tagConfig?: TagConfig;
}
function DetailHeader<T>({
function DetailHeader<T extends DetailHeaderData>({
data = {} as T,
statistics,
operations,
@@ -50,13 +57,13 @@ function DetailHeader<T>({
<div className="flex items-start gap-4 flex-1">
<div
className={`w-16 h-16 text-white rounded-lg flex-center shadow-lg ${
(data as any)?.iconColor
data?.iconColor
? ""
: "bg-gradient-to-br from-sky-300 to-blue-500 text-white"
}`}
style={(data as any)?.iconColor ? { backgroundColor: (data as any).iconColor } : undefined}
style={data?.iconColor ? { backgroundColor: data.iconColor } : undefined}
>
{<div className="w-[2.8rem] h-[2.8rem] text-gray-50">{(data as any)?.icon}</div> || (
{<div className="w-[2.8rem] h-[2.8rem] text-gray-50">{data?.icon}</div> || (
<Database className="w-8 h-8 text-white" />
)}
</div>

View File

@@ -1,5 +1,5 @@
import { TaskItem } from "@/pages/DataManagement/dataset.model";
import { calculateSHA256, checkIsFilesExist } from "@/utils/file.util";
import { calculateSHA256, checkIsFilesExist, streamSplitAndUpload, StreamUploadResult } from "@/utils/file.util";
import { App } from "antd";
import { useRef, useState } from "react";
@@ -9,17 +9,18 @@ export function useFileSliceUpload(
uploadChunk,
cancelUpload,
}: {
preUpload: (id: string, params: any) => Promise<{ data: number }>;
uploadChunk: (id: string, formData: FormData, config: any) => Promise<any>;
cancelUpload: ((reqId: number) => Promise<any>) | null;
preUpload: (id: string, params: Record<string, unknown>) => Promise<{ data: number }>;
uploadChunk: (id: string, formData: FormData, config: Record<string, unknown>) => Promise<unknown>;
cancelUpload: ((reqId: number) => Promise<unknown>) | null;
},
showTaskCenter = true // 上传时是否显示任务中心
showTaskCenter = true, // 上传时是否显示任务中心
enableStreamUpload = true // 是否启用流式分割上传
) {
const { message } = App.useApp();
const [taskList, setTaskList] = useState<TaskItem[]>([]);
const taskListRef = useRef<TaskItem[]>([]); // 用于固定任务顺序
const createTask = (detail: any = {}) => {
const createTask = (detail: Record<string, unknown> = {}) => {
const { dataset } = detail;
const title = `上传数据集: ${dataset.name} `;
const controller = new AbortController();
@@ -37,6 +38,14 @@ export function useFileSliceUpload(
taskListRef.current = [task, ...taskListRef.current];
setTaskList(taskListRef.current);
// 立即显示任务中心,让用户感知上传已开始
if (showTaskCenter) {
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
}
return task;
};
@@ -60,7 +69,7 @@ export function useFileSliceUpload(
// 携带前缀信息,便于刷新后仍停留在当前目录
window.dispatchEvent(
new CustomEvent(task.updateEvent, {
detail: { prefix: (task as any).prefix },
detail: { prefix: task.prefix },
})
);
}
@@ -71,7 +80,7 @@ export function useFileSliceUpload(
}
};
async function buildFormData({ file, reqId, i, j }) {
async function buildFormData({ file, reqId, i, j }: { file: { slices: Blob[]; name: string; size: number }; reqId: number; i: number; j: number }) {
const formData = new FormData();
const { slices, name, size } = file;
const checkSum = await calculateSHA256(slices[j]);
@@ -86,12 +95,18 @@ export function useFileSliceUpload(
return formData;
}
async function uploadSlice(task: TaskItem, fileInfo) {
async function uploadSlice(task: TaskItem, fileInfo: { loaded: number; i: number; j: number; files: { slices: Blob[]; name: string; size: number }[]; totalSize: number }) {
if (!task) {
return;
}
const { reqId, key } = task;
const { reqId, key, controller } = task;
const { loaded, i, j, files, totalSize } = fileInfo;
// 检查是否已取消
if (controller.signal.aborted) {
throw new Error("Upload cancelled");
}
const formData = await buildFormData({
file: files[i],
i,
@@ -101,6 +116,7 @@ export function useFileSliceUpload(
let newTask = { ...task };
await uploadChunk(key, formData, {
signal: controller.signal,
onUploadProgress: (e) => {
const loadedSize = loaded + e.loaded;
const curPercent = Number((loadedSize / totalSize) * 100).toFixed(2);
@@ -116,7 +132,7 @@ export function useFileSliceUpload(
});
}
async function uploadFile({ task, files, totalSize }) {
async function uploadFile({ task, files, totalSize }: { task: TaskItem; files: { slices: Blob[]; name: string; size: number; originFile: Blob }[]; totalSize: number }) {
console.log('[useSliceUpload] Calling preUpload with prefix:', task.prefix);
const { data: reqId } = await preUpload(task.key, {
totalFileNum: files.length,
@@ -132,24 +148,29 @@ export function useFileSliceUpload(
reqId,
isCancel: false,
cancelFn: () => {
task.controller.abort();
// 使用 newTask 的 controller 确保一致性
newTask.controller.abort();
cancelUpload?.(reqId);
if (task.updateEvent) window.dispatchEvent(new Event(task.updateEvent));
if (newTask.updateEvent) window.dispatchEvent(new Event(newTask.updateEvent));
},
};
updateTaskList(newTask);
if (showTaskCenter) {
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
}
// 注意:show:task-popover 事件已在 createTask 中触发,此处不再重复触发
// // 更新数据状态
if (task.updateEvent) window.dispatchEvent(new Event(task.updateEvent));
let loaded = 0;
for (let i = 0; i < files.length; i++) {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
throw new Error("Upload cancelled");
}
const { slices } = files[i];
for (let j = 0; j < slices.length; j++) {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
throw new Error("Upload cancelled");
}
await uploadSlice(newTask, {
loaded,
i,
@@ -163,7 +184,7 @@ export function useFileSliceUpload(
removeTask(newTask);
}
const handleUpload = async ({ task, files }) => {
const handleUpload = async ({ task, files }: { task: TaskItem; files: { slices: Blob[]; name: string; size: number; originFile: Blob }[] }) => {
const isErrorFile = await checkIsFilesExist(files);
if (isErrorFile) {
message.error("文件被修改或删除,请重新选择文件上传");
@@ -189,10 +210,174 @@ export function useFileSliceUpload(
}
};
/**
* 流式分割上传处理
* 用于大文件按行分割并立即上传的场景
*/
const handleStreamUpload = async ({ task, files }: { task: TaskItem; files: File[] }) => {
try {
console.log('[useSliceUpload] Starting stream upload for', files.length, 'files');
const totalSize = files.reduce((acc, file) => acc + file.size, 0);
// 存储所有文件的 reqId,用于取消上传
const reqIds: number[] = [];
const newTask: TaskItem = {
...task,
reqId: -1,
isCancel: false,
cancelFn: () => {
// 使用 newTask 的 controller 确保一致性
newTask.controller.abort();
// 取消所有文件的预上传请求
reqIds.forEach(id => cancelUpload?.(id));
if (newTask.updateEvent) window.dispatchEvent(new Event(newTask.updateEvent));
},
};
updateTaskList(newTask);
let totalUploadedLines = 0;
let totalProcessedBytes = 0;
const results: StreamUploadResult[] = [];
// 逐个处理文件,每个文件单独调用 preUpload
for (let i = 0; i < files.length; i++) {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
throw new Error("Upload cancelled");
}
const file = files[i];
console.log(`[useSliceUpload] Processing file ${i + 1}/${files.length}: ${file.name}`);
const result = await streamSplitAndUpload(
file,
(formData, config) => uploadChunk(task.key, formData, {
...config,
signal: newTask.controller.signal,
}),
(currentBytes, totalBytes, uploadedLines) => {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
return;
}
// 更新进度
const overallBytes = totalProcessedBytes + currentBytes;
const curPercent = Number((overallBytes / totalSize) * 100).toFixed(2);
const updatedTask: TaskItem = {
...newTask,
...taskListRef.current.find((item) => item.key === task.key),
size: overallBytes,
percent: curPercent >= 100 ? 99.99 : curPercent,
streamUploadInfo: {
currentFile: file.name,
fileIndex: i + 1,
totalFiles: files.length,
uploadedLines: totalUploadedLines + uploadedLines,
},
};
updateTaskList(updatedTask);
},
1024 * 1024, // 1MB chunk size
{
resolveReqId: async ({ totalFileNum, totalSize }) => {
const { data: reqId } = await preUpload(task.key, {
totalFileNum,
totalSize,
datasetId: task.key,
hasArchive: task.hasArchive,
prefix: task.prefix,
});
console.log(`[useSliceUpload] File ${file.name} preUpload response reqId:`, reqId);
reqIds.push(reqId);
return reqId;
},
hasArchive: newTask.hasArchive,
prefix: newTask.prefix,
signal: newTask.controller.signal,
maxConcurrency: 3,
}
);
results.push(result);
totalUploadedLines += result.uploadedCount;
totalProcessedBytes += file.size;
console.log(`[useSliceUpload] File ${file.name} processed, uploaded ${result.uploadedCount} lines`);
}
console.log('[useSliceUpload] Stream upload completed, total lines:', totalUploadedLines);
removeTask(newTask);
message.success(`成功上传 ${totalUploadedLines} 个文件(按行分割)`);
} catch (err) {
console.error('[useSliceUpload] Stream upload error:', err);
if (err.message === "Upload cancelled") {
message.info("上传已取消");
} else {
message.error("文件上传失败,请稍后重试");
}
removeTask({
...task,
isCancel: true,
...taskListRef.current.find((item) => item.key === task.key),
});
}
};
/**
* 注册流式上传事件监听
* 返回注销函数
*/
const registerStreamUploadListener = () => {
if (!enableStreamUpload) return () => {};
const streamUploadHandler = async (e: Event) => {
const customEvent = e as CustomEvent;
const { dataset, files, updateEvent, hasArchive, prefix } = customEvent.detail;
const controller = new AbortController();
const task: TaskItem = {
key: dataset.id,
title: `上传数据集: ${dataset.name} (按行分割)`,
percent: 0,
reqId: -1,
controller,
size: 0,
updateEvent,
hasArchive,
prefix,
};
taskListRef.current = [task, ...taskListRef.current];
setTaskList(taskListRef.current);
// 显示任务中心
if (showTaskCenter) {
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
}
await handleStreamUpload({ task, files });
};
window.addEventListener("upload:dataset-stream", streamUploadHandler);
return () => {
window.removeEventListener("upload:dataset-stream", streamUploadHandler);
};
};
return {
taskList,
createTask,
removeTask,
handleUpload,
handleStreamUpload,
registerStreamUploadListener,
};
}

View File

@@ -3,7 +3,9 @@
* 通过 iframe 加载外部页面
*/
export default function ContentGenerationPage() {
const iframeUrl = "http://192.168.0.8:3000";
const iframeUrl = "/api#/meeting";
window.localStorage.setItem("geeker-user", '{"token":"123","userInfo":{"name":"xteam"},"loginFrom":null,"loginData":null}');
return (
<div className="h-full w-full flex flex-col">
@@ -16,6 +18,11 @@ export default function ContentGenerationPage() {
className="w-full h-full border-0"
title="内容生成"
sandbox="allow-same-origin allow-scripts allow-popups allow-forms allow-downloads"
style={{marginLeft: "-220px",
marginTop: "-66px",
width: "calc(100% + 233px)",
height: "calc(100% + 108px)"
}}
/>
</div>
</div>

View File

@@ -1,6 +1,6 @@
import { useCallback, useEffect, useMemo, useRef, useState } from "react";
import { App, Button, Card, List, Spin, Typography, Tag, Switch, Tree, Empty } from "antd";
import { LeftOutlined, ReloadOutlined, SaveOutlined, MenuFoldOutlined, MenuUnfoldOutlined, CheckOutlined } from "@ant-design/icons";
import { App, Button, Card, List, Spin, Typography, Tag, Empty } from "antd";
import { LeftOutlined, ReloadOutlined, SaveOutlined, MenuFoldOutlined, MenuUnfoldOutlined } from "@ant-design/icons";
import { useNavigate, useParams } from "react-router";
import {
@@ -28,7 +28,6 @@ type EditorTaskListItem = {
hasAnnotation: boolean;
annotationUpdatedAt?: string | null;
annotationStatus?: AnnotationResultStatus | null;
segmentStats?: SegmentStats;
};
type LsfMessage = {
@@ -36,21 +35,6 @@ type LsfMessage = {
payload?: unknown;
};
type SegmentInfo = {
idx: number;
text: string;
start: number;
end: number;
hasAnnotation: boolean;
lineIndex: number;
chunkIndex: number;
};
type SegmentStats = {
done: number;
total: number;
};
type ApiResponse<T> = {
code?: number;
message?: string;
@@ -66,10 +50,11 @@ type EditorTaskPayload = {
type EditorTaskResponse = {
task?: EditorTaskPayload;
segmented?: boolean;
segments?: SegmentInfo[];
totalSegments?: number;
currentSegmentIndex?: number;
};
type EditorTaskListResponse = {
content?: EditorTaskListItem[];
totalElements?: number;
@@ -91,8 +76,6 @@ type ExportPayload = {
requestId?: string | null;
};
type SwitchDecision = "save" | "discard" | "cancel";
const LSF_IFRAME_SRC = "/lsf/lsf.html";
const TASK_PAGE_START = 0;
const TASK_PAGE_SIZE = 200;
@@ -154,16 +137,6 @@ const isAnnotationResultEmpty = (annotation?: Record<string, unknown>) => {
};
const resolveTaskStatusMeta = (item: EditorTaskListItem) => {
const segmentSummary = resolveSegmentSummary(item);
if (segmentSummary) {
if (segmentSummary.done >= segmentSummary.total) {
return { text: "已标注", type: "success" as const };
}
if (segmentSummary.done > 0) {
return { text: "标注中", type: "warning" as const };
}
return { text: "未标注", type: "secondary" as const };
}
if (!item.hasAnnotation) {
return { text: "未标注", type: "secondary" as const };
}
@@ -216,25 +189,6 @@ const buildAnnotationSnapshot = (annotation?: Record<string, unknown>) => {
const buildSnapshotKey = (fileId: string, segmentIndex?: number) =>
`${fileId}::${segmentIndex ?? "full"}`;
const buildSegmentStats = (segmentList?: SegmentInfo[] | null): SegmentStats | null => {
if (!Array.isArray(segmentList) || segmentList.length === 0) return null;
const total = segmentList.length;
const done = segmentList.reduce((count, seg) => count + (seg.hasAnnotation ? 1 : 0), 0);
return { done, total };
};
const normalizeSegmentStats = (stats?: SegmentStats | null): SegmentStats | null => {
if (!stats) return null;
const total = Number(stats.total);
const done = Number(stats.done);
if (!Number.isFinite(total) || total <= 0) return null;
const safeDone = Math.min(Math.max(done, 0), total);
return { done: safeDone, total };
};
const resolveSegmentSummary = (item: EditorTaskListItem) =>
normalizeSegmentStats(item.segmentStats);
const mergeTaskItems = (base: EditorTaskListItem[], next: EditorTaskListItem[]) => {
if (next.length === 0) return base;
const seen = new Set(base.map((item) => item.fileId));
@@ -282,18 +236,13 @@ export default function LabelStudioTextEditor() {
resolve: (payload?: ExportPayload) => void;
timer?: number;
} | null>(null);
const exportCheckSeqRef = useRef(0);
const savedSnapshotsRef = useRef<Record<string, string>>({});
const pendingAutoAdvanceRef = useRef(false);
const segmentStatsCacheRef = useRef<Record<string, SegmentStats>>({});
const segmentStatsSeqRef = useRef(0);
const segmentStatsLoadingRef = useRef<Set<string>>(new Set());
const [loadingProject, setLoadingProject] = useState(true);
const [loadingTasks, setLoadingTasks] = useState(false);
const [loadingTaskDetail, setLoadingTaskDetail] = useState(false);
const [saving, setSaving] = useState(false);
const [segmentSwitching, setSegmentSwitching] = useState(false);
const [iframeReady, setIframeReady] = useState(false);
const [lsReady, setLsReady] = useState(false);
@@ -306,16 +255,19 @@ export default function LabelStudioTextEditor() {
const [prefetching, setPrefetching] = useState(false);
const [selectedFileId, setSelectedFileId] = useState<string>("");
const [sidebarCollapsed, setSidebarCollapsed] = useState(false);
const [autoSaveOnSwitch, setAutoSaveOnSwitch] = useState(false);
// 分段相关状态
const [segmented, setSegmented] = useState(false);
const [segments, setSegments] = useState<SegmentInfo[]>([]);
const [currentSegmentIndex, setCurrentSegmentIndex] = useState(0);
const [segmentTotal, setSegmentTotal] = useState(0);
const isTextProject = useMemo(
() => (project?.datasetType || "").toUpperCase() === "TEXT",
[project?.datasetType],
);
const segmentIndices = useMemo(() => {
if (segmentTotal <= 0) return [] as number[];
return Array.from({ length: segmentTotal }, (_, index) => index);
}, [segmentTotal]);
const focusIframe = useCallback(() => {
const iframe = iframeRef.current;
@@ -330,70 +282,6 @@ export default function LabelStudioTextEditor() {
win.postMessage({ type, payload }, origin);
}, [origin]);
const applySegmentStats = useCallback((fileId: string, stats: SegmentStats | null) => {
if (!fileId) return;
const normalized = normalizeSegmentStats(stats);
setTasks((prev) =>
prev.map((item) =>
item.fileId === fileId
? { ...item, segmentStats: normalized || undefined }
: item
)
);
}, []);
const updateSegmentStatsCache = useCallback((fileId: string, stats: SegmentStats | null) => {
if (!fileId) return;
const normalized = normalizeSegmentStats(stats);
if (normalized) {
segmentStatsCacheRef.current[fileId] = normalized;
} else {
delete segmentStatsCacheRef.current[fileId];
}
applySegmentStats(fileId, normalized);
}, [applySegmentStats]);
const fetchSegmentStatsForFile = useCallback(async (fileId: string, seq: number) => {
if (!projectId || !fileId) return;
if (segmentStatsCacheRef.current[fileId] || segmentStatsLoadingRef.current.has(fileId)) return;
segmentStatsLoadingRef.current.add(fileId);
try {
const resp = (await getEditorTaskUsingGet(projectId, fileId, {
segmentIndex: 0,
})) as ApiResponse<EditorTaskResponse>;
if (segmentStatsSeqRef.current !== seq) return;
const data = resp?.data;
if (!data?.segmented) return;
const stats = buildSegmentStats(data.segments);
if (!stats) return;
segmentStatsCacheRef.current[fileId] = stats;
applySegmentStats(fileId, stats);
} catch (e) {
console.error(e);
} finally {
segmentStatsLoadingRef.current.delete(fileId);
}
}, [applySegmentStats, projectId]);
const prefetchSegmentStats = useCallback((items: EditorTaskListItem[]) => {
if (!projectId) return;
const fileIds = items
.map((item) => item.fileId)
.filter((fileId) => fileId && !segmentStatsCacheRef.current[fileId]);
if (fileIds.length === 0) return;
const seq = segmentStatsSeqRef.current;
let cursor = 0;
const workerCount = Math.min(3, fileIds.length);
const runWorker = async () => {
while (cursor < fileIds.length && segmentStatsSeqRef.current === seq) {
const fileId = fileIds[cursor];
cursor += 1;
await fetchSegmentStatsForFile(fileId, seq);
}
};
void Promise.all(Array.from({ length: workerCount }, () => runWorker()));
}, [fetchSegmentStatsForFile, projectId]);
const confirmEmptyAnnotationStatus = useCallback(() => {
return new Promise<AnnotationResultStatus | null>((resolve) => {
let resolved = false;
@@ -446,8 +334,6 @@ export default function LabelStudioTextEditor() {
const updateTaskSelection = useCallback((items: EditorTaskListItem[]) => {
const isCompleted = (item: EditorTaskListItem) => {
const summary = resolveSegmentSummary(item);
if (summary) return summary.done >= summary.total;
return item.hasAnnotation;
};
const defaultFileId =
@@ -508,9 +394,6 @@ export default function LabelStudioTextEditor() {
if (mode === "reset") {
prefetchSeqRef.current += 1;
setPrefetching(false);
segmentStatsSeqRef.current += 1;
segmentStatsCacheRef.current = {};
segmentStatsLoadingRef.current = new Set();
}
if (mode === "append") {
setLoadingMore(true);
@@ -591,20 +474,19 @@ export default function LabelStudioTextEditor() {
if (seq !== initSeqRef.current) return;
// 更新分段状态
const segmentIndex = data?.segmented
const isSegmented = !!data?.segmented;
const segmentIndex = isSegmented
? resolveSegmentIndex(data.currentSegmentIndex) ?? 0
: undefined;
if (data?.segmented) {
const stats = buildSegmentStats(data.segments);
if (isSegmented) {
setSegmented(true);
setSegments(data.segments || []);
setCurrentSegmentIndex(segmentIndex ?? 0);
updateSegmentStatsCache(fileId, stats);
const totalSegments = Number(data?.totalSegments ?? 0);
setSegmentTotal(Number.isFinite(totalSegments) && totalSegments > 0 ? totalSegments : 0);
} else {
setSegmented(false);
setSegments([]);
setCurrentSegmentIndex(0);
updateSegmentStatsCache(fileId, null);
setSegmentTotal(0);
}
const taskData = {
@@ -664,19 +546,14 @@ export default function LabelStudioTextEditor() {
} finally {
if (seq === initSeqRef.current) setLoadingTaskDetail(false);
}
}, [iframeReady, message, postToIframe, project, projectId, updateSegmentStatsCache]);
}, [iframeReady, message, postToIframe, project, projectId]);
const advanceAfterSave = useCallback(async (fileId: string, segmentIndex?: number) => {
if (!fileId) return;
if (segmented && segments.length > 0) {
const sortedSegmentIndices = segments
.map((seg) => seg.idx)
.sort((a, b) => a - b);
const baseIndex = segmentIndex ?? currentSegmentIndex;
const currentPos = sortedSegmentIndices.indexOf(baseIndex);
const nextSegmentIndex =
currentPos >= 0 ? sortedSegmentIndices[currentPos + 1] : sortedSegmentIndices[0];
if (nextSegmentIndex !== undefined) {
if (segmented && segmentTotal > 0) {
const baseIndex = Math.max(segmentIndex ?? currentSegmentIndex, 0);
const nextSegmentIndex = baseIndex + 1;
if (nextSegmentIndex < segmentTotal) {
await initEditorForFile(fileId, nextSegmentIndex);
return;
}
@@ -698,7 +575,7 @@ export default function LabelStudioTextEditor() {
initEditorForFile,
message,
segmented,
segments,
segmentTotal,
tasks,
]);
@@ -772,16 +649,6 @@ export default function LabelStudioTextEditor() {
const snapshot = buildAnnotationSnapshot(isRecord(annotation) ? annotation : undefined);
savedSnapshotsRef.current[snapshotKey] = snapshot;
// 分段模式下更新当前段落的标注状态
if (segmented && segmentIndex !== undefined) {
const nextSegments = segments.map((seg) =>
seg.idx === segmentIndex
? { ...seg, hasAnnotation: true }
: seg
);
setSegments(nextSegments);
updateSegmentStatsCache(String(fileId), buildSegmentStats(nextSegments));
}
if (options?.autoAdvance) {
await advanceAfterSave(String(fileId), segmentIndex);
}
@@ -800,69 +667,10 @@ export default function LabelStudioTextEditor() {
message,
projectId,
segmented,
segments,
selectedFileId,
tasks,
updateSegmentStatsCache,
]);
const requestExportForCheck = useCallback(() => {
if (!iframeReady || !lsReady) return Promise.resolve(undefined);
if (exportCheckRef.current) {
if (exportCheckRef.current.timer) {
window.clearTimeout(exportCheckRef.current.timer);
}
exportCheckRef.current.resolve(undefined);
exportCheckRef.current = null;
}
const requestId = `check_${Date.now()}_${++exportCheckSeqRef.current}`;
return new Promise<ExportPayload | undefined>((resolve) => {
const timer = window.setTimeout(() => {
if (exportCheckRef.current?.requestId === requestId) {
exportCheckRef.current = null;
}
resolve(undefined);
}, 3000);
exportCheckRef.current = {
requestId,
resolve,
timer,
};
postToIframe("LS_EXPORT_CHECK", { requestId });
});
}, [iframeReady, lsReady, postToIframe]);
const confirmSaveBeforeSwitch = useCallback(() => {
return new Promise<SwitchDecision>((resolve) => {
let resolved = false;
let modalInstance: { destroy: () => void } | null = null;
const settle = (decision: SwitchDecision) => {
if (resolved) return;
resolved = true;
resolve(decision);
};
const handleDiscard = () => {
if (modalInstance) modalInstance.destroy();
settle("discard");
};
modalInstance = modal.confirm({
title: "当前段落有未保存标注",
content: (
<div className="flex flex-col gap-2">
<Typography.Text></Typography.Text>
<Button type="link" danger style={{ padding: 0, height: "auto" }} onClick={handleDiscard}>
</Button>
</div>
),
okText: "保存并切换",
cancelText: "取消",
onOk: () => settle("save"),
onCancel: () => settle("cancel"),
});
});
}, [modal]);
const requestExport = useCallback((autoAdvance: boolean) => {
if (!selectedFileId) {
message.warning("请先选择文件");
@@ -875,7 +683,7 @@ export default function LabelStudioTextEditor() {
useEffect(() => {
const handleSaveShortcut = (event: KeyboardEvent) => {
if (!isSaveShortcut(event) || event.repeat) return;
if (saving || loadingTaskDetail || segmentSwitching) return;
if (saving || loadingTaskDetail) return;
if (!iframeReady || !lsReady) return;
event.preventDefault();
event.stopPropagation();
@@ -883,83 +691,7 @@ export default function LabelStudioTextEditor() {
};
window.addEventListener("keydown", handleSaveShortcut);
return () => window.removeEventListener("keydown", handleSaveShortcut);
}, [iframeReady, loadingTaskDetail, lsReady, requestExport, saving, segmentSwitching]);
// 段落切换处理
const handleSegmentChange = useCallback(async (newIndex: number) => {
if (newIndex === currentSegmentIndex) return;
if (segmentSwitching || saving || loadingTaskDetail) return;
if (!iframeReady || !lsReady) {
message.warning("编辑器未就绪,无法切换段落");
return;
}
setSegmentSwitching(true);
try {
const payload = await requestExportForCheck();
if (!payload) {
message.warning("无法读取当前标注,已取消切换");
return;
}
const payloadTaskId = payload.taskId;
if (expectedTaskIdRef.current && payloadTaskId) {
if (Number(payloadTaskId) !== expectedTaskIdRef.current) {
message.warning("已忽略过期的标注数据");
return;
}
}
const payloadFileId = payload.fileId || selectedFileId;
const payloadSegmentIndex = resolveSegmentIndex(payload.segmentIndex);
const resolvedSegmentIndex =
payloadSegmentIndex !== undefined
? payloadSegmentIndex
: segmented
? currentSegmentIndex
: undefined;
const annotation = isRecord(payload.annotation) ? payload.annotation : undefined;
const snapshotKey = payloadFileId
? buildSnapshotKey(String(payloadFileId), resolvedSegmentIndex)
: undefined;
const latestSnapshot = buildAnnotationSnapshot(annotation);
const lastSnapshot = snapshotKey ? savedSnapshotsRef.current[snapshotKey] : undefined;
const hasUnsavedChange = snapshotKey !== undefined && lastSnapshot !== undefined && latestSnapshot !== lastSnapshot;
if (hasUnsavedChange) {
if (autoSaveOnSwitch) {
const saved = await saveFromExport(payload);
if (!saved) return;
} else {
const decision = await confirmSaveBeforeSwitch();
if (decision === "cancel") return;
if (decision === "save") {
const saved = await saveFromExport(payload);
if (!saved) return;
}
}
}
await initEditorForFile(selectedFileId, newIndex);
} finally {
setSegmentSwitching(false);
}
}, [
autoSaveOnSwitch,
confirmSaveBeforeSwitch,
currentSegmentIndex,
iframeReady,
initEditorForFile,
loadingTaskDetail,
lsReady,
message,
requestExportForCheck,
saveFromExport,
segmented,
selectedFileId,
segmentSwitching,
saving,
]);
}, [iframeReady, loadingTaskDetail, lsReady, requestExport, saving]);
useEffect(() => {
setIframeReady(false);
@@ -977,12 +709,9 @@ export default function LabelStudioTextEditor() {
expectedTaskIdRef.current = null;
// 重置分段状态
setSegmented(false);
setSegments([]);
setCurrentSegmentIndex(0);
setSegmentTotal(0);
savedSnapshotsRef.current = {};
segmentStatsSeqRef.current += 1;
segmentStatsCacheRef.current = {};
segmentStatsLoadingRef.current = new Set();
if (exportCheckRef.current?.timer) {
window.clearTimeout(exportCheckRef.current.timer);
}
@@ -996,12 +725,6 @@ export default function LabelStudioTextEditor() {
loadTasks({ mode: "reset" });
}, [project?.supported, loadTasks]);
useEffect(() => {
if (!segmented) return;
if (tasks.length === 0) return;
prefetchSegmentStats(tasks);
}, [prefetchSegmentStats, segmented, tasks]);
useEffect(() => {
if (!selectedFileId) return;
initEditorForFile(selectedFileId);
@@ -1026,60 +749,6 @@ export default function LabelStudioTextEditor() {
return () => window.removeEventListener("focus", handleWindowFocus);
}, [focusIframe, lsReady]);
const segmentTreeData = useMemo(() => {
if (!segmented || segments.length === 0) return [];
const lineMap = new Map<number, SegmentInfo[]>();
segments.forEach((seg) => {
const list = lineMap.get(seg.lineIndex) || [];
list.push(seg);
lineMap.set(seg.lineIndex, list);
});
return Array.from(lineMap.entries())
.sort((a, b) => a[0] - b[0])
.map(([lineIndex, lineSegments]) => ({
key: `line-${lineIndex}`,
title: `${lineIndex + 1}`,
selectable: false,
children: lineSegments
.sort((a, b) => a.chunkIndex - b.chunkIndex)
.map((seg) => ({
key: `seg-${seg.idx}`,
title: (
<span className="flex items-center gap-1">
<span>{`${seg.chunkIndex + 1}`}</span>
{seg.hasAnnotation && (
<CheckOutlined style={{ fontSize: 10, color: "#52c41a" }} />
)}
</span>
),
})),
}));
}, [segmented, segments]);
const segmentLineKeys = useMemo(
() => segmentTreeData.map((item) => String(item.key)),
[segmentTreeData]
);
const inProgressSegmentedCount = useMemo(() => {
if (tasks.length === 0) return 0;
return tasks.reduce((count, item) => {
const summary = resolveSegmentSummary(item);
if (!summary) return count;
return summary.done < summary.total ? count + 1 : count;
}, 0);
}, [tasks]);
const handleSegmentSelect = useCallback((keys: Array<string | number>) => {
const [first] = keys;
if (first === undefined || first === null) return;
const key = String(first);
if (!key.startsWith("seg-")) return;
const nextIndex = Number(key.replace("seg-", ""));
if (!Number.isFinite(nextIndex)) return;
handleSegmentChange(nextIndex);
}, [handleSegmentChange]);
useEffect(() => {
const handler = (event: MessageEvent<LsfMessage>) => {
if (event.origin !== origin) return;
@@ -1148,7 +817,7 @@ export default function LabelStudioTextEditor() {
const canLoadMore = taskTotalPages > 0 && taskPage + 1 < taskTotalPages;
const saveDisabled =
!iframeReady || !selectedFileId || saving || segmentSwitching || loadingTaskDetail;
!iframeReady || !selectedFileId || saving || loadingTaskDetail;
const loadMoreNode = canLoadMore ? (
<div className="p-2 text-center">
<Button
@@ -1265,11 +934,6 @@ export default function LabelStudioTextEditor() {
>
<div className="px-3 py-2 border-b border-gray-200 bg-white font-medium text-sm flex items-center justify-between gap-2">
<span></span>
{segmented && (
<Tag color="orange" style={{ margin: 0 }}>
{inProgressSegmentedCount}
</Tag>
)}
</div>
<div className="flex-1 min-h-0 overflow-auto">
<List
@@ -1278,7 +942,6 @@ export default function LabelStudioTextEditor() {
dataSource={tasks}
loadMore={loadMoreNode}
renderItem={(item) => {
const segmentSummary = resolveSegmentSummary(item);
const statusMeta = resolveTaskStatusMeta(item);
return (
<List.Item
@@ -1300,11 +963,6 @@ export default function LabelStudioTextEditor() {
<Typography.Text type={statusMeta.type} style={{ fontSize: 11 }}>
{statusMeta.text}
</Typography.Text>
{segmentSummary && (
<Typography.Text type="secondary" style={{ fontSize: 10 }}>
{segmentSummary.done}/{segmentSummary.total}
</Typography.Text>
)}
</div>
{item.annotationUpdatedAt && (
<Typography.Text type="secondary" style={{ fontSize: 10 }}>
@@ -1323,21 +981,28 @@ export default function LabelStudioTextEditor() {
<div className="px-3 py-2 border-b border-gray-200 bg-gray-50 font-medium text-sm flex items-center justify-between">
<span>/</span>
<Tag color="blue" style={{ margin: 0 }}>
{currentSegmentIndex + 1} / {segments.length}
{segmentTotal > 0 ? currentSegmentIndex + 1 : 0} / {segmentTotal}
</Tag>
</div>
<div className="flex-1 min-h-0 overflow-auto px-2 py-2">
{segments.length > 0 ? (
<Tree
showLine
blockNode
selectedKeys={
segmented ? [`seg-${currentSegmentIndex}`] : []
}
expandedKeys={segmentLineKeys}
onSelect={handleSegmentSelect}
treeData={segmentTreeData}
/>
{segmentTotal > 0 ? (
<div className="grid grid-cols-[repeat(auto-fill,minmax(44px,1fr))] gap-1">
{segmentIndices.map((segmentIndex) => {
const isCurrent = segmentIndex === currentSegmentIndex;
return (
<div
key={segmentIndex}
className={
isCurrent
? "h-7 leading-7 rounded bg-blue-500 text-white text-center text-xs font-medium"
: "h-7 leading-7 rounded bg-gray-100 text-gray-700 text-center text-xs"
}
>
{segmentIndex + 1}
</div>
);
})}
</div>
) : (
<div className="py-6">
<Empty
@@ -1347,17 +1012,6 @@ export default function LabelStudioTextEditor() {
</div>
)}
</div>
<div className="px-3 py-2 border-t border-gray-200 flex items-center justify-between">
<Typography.Text style={{ fontSize: 12 }}>
</Typography.Text>
<Switch
size="small"
checked={autoSaveOnSwitch}
onChange={(checked) => setAutoSaveOnSwitch(checked)}
disabled={segmentSwitching || saving || loadingTaskDetail || !lsReady}
/>
</div>
</div>
)}
</div>

View File

@@ -57,6 +57,9 @@ export default function DataAnnotation() {
const [selectedRowKeys, setSelectedRowKeys] = useState<AnnotationTaskRowKey[]>([]);
const [selectedRows, setSelectedRows] = useState<AnnotationTaskListItem[]>([]);
const toSafeCount = (value: unknown) =>
typeof value === "number" && Number.isFinite(value) ? value : 0;
const handleAnnotate = (task: AnnotationTaskListItem) => {
const projectId = task.id;
if (!projectId) {
@@ -207,8 +210,20 @@ export default function DataAnnotation() {
width: 100,
align: "center" as const,
render: (value: number, record: AnnotationTaskListItem) => {
const total = record.totalCount || 0;
const annotated = value || 0;
const total = toSafeCount(record.totalCount ?? record.total_count);
const annotatedRaw = toSafeCount(
value ?? record.annotatedCount ?? record.annotated_count
);
const segmentationEnabled =
record.segmentationEnabled ?? record.segmentation_enabled;
const inProgressRaw = segmentationEnabled
? toSafeCount(record.inProgressCount ?? record.in_progress_count)
: 0;
const shouldExcludeInProgress =
total > 0 && annotatedRaw + inProgressRaw > total;
const annotated = shouldExcludeInProgress
? Math.max(annotatedRaw - inProgressRaw, 0)
: annotatedRaw;
const percent = total > 0 ? Math.round((annotated / total) * 100) : 0;
return (
<span title={`${annotated}/${total} (${percent}%)`}>

View File

@@ -43,14 +43,6 @@ const TemplateDetail: React.FC<TemplateDetailProps> = ({
<Descriptions.Item label="样式">
{template.style}
</Descriptions.Item>
<Descriptions.Item label="类型">
<Tag color={template.builtIn ? "gold" : "default"}>
{template.builtIn ? "系统内置" : "自定义"}
</Tag>
</Descriptions.Item>
<Descriptions.Item label="版本">
{template.version}
</Descriptions.Item>
<Descriptions.Item label="创建时间" span={2}>
{new Date(template.createdAt).toLocaleString()}
</Descriptions.Item>

View File

@@ -36,6 +36,7 @@ const TemplateForm: React.FC<TemplateFormProps> = ({
const [form] = Form.useForm();
const [loading, setLoading] = useState(false);
const [labelConfig, setLabelConfig] = useState("");
const selectedDataType = Form.useWatch("dataType", form);
useEffect(() => {
if (visible && template && mode === "edit") {
@@ -96,8 +97,12 @@ const TemplateForm: React.FC<TemplateFormProps> = ({
} else {
message.error(response.message || `模板${mode === "create" ? "创建" : "更新"}失败`);
}
} catch (error: any) {
if (error.errorFields) {
} catch (error: unknown) {
const hasErrorFields =
typeof error === "object" &&
error !== null &&
"errorFields" in error;
if (hasErrorFields) {
message.error("请填写所有必填字段");
} else {
message.error(`模板${mode === "create" ? "创建" : "更新"}失败`);
@@ -195,6 +200,7 @@ const TemplateForm: React.FC<TemplateFormProps> = ({
value={labelConfig}
onChange={setLabelConfig}
height={420}
dataType={selectedDataType}
/>
</div>
</Form>

View File

@@ -1,4 +1,4 @@
import React, { useState } from "react";
import React, { useState, useEffect } from "react";
import {
Button,
Table,
@@ -32,7 +32,16 @@ import {
TemplateTypeMap
} from "@/pages/DataAnnotation/annotation.const.tsx";
const TEMPLATE_ADMIN_KEY = "datamate_template_admin";
const TemplateList: React.FC = () => {
const [isAdmin, setIsAdmin] = useState(false);
useEffect(() => {
// 检查 localStorage 中是否存在特殊键
const hasAdminKey = localStorage.getItem(TEMPLATE_ADMIN_KEY) !== null;
setIsAdmin(hasAdminKey);
}, []);
const filterOptions = [
{
key: "category",
@@ -225,23 +234,7 @@ const TemplateList: React.FC = () => {
<Tag color={getCategoryColor(category)}>{ClassificationMap[category as keyof typeof ClassificationMap]?.label || category}</Tag>
),
},
{
title: "类型",
dataIndex: "builtIn",
key: "builtIn",
width: 100,
render: (builtIn: boolean) => (
<Tag color={builtIn ? "gold" : "default"}>
{builtIn ? "系统内置" : "自定义"}
</Tag>
),
},
{
title: "版本",
dataIndex: "version",
key: "version",
width: 80,
},
{
title: "创建时间",
dataIndex: "createdAt",
@@ -263,29 +256,31 @@ const TemplateList: React.FC = () => {
onClick={() => handleView(record)}
/>
</Tooltip>
<>
<Tooltip title="编辑">
<Button
type="link"
icon={<EditOutlined />}
onClick={() => handleEdit(record)}
/>
</Tooltip>
<Popconfirm
title="确定要删除这个模板吗?"
onConfirm={() => handleDelete(record.id)}
okText="确定"
cancelText="取消"
>
<Tooltip title="删除">
{isAdmin && (
<>
<Tooltip title="编辑">
<Button
type="link"
danger
icon={<DeleteOutlined />}
icon={<EditOutlined />}
onClick={() => handleEdit(record)}
/>
</Tooltip>
</Popconfirm>
</>
<Popconfirm
title="确定要删除这个模板吗?"
onConfirm={() => handleDelete(record.id)}
okText="确定"
cancelText="取消"
>
<Tooltip title="删除">
<Button
type="link"
danger
icon={<DeleteOutlined />}
/>
</Tooltip>
</Popconfirm>
</>
)}
</Space>
),
},
@@ -310,11 +305,13 @@ const TemplateList: React.FC = () => {
</div>
{/* Right side: Create button */}
<div className="flex items-center gap-2">
<Button type="primary" icon={<PlusOutlined />} onClick={handleCreate}>
</Button>
</div>
{isAdmin && (
<div className="flex items-center gap-2">
<Button type="primary" icon={<PlusOutlined />} onClick={handleCreate}>
</Button>
</div>
)}
</div>
<Card>

View File

@@ -3,16 +3,19 @@ import { get, post, put, del, download } from "@/utils/request";
// 导出格式类型
export type ExportFormat = "json" | "jsonl" | "csv" | "coco" | "yolo";
type RequestParams = Record<string, unknown>;
type RequestPayload = Record<string, unknown>;
// 标注任务管理相关接口
export function queryAnnotationTasksUsingGet(params?: any) {
export function queryAnnotationTasksUsingGet(params?: RequestParams) {
return get("/api/annotation/project", params);
}
export function createAnnotationTaskUsingPost(data: any) {
export function createAnnotationTaskUsingPost(data: RequestPayload) {
return post("/api/annotation/project", data);
}
export function syncAnnotationTaskUsingPost(data: any) {
export function syncAnnotationTaskUsingPost(data: RequestPayload) {
return post(`/api/annotation/task/sync`, data);
}
@@ -25,7 +28,7 @@ export function getAnnotationTaskByIdUsingGet(taskId: string) {
return get(`/api/annotation/project/${taskId}`);
}
export function updateAnnotationTaskByIdUsingPut(taskId: string, data: any) {
export function updateAnnotationTaskByIdUsingPut(taskId: string, data: RequestPayload) {
return put(`/api/annotation/project/${taskId}`, data);
}
@@ -35,17 +38,17 @@ export function getTagConfigUsingGet() {
}
// 标注模板管理
export function queryAnnotationTemplatesUsingGet(params?: any) {
export function queryAnnotationTemplatesUsingGet(params?: RequestParams) {
return get("/api/annotation/template", params);
}
export function createAnnotationTemplateUsingPost(data: any) {
export function createAnnotationTemplateUsingPost(data: RequestPayload) {
return post("/api/annotation/template", data);
}
export function updateAnnotationTemplateByIdUsingPut(
templateId: string | number,
data: any
data: RequestPayload
) {
return put(`/api/annotation/template/${templateId}`, data);
}
@@ -65,7 +68,7 @@ export function getEditorProjectInfoUsingGet(projectId: string) {
return get(`/api/annotation/editor/projects/${projectId}`);
}
export function listEditorTasksUsingGet(projectId: string, params?: any) {
export function listEditorTasksUsingGet(projectId: string, params?: RequestParams) {
return get(`/api/annotation/editor/projects/${projectId}/tasks`, params);
}
@@ -77,11 +80,19 @@ export function getEditorTaskUsingGet(
return get(`/api/annotation/editor/projects/${projectId}/tasks/${fileId}`, params);
}
export function getEditorTaskSegmentUsingGet(
projectId: string,
fileId: string,
params: { segmentIndex: number }
) {
return get(`/api/annotation/editor/projects/${projectId}/tasks/${fileId}/segments`, params);
}
export function upsertEditorAnnotationUsingPut(
projectId: string,
fileId: string,
data: {
annotation: any;
annotation: Record<string, unknown>;
expectedUpdatedAt?: string;
segmentIndex?: number;
}

View File

@@ -22,6 +22,7 @@ import {
getObjectDisplayName,
type LabelStudioTagConfig,
} from "../annotation.tagconfig";
import { DataType } from "../annotation.model";
const { Text, Title } = Typography;
@@ -44,10 +45,22 @@ interface TemplateConfigurationTreeEditorProps {
readOnly?: boolean;
readOnlyStructure?: boolean;
height?: number | string;
dataType?: DataType;
}
const DEFAULT_ROOT_TAG = "View";
const CHILD_TAGS = ["Label", "Choice", "Relation", "Item", "Path", "Channel"];
const OBJECT_TAGS_BY_DATA_TYPE: Record<DataType, string[]> = {
[DataType.TEXT]: ["Text", "Paragraphs", "Markdown"],
[DataType.IMAGE]: ["Image", "Bitmask"],
[DataType.AUDIO]: ["Audio", "AudioPlus"],
[DataType.VIDEO]: ["Video"],
[DataType.PDF]: ["PDF"],
[DataType.TIMESERIES]: ["Timeseries", "TimeSeries", "Vector"],
[DataType.CHAT]: ["Chat"],
[DataType.HTML]: ["HyperText", "Markdown"],
[DataType.TABLE]: ["Table", "Vector"],
};
const createId = () =>
`node_${Date.now().toString(36)}_${Math.random().toString(36).slice(2, 8)}`;
@@ -247,18 +260,34 @@ const createNode = (
attrs[attr] = "";
});
if (objectConfig && attrs.name !== undefined) {
if (objectConfig) {
const name = getDefaultName(tag);
attrs.name = name;
if (attrs.value !== undefined) {
attrs.value = `$${name}`;
if (!attrs.name) {
attrs.name = name;
}
if (!attrs.value) {
attrs.value = `$${attrs.name}`;
}
}
if (controlConfig && attrs.name !== undefined) {
attrs.name = getDefaultName(tag);
if (attrs.toName !== undefined) {
attrs.toName = objectNames[0] || "";
if (controlConfig) {
const isLabeling = controlConfig.category === "labeling";
if (isLabeling) {
if (!attrs.name) {
attrs.name = getDefaultName(tag);
}
if (!attrs.toName) {
attrs.toName = objectNames[0] || "";
}
} else {
// For layout controls, only fill if required
if (attrs.name !== undefined && !attrs.name) {
attrs.name = getDefaultName(tag);
}
if (attrs.toName !== undefined && !attrs.toName) {
attrs.toName = objectNames[0] || "";
}
}
}
@@ -420,14 +449,13 @@ const TemplateConfigurationTreeEditor = ({
readOnly = false,
readOnlyStructure = false,
height = 420,
dataType,
}: TemplateConfigurationTreeEditorProps) => {
const { config } = useTagConfig(false);
const [tree, setTree] = useState<XmlNode>(() => createEmptyTree());
const [selectedId, setSelectedId] = useState<string>(tree.id);
const [parseError, setParseError] = useState<string | null>(null);
const lastSerialized = useRef<string>("");
const [addChildTag, setAddChildTag] = useState<string | undefined>();
const [addSiblingTag, setAddSiblingTag] = useState<string | undefined>();
useEffect(() => {
if (!value) {
@@ -498,11 +526,17 @@ const TemplateConfigurationTreeEditor = ({
const objectOptions = useMemo(() => {
if (!config?.objects) return [];
return Object.keys(config.objects).map((tag) => ({
const options = Object.keys(config.objects).map((tag) => ({
value: tag,
label: getObjectDisplayName(tag),
}));
}, [config]);
if (!dataType) return options;
const allowedTags = OBJECT_TAGS_BY_DATA_TYPE[dataType];
if (!allowedTags) return options;
const allowedSet = new Set(allowedTags);
const filtered = options.filter((option) => allowedSet.has(option.value));
return filtered.length > 0 ? filtered : options;
}, [config, dataType]);
const tagOptions = useMemo(() => {
const options = [] as {
@@ -763,9 +797,8 @@ const TemplateConfigurationTreeEditor = ({
<Select
placeholder="添加子节点"
options={tagOptions}
value={addChildTag}
value={null}
onChange={(value) => {
setAddChildTag(undefined);
handleAddNode(value, "child");
}}
disabled={isStructureLocked}
@@ -773,9 +806,8 @@ const TemplateConfigurationTreeEditor = ({
<Select
placeholder="添加同级节点"
options={tagOptions}
value={addSiblingTag}
value={null}
onChange={(value) => {
setAddSiblingTag(undefined);
handleAddNode(value, "sibling");
}}
disabled={isStructureLocked || selectedNode.id === tree.id}

View File

@@ -4,14 +4,9 @@ import { ArrowLeft } from "lucide-react";
import { Button, Form, App } from "antd";
import { Link, useLocation, useNavigate } from "react-router";
import { createDatasetUsingPost } from "../dataset.api";
import { datasetTypes } from "../dataset.const";
import { DatasetType } from "../dataset.model";
import BasicInformation from "./components/BasicInformation";
const textDatasetTypeOptions = datasetTypes.filter(
(type) => type.value === DatasetType.TEXT
);
export default function DatasetCreate() {
const navigate = useNavigate();
const location = useLocation();
@@ -87,7 +82,6 @@ export default function DatasetCreate() {
data={newDataset}
setData={setNewDataset}
hidden={["dataSource"]}
datasetTypeOptions={textDatasetTypeOptions}
/>
</Form>
</div>

View File

@@ -5,7 +5,7 @@ import { Dataset, DatasetType, DataSource } from "../../dataset.model";
import { useCallback, useEffect, useMemo, useState } from "react";
import { queryTasksUsingGet } from "@/pages/DataCollection/collection.apis";
import { updateDatasetByIdUsingPut } from "../../dataset.api";
import { sliceFile } from "@/utils/file.util";
import { sliceFile, shouldStreamUpload } from "@/utils/file.util";
import Dragger from "antd/es/upload/Dragger";
const TEXT_FILE_MIME_PREFIX = "text/";
@@ -90,14 +90,16 @@ async function splitFileByLines(file: UploadFile): Promise<UploadFile[]> {
const lines = text.split(/\r?\n/).filter((line: string) => line.trim() !== "");
if (lines.length === 0) return [];
// 生成文件名:原文件名_序号.扩展名
// 生成文件名:原文件名_序号(不保留后缀)
const nameParts = file.name.split(".");
const ext = nameParts.length > 1 ? "." + nameParts.pop() : "";
if (nameParts.length > 1) {
nameParts.pop();
}
const baseName = nameParts.join(".");
const padLength = String(lines.length).length;
return lines.map((line: string, index: number) => {
const newFileName = `${baseName}_${String(index + 1).padStart(padLength, "0")}${ext}`;
const newFileName = `${baseName}_${String(index + 1).padStart(padLength, "0")}`;
const blob = new Blob([line], { type: "text/plain" });
const newFile = new File([blob], newFileName, { type: "text/plain" });
return {
@@ -164,17 +166,75 @@ export default function ImportConfiguration({
// 本地上传文件相关逻辑
const handleUpload = async (dataset: Dataset) => {
let filesToUpload =
const filesToUpload =
(form.getFieldValue("files") as UploadFile[] | undefined) || [];
// 如果启用分行分割,处理文件
// 如果启用分行分割,对大文件使用流式处理
if (importConfig.splitByLine && !hasNonTextFile) {
const splitResults = await Promise.all(
filesToUpload.map((file) => splitFileByLines(file))
);
filesToUpload = splitResults.flat();
// 检查是否有大文件需要流式分割上传
const filesForStreamUpload: File[] = [];
const filesForNormalUpload: UploadFile[] = [];
for (const file of filesToUpload) {
const originFile = file.originFileObj ?? file;
if (originFile instanceof File && shouldStreamUpload(originFile)) {
filesForStreamUpload.push(originFile);
} else {
filesForNormalUpload.push(file);
}
}
// 大文件使用流式分割上传
if (filesForStreamUpload.length > 0) {
window.dispatchEvent(
new CustomEvent("upload:dataset-stream", {
detail: {
dataset,
files: filesForStreamUpload,
updateEvent,
hasArchive: importConfig.hasArchive,
prefix: currentPrefix,
},
})
);
}
// 小文件使用传统分割方式
if (filesForNormalUpload.length > 0) {
const splitResults = await Promise.all(
filesForNormalUpload.map((file) => splitFileByLines(file))
);
const smallFilesToUpload = splitResults.flat();
// 计算分片列表
const sliceList = smallFilesToUpload.map((file) => {
const originFile = (file.originFileObj ?? file) as Blob;
const slices = sliceFile(originFile);
return {
originFile: originFile,
slices,
name: file.name,
size: originFile.size || 0,
};
});
console.log("[ImportConfiguration] Uploading small files with currentPrefix:", currentPrefix);
window.dispatchEvent(
new CustomEvent("upload:dataset", {
detail: {
dataset,
files: sliceList,
updateEvent,
hasArchive: importConfig.hasArchive,
prefix: currentPrefix,
},
})
);
}
return;
}
// 未启用分行分割,使用普通上传
// 计算分片列表
const sliceList = filesToUpload.map((file) => {
const originFile = (file.originFileObj ?? file) as Blob;
@@ -234,6 +294,10 @@ export default function ImportConfiguration({
if (!data) return;
console.log('[ImportConfiguration] handleImportData called, currentPrefix:', currentPrefix);
if (importConfig.source === DataSource.UPLOAD) {
// 立即显示任务中心,让用户感知上传已开始(在文件分割等耗时操作之前)
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
await handleUpload(data);
} else if (importConfig.source === DataSource.COLLECTION) {
await updateDatasetByIdUsingPut(data.id, {

View File

@@ -102,6 +102,13 @@ export interface DatasetTask {
executionHistory?: { time: string; status: string }[];
}
export interface StreamUploadInfo {
currentFile: string;
fileIndex: number;
totalFiles: number;
uploadedLines: number;
}
export interface TaskItem {
key: string;
title: string;
@@ -113,4 +120,6 @@ export interface TaskItem {
updateEvent?: string;
size?: number;
hasArchive?: boolean;
prefix?: string;
streamUploadInfo?: StreamUploadInfo;
}

View File

@@ -75,6 +75,30 @@ const OFFICE_PREVIEW_POLL_MAX_TIMES = 60;
type OfficePreviewStatus = "UNSET" | "PENDING" | "PROCESSING" | "READY" | "FAILED";
const parseMetadata = (value?: string | Record<string, unknown>) => {
if (!value) {
return null;
}
if (typeof value === "object") {
return value as Record<string, unknown>;
}
if (typeof value !== "string") {
return null;
}
try {
const parsed = JSON.parse(value);
return parsed && typeof parsed === "object" ? (parsed as Record<string, unknown>) : null;
} catch {
return null;
}
};
const isAnnotationItem = (record: KnowledgeItemView) => {
const metadata = parseMetadata(record.metadata);
const source = metadata && typeof metadata === "object" ? (metadata as { source?: { type?: string } }).source : null;
return source?.type === "annotation";
};
const isOfficeFileName = (fileName?: string) => {
const lowerName = (fileName || "").toLowerCase();
return OFFICE_FILE_EXTENSIONS.some((ext) => lowerName.endsWith(ext));
@@ -488,7 +512,7 @@ const KnowledgeSetDetail = () => {
setReadItemId(record.id);
setReadTitle("知识条目");
if (!record.sourceDatasetId || !record.sourceFileId) {
if (!record.sourceDatasetId || !record.sourceFileId || isAnnotationItem(record)) {
const content = record.content || "";
setReadContent(truncatePreviewText(content, PREVIEW_TEXT_MAX_LENGTH));
setReadModalOpen(true);
@@ -921,7 +945,7 @@ const KnowledgeSetDetail = () => {
]}
tagConfig={{
showAdd: true,
tags: (knowledgeSet?.tags || []) as any,
tags: knowledgeSet?.tags || [],
onFetchTags: async () => {
const res = await queryDatasetTagsUsingGet({
page: 0,
@@ -950,10 +974,10 @@ const KnowledgeSetDetail = () => {
fetchKnowledgeSet();
}
},
onAddTag: async (tag: any) => {
onAddTag: async (tagName: string) => {
if (knowledgeSet) {
const currentTags = knowledgeSet.tags || [];
const newTagName = typeof tag === "string" ? tag : tag?.name;
const newTagName = tagName?.trim();
if (!newTagName) return;
await updateKnowledgeSetByIdUsingPut(knowledgeSet.id, {
name: knowledgeSet.name,
@@ -991,7 +1015,7 @@ const KnowledgeSetDetail = () => {
<Descriptions.Item label="领域">{knowledgeSet?.domain || "-"}</Descriptions.Item>
<Descriptions.Item label="业务线">{knowledgeSet?.businessLine || "-"}</Descriptions.Item>
<Descriptions.Item label="负责人">{knowledgeSet?.owner || "-"}</Descriptions.Item>
<Descriptions.Item label="敏感级别">{knowledgeSet?.sensitivity || "-"}</Descriptions.Item>
{/* <Descriptions.Item label="敏感级别">{knowledgeSet?.sensitivity || "-"}</Descriptions.Item> */}
<Descriptions.Item label="有效期">
{knowledgeSet?.validFrom || "-"} ~ {knowledgeSet?.validTo || "-"}
</Descriptions.Item>

View File

@@ -257,7 +257,7 @@ export default function KnowledgeManagementPage() {
return (
<div className="h-full flex flex-col gap-4">
<div className="flex items-center justify-between">
<h1 className="text-xl font-bold"></h1>
<h1 className="text-xl font-bold"></h1>
<div className="flex gap-2 items-center">
<Button onClick={() => navigate("/data/knowledge-management/search")}>

View File

@@ -9,6 +9,7 @@ import {
import {
knowledgeSourceTypeOptions,
knowledgeStatusOptions,
// sensitivityOptions,
} from "../knowledge-management.const";
import {
KnowledgeSet,
@@ -169,9 +170,9 @@ export default function CreateKnowledgeSet({
<Form.Item label="负责人" name="owner">
<Input placeholder="请输入负责人" />
</Form.Item>
<Form.Item label="敏感级别" name="sensitivity">
<Input placeholder="请输入敏感级别" />
</Form.Item>
{/* <Form.Item label="敏感级别" name="sensitivity">
<Select options={sensitivityOptions} placeholder="请选择敏感级别" />
</Form.Item> */}
</div>
<div className="grid grid-cols-2 gap-4">
<Form.Item label="有效期开始" name="validFrom">
@@ -191,9 +192,6 @@ export default function CreateKnowledgeSet({
placeholder="请选择或输入标签"
/>
</Form.Item>
<Form.Item label="扩展元数据" name="metadata">
<Input.TextArea placeholder="请输入元数据(JSON)" rows={3} />
</Form.Item>
</Form>
</Modal>
</>

View File

@@ -66,6 +66,11 @@ export const knowledgeSourceTypeOptions = [
{ label: "文件上传", value: KnowledgeSourceType.FILE_UPLOAD },
];
// export const sensitivityOptions = [
// { label: "敏感", value: "敏感" },
// { label: "不敏感", value: "不敏感" },
// ];
export type KnowledgeSetView = {
id: string;
name: string;

View File

@@ -3,25 +3,28 @@ import {
preUploadUsingPost,
uploadFileChunkUsingPost,
} from "@/pages/DataManagement/dataset.api";
import { Button, Empty, Progress } from "antd";
import { DeleteOutlined } from "@ant-design/icons";
import { Button, Empty, Progress, Tag } from "antd";
import { DeleteOutlined, FileTextOutlined } from "@ant-design/icons";
import { useEffect } from "react";
import { useFileSliceUpload } from "@/hooks/useSliceUpload";
export default function TaskUpload() {
const { createTask, taskList, removeTask, handleUpload } = useFileSliceUpload(
const { createTask, taskList, removeTask, handleUpload, registerStreamUploadListener } = useFileSliceUpload(
{
preUpload: preUploadUsingPost,
uploadChunk: uploadFileChunkUsingPost,
cancelUpload: cancelUploadUsingPut,
}
},
true, // showTaskCenter
true // enableStreamUpload
);
useEffect(() => {
const uploadHandler = (e: any) => {
console.log('[TaskUpload] Received upload event detail:', e.detail);
const { files } = e.detail;
const task = createTask(e.detail);
const uploadHandler = (e: Event) => {
const customEvent = e as CustomEvent;
console.log('[TaskUpload] Received upload event detail:', customEvent.detail);
const { files } = customEvent.detail;
const task = createTask(customEvent.detail);
console.log('[TaskUpload] Created task with prefix:', task.prefix);
handleUpload({ task, files });
};
@@ -29,7 +32,13 @@ export default function TaskUpload() {
return () => {
window.removeEventListener("upload:dataset", uploadHandler);
};
}, []);
}, [createTask, handleUpload]);
// 注册流式上传监听器
useEffect(() => {
const unregister = registerStreamUploadListener();
return unregister;
}, [registerStreamUploadListener]);
return (
<div
@@ -55,7 +64,22 @@ export default function TaskUpload() {
></Button>
</div>
<Progress size="small" percent={task.percent} />
<Progress size="small" percent={Number(task.percent)} />
{task.streamUploadInfo && (
<div className="flex items-center gap-2 text-xs text-gray-500 mt-1">
<Tag icon={<FileTextOutlined />} size="small">
</Tag>
<span>
: {task.streamUploadInfo.uploadedLines}
</span>
{task.streamUploadInfo.totalFiles > 1 && (
<span>
({task.streamUploadInfo.fileIndex}/{task.streamUploadInfo.totalFiles} )
</span>
)}
</div>
)}
</div>
))}
{taskList.length === 0 && (

View File

@@ -54,7 +54,7 @@ const LoginPage: React.FC = () => {
<Title level={2} className="!text-white !mb-2 tracking-wide font-bold">
DataBuilder
</Title>
<Text className="text-gray-400 text-sm tracking-wider">
<Text className="text-gray-400! text-sm tracking-wider">
</Text>
</div>
@@ -100,8 +100,8 @@ const LoginPage: React.FC = () => {
</Form.Item>
<div className="text-center mt-4">
<Text className="text-gray-600 text-xs">
·
<Text className="text-gray-600! text-xs">
·
</Text>
</div>
</Form>

View File

@@ -1,79 +1,657 @@
import { UploadFile } from "antd";
import jsSHA from "jssha";
const CHUNK_SIZE = 1024 * 1024 * 60;
// 默认分片大小:5MB(适合大多数网络环境)
export const DEFAULT_CHUNK_SIZE = 1024 * 1024 * 5;
// 大文件阈值:10MB
export const LARGE_FILE_THRESHOLD = 1024 * 1024 * 10;
// 最大并发上传数
export const MAX_CONCURRENT_UPLOADS = 3;
// 文本文件读取块大小:20MB(用于计算 SHA256)
const BUFFER_CHUNK_SIZE = 1024 * 1024 * 20;
export function sliceFile(file, chunkSize = CHUNK_SIZE): Blob[] {
/**
* 将文件分割为多个分片
* @param file 文件对象
* @param chunkSize 分片大小(字节),默认 5MB
* @returns 分片数组(Blob 列表)
*/
export function sliceFile(file: Blob, chunkSize = DEFAULT_CHUNK_SIZE): Blob[] {
const totalSize = file.size;
const chunks: Blob[] = [];
// 小文件不需要分片
if (totalSize <= chunkSize) {
return [file];
}
let start = 0;
let end = start + chunkSize;
const chunks = [];
while (start < totalSize) {
const end = Math.min(start + chunkSize, totalSize);
const blob = file.slice(start, end);
chunks.push(blob);
start = end;
end = start + chunkSize;
}
return chunks;
}
export function calculateSHA256(file: Blob): Promise<string> {
let count = 0;
const hash = new jsSHA("SHA-256", "ARRAYBUFFER", { encoding: "UTF8" });
/**
* 计算文件的 SHA256 哈希值
* @param file 文件 Blob
* @param onProgress 进度回调(可选)
* @returns SHA256 哈希字符串
*/
export function calculateSHA256(
file: Blob,
onProgress?: (percent: number) => void
): Promise<string> {
return new Promise((resolve, reject) => {
const hash = new jsSHA("SHA-256", "ARRAYBUFFER", { encoding: "UTF8" });
const reader = new FileReader();
let processedSize = 0;
function readChunk(start: number, end: number) {
const slice = file.slice(start, end);
reader.readAsArrayBuffer(slice);
}
const bufferChunkSize = 1024 * 1024 * 20;
function processChunk(offset: number) {
const start = offset;
const end = Math.min(start + bufferChunkSize, file.size);
count = end;
const end = Math.min(start + BUFFER_CHUNK_SIZE, file.size);
readChunk(start, end);
}
reader.onloadend = function () {
const arraybuffer = reader.result;
reader.onloadend = function (e) {
const arraybuffer = reader.result as ArrayBuffer;
if (!arraybuffer) {
reject(new Error("Failed to read file"));
return;
}
hash.update(arraybuffer);
if (count < file.size) {
processChunk(count);
processedSize += (e.target as FileReader).result?.byteLength || 0;
if (onProgress) {
const percent = Math.min(100, Math.round((processedSize / file.size) * 100));
onProgress(percent);
}
if (processedSize < file.size) {
processChunk(processedSize);
} else {
resolve(hash.getHash("HEX", { outputLen: 256 }));
}
};
reader.onerror = () => reject(new Error("File reading failed"));
processChunk(0);
});
}
/**
* 批量计算多个文件的 SHA256
* @param files 文件列表
* @param onFileProgress 单个文件进度回调(可选)
* @returns 哈希值数组
*/
export async function calculateSHA256Batch(
files: Blob[],
onFileProgress?: (index: number, percent: number) => void
): Promise<string[]> {
const results: string[] = [];
for (let i = 0; i < files.length; i++) {
const hash = await calculateSHA256(files[i], (percent) => {
onFileProgress?.(i, percent);
});
results.push(hash);
}
return results;
}
/**
* 检查文件是否存在(未被修改或删除)
* @param fileList 文件列表
* @returns 返回第一个不存在的文件,或 null(如果都存在)
*/
export function checkIsFilesExist(
fileList: UploadFile[]
): Promise<UploadFile | null> {
fileList: Array<{ originFile?: Blob }>
): Promise<{ originFile?: Blob } | null> {
return new Promise((resolve) => {
const loadEndFn = (file: UploadFile, reachEnd: boolean, e) => {
const fileNotExist = !e.target.result;
if (!fileList.length) {
resolve(null);
return;
}
let checkedCount = 0;
const totalCount = fileList.length;
const loadEndFn = (file: { originFile?: Blob }, e: ProgressEvent<FileReader>) => {
checkedCount++;
const fileNotExist = !e.target?.result;
if (fileNotExist) {
resolve(file);
return;
}
if (reachEnd) {
if (checkedCount >= totalCount) {
resolve(null);
}
};
for (let i = 0; i < fileList.length; i++) {
const { originFile: file } = fileList[i];
for (const file of fileList) {
const fileReader = new FileReader();
fileReader.readAsArrayBuffer(file);
fileReader.onloadend = (e) =>
loadEndFn(fileList[i], i === fileList.length - 1, e);
const actualFile = file.originFile;
if (!actualFile) {
checkedCount++;
if (checkedCount >= totalCount) {
resolve(null);
}
continue;
}
fileReader.readAsArrayBuffer(actualFile.slice(0, 1));
fileReader.onloadend = (e) => loadEndFn(file, e);
fileReader.onerror = () => {
checkedCount++;
resolve(file);
};
}
});
}
/**
* 判断文件是否为大文件
* @param size 文件大小(字节)
* @param threshold 阈值(字节),默认 10MB
*/
export function isLargeFile(size: number, threshold = LARGE_FILE_THRESHOLD): boolean {
return size > threshold;
}
/**
* 格式化文件大小为人类可读格式
* @param bytes 字节数
* @param decimals 小数位数
*/
export function formatFileSize(bytes: number, decimals = 2): string {
if (bytes === 0) return "0 B";
const k = 1024;
const sizes = ["B", "KB", "MB", "GB", "TB", "PB"];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return `${parseFloat((bytes / Math.pow(k, i)).toFixed(decimals))} ${sizes[i]}`;
}
/**
* 并发执行异步任务
* @param tasks 任务函数数组
* @param maxConcurrency 最大并发数
* @param onTaskComplete 单个任务完成回调(可选)
*/
export async function runConcurrentTasks<T>(
tasks: (() => Promise<T>)[],
maxConcurrency: number,
onTaskComplete?: (index: number, result: T) => void
): Promise<T[]> {
const results: T[] = new Array(tasks.length);
let index = 0;
async function runNext(): Promise<void> {
const currentIndex = index++;
if (currentIndex >= tasks.length) return;
const result = await tasks[currentIndex]();
results[currentIndex] = result;
onTaskComplete?.(currentIndex, result);
await runNext();
}
const workers = Array(Math.min(maxConcurrency, tasks.length))
.fill(null)
.map(() => runNext());
await Promise.all(workers);
return results;
}
/**
* 按行分割文本文件内容
* @param text 文本内容
* @param skipEmptyLines 是否跳过空行,默认 true
* @returns 行数组
*/
export function splitTextByLines(text: string, skipEmptyLines = true): string[] {
const lines = text.split(/\r?\n/);
if (skipEmptyLines) {
return lines.filter((line) => line.trim() !== "");
}
return lines;
}
/**
* 创建分片信息对象
* @param file 原始文件
* @param chunkSize 分片大小
*/
export function createFileSliceInfo(
file: File | Blob,
chunkSize = DEFAULT_CHUNK_SIZE
): {
originFile: Blob;
slices: Blob[];
name: string;
size: number;
totalChunks: number;
} {
const slices = sliceFile(file, chunkSize);
return {
originFile: file,
slices,
name: (file as File).name || "unnamed",
size: file.size,
totalChunks: slices.length,
};
}
/**
* 支持的文本文件 MIME 类型前缀
*/
export const TEXT_FILE_MIME_PREFIX = "text/";
/**
* 支持的文本文件 MIME 类型集合
*/
export const TEXT_FILE_MIME_TYPES = new Set([
"application/json",
"application/xml",
"application/csv",
"application/ndjson",
"application/x-ndjson",
"application/x-yaml",
"application/yaml",
"application/javascript",
"application/x-javascript",
"application/sql",
"application/rtf",
"application/xhtml+xml",
"application/svg+xml",
]);
/**
* 支持的文本文件扩展名集合
*/
export const TEXT_FILE_EXTENSIONS = new Set([
".txt",
".md",
".markdown",
".csv",
".tsv",
".json",
".jsonl",
".ndjson",
".log",
".xml",
".yaml",
".yml",
".sql",
".js",
".ts",
".jsx",
".tsx",
".html",
".htm",
".css",
".scss",
".less",
".py",
".java",
".c",
".cpp",
".h",
".hpp",
".go",
".rs",
".rb",
".php",
".sh",
".bash",
".zsh",
".ps1",
".bat",
".cmd",
".svg",
".rtf",
]);
/**
* 判断文件是否为文本文件(支持 UploadFile 类型)
* @param file UploadFile 对象
*/
export function isTextUploadFile(file: UploadFile): boolean {
const mimeType = (file.type || "").toLowerCase();
if (mimeType) {
if (mimeType.startsWith(TEXT_FILE_MIME_PREFIX)) return true;
if (TEXT_FILE_MIME_TYPES.has(mimeType)) return true;
}
const fileName = file.name || "";
const dotIndex = fileName.lastIndexOf(".");
if (dotIndex < 0) return false;
const ext = fileName.slice(dotIndex).toLowerCase();
return TEXT_FILE_EXTENSIONS.has(ext);
}
/**
* 判断文件名是否为文本文件
* @param fileName 文件名
*/
export function isTextFileByName(fileName: string): boolean {
const lowerName = fileName.toLowerCase();
// 先检查 MIME 类型(如果有)
// 这里简化处理,主要通过扩展名判断
const dotIndex = lowerName.lastIndexOf(".");
if (dotIndex < 0) return false;
const ext = lowerName.slice(dotIndex);
return TEXT_FILE_EXTENSIONS.has(ext);
}
/**
* 获取文件扩展名
* @param fileName 文件名
*/
export function getFileExtension(fileName: string): string {
const dotIndex = fileName.lastIndexOf(".");
if (dotIndex < 0) return "";
return fileName.slice(dotIndex).toLowerCase();
}
/**
* 安全地读取文件为文本
* @param file 文件对象
* @param encoding 编码,默认 UTF-8
*/
export function readFileAsText(
file: File | Blob,
encoding = "UTF-8"
): Promise<string> {
return new Promise((resolve, reject) => {
const reader = new FileReader();
reader.onload = (e) => resolve(e.target?.result as string);
reader.onerror = () => reject(new Error("Failed to read file"));
reader.readAsText(file, encoding);
});
}
/**
* 流式分割文件并逐行上传
* 使用 Blob.slice 逐块读取,避免一次性加载大文件到内存
* @param file 文件对象
* @param datasetId 数据集ID
* @param uploadFn 上传函数,接收 FormData 和配置,返回 Promise
* @param onProgress 进度回调 (currentBytes, totalBytes, uploadedLines)
* @param chunkSize 每次读取的块大小,默认 1MB
* @param options 其他选项
* @returns 上传结果统计
*/
export interface StreamUploadOptions {
reqId?: number;
resolveReqId?: (params: { totalFileNum: number; totalSize: number }) => Promise<number>;
onReqIdResolved?: (reqId: number) => void;
fileNamePrefix?: string;
hasArchive?: boolean;
prefix?: string;
signal?: AbortSignal;
maxConcurrency?: number;
}
export interface StreamUploadResult {
uploadedCount: number;
totalBytes: number;
skippedEmptyCount: number;
}
async function processFileLines(
file: File,
chunkSize: number,
signal: AbortSignal | undefined,
onLine?: (line: string, index: number) => Promise<void> | void,
onProgress?: (currentBytes: number, totalBytes: number, processedLines: number) => void
): Promise<{ lineCount: number; skippedEmptyCount: number }> {
const fileSize = file.size;
let offset = 0;
let buffer = "";
let skippedEmptyCount = 0;
let lineIndex = 0;
while (offset < fileSize) {
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
const end = Math.min(offset + chunkSize, fileSize);
const chunk = file.slice(offset, end);
const text = await readFileAsText(chunk);
const combined = buffer + text;
const lines = combined.split(/\r?\n/);
buffer = lines.pop() || "";
for (const line of lines) {
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (!line.trim()) {
skippedEmptyCount++;
continue;
}
const currentIndex = lineIndex;
lineIndex += 1;
if (onLine) {
await onLine(line, currentIndex);
}
}
offset = end;
onProgress?.(offset, fileSize, lineIndex);
}
if (buffer.trim()) {
const currentIndex = lineIndex;
lineIndex += 1;
if (onLine) {
await onLine(buffer, currentIndex);
}
} else if (buffer.length > 0) {
skippedEmptyCount++;
}
return { lineCount: lineIndex, skippedEmptyCount };
}
export async function streamSplitAndUpload(
file: File,
uploadFn: (formData: FormData, config?: { onUploadProgress?: (e: { loaded: number; total: number }) => void }) => Promise<unknown>,
onProgress?: (currentBytes: number, totalBytes: number, uploadedLines: number) => void,
chunkSize: number = 1024 * 1024, // 1MB
options: StreamUploadOptions
): Promise<StreamUploadResult> {
const {
reqId: initialReqId,
resolveReqId,
onReqIdResolved,
fileNamePrefix,
prefix,
signal,
maxConcurrency = 3,
} = options;
const fileSize = file.size;
let uploadedCount = 0;
let skippedEmptyCount = 0;
// 获取文件名基础部分和扩展名
const originalFileName = fileNamePrefix || file.name;
const lastDotIndex = originalFileName.lastIndexOf(".");
const baseName = lastDotIndex > 0 ? originalFileName.slice(0, lastDotIndex) : originalFileName;
const fileExtension = lastDotIndex > 0 ? originalFileName.slice(lastDotIndex) : "";
let resolvedReqId = initialReqId;
if (!resolvedReqId) {
const scanResult = await processFileLines(file, chunkSize, signal);
const totalFileNum = scanResult.lineCount;
skippedEmptyCount = scanResult.skippedEmptyCount;
if (totalFileNum === 0) {
return {
uploadedCount: 0,
totalBytes: fileSize,
skippedEmptyCount,
};
}
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (!resolveReqId) {
throw new Error("Missing pre-upload request id");
}
resolvedReqId = await resolveReqId({ totalFileNum, totalSize: fileSize });
if (!resolvedReqId) {
throw new Error("Failed to resolve pre-upload request id");
}
onReqIdResolved?.(resolvedReqId);
}
if (!resolvedReqId) {
throw new Error("Missing pre-upload request id");
}
/**
* 上传单行内容
* 每行作为独立文件上传,fileNo 对应行序号,chunkNo 固定为 1
*/
async function uploadLine(line: string, index: number): Promise<void> {
// 检查是否已取消
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (!line.trim()) {
skippedEmptyCount++;
return;
}
// 保留原始文件扩展名
const fileIndex = index + 1;
const newFileName = `${baseName}_${String(fileIndex).padStart(6, "0")}${fileExtension}`;
const blob = new Blob([line], { type: "text/plain" });
const lineFile = new File([blob], newFileName, { type: "text/plain" });
// 计算分片(小文件通常只需要一个分片)
const slices = sliceFile(lineFile, DEFAULT_CHUNK_SIZE);
const checkSum = await calculateSHA256(slices[0]);
// 检查是否已取消(计算哈希后)
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
const formData = new FormData();
formData.append("file", slices[0]);
formData.append("reqId", resolvedReqId.toString());
// 每行作为独立文件上传
formData.append("fileNo", fileIndex.toString());
formData.append("chunkNo", "1");
formData.append("fileName", newFileName);
formData.append("fileSize", lineFile.size.toString());
formData.append("totalChunkNum", "1");
formData.append("checkSumHex", checkSum);
if (prefix !== undefined) {
formData.append("prefix", prefix);
}
await uploadFn(formData, {
onUploadProgress: () => {
// 单行文件很小,进度主要用于追踪上传状态
},
});
}
const inFlight = new Set<Promise<void>>();
let uploadError: unknown = null;
const enqueueUpload = async (line: string, index: number) => {
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (uploadError) {
throw uploadError;
}
const uploadPromise = uploadLine(line, index)
.then(() => {
uploadedCount++;
})
.catch((err) => {
uploadError = err;
});
inFlight.add(uploadPromise);
uploadPromise.finally(() => inFlight.delete(uploadPromise));
if (inFlight.size >= maxConcurrency) {
await Promise.race(inFlight);
if (uploadError) {
throw uploadError;
}
}
};
let uploadResult: { lineCount: number; skippedEmptyCount: number } | null = null;
try {
uploadResult = await processFileLines(
file,
chunkSize,
signal,
enqueueUpload,
(currentBytes, totalBytes) => {
onProgress?.(currentBytes, totalBytes, uploadedCount);
}
);
if (uploadError) {
throw uploadError;
}
} finally {
if (inFlight.size > 0) {
await Promise.allSettled(inFlight);
}
}
if (!uploadResult || (initialReqId && uploadResult.lineCount === 0)) {
return {
uploadedCount: 0,
totalBytes: fileSize,
skippedEmptyCount: uploadResult?.skippedEmptyCount ?? 0,
};
}
if (!initialReqId) {
skippedEmptyCount = skippedEmptyCount || uploadResult.skippedEmptyCount;
} else {
skippedEmptyCount = uploadResult.skippedEmptyCount;
}
return {
uploadedCount,
totalBytes: fileSize,
skippedEmptyCount,
};
}
/**
* 判断文件是否需要流式分割上传
* @param file 文件对象
* @param threshold 阈值,默认 5MB
*/
export function shouldStreamUpload(file: File, threshold: number = 5 * 1024 * 1024): boolean {
return file.size > threshold;
}

View File

@@ -92,6 +92,14 @@ class Request {
});
}
// 监听 AbortSignal 来中止请求
if (config.signal) {
config.signal.addEventListener("abort", () => {
xhr.abort();
reject(new Error("上传已取消"));
});
}
// 监听上传进度
xhr.upload.addEventListener("progress", function (event) {
if (event.lengthComputable) {

View File

@@ -66,7 +66,7 @@ class Settings(BaseSettings):
datamate_backend_base_url: str = "http://datamate-backend:8080/api"
# 标注编辑器(Label Studio Editor)相关
editor_max_text_bytes: int = 2 * 1024 * 1024 # 2MB,避免一次加载超大文本卡死前端
editor_max_text_bytes: int = 0 # <=0 表示不限制,正数为最大字节数
# 全局设置实例
settings = Settings()

View File

@@ -19,6 +19,7 @@ from app.db.session import get_db
from app.module.annotation.schema.editor import (
EditorProjectInfo,
EditorTaskListResponse,
EditorTaskSegmentResponse,
EditorTaskResponse,
UpsertAnnotationRequest,
UpsertAnnotationResponse,
@@ -87,6 +88,21 @@ async def get_editor_task(
return StandardResponse(code=200, message="success", data=task)
@router.get(
"/projects/{project_id}/tasks/{file_id}/segments",
response_model=StandardResponse[EditorTaskSegmentResponse],
)
async def get_editor_task_segment(
project_id: str = Path(..., description="标注项目ID(t_dm_labeling_projects.id)"),
file_id: str = Path(..., description="文件ID(t_dm_dataset_files.id)"),
segment_index: int = Query(..., ge=0, alias="segmentIndex", description="段落索引(从0开始)"),
db: AsyncSession = Depends(get_db),
):
service = AnnotationEditorService(db)
result = await service.get_task_segment(project_id, file_id, segment_index)
return StandardResponse(code=200, message="success", data=result)
@router.put(
"/projects/{project_id}/tasks/{file_id}/annotation",
response_model=StandardResponse[UpsertAnnotationResponse],

View File

@@ -150,6 +150,18 @@ async def create_mapping(
labeling_project, snapshot_file_ids
)
# 如果启用了分段且为文本数据集,预生成切片结构
if dataset_type == TEXT_DATASET_TYPE and request.segmentation_enabled:
try:
from ..service.editor import AnnotationEditorService
editor_service = AnnotationEditorService(db)
# 异步预计算切片(不阻塞创建响应)
segmentation_result = await editor_service.precompute_segmentation_for_project(labeling_project.id)
logger.info(f"Precomputed segmentation for project {labeling_project.id}: {segmentation_result}")
except Exception as e:
logger.warning(f"Failed to precompute segmentation for project {labeling_project.id}: {e}")
# 不影响项目创建,只记录警告
response_data = DatasetMappingCreateResponse(
id=mapping.id,
labeling_project_id=str(mapping.labeling_project_id),

View File

@@ -79,12 +79,9 @@ class EditorTaskListResponse(BaseModel):
class SegmentInfo(BaseModel):
"""段落信息(用于文本分段标注)"""
"""段落摘要(用于文本分段标注)"""
idx: int = Field(..., description="段落索引")
text: str = Field(..., description="段落文本")
start: int = Field(..., description="在原文中的起始位置")
end: int = Field(..., description="在原文中的结束位置")
has_annotation: bool = Field(False, alias="hasAnnotation", description="该段落是否已有标注")
line_index: int = Field(0, alias="lineIndex", description="JSONL 行索引(从0开始)")
chunk_index: int = Field(0, alias="chunkIndex", description="行内分片索引(从0开始)")
@@ -100,7 +97,29 @@ class EditorTaskResponse(BaseModel):
# 分段相关字段
segmented: bool = Field(False, description="是否启用分段模式")
segments: Optional[List[SegmentInfo]] = Field(None, description="段落列表")
total_segments: int = Field(0, alias="totalSegments", description="段落")
current_segment_index: int = Field(0, alias="currentSegmentIndex", description="当前段落索引")
model_config = ConfigDict(populate_by_name=True)
class SegmentDetail(BaseModel):
"""段落内容"""
idx: int = Field(..., description="段落索引")
text: str = Field(..., description="段落文本")
has_annotation: bool = Field(False, alias="hasAnnotation", description="该段落是否已有标注")
line_index: int = Field(0, alias="lineIndex", description="JSONL 行索引(从0开始)")
chunk_index: int = Field(0, alias="chunkIndex", description="行内分片索引(从0开始)")
model_config = ConfigDict(populate_by_name=True)
class EditorTaskSegmentResponse(BaseModel):
"""编辑器单段内容响应"""
segmented: bool = Field(False, description="是否启用分段模式")
segment: Optional[SegmentDetail] = Field(None, description="段落内容")
total_segments: int = Field(0, alias="totalSegments", description="总段落数")
current_segment_index: int = Field(0, alias="currentSegmentIndex", description="当前段落索引")

View File

@@ -36,7 +36,9 @@ from app.module.annotation.schema.editor import (
EditorProjectInfo,
EditorTaskListItem,
EditorTaskListResponse,
EditorTaskSegmentResponse,
EditorTaskResponse,
SegmentDetail,
SegmentInfo,
UpsertAnnotationRequest,
UpsertAnnotationResponse,
@@ -538,6 +540,50 @@ class AnnotationEditorService:
return value
return raw_text
def _build_segment_contexts(
self,
records: List[Tuple[Optional[Dict[str, Any]], str]],
record_texts: List[str],
segment_annotation_keys: set[str],
) -> Tuple[List[SegmentInfo], List[Tuple[Optional[Dict[str, Any]], str, str, int, int]]]:
splitter = AnnotationTextSplitter(max_chars=self.SEGMENT_THRESHOLD)
segments: List[SegmentInfo] = []
segment_contexts: List[Tuple[Optional[Dict[str, Any]], str, str, int, int]] = []
segment_cursor = 0
for record_index, ((payload, raw_text), record_text) in enumerate(zip(records, record_texts)):
normalized_text = record_text or ""
if len(normalized_text) > self.SEGMENT_THRESHOLD:
raw_segments = splitter.split(normalized_text)
for chunk_index, seg in enumerate(raw_segments):
segments.append(
SegmentInfo(
idx=segment_cursor,
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=chunk_index,
)
)
segment_contexts.append((payload, raw_text, seg["text"], record_index, chunk_index))
segment_cursor += 1
else:
segments.append(
SegmentInfo(
idx=segment_cursor,
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=0,
)
)
segment_contexts.append((payload, raw_text, normalized_text, record_index, 0))
segment_cursor += 1
if not segments:
segments = [SegmentInfo(idx=0, hasAnnotation=False, lineIndex=0, chunkIndex=0)]
segment_contexts = [(None, "", "", 0, 0)]
return segments, segment_contexts
async def get_project_info(self, project_id: str) -> EditorProjectInfo:
project = await self._get_project_or_404(project_id)
@@ -668,6 +714,124 @@ class AnnotationEditorService:
return await self._build_text_task(project, file_record, file_id, segment_index)
async def get_task_segment(
self,
project_id: str,
file_id: str,
segment_index: int,
) -> EditorTaskSegmentResponse:
project = await self._get_project_or_404(project_id)
dataset_type = self._normalize_dataset_type(await self._get_dataset_type(project.dataset_id))
if dataset_type != DATASET_TYPE_TEXT:
raise HTTPException(
status_code=400,
detail="当前仅支持 TEXT 项目的段落内容",
)
file_result = await self.db.execute(
select(DatasetFiles).where(
DatasetFiles.id == file_id,
DatasetFiles.dataset_id == project.dataset_id,
)
)
file_record = file_result.scalar_one_or_none()
if not file_record:
raise HTTPException(status_code=404, detail=f"文件不存在或不属于该项目: {file_id}")
if not self._resolve_segmentation_enabled(project):
return EditorTaskSegmentResponse(
segmented=False,
segment=None,
totalSegments=0,
currentSegmentIndex=0,
)
text_content = await self._fetch_text_content_via_download_api(project.dataset_id, file_id)
assert isinstance(text_content, str)
label_config = await self._resolve_project_label_config(project)
primary_text_key = self._resolve_primary_text_key(label_config)
file_name = str(getattr(file_record, "file_name", "")).lower()
records: List[Tuple[Optional[Dict[str, Any]], str]] = []
if file_name.endswith(JSONL_EXTENSION):
records = self._parse_jsonl_records(text_content)
else:
parsed_payload = self._try_parse_json_payload(text_content)
if parsed_payload:
records = [(parsed_payload, text_content)]
if not records:
records = [(None, text_content)]
record_texts = [
self._resolve_primary_text_value(payload, raw_text, primary_text_key)
for payload, raw_text in records
]
if not record_texts:
record_texts = [text_content]
needs_segmentation = len(records) > 1 or any(
len(text or "") > self.SEGMENT_THRESHOLD for text in record_texts
)
if not needs_segmentation:
return EditorTaskSegmentResponse(
segmented=False,
segment=None,
totalSegments=0,
currentSegmentIndex=0,
)
ann_result = await self.db.execute(
select(AnnotationResult).where(
AnnotationResult.project_id == project.id,
AnnotationResult.file_id == file_id,
)
)
ann = ann_result.scalar_one_or_none()
segment_annotations: Dict[str, Dict[str, Any]] = {}
if ann and isinstance(ann.annotation, dict):
segment_annotations = self._extract_segment_annotations(ann.annotation)
segment_annotation_keys = set(segment_annotations.keys())
segments, segment_contexts = self._build_segment_contexts(
records,
record_texts,
segment_annotation_keys,
)
total_segments = len(segment_contexts)
if total_segments == 0:
return EditorTaskSegmentResponse(
segmented=False,
segment=None,
totalSegments=0,
currentSegmentIndex=0,
)
if segment_index < 0 or segment_index >= total_segments:
raise HTTPException(
status_code=400,
detail=f"segmentIndex 超出范围: {segment_index}",
)
segment_info = segments[segment_index]
_, _, segment_text, line_index, chunk_index = segment_contexts[segment_index]
segment_detail = SegmentDetail(
idx=segment_info.idx,
text=segment_text,
hasAnnotation=segment_info.has_annotation,
lineIndex=line_index,
chunkIndex=chunk_index,
)
return EditorTaskSegmentResponse(
segmented=True,
segment=segment_detail,
totalSegments=total_segments,
currentSegmentIndex=segment_index,
)
async def _build_text_task(
self,
project: LabelingProject,
@@ -723,7 +887,8 @@ class AnnotationEditorService:
needs_segmentation = segmentation_enabled and (
len(records) > 1 or any(len(text or "") > self.SEGMENT_THRESHOLD for text in record_texts)
)
segments: Optional[List[SegmentInfo]] = None
segments: List[SegmentInfo] = []
segment_contexts: List[Tuple[Optional[Dict[str, Any]], str, str, int, int]] = []
current_segment_index = 0
display_text = record_texts[0] if record_texts else text_content
selected_payload = records[0][0] if records else None
@@ -732,46 +897,13 @@ class AnnotationEditorService:
display_text = "\n".join(record_texts) if record_texts else text_content
if needs_segmentation:
splitter = AnnotationTextSplitter(max_chars=self.SEGMENT_THRESHOLD)
segment_contexts: List[Tuple[Optional[Dict[str, Any]], str, str, int, int]] = []
segments = []
segment_cursor = 0
for record_index, ((payload, raw_text), record_text) in enumerate(zip(records, record_texts)):
normalized_text = record_text or ""
if len(normalized_text) > self.SEGMENT_THRESHOLD:
raw_segments = splitter.split(normalized_text)
for chunk_index, seg in enumerate(raw_segments):
segments.append(SegmentInfo(
idx=segment_cursor,
text=seg["text"],
start=seg["start"],
end=seg["end"],
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=chunk_index,
))
segment_contexts.append((payload, raw_text, seg["text"], record_index, chunk_index))
segment_cursor += 1
else:
segments.append(SegmentInfo(
idx=segment_cursor,
text=normalized_text,
start=0,
end=len(normalized_text),
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=0,
))
segment_contexts.append((payload, raw_text, normalized_text, record_index, 0))
segment_cursor += 1
if not segments:
segments = [SegmentInfo(idx=0, text="", start=0, end=0, hasAnnotation=False, lineIndex=0, chunkIndex=0)]
segment_contexts = [(None, "", "", 0, 0)]
_, segment_contexts = self._build_segment_contexts(
records,
record_texts,
segment_annotation_keys,
)
current_segment_index = segment_index if segment_index is not None else 0
if current_segment_index < 0 or current_segment_index >= len(segments):
if current_segment_index < 0 or current_segment_index >= len(segment_contexts):
current_segment_index = 0
selected_payload, _, display_text, _, _ = segment_contexts[current_segment_index]
@@ -849,8 +981,7 @@ class AnnotationEditorService:
task=task,
annotationUpdatedAt=annotation_updated_at,
segmented=needs_segmentation,
segments=segments,
totalSegments=len(segments) if segments else 1,
totalSegments=len(segment_contexts) if needs_segmentation else 1,
currentSegmentIndex=current_segment_index,
)
@@ -1185,3 +1316,195 @@ class AnnotationEditorService:
except Exception as exc:
logger.warning("标注同步知识管理失败:%s", exc)
async def precompute_segmentation_for_project(
self,
project_id: str,
max_retries: int = 3
) -> Dict[str, Any]:
"""
为指定项目的所有文本文件预计算切片结构并持久化到数据库
Args:
project_id: 标注项目ID
max_retries: 失败重试次数
Returns:
统计信息:{total_files, succeeded, failed}
"""
project = await self._get_project_or_404(project_id)
dataset_type = self._normalize_dataset_type(await self._get_dataset_type(project.dataset_id))
# 只处理文本数据集
if dataset_type != DATASET_TYPE_TEXT:
logger.info(f"项目 {project_id} 不是文本数据集,跳过切片预生成")
return {"total_files": 0, "succeeded": 0, "failed": 0}
# 检查是否启用分段
if not self._resolve_segmentation_enabled(project):
logger.info(f"项目 {project_id} 未启用分段,跳过切片预生成")
return {"total_files": 0, "succeeded": 0, "failed": 0}
# 获取项目的所有文本文件(排除源文档)
files_result = await self.db.execute(
select(DatasetFiles)
.join(LabelingProjectFile, LabelingProjectFile.file_id == DatasetFiles.id)
.where(
LabelingProjectFile.project_id == project_id,
DatasetFiles.dataset_id == project.dataset_id,
)
)
file_records = files_result.scalars().all()
if not file_records:
logger.info(f"项目 {project_id} 没有文件,跳过切片预生成")
return {"total_files": 0, "succeeded": 0, "failed": 0}
# 过滤源文档文件
valid_files = []
for file_record in file_records:
file_type = str(getattr(file_record, "file_type", "") or "").lower()
file_name = str(getattr(file_record, "file_name", "")).lower()
is_source_document = (
file_type in SOURCE_DOCUMENT_TYPES or
any(file_name.endswith(ext) for ext in SOURCE_DOCUMENT_EXTENSIONS)
)
if not is_source_document:
valid_files.append(file_record)
total_files = len(valid_files)
succeeded = 0
failed = 0
label_config = await self._resolve_project_label_config(project)
primary_text_key = self._resolve_primary_text_key(label_config)
for file_record in valid_files:
file_id = str(file_record.id) # type: ignore
file_name = str(getattr(file_record, "file_name", ""))
for retry in range(max_retries):
try:
# 读取文本内容
text_content = await self._fetch_text_content_via_download_api(project.dataset_id, file_id)
if not isinstance(text_content, str):
logger.warning(f"文件 {file_id} 内容不是字符串,跳过切片")
failed += 1
break
# 解析文本记录
records: List[Tuple[Optional[Dict[str, Any]], str]] = []
if file_name.lower().endswith(JSONL_EXTENSION):
records = self._parse_jsonl_records(text_content)
else:
parsed_payload = self._try_parse_json_payload(text_content)
if parsed_payload:
records = [(parsed_payload, text_content)]
if not records:
records = [(None, text_content)]
record_texts = [
self._resolve_primary_text_value(payload, raw_text, primary_text_key)
for payload, raw_text in records
]
if not record_texts:
record_texts = [text_content]
# 判断是否需要分段
needs_segmentation = len(records) > 1 or any(
len(text or "") > self.SEGMENT_THRESHOLD for text in record_texts
)
if not needs_segmentation:
# 不需要分段的文件,跳过
succeeded += 1
break
# 执行切片
splitter = AnnotationTextSplitter(max_chars=self.SEGMENT_THRESHOLD)
segment_cursor = 0
segments = {}
for record_index, ((payload, raw_text), record_text) in enumerate(zip(records, record_texts)):
normalized_text = record_text or ""
if len(normalized_text) > self.SEGMENT_THRESHOLD:
raw_segments = splitter.split(normalized_text)
for chunk_index, seg in enumerate(raw_segments):
segments[str(segment_cursor)] = {
SEGMENT_RESULT_KEY: [],
SEGMENT_CREATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
SEGMENT_UPDATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
}
segment_cursor += 1
else:
segments[str(segment_cursor)] = {
SEGMENT_RESULT_KEY: [],
SEGMENT_CREATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
SEGMENT_UPDATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
}
segment_cursor += 1
if not segments:
succeeded += 1
break
# 构造分段标注结构
final_payload = {
SEGMENTED_KEY: True,
"version": 1,
SEGMENTS_KEY: segments,
SEGMENT_TOTAL_KEY: segment_cursor,
}
# 检查是否已存在标注
existing_result = await self.db.execute(
select(AnnotationResult).where(
AnnotationResult.project_id == project_id,
AnnotationResult.file_id == file_id,
)
)
existing = existing_result.scalar_one_or_none()
now = datetime.utcnow()
if existing:
# 更新现有标注
existing.annotation = final_payload # type: ignore[assignment]
existing.annotation_status = ANNOTATION_STATUS_IN_PROGRESS # type: ignore[assignment]
existing.updated_at = now # type: ignore[assignment]
else:
# 创建新标注记录
record = AnnotationResult(
id=str(uuid.uuid4()),
project_id=project_id,
file_id=file_id,
annotation=final_payload,
annotation_status=ANNOTATION_STATUS_IN_PROGRESS,
created_at=now,
updated_at=now,
)
self.db.add(record)
await self.db.commit()
succeeded += 1
logger.info(f"成功为文件 {file_id} 预生成 {segment_cursor} 个切片")
break
except Exception as e:
logger.warning(
f"为文件 {file_id} 预生成切片失败 (重试 {retry + 1}/{max_retries}): {e}"
)
if retry == max_retries - 1:
failed += 1
await self.db.rollback()
logger.info(
f"项目 {project_id} 切片预生成完成: 总计 {total_files}, 成功 {succeeded}, 失败 {failed}"
)
return {
"total_files": total_files,
"succeeded": succeeded,
"failed": failed,
}

View File

@@ -11,7 +11,6 @@ from sqlalchemy.ext.asyncio import AsyncSession
from app.core.config import settings
from app.core.logging import get_logger
from app.db.models import Dataset, DatasetFiles, LabelingProject
from app.module.annotation.service.text_fetcher import fetch_text_content_via_download_api
logger = get_logger(__name__)
@@ -77,15 +76,18 @@ class KnowledgeSyncService:
if set_id:
exists = await self._get_knowledge_set(set_id)
if exists:
if exists and self._metadata_matches_project(exists.get("metadata"), project.id):
return set_id
logger.warning("知识集不存在,准备重建:set_id=%s", set_id)
logger.warning(
"知识集不存在或归属不匹配,准备重建:set_id=%s project_id=%s",
set_id,
project.id,
)
dataset_name = project.name or "annotation-project"
base_name = dataset_name.strip() or "annotation-project"
project_name = (project.name or "annotation-project").strip() or "annotation-project"
metadata = self._build_set_metadata(project)
existing = await self._find_knowledge_set_by_name(base_name)
existing = await self._find_knowledge_set_by_name_and_project(project_name, project.id)
if existing:
await self._update_project_config(
project,
@@ -96,19 +98,19 @@ class KnowledgeSyncService:
)
return existing.get("id")
created = await self._create_knowledge_set(base_name, metadata)
created = await self._create_knowledge_set(project_name, metadata)
if not created:
created = await self._find_knowledge_set_by_name(base_name)
created = await self._find_knowledge_set_by_name_and_project(project_name, project.id)
if not created:
fallback_name = self._build_fallback_set_name(base_name, project.id)
existing = await self._find_knowledge_set_by_name(fallback_name)
fallback_name = self._build_fallback_set_name(project_name, project.id)
existing = await self._find_knowledge_set_by_name_and_project(fallback_name, project.id)
if existing:
created = existing
else:
created = await self._create_knowledge_set(fallback_name, metadata)
if not created:
created = await self._find_knowledge_set_by_name(fallback_name)
created = await self._find_knowledge_set_by_name_and_project(fallback_name, project.id)
if not created:
return None
@@ -153,16 +155,18 @@ class KnowledgeSyncService:
return []
return [item for item in content if isinstance(item, dict)]
async def _find_knowledge_set_by_name(self, name: str) -> Optional[Dict[str, Any]]:
async def _find_knowledge_set_by_name_and_project(self, name: str, project_id: str) -> Optional[Dict[str, Any]]:
if not name:
return None
items = await self._list_knowledge_sets(name)
if not items:
return None
exact_matches = [item for item in items if item.get("name") == name]
if not exact_matches:
return None
return exact_matches[0]
for item in items:
if item.get("name") != name:
continue
if self._metadata_matches_project(item.get("metadata"), project_id):
return item
return None
async def _create_knowledge_set(self, name: str, metadata: str) -> Optional[Dict[str, Any]]:
payload = {
@@ -249,16 +253,6 @@ class KnowledgeSyncService:
content_type = "MARKDOWN"
content = annotation_json
if dataset_type == "TEXT":
try:
content = await fetch_text_content_via_download_api(
project.dataset_id,
str(file_record.id),
)
content = self._append_annotation_to_content(content, annotation_json, content_type)
except Exception as exc:
logger.warning("读取文本失败,改为仅存标注JSON:%s", exc)
content = annotation_json
payload: Dict[str, Any] = {
"title": title,
@@ -289,13 +283,6 @@ class KnowledgeSyncService:
extension = file_type
return extension.lower() in {"md", "markdown"}
def _append_annotation_to_content(self, content: str, annotation_json: str, content_type: str) -> str:
if content_type == "MARKDOWN":
return (
f"{content}\n\n---\n\n## 标注结果\n\n```json\n"
f"{annotation_json}\n```")
return f"{content}\n\n---\n\n标注结果(JSON):\n{annotation_json}"
def _strip_extension(self, file_name: str) -> str:
if not file_name:
return ""
@@ -359,6 +346,27 @@ class KnowledgeSyncService:
except Exception:
return json.dumps({"error": "failed to serialize"}, ensure_ascii=False)
def _metadata_matches_project(self, metadata: Any, project_id: str) -> bool:
if not project_id:
return False
parsed = self._parse_metadata(metadata)
if not parsed:
return False
return str(parsed.get("project_id") or "").strip() == project_id
def _parse_metadata(self, metadata: Any) -> Optional[Dict[str, Any]]:
if metadata is None:
return None
if isinstance(metadata, dict):
return metadata
if isinstance(metadata, str):
try:
payload = json.loads(metadata)
except Exception:
return None
return payload if isinstance(payload, dict) else None
return None
def _safe_response_text(self, response: httpx.Response) -> str:
try:
return response.text

View File

@@ -19,23 +19,24 @@ async def fetch_text_content_via_download_api(dataset_id: str, file_id: str) ->
resp = await client.get(url)
resp.raise_for_status()
max_bytes = settings.editor_max_text_bytes
content_length = resp.headers.get("content-length")
if content_length:
if max_bytes > 0 and content_length:
try:
if int(content_length) > settings.editor_max_text_bytes:
if int(content_length) > max_bytes:
raise HTTPException(
status_code=413,
detail=f"文本文件过大,限制 {settings.editor_max_text_bytes} 字节",
detail=f"文本文件过大,限制 {max_bytes} 字节",
)
except ValueError:
# content-length 非法则忽略,走实际长度判断
pass
data = resp.content
if len(data) > settings.editor_max_text_bytes:
if max_bytes > 0 and len(data) > max_bytes:
raise HTTPException(
status_code=413,
detail=f"文本文件过大,限制 {settings.editor_max_text_bytes} 字节",
detail=f"文本文件过大,限制 {max_bytes} 字节",
)
# TEXT POC:默认按 UTF-8 解码,不可解码字符用替换符处理

View File

@@ -45,7 +45,7 @@ RUN npm config set registry https://registry.npmmirror.com && \
##### RUNNER
FROM gcr.io/distroless/nodejs20-debian12 AS runner
FROM gcr.nju.edu.cn/distroless/nodejs20-debian12 AS runner
WORKDIR /app
ENV NODE_ENV=production

View File

@@ -0,0 +1,93 @@
# backend-python Dockerfile 离线版本
# 修改点: 使用本地 DataX 源码替代 git clone
FROM maven:3-eclipse-temurin-8 AS datax-builder
# 配置 Maven 阿里云镜像
RUN mkdir -p /root/.m2 && \
echo '<?xml version="1.0" encoding="UTF-8"?>\n\
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"\n\
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n\
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">\n\
<mirrors>\n\
<mirror>\n\
<id>aliyunmaven</id>\n\
<mirrorOf>*</mirrorOf>\n\
<name>阿里云公共仓库</name>\n\
<url>https://maven.aliyun.com/repository/public</url>\n\
</mirror>\n\
</mirrors>\n\
</settings>' > /root/.m2/settings.xml
# 离线模式: 从构建参数获取本地 DataX 路径
ARG DATAX_LOCAL_PATH=./build-cache/resources/DataX
# 复制本地 DataX 源码(离线环境预先下载)
COPY ${DATAX_LOCAL_PATH} /DataX
COPY runtime/datax/ DataX/
RUN cd DataX && \
sed -i "s/com.mysql.jdbc.Driver/com.mysql.cj.jdbc.Driver/g" \
plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java && \
mvn -U clean package assembly:assembly -Dmaven.test.skip=true
FROM python:3.12-slim
# 配置 apt 阿里云镜像源
RUN if [ -f /etc/apt/sources.list.d/debian.sources ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources; \
elif [ -f /etc/apt/sources.list ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list; \
fi && \
apt-get update && \
apt-get install -y --no-install-recommends vim openjdk-21-jre nfs-common glusterfs-client rsync && \
rm -rf /var/lib/apt/lists/*
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
POETRY_VERSION=2.2.1 \
POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_CREATE=false \
POETRY_CACHE_DIR=/tmp/poetry_cache
ENV JAVA_HOME=/usr/lib/jvm/java-21-openjdk
ENV PATH="/root/.local/bin:$JAVA_HOME/bin:$PATH"
WORKDIR /app
# 配置 pip 阿里云镜像并安装 Poetry
RUN --mount=type=cache,target=/root/.cache/pip \
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ && \
pip config set global.trusted-host mirrors.aliyun.com && \
pip install --upgrade --root-user-action=ignore pip \
&& pip install --root-user-action=ignore pipx \
&& pipx install "poetry==$POETRY_VERSION"
COPY --from=datax-builder /DataX/target/datax/datax /opt/datax
RUN cp /opt/datax/plugin/reader/mysqlreader/libs/mysql* /opt/datax/plugin/reader/starrocksreader/libs/
# Copy only dependency files first
COPY runtime/datamate-python/pyproject.toml runtime/datamate-python/poetry.lock* /app/
# Install dependencies
RUN --mount=type=cache,target=$POETRY_CACHE_DIR \
poetry install --no-root --only main
# 离线模式: 使用本地 NLTK 数据
ARG NLTK_DATA_LOCAL_PATH=./build-cache/resources/nltk_data
COPY ${NLTK_DATA_LOCAL_PATH} /usr/local/nltk_data
ENV NLTK_DATA=/usr/local/nltk_data
# Copy the rest of the application
COPY runtime/datamate-python /app
COPY runtime/datamate-python/deploy/docker-entrypoint.sh /docker-entrypoint.sh
RUN chmod +x /docker-entrypoint.sh || true
# Expose the application port
EXPOSE 18000
ENTRYPOINT ["/docker-entrypoint.sh"]

View File

@@ -0,0 +1,82 @@
# backend-python Dockerfile 离线版本 v2
FROM maven:3-eclipse-temurin-8 AS datax-builder
# 配置 Maven 阿里云镜像
RUN mkdir -p /root/.m2 && \
echo '<?xml version="1.0" encoding="UTF-8"?>\n\
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"\n\
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n\
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">\n\
<mirrors>\n\
<mirror>\n\
<id>aliyunmaven</id>\n\
<mirrorOf>*</mirrorOf>\n\
<name>阿里云公共仓库</name>\n\
<url>https://maven.aliyun.com/repository/public</url>\n\
</mirror>\n\
</mirrors>\n\
</settings>' > /root/.m2/settings.xml
# 离线模式: 从构建参数获取本地 DataX 路径
ARG RESOURCES_DIR=./build-cache/resources
ARG DATAX_LOCAL_PATH=${RESOURCES_DIR}/DataX
# 复制本地 DataX 源码
COPY ${DATAX_LOCAL_PATH} /DataX
COPY runtime/datax/ DataX/
RUN cd DataX && \
sed -i "s/com.mysql.jdbc.Driver/com.mysql.cj.jdbc.Driver/g" \
plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java && \
mvn -U clean package assembly:assembly -Dmaven.test.skip=true
# 使用预装 APT 包的基础镜像
FROM datamate-python-base:latest
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
POETRY_VERSION=2.2.1 \
POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_CREATE=false \
POETRY_CACHE_DIR=/tmp/poetry_cache
ENV JAVA_HOME=/usr/lib/jvm/java-21-openjdk
ENV PATH="/root/.local/bin:$JAVA_HOME/bin:$PATH"
WORKDIR /app
# 配置 pip 阿里云镜像并安装 Poetry
RUN --mount=type=cache,target=/root/.cache/pip \
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ && \
pip config set global.trusted-host mirrors.aliyun.com && \
pip install --upgrade --root-user-action=ignore pip \
&& pip install --root-user-action=ignore pipx \
&& pipx install "poetry==$POETRY_VERSION"
COPY --from=datax-builder /DataX/target/datax/datax /opt/datax
RUN cp /opt/datax/plugin/reader/mysqlreader/libs/mysql* /opt/datax/plugin/reader/starrocksreader/libs/
# Copy only dependency files first
COPY runtime/datamate-python/pyproject.toml runtime/datamate-python/poetry.lock* /app/
# Install dependencies
RUN --mount=type=cache,target=$POETRY_CACHE_DIR \
poetry install --no-root --only main
# 离线模式: 使用本地 NLTK 数据
ARG RESOURCES_DIR=./build-cache/resources
ARG NLTK_DATA_LOCAL_PATH=${RESOURCES_DIR}/nltk_data
COPY ${NLTK_DATA_LOCAL_PATH} /usr/local/nltk_data
ENV NLTK_DATA=/usr/local/nltk_data
# Copy the rest of the application
COPY runtime/datamate-python /app
COPY runtime/datamate-python/deploy/docker-entrypoint.sh /docker-entrypoint.sh
RUN chmod +x /docker-entrypoint.sh || true
EXPOSE 18000
ENTRYPOINT ["/docker-entrypoint.sh"]

View File

@@ -0,0 +1,71 @@
# backend Dockerfile 离线版本
# 使用预装 APT 包的基础镜像
FROM maven:3-eclipse-temurin-21 AS builder
# 配置 Maven 阿里云镜像
RUN mkdir -p /root/.m2 && \
echo '<?xml version="1.0" encoding="UTF-8"?>\n\
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"\n\
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n\
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">\n\
<mirrors>\n\
<mirror>\n\
<id>aliyunmaven</id>\n\
<mirrorOf>*</mirrorOf>\n\
<name>阿里云公共仓库</name>\n\
<url>https://maven.aliyun.com/repository/public</url>\n\
</mirror>\n\
</mirrors>\n\
</settings>' > /root/.m2/settings.xml
WORKDIR /opt/backend
# 先复制所有 pom.xml 文件
COPY backend/pom.xml ./
COPY backend/services/pom.xml ./services/
COPY backend/shared/domain-common/pom.xml ./shared/domain-common/
COPY backend/shared/security-common/pom.xml ./shared/security-common/
COPY backend/services/data-annotation-service/pom.xml ./services/data-annotation-service/
COPY backend/services/data-cleaning-service/pom.xml ./services/data-cleaning-service/
COPY backend/services/data-evaluation-service/pom.xml ./services/data-evaluation-service/
COPY backend/services/data-management-service/pom.xml ./services/data-management-service/
COPY backend/services/data-synthesis-service/pom.xml ./services/data-synthesis-service/
COPY backend/services/execution-engine-service/pom.xml ./services/execution-engine-service/
COPY backend/services/main-application/pom.xml ./services/main-application/
COPY backend/services/operator-market-service/pom.xml ./services/operator-market-service/
COPY backend/services/pipeline-orchestration-service/pom.xml ./services/pipeline-orchestration-service/
COPY backend/services/rag-indexer-service/pom.xml ./services/rag-indexer-service/
COPY backend/services/rag-query-service/pom.xml ./services/rag-query-service/
# 使用缓存卷下载依赖
RUN --mount=type=cache,target=/root/.m2/repository \
cd /opt/backend/services && \
mvn dependency:go-offline -Dmaven.test.skip=true || true
# 复制所有源代码
COPY backend/ /opt/backend
# 编译打包
RUN --mount=type=cache,target=/root/.m2/repository \
cd /opt/backend/services && \
mvn clean package -Dmaven.test.skip=true
# 使用预装 APT 包的基础镜像
FROM datamate-java-base:latest
# 不再执行 apt-get update,因为基础镜像已经预装了所有需要的包
# 如果需要添加额外的包,可以在这里添加,但离线环境下会失败
COPY --from=builder /opt/backend/services/main-application/target/datamate.jar /opt/backend/datamate.jar
COPY scripts/images/backend/start.sh /opt/backend/start.sh
COPY runtime/ops/examples/test_operator/test_operator.tar /opt/backend/test_operator.tar
RUN dos2unix /opt/backend/start.sh \
&& chmod +x /opt/backend/start.sh \
&& ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
EXPOSE 8080
ENTRYPOINT ["/opt/backend/start.sh"]
CMD ["java", "-Duser.timezone=Asia/Shanghai", "-jar", "/opt/backend/datamate.jar"]

View File

@@ -0,0 +1,62 @@
# 预安装 APT 包的基础镜像
# 在有网环境构建这些镜像,在无网环境作为基础镜像使用
# ==================== backend / gateway 基础镜像 ====================
FROM eclipse-temurin:21-jdk AS datamate-java-base
# 配置 apt 阿里云镜像源
RUN if [ -f /etc/apt/sources.list.d/debian.sources ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources; \
elif [ -f /etc/apt/sources.list.d/ubuntu.sources ]; then \
sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g; s/security.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list.d/ubuntu.sources; \
elif [ -f /etc/apt/sources.list ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g; s/archive.ubuntu.com/mirrors.aliyun.com/g; s/security.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list; \
fi && \
apt-get update && \
apt-get install -y vim wget curl rsync python3 python3-pip python-is-python3 dos2unix libreoffice fonts-noto-cjk && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# ==================== backend-python 基础镜像 ====================
FROM python:3.12-slim AS datamate-python-base
RUN if [ -f /etc/apt/sources.list.d/debian.sources ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources; \
elif [ -f /etc/apt/sources.list ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list; \
fi && \
apt-get update && \
apt-get install -y --no-install-recommends vim openjdk-21-jre nfs-common glusterfs-client rsync && \
rm -rf /var/lib/apt/lists/*
# ==================== runtime 基础镜像 ====================
FROM ghcr.nju.edu.cn/astral-sh/uv:python3.11-bookworm AS datamate-runtime-base
RUN if [ -f /etc/apt/sources.list.d/debian.sources ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources; \
elif [ -f /etc/apt/sources.list ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list; \
fi && \
apt update && \
apt install -y libgl1 libglib2.0-0 vim libmagic1 libreoffice dos2unix swig poppler-utils tesseract-ocr && \
rm -rf /var/lib/apt/lists/*
# ==================== deer-flow-backend 基础镜像 ====================
FROM ghcr.nju.edu.cn/astral-sh/uv:python3.12-bookworm AS deer-flow-backend-base
RUN if [ -f /etc/apt/sources.list.d/debian.sources ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources; \
elif [ -f /etc/apt/sources.list ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list; \
fi && \
apt-get update && apt-get install -y libpq-dev git && \
rm -rf /var/lib/apt/lists/*
# ==================== mineru 基础镜像 ====================
FROM python:3.11-slim AS mineru-base
RUN sed -i 's/deb.debian.org/mirrors.huaweicloud.com/g' /etc/apt/sources.list.d/debian.sources && \
apt-get update && \
apt-get install -y curl vim libgl1 libglx0 libopengl0 libglib2.0-0 procps && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

View File

@@ -0,0 +1,44 @@
# deer-flow-backend Dockerfile 离线版本
# 修改点: 使用本地 deer-flow 源码替代 git clone
FROM ghcr.nju.edu.cn/astral-sh/uv:python3.12-bookworm
# Install uv.
COPY --from=ghcr.nju.edu.cn/astral-sh/uv:latest /uv /bin/uv
# 配置 apt 阿里云镜像源并安装系统依赖
RUN if [ -f /etc/apt/sources.list.d/debian.sources ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources; \
elif [ -f /etc/apt/sources.list ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list; \
fi && \
apt-get update && apt-get install -y \
libpq-dev git \
&& rm -rf /var/lib/apt/lists/*
# 配置 uv 使用阿里云 PyPI 镜像
ENV UV_INDEX_URL="https://mirrors.aliyun.com/pypi/simple/"
WORKDIR /app
# 离线模式: 本地 deer-flow 路径
ARG RESOURCES_DIR=./build-cache/resources
ARG DEERFLOW_DIR=${RESOURCES_DIR}/deer-flow
# 复制本地 deer-flow 源码(离线环境预先下载)
COPY ${DEERFLOW_DIR} /app
COPY runtime/deer-flow/.env /app/.env
COPY runtime/deer-flow/conf.yaml /app/conf.yaml
# Pre-cache the application dependencies.
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --locked --no-install-project
# Install the application dependencies.
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --locked
EXPOSE 8000
# Run the application.
CMD ["uv", "run", "--no-sync", "python", "server.py", "--host", "0.0.0.0", "--port", "8000"]

View File

@@ -0,0 +1,75 @@
# deer-flow-frontend Dockerfile 离线版本
# 修改点: 使用本地 deer-flow 源码替代 git clone
##### DEPENDENCIES
FROM node:20-alpine AS deps
RUN apk add --no-cache libc6-compat openssl
WORKDIR /app
# 离线模式: 本地 deer-flow 路径
ARG RESOURCES_DIR=./build-cache/resources
ARG DEERFLOW_DIR=${RESOURCES_DIR}/deer-flow
# 复制本地 deer-flow 源码
COPY ${DEERFLOW_DIR}/web /app
# 配置 npm 淘宝镜像并安装依赖
RUN npm config set registry https://registry.npmmirror.com && \
if [ -f yarn.lock ]; then yarn config set registry https://registry.npmmirror.com && yarn --frozen-lockfile; \
elif [ -f package-lock.json ]; then npm ci; \
elif [ -f pnpm-lock.yaml ]; then npm install -g pnpm && pnpm config set registry https://registry.npmmirror.com && pnpm i; \
else echo "Lockfile not found." && exit 1; \
fi
##### BUILDER
FROM node:20-alpine AS builder
RUN apk add --no-cache git
WORKDIR /app
ARG NEXT_PUBLIC_API_URL="/deer-flow-backend"
# 离线模式: 复制本地源码
ARG RESOURCES_DIR=./build-cache/resources
ARG DEERFLOW_DIR=${RESOURCES_DIR}/deer-flow
COPY ${DEERFLOW_DIR} /deer-flow
RUN cd /deer-flow \
&& mv /deer-flow/web/* /app \
&& rm -rf /deer-flow
COPY --from=deps /app/node_modules ./node_modules
ENV NEXT_TELEMETRY_DISABLED=1
# 配置 npm 淘宝镜像
RUN npm config set registry https://registry.npmmirror.com && \
if [ -f yarn.lock ]; then yarn config set registry https://registry.npmmirror.com && SKIP_ENV_VALIDATION=1 yarn build; \
elif [ -f package-lock.json ]; then SKIP_ENV_VALIDATION=1 npm run build; \
elif [ -f pnpm-lock.yaml ]; then npm install -g pnpm && pnpm config set registry https://registry.npmmirror.com && SKIP_ENV_VALIDATION=1 pnpm run build; \
else echo "Lockfile not found." && exit 1; \
fi
##### RUNNER
FROM gcr.nju.edu.cn/distroless/nodejs20-debian12 AS runner
WORKDIR /app
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
COPY --from=builder /app/next.config.js ./
COPY --from=builder /app/public ./public
COPY --from=builder /app/package.json ./package.json
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
EXPOSE 3000
ENV PORT=3000
CMD ["server.js"]

View File

@@ -0,0 +1,47 @@
# gateway Dockerfile 离线版本
FROM maven:3-eclipse-temurin-21 AS builder
# 配置 Maven 阿里云镜像
RUN mkdir -p /root/.m2 && \
echo '<?xml version="1.0" encoding="UTF-8"?>\n\
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"\n\
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n\
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">\n\
<mirrors>\n\
<mirror>\n\
<id>aliyunmaven</id>\n\
<mirrorOf>*</mirrorOf>\n\
<name>阿里云公共仓库</name>\n\
<url>https://maven.aliyun.com/repository/public</url>\n\
</mirror>\n\
</mirrors>\n\
</settings>' > /root/.m2/settings.xml
WORKDIR /opt/gateway
COPY backend/pom.xml ./
COPY backend/api-gateway/pom.xml ./api-gateway/
RUN --mount=type=cache,target=/root/.m2/repository \
cd /opt/gateway/api-gateway && \
mvn dependency:go-offline -Dmaven.test.skip=true || true
COPY backend/api-gateway /opt/gateway/api-gateway
RUN --mount=type=cache,target=/root/.m2/repository \
cd /opt/gateway/api-gateway && \
mvn clean package -Dmaven.test.skip=true
FROM datamate-java-base:latest
COPY --from=builder /opt/gateway/api-gateway/target/gateway.jar /opt/gateway/gateway.jar
COPY scripts/images/gateway/start.sh /opt/gateway/start.sh
RUN dos2unix /opt/gateway/start.sh \
&& chmod +x /opt/gateway/start.sh \
&& ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
EXPOSE 8080
ENTRYPOINT ["/opt/gateway/start.sh"]
CMD ["java", "-Duser.timezone=Asia/Shanghai", "-jar", "/opt/gateway/gateway.jar"]

View File

@@ -0,0 +1,54 @@
# runtime Dockerfile 离线版本
# 修改点: 使用本地模型文件替代 wget 下载
FROM ghcr.nju.edu.cn/astral-sh/uv:python3.11-bookworm
# 配置 apt 阿里云镜像源
RUN --mount=type=cache,target=/var/cache/apt \
--mount=type=cache,target=/var/lib/apt \
if [ -f /etc/apt/sources.list.d/debian.sources ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources; \
elif [ -f /etc/apt/sources.list ]; then \
sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list; \
fi \
&& apt update \
&& apt install -y libgl1 libglib2.0-0 vim libmagic1 libreoffice dos2unix swig poppler-utils tesseract-ocr
# 离线模式: 本地模型文件路径
ARG RESOURCES_DIR=./build-cache/resources
ARG MODELS_DIR=${RESOURCES_DIR}/models
# 复制本地 PaddleOCR 模型(离线环境预先下载)
RUN mkdir -p /home/models
COPY ${MODELS_DIR}/ch_ppocr_mobile_v2.0_cls_infer.tar /home/models/
RUN tar -xf /home/models/ch_ppocr_mobile_v2.0_cls_infer.tar -C /home/models
COPY runtime/python-executor /opt/runtime
COPY runtime/ops /opt/runtime/datamate/ops
COPY runtime/ops/user /opt/runtime/user
COPY scripts/images/runtime/start.sh /opt/runtime/start.sh
ENV PYTHONPATH=/opt/runtime/datamate/
ENV UV_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV UV_INDEX_STRATEGY=unsafe-best-match
# 配置 uv 使用阿里云 PyPI 镜像
ENV UV_INDEX_URL="https://mirrors.aliyun.com/pypi/simple/"
WORKDIR /opt/runtime
# 复制本地 spaCy 模型(离线环境预先下载)
COPY ${MODELS_DIR}/zh_core_web_sm-3.8.0-py3-none-any.whl /tmp/
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install -e .[all] --system \
&& uv pip install -r /opt/runtime/datamate/ops/pyproject.toml --system \
&& uv pip install /tmp/zh_core_web_sm-3.8.0-py3-none-any.whl --system \
&& echo "/usr/local/lib/ops/site-packages" > /usr/local/lib/python3.11/site-packages/ops.pth
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
&& chmod +x /opt/runtime/start.sh \
&& dos2unix /opt/runtime/start.sh
EXPOSE 8081
ENTRYPOINT ["/opt/runtime/start.sh"]

View File

@@ -0,0 +1,42 @@
# runtime Dockerfile 离线版本 v2
# 使用预装 APT 包的基础镜像
FROM datamate-runtime-base:latest
# 离线模式: 本地模型文件路径
ARG RESOURCES_DIR=./build-cache/resources
ARG MODELS_DIR=${RESOURCES_DIR}/models
# 复制本地 PaddleOCR 模型
RUN mkdir -p /home/models
COPY ${MODELS_DIR}/ch_ppocr_mobile_v2.0_cls_infer.tar /home/models/
RUN tar -xf /home/models/ch_ppocr_mobile_v2.0_cls_infer.tar -C /home/models
COPY runtime/python-executor /opt/runtime
COPY runtime/ops /opt/runtime/datamate/ops
COPY runtime/ops/user /opt/runtime/user
COPY scripts/images/runtime/start.sh /opt/runtime/start.sh
ENV PYTHONPATH=/opt/runtime/datamate/
ENV UV_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV UV_INDEX_STRATEGY=unsafe-best-match
ENV UV_INDEX_URL="https://mirrors.aliyun.com/pypi/simple/"
WORKDIR /opt/runtime
# 复制本地 spaCy 模型
COPY ${MODELS_DIR}/zh_core_web_sm-3.8.0-py3-none-any.whl /tmp/
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install -e .[all] --system \
&& uv pip install -r /opt/runtime/datamate/ops/pyproject.toml --system \
&& uv pip install /tmp/zh_core_web_sm-3.8.0-py3-none-any.whl --system \
&& echo "/usr/local/lib/ops/site-packages" > /usr/local/lib/python3.11/site-packages/ops.pth
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
&& chmod +x /opt/runtime/start.sh \
&& dos2unix /opt/runtime/start.sh
EXPOSE 8081
ENTRYPOINT ["/opt/runtime/start.sh"]

View File

@@ -0,0 +1,76 @@
# Makefile 离线构建扩展
# 将此内容追加到主 Makefile 或单独使用
# 使用方法: make -f Makefile.offline <target>
# 离线构建配置
CACHE_DIR ?= ./build-cache
VERSION ?= latest
# ========== 离线构建目标 ==========
.PHONY: offline-export
offline-export:
@echo "导出离线构建缓存..."
@bash scripts/offline/export-cache.sh $(CACHE_DIR)
.PHONY: offline-build
offline-build:
@echo "使用缓存进行离线构建..."
@bash scripts/offline/build-offline.sh $(CACHE_DIR) $(VERSION)
.PHONY: offline-setup
offline-setup:
@echo "解压并设置离线缓存..."
@if [ ! -d "$(CACHE_DIR)" ]; then \
echo "查找缓存压缩包..."; \
cache_file=$$(ls -t build-cache-*.tar.gz 2>/dev/null | head -1); \
if [ -z "$$cache_file" ]; then \
echo "错误: 未找到缓存压缩包 (build-cache-*.tar.gz)"; \
exit 1; \
fi; \
echo "解压 $$cache_file..."; \
tar -xzf "$$cache_file"; \
fi
@echo "✓ 离线缓存准备完成"
# 单个服务的离线构建
.PHONY: %-offline-build
%-offline-build:
@echo "离线构建 $*..."
@$(eval CACHE_FILE := $(CACHE_DIR)/buildkit/$*-cache)
@$(eval IMAGE_NAME := $(if $(filter deer-flow%,$*),$*,datamate-$*))
@if [ ! -d "$(CACHE_FILE)" ]; then \
echo "错误: $* 的缓存不存在于 $(CACHE_FILE)"; \
exit 1; \
fi
@docker buildx build \
--cache-from type=local,src=$(CACHE_FILE) \
--network=none \
-f scripts/images/$*/Dockerfile \
-t $(IMAGE_NAME):$(VERSION) \
--load \
. || echo "警告: $* 离线构建失败"
# 兼容原 Makefile 的构建目标(离线模式)
.PHONY: build-offline
build-offline: offline-setup
@$(MAKE) offline-build
.PHONY: help-offline
help-offline:
@echo "离线构建命令:"
@echo " make offline-export - 在有网环境导出构建缓存"
@echo " make offline-setup - 解压并准备离线缓存"
@echo " make offline-build - 在无网环境使用缓存构建"
@echo " make <service>-offline-build - 离线构建单个服务"
@echo ""
@echo "示例:"
@echo " # 有网环境导出缓存"
@echo " make offline-export"
@echo ""
@echo " # 传输 build-cache-*.tar.gz 到无网环境"
@echo " scp build-cache-20250202.tar.gz user@offline-server:/path/"
@echo ""
@echo " # 无网环境构建"
@echo " make offline-setup"
@echo " make offline-build"

489
scripts/offline/README.md Normal file
View File

@@ -0,0 +1,489 @@
# BuildKit 离线构建方案
本方案使用 Docker BuildKit 的缓存机制,实现在弱网/无网环境下的镜像构建。
## 方案概述
```
┌─────────────────────────────────────────────────────────────────┐
│ 有网环境 (Build Machine) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ 基础镜像 │ │ BuildKit │ │ 外部资源 │ │
│ │ docker pull │ + │ 缓存导出 │ + │ (模型/源码) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ build-cache.tar.gz│ │
│ └────────┬─────────┘ │
└─────────────────────────────┼───────────────────────────────────┘
│ 传输到无网环境
┌─────────────────────────────────────────────────────────────────┐
│ 无网环境 (Offline Machine) │
│ ┌──────────────────┐ │
│ │ build-cache.tar.gz│ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ docker load │ │ BuildKit │ │ 本地资源挂载 │ │
│ │ 基础镜像 │ + │ 缓存导入 │ + │ (模型/源码) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ ▼ │
│ 构建成功! │
└─────────────────────────────────────────────────────────────────┘
```
## 快速开始
### 方法一:使用 Makefile 扩展(推荐)
#### 1. 合并 Makefile
`Makefile.offline.mk` 追加到主 Makefile:
```bash
# Linux/Mac
cat Makefile.offline.mk >> Makefile
# Windows (PowerShell)
Get-Content Makefile.offline.mk | Add-Content Makefile
```
#### 2. 有网环境导出缓存
```bash
# 导出所有缓存(包括基础镜像、BuildKit 缓存、外部资源)
make offline-export
# 或者指定输出目录
make offline-export CACHE_DIR=/path/to/cache
```
执行完成后,会生成压缩包:`build-cache-YYYYMMDD.tar.gz`
#### 3. 传输到无网环境
```bash
# 使用 scp 或其他方式传输
scp build-cache-20250202.tar.gz user@offline-server:/opt/datamate/
# 或者使用 U 盘等物理介质
```
#### 4. 无网环境构建(推荐使用传统方式)
```bash
# 解压缓存
tar -xzf build-cache-20250202.tar.gz
# 诊断环境(检查基础镜像等)
make offline-diagnose
# 方法 A:传统 docker build(推荐,更稳定)
make offline-setup
make offline-build-classic
# 方法 B:BuildKit 构建(如果方法 A 失败)
make offline-setup
make offline-build
# 或者指定版本号
make offline-build-classic OFFLINE_VERSION=v1.0.0
```
**⚠️ 重要提示**:如果遇到镜像拉取问题,请使用 `make offline-build-classic` 而不是 `make offline-build`
### 方法二:使用独立脚本
#### 导出缓存
```bash
cd scripts/offline
./export-cache.sh /path/to/output
```
#### 离线构建
```bash
cd scripts/offline
./build-offline.sh /path/to/cache [version]
```
## 详细说明
### 缓存内容
缓存目录结构:
```
build-cache/
├── buildkit/ # BuildKit 缓存
│ ├── backend-cache/
│ ├── backend-python-cache/
│ ├── database-cache/
│ ├── frontend-cache/
│ ├── gateway-cache/
│ ├── runtime-cache/
│ ├── deer-flow-backend-cache/
│ ├── deer-flow-frontend-cache/
│ └── mineru-cache/
├── images/
│ └── base-images.tar # 基础镜像集合
└── resources/ # 外部资源
├── models/
│ ├── ch_ppocr_mobile_v2.0_cls_infer.tar # PaddleOCR 模型
│ └── zh_core_web_sm-3.8.0-py3-none-any.whl # spaCy 模型
├── DataX/ # DataX 源码
└── deer-flow/ # deer-flow 源码
```
### 单个服务构建
```bash
# 仅构建 backend
make backend-offline-build
# 仅构建 runtime
make runtime-offline-build
# 仅构建 deer-flow-backend
make deer-flow-backend-offline-build
```
### 增量更新
如果只有部分服务代码变更,可以只导出该服务的缓存:
```bash
# 重新导出 backend 缓存
docker buildx build \
--cache-to type=local,dest=./build-cache/buildkit/backend-cache,mode=max \
-f scripts/images/backend/Dockerfile \
-t datamate-backend:cache .
# 传输并重新构建
tar -czf build-cache-partial.tar.gz build-cache/buildkit/backend-cache
# ... 传输到无网环境 ...
make backend-offline-build
```
## APT 缓存问题详解
### 问题描述
即使使用了 `--mount=type=cache,target=/var/cache/apt`,Dockerfile 中的 `apt-get update` 仍会尝试从网络获取包列表(list 数据),导致无网环境下构建失败:
```
Err:1 http://mirrors.aliyun.com/debian bookworm InRelease
Could not resolve 'mirrors.aliyun.com'
Reading package lists...
E: Failed to fetch http://mirrors.aliyun.com/debian/dists/bookworm/InRelease
```
### 根本原因
- `--mount=type=cache,target=/var/cache/apt` 只缓存下载的 `.deb`
- `apt-get update` 会尝试从配置的源获取最新的包索引(InRelease/Packages 文件)
- `/var/lib/apt/lists/` 目录存储包索引,但通常不在缓存范围内
### 解决方案
#### 方案 1: 使用预装 APT 包的基础镜像(推荐)
这是最有效的方法:
**步骤 1**: 在有网环境构建预装所有依赖的基础镜像
```bash
# 构建并保存带 APT 预装包的基础镜像
./scripts/offline/build-base-images.sh
```
这会创建以下预装基础镜像:
- `datamate-java-base` - 用于 backend、gateway(预装 vim、python3、libreoffice 等)
- `datamate-python-base` - 用于 backend-python(预装 openjdk、nfs-common 等)
- `datamate-runtime-base` - 用于 runtime(预装 libgl1、tesseract-ocr 等)
- `deer-flow-backend-base` - 用于 deer-flow-backend
- `mineru-base` - 用于 mineru
**步骤 2**: 在无网环境使用这些基础镜像构建
```bash
# 加载包含预装基础镜像的 tar 包
docker load -i build-cache/images/base-images-with-apt.tar
# 使用最终版构建脚本
./scripts/offline/build-offline-final.sh
```
#### 方案 2: 修改 Dockerfile 跳过 apt update
如果确定不需要安装新包,可以修改 Dockerfile:
```dockerfile
# 原代码
RUN apt-get update && apt-get install -y xxx
# 修改为(离线环境)
# RUN apt-get update && \
RUN apt-get install -y xxx || true
```
#### 方案 3: 挂载 apt lists 缓存
在有网环境预先下载并保存 apt lists:
```bash
# 有网环境:保存 apt lists
docker run --rm \
-v "$(pwd)/apt-lists:/var/lib/apt/lists" \
eclipse-temurin:21-jdk \
apt-get update
# 无网环境:挂载保存的 lists
docker build \
--mount=type=bind,source=$(pwd)/apt-lists,target=/var/lib/apt/lists,ro \
-f Dockerfile .
```
**注意**: BuildKit 的 `--mount=type=bind``docker build` 中不直接支持,需要在 Dockerfile 中使用。
---
## 故障排查
### 问题 1: 构建时仍然尝试拉取镜像(最常见)
**现象**:
```
ERROR: failed to solve: pulling from host ...
ERROR: pull access denied, repository does not exist or may require authorization
```
**原因**:
- 基础镜像未正确加载
- BuildKit 尝试验证远程镜像
**解决方案**:
1. **使用传统构建方式(推荐)**:
```bash
make offline-build-classic
```
2. **手动加载基础镜像**:
```bash
# 加载基础镜像
docker load -i build-cache/images/base-images.tar
# 验证镜像存在
docker images | grep -E "(maven|eclipse-temurin|mysql|node|nginx)"
```
3. **使用 Docker 守护进程的离线模式**:
```bash
# 编辑 /etc/docker/daemon.json
{
"registry-mirrors": [],
"insecure-registries": []
}
# 重启 Docker
sudo systemctl restart docker
```
### 问题 2: 缓存导入失败
```
ERROR: failed to solve: failed to read cache metadata
```
**解决**: 缓存目录可能损坏,重新在有网环境导出。
### 问题 3: 基础镜像不存在
```
ERROR: pull access denied
```
**解决**:
1. 先执行 `make offline-setup` 加载基础镜像
2. 运行 `make offline-diagnose` 检查缺失的镜像
3. 重新导出缓存时确保包含所有基础镜像
### 问题 4: 网络连接错误(无网环境)
```
ERROR: failed to do request: dial tcp: lookup ...
```
**解决**: 检查 Dockerfile 中是否还有网络依赖(如 `git clone``wget``pip install` 等),可能需要修改 Dockerfile 使用本地资源。
### 问题 5: 内存不足
BuildKit 缓存可能占用大量内存,可以设置资源限制:
```bash
# 创建带资源限制的 buildx 构建器
docker buildx create --name offline-builder \
--driver docker-container \
--driver-opt memory=8g \
--use
```
### 问题 6: BuildKit 构建器无法使用本地镜像
**现象**: 即使镜像已加载,BuildKit 仍提示找不到镜像
**解决**: BuildKit 的 `docker-container` 驱动无法直接访问本地镜像。使用以下方法之一:
**方法 A**: 使用传统 Docker 构建(推荐)
```bash
make offline-build-classic
```
**方法 B**: 将镜像推送到本地 registry
```bash
# 启动本地 registry
docker run -d -p 5000:5000 --name registry registry:2
# 标记并推送镜像到本地 registry
docker tag maven:3-eclipse-temurin-21 localhost:5000/maven:3-eclipse-temurin-21
docker push localhost:5000/maven:3-eclipse-temurin-21
# 修改 Dockerfile 使用本地 registry
# FROM localhost:5000/maven:3-eclipse-temurin-21
```
**方法 C**: 使用 `docker` 驱动的 buildx 构建器(不需要推送镜像,但有其他限制)
```bash
# 创建使用 docker 驱动的构建器
docker buildx create --name offline-builder --driver docker --use
# 但这种方式无法使用 --cache-from type=local
# 仅适用于简单的离线构建场景
```
## 限制说明
1. **镜像版本**: 基础镜像版本必须与缓存导出时一致
2. **Dockerfile 变更**: 如果 Dockerfile 发生较大变更,可能需要重新导出缓存
3. **资源文件**: mineru 镜像中的模型下载(`mineru-models-download`)仍需要网络,如果需要在完全无网环境使用,需要预先将模型文件挂载到镜像中
## 高级用法
### 自定义缓存位置
```bash
make offline-export CACHE_DIR=/mnt/nas/build-cache
make offline-build CACHE_DIR=/mnt/nas/build-cache
```
### 导出特定平台缓存
```bash
# 导出 ARM64 平台的缓存
docker buildx build \
--platform linux/arm64 \
--cache-to type=local,dest=./build-cache/buildkit/backend-cache,mode=max \
-f scripts/images/backend/Dockerfile .
```
### 使用远程缓存(有网环境)
```bash
# 导出到 S3/MinIO
docker buildx build \
--cache-to type=s3,region=us-east-1,bucket=mybucket,name=backend-cache \
-f scripts/images/backend/Dockerfile .
# 从 S3 导入
docker buildx build \
--cache-from type=s3,region=us-east-1,bucket=mybucket,name=backend-cache \
-f scripts/images/backend/Dockerfile .
```
## 文件清单
```
scripts/offline/
├── export-cache.sh # 有网环境导出缓存脚本
├── build-base-images.sh # 构建 APT 预装基础镜像
├── build-offline.sh # 基础离线构建脚本(BuildKit)
├── build-offline-v2.sh # 增强版离线构建脚本
├── build-offline-classic.sh # 传统 docker build 脚本
├── build-offline-final.sh # 最终版(使用预装基础镜像,推荐)
├── diagnose.sh # 环境诊断脚本
├── Dockerfile.base-images # 预装 APT 包的基础镜像定义
├── Dockerfile.backend.offline # backend 离线 Dockerfile(使用预装基础镜像)
├── Dockerfile.gateway.offline # gateway 离线 Dockerfile(使用预装基础镜像)
├── Dockerfile.backend-python.offline # backend-python 离线 Dockerfile
├── Dockerfile.backend-python.offline-v2 # backend-python 离线 Dockerfile v2(使用预装基础镜像)
├── Dockerfile.runtime.offline # runtime 离线 Dockerfile
├── Dockerfile.runtime.offline-v2 # runtime 离线 Dockerfile v2(使用预装基础镜像)
├── Dockerfile.deer-flow-backend.offline # deer-flow-backend 离线 Dockerfile
├── Dockerfile.deer-flow-frontend.offline # deer-flow-frontend 离线 Dockerfile
├── Makefile.offline # 独立离线构建 Makefile
└── README.md # 本文档
Makefile.offline.mk # Makefile 扩展(追加到主 Makefile)
```
## 推荐工作流(解决 APT 问题版)
### 工作流 A: 使用预装 APT 包的基础镜像(彻底解决 APT 问题)
```bash
# ========== 有网环境 ==========
# 1. 构建并保存带 APT 预装包的基础镜像
./scripts/offline/build-base-images.sh
# 输出: build-cache/images/base-images-with-apt.tar
# 2. 导出其他缓存(BuildKit 缓存、外部资源)
./scripts/offline/export-cache.sh
# 3. 打包传输
scp build-cache/images/base-images-with-apt.tar user@offline-server:/opt/datamate/build-cache/images/
scp build-cache-*.tar.gz user@offline-server:/opt/datamate/
# ========== 无网环境 ==========
cd /opt/datamate
# 4. 解压
tar -xzf build-cache-*.tar.gz
# 5. 加载预装基础镜像(关键!)
docker load -i build-cache/images/base-images-with-apt.tar
# 6. 使用最终版脚本构建
./scripts/offline/build-offline-final.sh
```
### 工作流 B: 简单场景(使用传统构建)
如果 APT 包需求简单,可以直接使用传统构建:
```bash
# 有网环境
make offline-export
# 传输到无网环境
scp build-cache-*.tar.gz offline-server:/path/
# 无网环境
tar -xzf build-cache-*.tar.gz
make offline-diagnose # 检查环境
make offline-build-classic # 使用传统构建
```
## 参考
- [Docker BuildKit Documentation](https://docs.docker.com/build/buildkit/)
- [Cache Storage Backends](https://docs.docker.com/build/cache/backends/)

View File

@@ -0,0 +1,87 @@
#!/bin/bash
# 构建带有预装 APT 包的基础镜像
# Usage: ./build-base-images.sh [output-dir]
set -e
OUTPUT_DIR="${1:-./build-cache}"
IMAGES_DIR="$OUTPUT_DIR/images"
mkdir -p "$IMAGES_DIR"
echo "======================================"
echo "构建预装 APT 包的基础镜像"
echo "======================================"
# 构建各个基础镜像
echo ""
echo "1. 构建 datamate-java-base (用于 backend, gateway)..."
docker build \
-t datamate-java-base:latest \
--target datamate-java-base \
-f scripts/offline/Dockerfile.base-images \
. || echo "Warning: datamate-java-base 构建失败"
echo ""
echo "2. 构建 datamate-python-base (用于 backend-python)..."
docker build \
-t datamate-python-base:latest \
--target datamate-python-base \
-f scripts/offline/Dockerfile.base-images \
. || echo "Warning: datamate-python-base 构建失败"
echo ""
echo "3. 构建 datamate-runtime-base (用于 runtime)..."
docker build \
-t datamate-runtime-base:latest \
--target datamate-runtime-base \
-f scripts/offline/Dockerfile.base-images \
. || echo "Warning: datamate-runtime-base 构建失败"
echo ""
echo "4. 构建 deer-flow-backend-base (用于 deer-flow-backend)..."
docker build \
-t deer-flow-backend-base:latest \
--target deer-flow-backend-base \
-f scripts/offline/Dockerfile.base-images \
. || echo "Warning: deer-flow-backend-base 构建失败"
echo ""
echo "5. 构建 mineru-base (用于 mineru)..."
docker build \
-t mineru-base:latest \
--target mineru-base \
-f scripts/offline/Dockerfile.base-images \
. || echo "Warning: mineru-base 构建失败"
echo ""
echo "======================================"
echo "保存基础镜像集合"
echo "======================================"
docker save -o "$IMAGES_DIR/base-images-with-apt.tar" \
maven:3-eclipse-temurin-21 \
maven:3-eclipse-temurin-8 \
eclipse-temurin:21-jdk \
mysql:8 \
node:20-alpine \
nginx:1.29 \
ghcr.nju.edu.cn/astral-sh/uv:python3.11-bookworm \
ghcr.nju.edu.cn/astral-sh/uv:python3.12-bookworm \
ghcr.nju.edu.cn/astral-sh/uv:latest \
python:3.12-slim \
python:3.11-slim \
gcr.io/distroless/nodejs20-debian12 \
datamate-java-base:latest \
datamate-python-base:latest \
datamate-runtime-base:latest \
deer-flow-backend-base:latest \
mineru-base:latest \
2>/dev/null || echo "Warning: 部分镜像保存失败"
echo ""
echo "======================================"
echo "✓ 基础镜像构建完成"
echo "======================================"
echo "镜像列表:"
docker images | grep -E "(datamate-|deer-flow-|mineru-)base" || true

View File

@@ -0,0 +1,206 @@
#!/bin/bash
# 传统 docker build 离线构建脚本(不使用 buildx)
# 这种方式更稳定,兼容性更好
# Usage: ./build-offline-classic.sh [cache-dir] [version]
set -e
CACHE_DIR="${1:-./build-cache}"
VERSION="${2:-latest}"
IMAGES_DIR="$CACHE_DIR/images"
RESOURCES_DIR="$CACHE_DIR/resources"
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }
log_debug() { echo -e "${BLUE}[DEBUG]${NC} $1"; }
# 检查缓存目录
if [ ! -d "$CACHE_DIR" ]; then
log_error "缓存目录 $CACHE_DIR 不存在"
exit 1
fi
# 加载基础镜像
load_base_images() {
log_info "加载基础镜像..."
if [ ! -f "$IMAGES_DIR/base-images.tar" ]; then
log_warn "基础镜像 tar 包不存在,检查本地镜像..."
return
fi
log_info "$IMAGES_DIR/base-images.tar 加载..."
docker load -i "$IMAGES_DIR/base-images.tar"
log_info "✓ 基础镜像加载完成"
}
# 检查镜像是否存在
check_image() {
docker inspect "$1" > /dev/null 2>&1
}
# 构建函数
build_service() {
local service_name=$1
local image_name=$2
local dockerfile=$3
log_info "----------------------------------------"
log_info "构建 $service_name"
log_info "----------------------------------------"
# 检查 Dockerfile 是否存在
if [ ! -f "$dockerfile" ]; then
log_error "Dockerfile 不存在: $dockerfile"
return 1
fi
# 获取所需的基础镜像
local from_images
from_images=$(grep -E '^FROM' "$dockerfile" | sed 's/FROM //' | sed 's/ AS .*//' | sed 's/ as .*//' | awk '{print $1}' | sort -u)
log_info "检查基础镜像..."
local all_exist=true
for img in $from_images; do
# 跳过多阶段构建的中间阶段引用
if [[ "$img" == --from=* ]]; then
continue
fi
if check_image "$img"; then
log_info "$img"
else
log_error "$img (缺失)"
all_exist=false
fi
done
if [ "$all_exist" = false ]; then
log_error "缺少必要的基础镜像,无法构建 $service_name"
return 1
fi
# 准备构建参数
local build_args=()
# 根据服务类型添加特殊处理
case "$service_name" in
runtime)
# runtime 需要模型文件
if [ -d "$RESOURCES_DIR/models" ]; then
log_info "使用本地模型文件"
build_args+=("--build-arg" "RESOURCES_DIR=$RESOURCES_DIR")
fi
;;
backend-python)
if [ -d "$RESOURCES_DIR/DataX" ]; then
log_info "使用本地 DataX 源码"
build_args+=("--build-arg" "RESOURCES_DIR=$RESOURCES_DIR")
build_args+=("--build-arg" "DATAX_LOCAL_PATH=$RESOURCES_DIR/DataX")
fi
;;
deer-flow-backend|deer-flow-frontend)
if [ -d "$RESOURCES_DIR/deer-flow" ]; then
log_info "使用本地 deer-flow 源码"
build_args+=("--build-arg" "RESOURCES_DIR=$RESOURCES_DIR")
fi
;;
esac
# 执行构建
log_info "开始构建..."
if docker build \
--pull=false \
"${build_args[@]}" \
-f "$dockerfile" \
-t "$image_name:$VERSION" \
. 2>&1; then
log_info "$service_name 构建成功"
return 0
else
log_error "$service_name 构建失败"
return 1
fi
}
# 主流程
main() {
log_info "======================================"
log_info "传统 Docker 离线构建"
log_info "======================================"
# 加载基础镜像
load_base_images
# 定义要构建的服务
declare -A SERVICES=(
["database"]="datamate-database:scripts/images/database/Dockerfile"
["gateway"]="datamate-gateway:scripts/images/gateway/Dockerfile"
["backend"]="datamate-backend:scripts/images/backend/Dockerfile"
["frontend"]="datamate-frontend:scripts/images/frontend/Dockerfile"
["runtime"]="datamate-runtime:scripts/images/runtime/Dockerfile"
["backend-python"]="datamate-backend-python:scripts/images/backend-python/Dockerfile"
)
# deer-flow 和 mineru 是可选的
OPTIONAL_SERVICES=(
"deer-flow-backend:deer-flow-backend:scripts/images/deer-flow-backend/Dockerfile"
"deer-flow-frontend:deer-flow-frontend:scripts/images/deer-flow-frontend/Dockerfile"
"mineru:datamate-mineru:scripts/images/mineru/Dockerfile"
)
log_info ""
log_info "构建核心服务..."
local failed=()
local succeeded=()
for service_name in "${!SERVICES[@]}"; do
IFS=':' read -r image_name dockerfile <<< "${SERVICES[$service_name]}"
if build_service "$service_name" "$image_name" "$dockerfile"; then
succeeded+=("$service_name")
else
failed+=("$service_name")
fi
echo ""
done
# 尝试构建可选服务
log_info "构建可选服务..."
for service_config in "${OPTIONAL_SERVICES[@]}"; do
IFS=':' read -r service_name image_name dockerfile <<< "$service_config"
if build_service "$service_name" "$image_name" "$dockerfile"; then
succeeded+=("$service_name")
else
log_warn "$service_name 构建失败(可选服务,继续)"
fi
echo ""
done
# 汇总
log_info "======================================"
log_info "构建结果"
log_info "======================================"
if [ ${#succeeded[@]} -gt 0 ]; then
log_info "成功 (${#succeeded[@]}): ${succeeded[*]}"
fi
if [ ${#failed[@]} -gt 0 ]; then
log_error "失败 (${#failed[@]}): ${failed[*]}"
exit 1
else
log_info "✓ 所有核心服务构建成功!"
echo ""
docker images --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}" | grep -E "(datamate-|deer-flow-)" || true
fi
}
main "$@"

View File

@@ -0,0 +1,181 @@
#!/bin/bash
# 最终版离线构建脚本 - 使用预装 APT 包的基础镜像
# Usage: ./build-offline-final.sh [cache-dir] [version]
set -e
CACHE_DIR="${1:-./build-cache}"
VERSION="${2:-latest}"
IMAGES_DIR="$CACHE_DIR/images"
RESOURCES_DIR="$CACHE_DIR/resources"
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }
# 检查缓存目录
if [ ! -d "$CACHE_DIR" ]; then
log_error "缓存目录 $CACHE_DIR 不存在"
exit 1
fi
# 加载基础镜像
load_images() {
log_info "加载基础镜像..."
# 优先加载带 APT 预装包的镜像集合
if [ -f "$IMAGES_DIR/base-images-with-apt.tar" ]; then
log_info "加载带 APT 预装包的基础镜像..."
docker load -i "$IMAGES_DIR/base-images-with-apt.tar"
elif [ -f "$IMAGES_DIR/base-images.tar" ]; then
log_warn "加载普通基础镜像(不含 APT 预装包)..."
docker load -i "$IMAGES_DIR/base-images.tar"
else
log_warn "基础镜像 tar 包不存在,检查本地镜像..."
fi
log_info "✓ 镜像加载完成"
}
# 验证镜像是否存在
verify_image() {
docker inspect "$1" > /dev/null 2>&1
}
# 构建函数
build_service() {
local service_name=$1
local image_name=$2
local dockerfile=$3
local base_image=$4 # 必需的基础镜像
log_info "----------------------------------------"
log_info "构建 $service_name"
log_info "----------------------------------------"
if [ ! -f "$dockerfile" ]; then
log_error "Dockerfile 不存在: $dockerfile"
return 1
fi
# 检查必需的基础镜像
if [ -n "$base_image" ]; then
if verify_image "$base_image"; then
log_info "✓ 基础镜像存在: $base_image"
else
log_error "✗ 缺少基础镜像: $base_image"
log_info "请确保已加载正确的 base-images-with-apt.tar"
return 1
fi
fi
# 准备构建参数
local build_args=()
# 添加资源目录参数
if [ -d "$RESOURCES_DIR" ]; then
build_args+=("--build-arg" "RESOURCES_DIR=$RESOURCES_DIR")
fi
# 执行构建
log_info "开始构建..."
if docker build \
--pull=false \
"${build_args[@]}" \
-f "$dockerfile" \
-t "$image_name:$VERSION" \
. 2>&1; then
log_info "$service_name 构建成功"
return 0
else
log_error "$service_name 构建失败"
return 1
fi
}
# 主流程
main() {
log_info "======================================"
log_info "最终版离线构建 (使用 APT 预装基础镜像)"
log_info "======================================"
# 加载基础镜像
load_images
# 验证关键基础镜像
log_info ""
log_info "验证预装基础镜像..."
REQUIRED_BASE_IMAGES=(
"datamate-java-base:latest"
"datamate-python-base:latest"
"datamate-runtime-base:latest"
)
for img in "${REQUIRED_BASE_IMAGES[@]}"; do
if verify_image "$img"; then
log_info "$img"
else
log_warn "$img (缺失)"
fi
done
# 定义服务配置
declare -A SERVICES=(
["database"]="datamate-database:scripts/images/database/Dockerfile:"
["gateway"]="datamate-gateway:scripts/offline/Dockerfile.gateway.offline:datamate-java-base:latest"
["backend"]="datamate-backend:scripts/offline/Dockerfile.backend.offline:datamate-java-base:latest"
["frontend"]="datamate-frontend:scripts/images/frontend/Dockerfile:"
["runtime"]="datamate-runtime:scripts/offline/Dockerfile.runtime.offline-v2:datamate-runtime-base:latest"
["backend-python"]="datamate-backend-python:scripts/offline/Dockerfile.backend-python.offline-v2:datamate-python-base:latest"
)
log_info ""
log_info "======================================"
log_info "开始构建服务"
log_info "======================================"
local failed=()
local succeeded=()
for service_name in "${!SERVICES[@]}"; do
IFS=':' read -r image_name dockerfile base_image <<< "${SERVICES[$service_name]}"
if build_service "$service_name" "$image_name" "$dockerfile" "$base_image"; then
succeeded+=("$service_name")
else
failed+=("$service_name")
fi
echo ""
done
# 汇总
log_info "======================================"
log_info "构建结果"
log_info "======================================"
if [ ${#succeeded[@]} -gt 0 ]; then
log_info "成功 (${#succeeded[@]}): ${succeeded[*]}"
fi
if [ ${#failed[@]} -gt 0 ]; then
log_error "失败 (${#failed[@]}): ${failed[*]}"
log_info ""
log_info "提示: 如果失败是因为缺少预装基础镜像,请确保:"
log_info " 1. 在有网环境执行: ./scripts/offline/build-base-images.sh"
log_info " 2. 将生成的 base-images-with-apt.tar 传输到无网环境"
log_info " 3. 在无网环境加载: docker load -i base-images-with-apt.tar"
exit 1
else
log_info "✓ 所有服务构建成功!"
echo ""
docker images --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}" | grep -E "(datamate-|deer-flow-)" || true
fi
}
main "$@"

View File

@@ -0,0 +1,249 @@
#!/bin/bash
# BuildKit 离线构建脚本 v2 - 增强版
# Usage: ./build-offline-v2.sh [cache-dir] [version]
set -e
CACHE_DIR="${1:-./build-cache}"
VERSION="${2:-latest}"
BUILDKIT_CACHE_DIR="$CACHE_DIR/buildkit"
IMAGES_DIR="$CACHE_DIR/images"
RESOURCES_DIR="$CACHE_DIR/resources"
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# 检查缓存目录
if [ ! -d "$CACHE_DIR" ]; then
log_error "缓存目录 $CACHE_DIR 不存在"
log_info "请先解压缓存包: tar -xzf build-cache-*.tar.gz"
exit 1
fi
# 确保 buildx 构建器存在(使用 docker 驱动以支持本地镜像)
setup_buildx() {
log_info "设置 BuildKit 构建器..."
# 删除旧的构建器
if docker buildx inspect offline-builder > /dev/null 2>&1; then
docker buildx rm offline-builder 2>/dev/null || true
fi
# 创建新的构建器,使用 docker 驱动(支持本地镜像,不需要推送到 registry)
docker buildx create --name offline-builder \
--driver docker-container \
--driver-opt image=moby/buildkit:buildx-stable-1 \
--use
log_info "BuildKit 构建器创建完成"
}
# 加载基础镜像
load_base_images() {
log_info "加载基础镜像..."
if [ ! -f "$IMAGES_DIR/base-images.tar" ]; then
log_warn "基础镜像文件不存在: $IMAGES_DIR/base-images.tar"
log_info "检查本地是否存在所需镜像..."
# 检查关键镜像是否存在
required_images=(
"maven:3-eclipse-temurin-21"
"eclipse-temurin:21-jdk"
"mysql:8"
"node:20-alpine"
"nginx:1.29"
)
for img in "${required_images[@]}"; do
if ! docker inspect "$img" > /dev/null 2>&1; then
log_error "缺少基础镜像: $img"
log_info "请确保基础镜像已加载: docker load -i base-images.tar"
exit 1
fi
done
log_info "本地基础镜像检查通过"
return
fi
log_info "$IMAGES_DIR/base-images.tar 加载基础镜像..."
docker load -i "$IMAGES_DIR/base-images.tar"
log_info "基础镜像加载完成"
}
# 验证镜像是否存在
verify_image() {
local image_name=$1
if docker inspect "$image_name" > /dev/null 2>&1; then
return 0
else
return 1
fi
}
# 离线构建函数
offline_build() {
local service_name=$1
local image_name=$2
local dockerfile=$3
local cache_file="$BUILDKIT_CACHE_DIR/${service_name}-cache"
log_info "----------------------------------------"
log_info "构建 [$service_name] -> $image_name:$VERSION"
log_info "----------------------------------------"
if [ ! -d "$cache_file" ]; then
log_warn "$service_name 的缓存不存在,跳过..."
return 0
fi
# 获取 Dockerfile 中的基础镜像
local base_images
base_images=$(grep -E '^FROM' "$dockerfile" | awk '{print $2}' | sort -u)
log_info "检查基础镜像..."
for base_img in $base_images; do
# 跳过多阶段构建中的 AS 别名
base_img=$(echo "$base_img" | cut -d':' -f1-2 | sed 's/AS.*//i' | tr -d ' ')
if [ -z "$base_img" ] || [[ "$base_img" == *"AS"* ]]; then
continue
fi
if verify_image "$base_img"; then
log_info "$base_img"
else
log_warn "$base_img (未找到)"
# 尝试从 base-images.tar 中加载
if [ -f "$IMAGES_DIR/base-images.tar" ]; then
log_info " 尝试从 tar 包加载..."
docker load -i "$IMAGES_DIR/base-images.tar" 2>/dev/null || true
fi
fi
done
# 执行离线构建
log_info "开始构建..."
# 构建参数
local build_args=()
# 为需要外部资源的服务添加 build-arg
case "$service_name" in
runtime|deer-flow-backend|deer-flow-frontend)
if [ -d "$RESOURCES_DIR" ]; then
build_args+=("--build-arg" "RESOURCES_DIR=$RESOURCES_DIR")
fi
;;
backend-python)
if [ -d "$RESOURCES_DIR" ]; then
build_args+=("--build-arg" "RESOURCES_DIR=$RESOURCES_DIR")
build_args+=("--build-arg" "DATAX_LOCAL_PATH=$RESOURCES_DIR/DataX")
fi
;;
esac
# 执行构建
if docker buildx build \
--builder offline-builder \
--cache-from "type=local,src=$cache_file" \
--pull=false \
--output "type=docker" \
"${build_args[@]}" \
-f "$dockerfile" \
-t "$image_name:$VERSION" \
. 2>&1; then
log_info "$service_name 构建成功"
return 0
else
log_error "$service_name 构建失败"
return 1
fi
}
# 主流程
main() {
log_info "======================================"
log_info "BuildKit 离线构建"
log_info "======================================"
log_info "缓存目录: $CACHE_DIR"
log_info "版本: $VERSION"
# 步骤 1: 设置构建器
setup_buildx
# 步骤 2: 加载基础镜像
load_base_images
# 步骤 3: 定义服务列表
declare -A SERVICES=(
["database"]="datamate-database:scripts/images/database/Dockerfile"
["gateway"]="datamate-gateway:scripts/images/gateway/Dockerfile"
["backend"]="datamate-backend:scripts/images/backend/Dockerfile"
["frontend"]="datamate-frontend:scripts/images/frontend/Dockerfile"
["runtime"]="datamate-runtime:scripts/images/runtime/Dockerfile"
["backend-python"]="datamate-backend-python:scripts/images/backend-python/Dockerfile"
["deer-flow-backend"]="deer-flow-backend:scripts/images/deer-flow-backend/Dockerfile"
["deer-flow-frontend"]="deer-flow-frontend:scripts/images/deer-flow-frontend/Dockerfile"
["mineru"]="datamate-mineru:scripts/images/mineru/Dockerfile"
)
# 步骤 4: 批量构建
log_info ""
log_info "======================================"
log_info "开始批量构建"
log_info "======================================"
local failed=()
local succeeded=()
for service_name in "${!SERVICES[@]}"; do
IFS=':' read -r image_name dockerfile <<< "${SERVICES[$service_name]}"
if offline_build "$service_name" "$image_name" "$dockerfile"; then
succeeded+=("$service_name")
else
failed+=("$service_name")
fi
echo ""
done
# 步骤 5: 汇总结果
log_info "======================================"
log_info "构建完成"
log_info "======================================"
if [ ${#succeeded[@]} -gt 0 ]; then
log_info "成功 (${#succeeded[@]}): ${succeeded[*]}"
fi
if [ ${#failed[@]} -gt 0 ]; then
log_error "失败 (${#failed[@]}): ${failed[*]}"
exit 1
else
log_info "✓ 所有服务构建成功!"
echo ""
log_info "镜像列表:"
docker images --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}" | grep -E "(datamate-|deer-flow-)" || true
fi
}
# 执行主流程
main "$@"

View File

@@ -0,0 +1,109 @@
#!/bin/bash
# BuildKit 离线构建脚本 - 在无网环境执行
# Usage: ./build-offline.sh [cache-dir] [version]
set -e
CACHE_DIR="${1:-./build-cache}"
VERSION="${2:-latest}"
BUILDKIT_CACHE_DIR="$CACHE_DIR/buildkit"
IMAGES_DIR="$CACHE_DIR/images"
RESOURCES_DIR="$CACHE_DIR/resources"
# 检查缓存目录
if [ ! -d "$CACHE_DIR" ]; then
echo "错误: 缓存目录 $CACHE_DIR 不存在"
echo "请先解压缓存包: tar -xzf build-cache-*.tar.gz"
exit 1
fi
# 确保 buildx 构建器存在
if ! docker buildx inspect offline-builder > /dev/null 2>&1; then
echo "创建 buildx 构建器..."
docker buildx create --name offline-builder --driver docker-container --use
else
docker buildx use offline-builder
fi
echo "======================================"
echo "1. 加载基础镜像"
echo "======================================"
if [ -f "$IMAGES_DIR/base-images.tar" ]; then
echo "$IMAGES_DIR/base-images.tar 加载基础镜像..."
docker load -i "$IMAGES_DIR/base-images.tar"
echo "✓ 基础镜像加载完成"
else
echo "警告: 基础镜像文件不存在,假设镜像已存在"
fi
echo ""
echo "======================================"
echo "2. 离线构建服务"
echo "======================================"
# 定义服务配置(与 export-cache.sh 保持一致)
SERVICES=(
"backend:datamate-backend:scripts/images/backend/Dockerfile"
"backend-python:datamate-backend-python:scripts/images/backend-python/Dockerfile"
"database:datamate-database:scripts/images/database/Dockerfile"
"frontend:datamate-frontend:scripts/images/frontend/Dockerfile"
"gateway:datamate-gateway:scripts/images/gateway/Dockerfile"
"runtime:datamate-runtime:scripts/images/runtime/Dockerfile"
"deer-flow-backend:deer-flow-backend:scripts/images/deer-flow-backend/Dockerfile"
"deer-flow-frontend:deer-flow-frontend:scripts/images/deer-flow-frontend/Dockerfile"
"mineru:datamate-mineru:scripts/images/mineru/Dockerfile"
)
# 检查是否有资源目录需要挂载
MOUNT_ARGS=""
if [ -d "$RESOURCES_DIR" ]; then
echo "检测到资源目录,将用于本地资源挂载"
MOUNT_ARGS="--build-arg RESOURCES_DIR=$RESOURCES_DIR"
fi
for service_config in "${SERVICES[@]}"; do
IFS=':' read -r service_name image_name dockerfile <<< "$service_config"
cache_file="$BUILDKIT_CACHE_DIR/$service_name-cache"
echo ""
echo "--------------------------------------"
echo "构建 [$service_name] -> $image_name:$VERSION"
echo "--------------------------------------"
if [ ! -d "$cache_file" ]; then
echo "警告: $service_name 的缓存不存在,跳过..."
continue
fi
# 使用缓存进行离线构建
# --pull=false: 不尝试拉取镜像
# --network=none: 禁用网络访问
docker buildx build \
--cache-from "type=local,src=$cache_file" \
--pull=false \
--network=none \
-f "$dockerfile" \
-t "$image_name:$VERSION" \
--load \
. 2>&1 || {
echo "警告: $service_name 离线构建遇到问题,尝试仅使用缓存..."
docker buildx build \
--cache-from "type=local,src=$cache_file" \
--pull=false \
-f "$dockerfile" \
-t "$image_name:$VERSION" \
--load \
. 2>&1
}
echo "$service_name 构建完成"
done
echo ""
echo "======================================"
echo "✓ 离线构建完成!"
echo "======================================"
echo ""
echo "构建的镜像列表:"
docker images | grep -E "(datamate-|deer-flow-)" || true

151
scripts/offline/diagnose.sh Normal file
View File

@@ -0,0 +1,151 @@
#!/bin/bash
# 离线构建诊断脚本
# Usage: ./diagnose.sh [cache-dir]
set -e
CACHE_DIR="${1:-./build-cache}"
echo "======================================"
echo "离线构建环境诊断"
echo "======================================"
echo ""
# 1. 检查 Docker
echo "1. Docker 版本:"
docker version --format '{{.Server.Version}}' 2>/dev/null || echo " 无法获取版本"
echo ""
# 2. 检查 BuildKit
echo "2. BuildKit 状态:"
if docker buildx version > /dev/null 2>&1; then
docker buildx version
echo ""
echo "可用的构建器:"
docker buildx ls
else
echo " BuildKit 不可用"
fi
echo ""
# 3. 检查缓存目录
echo "3. 缓存目录检查 ($CACHE_DIR):"
if [ -d "$CACHE_DIR" ]; then
echo " ✓ 缓存目录存在"
# 检查子目录
for subdir in buildkit images resources; do
if [ -d "$CACHE_DIR/$subdir" ]; then
echo "$subdir/ 存在"
count=$(find "$CACHE_DIR/$subdir" -type d | wc -l)
echo " 子目录数量: $count"
else
echo "$subdir/ 不存在"
fi
done
else
echo " ✗ 缓存目录不存在"
fi
echo ""
# 4. 检查基础镜像
echo "4. 基础镜像检查:"
required_images=(
"maven:3-eclipse-temurin-21"
"maven:3-eclipse-temurin-8"
"eclipse-temurin:21-jdk"
"mysql:8"
"node:20-alpine"
"nginx:1.29"
"ghcr.nju.edu.cn/astral-sh/uv:python3.11-bookworm"
"python:3.12-slim"
"gcr.io/distroless/nodejs20-debian12"
)
missing_images=()
for img in "${required_images[@]}"; do
if docker inspect "$img" > /dev/null 2>&1; then
size=$(docker images --format "{{.Size}}" "$img" | head -1)
echo "$img ($size)"
else
echo "$img (缺失)"
missing_images+=("$img")
fi
done
echo ""
# 5. 检查 BuildKit 缓存
echo "5. BuildKit 缓存检查:"
if [ -d "$CACHE_DIR/buildkit" ]; then
for cache_dir in "$CACHE_DIR/buildkit"/*-cache; do
if [ -d "$cache_dir" ]; then
name=$(basename "$cache_dir")
size=$(du -sh "$cache_dir" 2>/dev/null | cut -f1)
echo "$name ($size)"
fi
done
else
echo " ✗ 缓存目录不存在"
fi
echo ""
# 6. 检查资源文件
echo "6. 外部资源检查:"
if [ -d "$CACHE_DIR/resources" ]; then
if [ -f "$CACHE_DIR/resources/models/ch_ppocr_mobile_v2.0_cls_infer.tar" ]; then
size=$(du -sh "$CACHE_DIR/resources/models/ch_ppocr_mobile_v2.0_cls_infer.tar" | cut -f1)
echo " ✓ PaddleOCR 模型 ($size)"
else
echo " ✗ PaddleOCR 模型缺失"
fi
if [ -f "$CACHE_DIR/resources/models/zh_core_web_sm-3.8.0-py3-none-any.whl" ]; then
size=$(du -sh "$CACHE_DIR/resources/models/zh_core_web_sm-3.8.0-py3-none-any.whl" | cut -f1)
echo " ✓ spaCy 模型 ($size)"
else
echo " ✗ spaCy 模型缺失"
fi
if [ -d "$CACHE_DIR/resources/DataX" ]; then
echo " ✓ DataX 源码"
else
echo " ✗ DataX 源码缺失"
fi
if [ -d "$CACHE_DIR/resources/deer-flow" ]; then
echo " ✓ deer-flow 源码"
else
echo " ✗ deer-flow 源码缺失"
fi
else
echo " ✗ 资源目录不存在"
fi
echo ""
# 7. 网络检查
echo "7. 网络检查:"
if ping -c 1 8.8.8.8 > /dev/null 2>&1; then
echo " ⚠ 网络可用(离线构建环境通常不需要)"
else
echo " ✓ 网络不可达(符合离线环境)"
fi
echo ""
# 8. 总结
echo "======================================"
echo "诊断总结"
echo "======================================"
if [ ${#missing_images[@]} -eq 0 ]; then
echo "✓ 所有基础镜像已就绪"
else
echo "✗ 缺少 ${#missing_images[@]} 个基础镜像:"
printf ' - %s\n' "${missing_images[@]}"
echo ""
echo "修复方法:"
if [ -f "$CACHE_DIR/images/base-images.tar" ]; then
echo " docker load -i $CACHE_DIR/images/base-images.tar"
else
echo " 请确保有网环境导出时包含所有基础镜像"
fi
fi

View File

@@ -0,0 +1,172 @@
#!/bin/bash
# BuildKit 缓存导出脚本 - 在有网环境执行
# Usage: ./export-cache.sh [output-dir]
set -e
OUTPUT_DIR="${1:-./build-cache}"
BUILDKIT_CACHE_DIR="$OUTPUT_DIR/buildkit"
IMAGES_DIR="$OUTPUT_DIR/images"
RESOURCES_DIR="$OUTPUT_DIR/resources"
APT_CACHE_DIR="$OUTPUT_DIR/apt-cache"
# 确保 buildx 构建器存在
if ! docker buildx inspect offline-builder > /dev/null 2>&1; then
echo "创建 buildx 构建器..."
docker buildx create --name offline-builder --driver docker-container --use
else
docker buildx use offline-builder
fi
mkdir -p "$BUILDKIT_CACHE_DIR" "$IMAGES_DIR" "$RESOURCES_DIR" "$APT_CACHE_DIR"
echo "======================================"
echo "1. 导出基础镜像"
echo "======================================"
BASE_IMAGES=(
"maven:3-eclipse-temurin-21"
"maven:3-eclipse-temurin-8"
"eclipse-temurin:21-jdk"
"mysql:8"
"node:20-alpine"
"nginx:1.29"
"ghcr.nju.edu.cn/astral-sh/uv:python3.11-bookworm"
"ghcr.nju.edu.cn/astral-sh/uv:python3.12-bookworm"
"ghcr.nju.edu.cn/astral-sh/uv:latest"
"python:3.12-slim"
"python:3.11-slim"
"gcr.nju.edu.cn/distroless/nodejs20-debian12"
)
for img in "${BASE_IMAGES[@]}"; do
echo "拉取: $img"
docker pull "$img" || echo "警告: $img 拉取失败,可能已存在"
done
echo ""
echo "保存基础镜像到 $IMAGES_DIR/base-images.tar..."
docker save -o "$IMAGES_DIR/base-images.tar" "${BASE_IMAGES[@]}"
echo "✓ 基础镜像保存完成"
echo ""
echo "======================================"
echo "2. 导出 BuildKit 构建缓存"
echo "======================================"
# 定义服务配置
SERVICES=(
"backend:datamate-backend:scripts/images/backend/Dockerfile"
"backend-python:datamate-backend-python:scripts/images/backend-python/Dockerfile"
"database:datamate-database:scripts/images/database/Dockerfile"
"frontend:datamate-frontend:scripts/images/frontend/Dockerfile"
"gateway:datamate-gateway:scripts/images/gateway/Dockerfile"
"runtime:datamate-runtime:scripts/images/runtime/Dockerfile"
"deer-flow-backend:deer-flow-backend:scripts/images/deer-flow-backend/Dockerfile"
"deer-flow-frontend:deer-flow-frontend:scripts/images/deer-flow-frontend/Dockerfile"
"mineru:datamate-mineru:scripts/images/mineru/Dockerfile"
)
for service_config in "${SERVICES[@]}"; do
IFS=':' read -r service_name image_name dockerfile <<< "$service_config"
cache_file="$BUILDKIT_CACHE_DIR/$service_name-cache"
echo ""
echo "导出 [$service_name] 缓存到 $cache_file..."
# 先正常构建以填充缓存
docker buildx build \
--cache-to "type=local,dest=$cache_file,mode=max" \
-f "$dockerfile" \
-t "$image_name:cache" \
. || echo "警告: $service_name 缓存导出失败"
echo "$service_name 缓存导出完成"
done
echo ""
echo "======================================"
echo "3. 预下载外部资源"
echo "======================================"
# PaddleOCR 模型
mkdir -p "$RESOURCES_DIR/models"
if [ ! -f "$RESOURCES_DIR/models/ch_ppocr_mobile_v2.0_cls_infer.tar" ]; then
echo "下载 PaddleOCR 模型..."
wget -O "$RESOURCES_DIR/models/ch_ppocr_mobile_v2.0_cls_infer.tar" \
"https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar" || true
fi
# spaCy 模型
if [ ! -f "$RESOURCES_DIR/models/zh_core_web_sm-3.8.0-py3-none-any.whl" ]; then
echo "下载 spaCy 模型..."
wget -O "$RESOURCES_DIR/models/zh_core_web_sm-3.8.0-py3-none-any.whl" \
"https://ghproxy.net/https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.8.0/zh_core_web_sm-3.8.0-py3-none-any.whl" || true
fi
# DataX 源码
if [ ! -d "$RESOURCES_DIR/DataX" ]; then
echo "克隆 DataX 源码..."
git clone --depth 1 "https://gitee.com/alibaba/DataX.git" "$RESOURCES_DIR/DataX" || true
fi
# deer-flow 源码(用于 deer-flow 构建)
if [ ! -d "$RESOURCES_DIR/deer-flow" ]; then
echo "克隆 deer-flow 源码..."
git clone --depth 1 "https://ghproxy.net/https://github.com/ModelEngine-Group/deer-flow.git" "$RESOURCES_DIR/deer-flow" || true
fi
echo ""
echo "======================================"
echo "4. 导出 APT 缓存"
echo "======================================"
# 为需要 apt 的镜像预生成 apt 缓存
echo "生成 APT list 缓存..."
# eclipse-temurin:21-jdk 的 apt 缓存
docker run --rm \
-v "$APT_CACHE_DIR/eclipse-temurin:/var/cache/apt/archives" \
-v "$APT_CACHE_DIR/eclipse-temurin-lists:/var/lib/apt/lists" \
eclipse-temurin:21-jdk \
bash -c "apt-get update && apt-get install -y --download-only vim wget curl rsync python3 python3-pip python-is-python3 dos2unix libreoffice fonts-noto-cjk 2>/dev/null || true" 2>/dev/null || echo " Warning: eclipse-temurin apt 缓存导出失败"
# python:3.12-slim 的 apt 缓存
docker run --rm \
-v "$APT_CACHE_DIR/python312:/var/cache/apt/archives" \
-v "$APT_CACHE_DIR/python312-lists:/var/lib/apt/lists" \
python:3.12-slim \
bash -c "apt-get update && apt-get install -y --download-only vim openjdk-21-jre nfs-common glusterfs-client rsync 2>/dev/null || true" 2>/dev/null || echo " Warning: python3.12 apt 缓存导出失败"
# python:3.11-slim 的 apt 缓存
docker run --rm \
-v "$APT_CACHE_DIR/python311:/var/cache/apt/archives" \
-v "$APT_CACHE_DIR/python311-lists:/var/lib/apt/lists" \
python:3.11-slim \
bash -c "apt-get update && apt-get install -y --download-only curl vim libgl1 libglx0 libopengl0 libglib2.0-0 procps 2>/dev/null || true" 2>/dev/null || echo " Warning: python3.11 apt 缓存导出失败"
echo "✓ APT 缓存导出完成"
echo ""
echo "======================================"
echo "5. 打包缓存"
echo "======================================"
cd "$OUTPUT_DIR"
tar -czf "build-cache-$(date +%Y%m%d).tar.gz" buildkit images resources apt-cache
cd - > /dev/null
echo ""
echo "======================================"
echo "✓ 缓存导出完成!"
echo "======================================"
echo "缓存位置: $OUTPUT_DIR"
echo "传输文件: $OUTPUT_DIR/build-cache-$(date +%Y%m%d).tar.gz"
echo ""
echo "包含内容:"
echo " - 基础镜像 (images/)"
echo " - BuildKit 缓存 (buildkit/)"
echo " - 外部资源 (resources/)"
echo " - APT 缓存 (apt-cache/)"
echo ""
echo "请将此压缩包传输到无网环境后解压使用"