Compare commits

..

23 Commits

Author SHA1 Message Date
d0972cbc9d feat(data-management): 实现数据集文件版本管理和内部路径保护
- 将数据集文件查询方法替换为只查询可见文件的版本
- 引入文件状态管理(ACTIVE/ARCHIVED)和内部目录结构
- 实现文件重复处理策略,支持版本控制模式而非覆盖
- 添加内部数据目录保护,防止访问.datamate等系统目录
- 重构文件上传流程,引入暂存目录和事务后清理机制
- 实现文件版本归档功能,保留历史版本到专用存储位置
- 优化文件路径规范化和安全验证逻辑
- 修复文件删除逻辑,确保归档文件不会被错误移除
- 更新数据集压缩下载功能以排除内部系统文件
2026-02-04 23:53:35 +08:00
473f4e717f feat(annotation): 添加文本分段索引显示功能
- 实现了分段索引数组的生成逻辑
- 添加了分段索引网格显示界面
- 支持当前分段高亮显示
- 优化了分段导航的用户体验
- 替换了原有的分段提示文字为可视化索引组件
2026-02-04 19:16:48 +08:00
6b0042cb66 refactor(annotation): 简化任务选择逻辑并移除未使用的状态管理
- 移除了 resolveSegmentSummary 函数调用以简化完成状态判断
- 删除了未使用的 segmentStats 相关引用和缓存清理代码
- 简化了重置模式下的状态更新逻辑
2026-02-04 18:23:49 +08:00
fa9e9d9f68 refactor(annotation): 简化文本标注编辑器的段落管理功能
- 移除段落统计相关的数据结构和缓存逻辑
- 删除段落切换确认对话框和自动保存选项
- 简化段落加载和状态管理流程
- 将段落列表视图替换为简单的进度显示
- 更新API接口以支持单段内容获取
- 重构后端服务实现单段内容查询功能
2026-02-04 18:08:14 +08:00
707e65b017 refactor(annotation): 优化编辑器服务中的分段处理逻辑
- 在处理分段注释时初始化 segments 列表变量
- 确保分段信息列表在函数开始时被正确初始化
- 提高代码可读性和变量声明的一致性
2026-02-04 17:35:14 +08:00
cda22a720c feat(annotation): 优化文本标注分段功能实现
- 新增 getEditorTaskSegmentsUsingGet 接口用于获取任务分段信息
- 移除 SegmentInfo 中的 text、start、end 字段,精简数据结构
- 添加 EditorTaskSegmentsResponse 类型定义用于分段摘要响应
- 实现服务端 get_task_segments 方法,支持分段信息查询
- 重构前端组件缓存机制,使用 segmentSummaryFileRef 管理分段状态
- 优化分段构建逻辑,提取 _build_segment_contexts 公共方法
- 调整后端 _build_text_task 方法中的分段处理流程
- 更新 API 类型定义,统一 RequestParams 和 RequestPayload 类型
2026-02-04 16:59:04 +08:00
394e2bda18 feat(data-management): 添加数据集文件取消上传功能
- 在OpenAPI规范中定义了取消上传的REST端点接口
- 实现了DatasetFileApplicationService中的取消上传业务逻辑
- 在FileService中添加了完整的取消上传服务方法
- 创建了DatasetUploadController控制器处理取消上传请求
- 实现了临时分片文件清理和数据库记录删除功能
2026-02-04 16:25:03 +08:00
4220284f5a refactor(utils): 重构文件流式分割上传功能
- 将 streamSplitAndUpload 函数拆分为独立的 processFileLines 函数
- 简化文件按行处理逻辑,移除冗余的行收集和缓存机制
- 优化并发上传实现,使用 Promise 集合管理上传任务
- 修复上传过程中断信号处理和错误传播机制
- 统一进度回调参数结构,改进字节和行数跟踪逻辑
- 优化空行跳过计数和上传结果返回值处理
2026-02-04 16:11:03 +08:00
8415166949 refactor(upload): 重构切片上传逻辑支持动态请求ID解析
- 移除预先批量获取reqId的方式,改为按需解析
- 新增resolveReqId函数支持动态获取请求ID
- 添加onReqIdResolved回调处理ID解析完成事件
- 改进文件按行切片上传,每行作为独立文件处理
- 优化空行跳过逻辑,统计跳过的空行数量
- 修复fileNo和chunkNo的对应关系
- 更新streamSplitAndUpload参数结构
2026-02-04 15:58:58 +08:00
078f303f57 Revert "fix: 修复 hasArchive 和 splitByLine 同时存在的问题"
This reverts commit 50f2da5503.
2026-02-04 15:48:01 +08:00
50f2da5503 fix: 修复 hasArchive 和 splitByLine 同时存在的问题
问题:hasArchive 默认为 true,而 splitByLine 可以与其同时开启,
      导致压缩包被错误地按行分割,产生逻辑矛盾。

修复:
1. 当 hasArchive=true 时,禁用 splitByLine switch
2. 添加 useEffect,当 hasArchive 变为 true 时自动关闭 splitByLine

修改文件:frontend/src/pages/DataManagement/Detail/components/ImportConfiguration.tsx
2026-02-04 15:43:53 +08:00
3af1daf8b6 fix: 修复流式分割上传的"预上传请求不存在"错误
问题:handleStreamUpload 中为所有文件只调用一次 preUpload,设置
      totalFileNum: files.length(原始文件数),但实际上传的文件数量
      是按行分割后的总行数,导致后端提前删除预上传请求。

修复:将 preUpload 调用移到文件循环内部,为每个原始文件单独调用
      preUpload,设置 totalFileNum: 1,每个文件有自己的 reqId。
      这样可以避免按行分割导致的请求被提前删除问题。

修改文件:frontend/src/hooks/useSliceUpload.tsx
2026-02-04 15:39:05 +08:00
7c7729434b fix: 修复流式分割上传的三个问题
1. 实现真正的并发控制,避免同时产生大量请求
   - 使用任务队列模式,确保同时运行的任务不超过 maxConcurrency
   - 完成一个任务后才启动下一个,而不是一次性启动所有任务

2. 修复 API 错误(预上传请求不存在)
   - 所有分片使用相同的 fileNo=1(属于同一个预上传请求)
   - chunkNo 改为行号,表示第几行数据
   - 这是根本原因:之前每行都被当作不同文件,但只有第一个文件有有效的预上传请求

3. 保留原始文件扩展名
   - 正确提取并保留文件扩展名
   - 例如:132.txt → 132_000001.txt(而不是 132_000001)
2026-02-04 15:06:02 +08:00
17a62cd3c2 fix: 修复上传取消功能,确保 HTTP 请求正确中止
- 在 XMLHttpRequest 中添加 signal.aborted 检查
- 修复 useSliceUpload 中的 cancelFn 闭包问题
- 确保流式上传和分片上传都能正确取消
2026-02-04 14:51:23 +08:00
f381d641ab fix(upload): 修复流式上传中的文件名处理逻辑
- 修正预上传接口调用时传递正确的文件总数而非固定值-1
- 移除导入配置中文件分割时的文件扩展名保留逻辑
- 删除流式上传选项中的fileExtension参数定义
- 移除流式上传实现中的文件扩展名处理相关代码
- 简化新文件名生成逻辑,不再附加扩展名后缀
2026-02-04 07:47:41 +08:00
c8611d29ff feat(upload): 实现流式分割上传,优化大文件上传体验
实现边分割边上传的流式处理,避免大文件一次性加载导致前端卡顿。

修改内容:
1. file.util.ts - 流式分割上传核心功能
   - 新增 streamSplitAndUpload 函数,实现边分割边上传
   - 新增 shouldStreamUpload 函数,判断是否使用流式上传
   - 新增 StreamUploadOptions 和 StreamUploadResult 接口
   - 优化分片大小(默认 5MB)

2. ImportConfiguration.tsx - 智能上传策略
   - 大文件(>5MB)使用流式分割上传
   - 小文件(≤5MB)使用传统分割方式
   - 保持 UI 不变

3. useSliceUpload.tsx - 流式上传处理
   - 新增 handleStreamUpload 处理流式上传事件
   - 支持并发上传和更好的进度管理

4. TaskUpload.tsx - 进度显示优化
   - 注册流式上传事件监听器
   - 显示流式上传信息(已上传行数、当前文件等)

5. dataset.model.ts - 类型定义扩展
   - 新增 StreamUploadInfo 接口
   - TaskItem 接口添加 streamUploadInfo 和 prefix 字段

实现特点:
- 流式读取:使用 Blob.slice 逐块读取,避免一次性加载
- 逐行检测:按换行符分割,形成完整行后立即上传
- 内存优化:buffer 只保留当前块和未完成行,不累积所有分割结果
- 并发控制:支持 3 个并发上传,提升效率
- 进度可见:实时显示已上传行数和总体进度
- 错误处理:单个文件上传失败不影响其他文件
- 向后兼容:小文件仍使用原有分割方式

优势:
- 大文件上传不再卡顿,用户体验大幅提升
- 内存占用显著降低(从加载整个文件到只保留当前块)
- 上传效率提升(边分割边上传,并发上传多个小文件)

相关文件:
- frontend/src/utils/file.util.ts
- frontend/src/pages/DataManagement/Detail/components/ImportConfiguration.tsx
- frontend/src/hooks/useSliceUpload.tsx
- frontend/src/pages/Layout/TaskUpload.tsx
- frontend/src/pages/DataManagement/dataset.model.ts
2026-02-03 13:12:10 +00:00
147beb1ec7 feat(annotation): 实现文本切片预生成功能
在创建标注任务时自动预生成文本切片结构,避免每次进入标注页面时的实时计算。

修改内容:
1. 在 AnnotationEditorService 中新增 precompute_segmentation_for_project 方法
   - 为项目的所有文本文件预计算切片结构
   - 使用 AnnotationTextSplitter 执行切片
   - 将切片结构持久化到 AnnotationResult 表(状态为 IN_PROGRESS)
   - 支持失败重试机制
   - 返回统计信息

2. 修改 create_mapping 接口
   - 在创建标注任务后,如果启用分段且为文本数据集,自动触发切片预生成
   - 使用 try-except 捕获异常,确保切片失败不影响项目创建

特点:
- 使用现有的 AnnotationTextSplitter 类
- 切片数据结构与现有分段标注格式一致
- 向后兼容(未切片的任务仍然使用实时计算)
- 性能优化:避免进入标注页面时的重复计算

相关文件:
- runtime/datamate-python/app/module/annotation/service/editor.py
- runtime/datamate-python/app/module/annotation/interface/project.py
2026-02-03 12:59:29 +00:00
699031dae7 fix: 修复编辑数据集时无法清除关联数据集的编译问题
问题分析:
之前尝试使用 @TableField(updateStrategy = FieldStrategy.IGNORED/ALWAYS) 注解
来强制更新 null 值,但 FieldStrategy.ALWAYS 可能不存在于当前
MyBatis-Plus 3.5.14 版本中,导致编译错误。

修复方案:
1. 移除 Dataset.java 中 parentDatasetId 字段的 @TableField(updateStrategy) 注解
2. 移除不需要的 import com.baomidou.mybatisplus.annotation.FieldStrategy
3. 在 DatasetApplicationService.updateDataset 方法中:
   - 添加 import com.baomidou.mybatisplus.core.conditions.update.LambdaUpdateWrapper
   - 保存原始的 parentDatasetId 值用于比较
   - handleParentChange 之后,检查 parentDatasetId 是否发生变化
   - 如果发生变化,使用 LambdaUpdateWrapper 显式地更新 parentDatasetId 字段
   - 这样即使值为 null 也能被正确更新到数据库

原理:
MyBatis-Plus 的 updateById 方法默认只更新非 null 字段。
通过使用 LambdaUpdateWrapper 的 set 方法,可以显式地设置字段值,
包括 null 值,从而确保字段能够被正确更新到数据库。
2026-02-03 11:09:15 +00:00
88b1383653 fix: 恢复前端发送空字符串以支持清除关联数据集
修改说明:
移除了之前将空字符串转换为 undefined 的逻辑,
现在直接发送表单值,包括空字符串。

配合后端修改(commit cc6415c):
1. 当用户选择"无关联数据集"时,发送空字符串 ""
2. 后端 handleParentChange 方法通过 normalizeParentId 将空字符串转为 null
3. Dataset.parentDatasetId 字段添加了 @TableField(updateStrategy = FieldStrategy.IGNORED)
4. 确保即使值为 null 也会被更新到数据库
2026-02-03 10:57:14 +00:00
cc6415c4d9 fix: 修复编辑数据集时无法清除关联数据集的问题
问题描述:
在数据管理的数据集编辑中,如果之前设置了关联数据集,编辑时选择不关联数据集后保存不会生效。

根本原因:
MyBatis-Plus 的 updateById 方法默认使用 FieldStrategy.NOT_NULL 策略,
只有当字段值为非 null 时才会更新到数据库。
当 parentDatasetId 从有值变为 null 时,默认情况下不会更新到数据库。

修复方案:
在 Dataset.java 的 parentDatasetId 字段上添加 @TableField(updateStrategy = FieldStrategy.IGNORED) 注解,
表示即使值为 null 也需要更新到数据库。

配合前端修改(恢复发送空字符串),现在可以正确清除关联数据集:
1. 前端发送空字符串表示"无关联数据集"
2. 后端 handleParentChange 通过 normalizeParentId 将空字符串转为 null
3. dataset.setParentDatasetId(null) 设置为 null
4. 由于添加了 IGNORED 策略,即使为 null 也会更新到数据库
2026-02-03 10:57:08 +00:00
3d036c4cd6 fix: 修复编辑数据集时无法清除关联数据集的问题
问题描述:
在数据管理的数据集编辑中,如果之前设置了关联数据集,编辑时选择不关联数据集后保存不会生效。

问题原因:
后端 updateDataset 方法中的条件判断:
```java
if (updateDatasetRequest.getParentDatasetId() != null) {
    handleParentChange(dataset, updateDatasetRequest.getParentDatasetId());
}
```
当 parentDatasetId 为 null 或空字符串时,条件判断为 false,不会执行 handleParentChange,导致无法清除关联数据集。

修复方案:
去掉条件判断,始终调用 handleParentChange。handleParentChange 内部通过 normalizeParentId 方法将空字符串和 null 都转换为 null,这样既支持设置新的父数据集,也支持清除关联。

配合前端修改(commit 2445235),将空字符串转换为 undefined(被后端反序列化为 null),确保清除关联的操作能够正确执行。
2026-02-03 09:35:09 +00:00
2445235fd2 fix: 修复编辑数据集时清除关联数据集不生效的问题
问题描述:
在数据管理的数据集编辑中,如果之前设置了关联数据集,编辑时选择不关联数据集后保存不会生效。

问题原因:
- BasicInformation.tsx中,"无关联数据集"选项的值是空字符串""
- 当用户选择不关联数据集时,parentDatasetId的值为""
- 后端API将空字符串视为无效值而忽略,而不是识别为"清除关联"的操作

修复方案:
- 在EditDataset.tsx的handleSubmit函数中,将parentDatasetId的空字符串转换为undefined
- 使用 formValues.parentDatasetId || undefined 确保空字符串被转换为 undefined
- 这样后端API能正确识别为要清除关联数据集的操作
2026-02-03 09:23:13 +00:00
893e0a1580 fix: 上传文件时任务中心立即显示
问题描述:
在数据管理的数据集详情页上传文件时,点击确认后,弹窗消失,但是需要等待文件处理(特别是启用按行分割时)后任务中心才弹出来,用户体验不好。

修改内容:
1. useSliceUpload.tsx: 在 createTask 函数中添加立即显示任务中心的逻辑,确保任务创建后立即显示
2. ImportConfiguration.tsx: 在 handleImportData 函数中,在执行耗时的文件处理操作(如文件分割)之前,立即触发 show:task-popover 事件显示任务中心

效果:
- 修改前:点击确认 → 弹窗消失 → (等待文件处理)→ 任务中心弹出
- 修改后:点击确认 → 弹窗消失 + 任务中心立即弹出 → 文件开始处理
2026-02-03 09:14:40 +00:00
32 changed files with 3195 additions and 1526 deletions

View File

@@ -470,6 +470,23 @@ paths:
'200': '200':
description: 上传成功 description: 上传成功
/data-management/datasets/upload/cancel-upload/{reqId}:
put:
tags: [ DatasetFile ]
operationId: cancelUpload
summary: 取消上传
description: 取消预上传请求并清理临时分片
parameters:
- name: reqId
in: path
required: true
schema:
type: string
description: 预上传请求ID
responses:
'200':
description: 取消成功
/data-management/dataset-types: /data-management/dataset-types:
get: get:
operationId: getDatasetTypes operationId: getDatasetTypes

View File

@@ -1,5 +1,6 @@
package com.datamate.datamanagement.application; package com.datamate.datamanagement.application;
import com.baomidou.mybatisplus.core.conditions.update.LambdaUpdateWrapper;
import com.baomidou.mybatisplus.core.metadata.IPage; import com.baomidou.mybatisplus.core.metadata.IPage;
import com.baomidou.mybatisplus.extension.plugins.pagination.Page; import com.baomidou.mybatisplus.extension.plugins.pagination.Page;
import com.datamate.common.domain.utils.ChunksSaver; import com.datamate.common.domain.utils.ChunksSaver;
@@ -101,6 +102,7 @@ public class DatasetApplicationService {
public Dataset updateDataset(String datasetId, UpdateDatasetRequest updateDatasetRequest) { public Dataset updateDataset(String datasetId, UpdateDatasetRequest updateDatasetRequest) {
Dataset dataset = datasetRepository.getById(datasetId); Dataset dataset = datasetRepository.getById(datasetId);
BusinessAssert.notNull(dataset, DataManagementErrorCode.DATASET_NOT_FOUND); BusinessAssert.notNull(dataset, DataManagementErrorCode.DATASET_NOT_FOUND);
if (StringUtils.hasText(updateDatasetRequest.getName())) { if (StringUtils.hasText(updateDatasetRequest.getName())) {
dataset.setName(updateDatasetRequest.getName()); dataset.setName(updateDatasetRequest.getName());
} }
@@ -113,13 +115,31 @@ public class DatasetApplicationService {
if (Objects.nonNull(updateDatasetRequest.getStatus())) { if (Objects.nonNull(updateDatasetRequest.getStatus())) {
dataset.setStatus(updateDatasetRequest.getStatus()); dataset.setStatus(updateDatasetRequest.getStatus());
} }
if (updateDatasetRequest.getParentDatasetId() != null) { if (updateDatasetRequest.isParentDatasetIdProvided()) {
// 保存原始的 parentDatasetId 值,用于比较是否发生了变化
String originalParentDatasetId = dataset.getParentDatasetId();
// 处理父数据集变更:仅当请求显式包含 parentDatasetId 时处理
// handleParentChange 内部通过 normalizeParentId 方法将空字符串和 null 都转换为 null
// 这样既支持设置新的父数据集,也支持清除关联
handleParentChange(dataset, updateDatasetRequest.getParentDatasetId()); handleParentChange(dataset, updateDatasetRequest.getParentDatasetId());
// 检查 parentDatasetId 是否发生了变化
if (!Objects.equals(originalParentDatasetId, dataset.getParentDatasetId())) {
// 使用 LambdaUpdateWrapper 显式地更新 parentDatasetId 字段
// 这样即使值为 null 也能被正确更新到数据库
datasetRepository.update(null, new LambdaUpdateWrapper<Dataset>()
.eq(Dataset::getId, datasetId)
.set(Dataset::getParentDatasetId, dataset.getParentDatasetId()));
} }
}
if (StringUtils.hasText(updateDatasetRequest.getDataSource())) { if (StringUtils.hasText(updateDatasetRequest.getDataSource())) {
// 数据源id不为空,使用异步线程进行文件扫盘落库 // 数据源id不为空,使用异步线程进行文件扫盘落库
processDataSourceAsync(dataset.getId(), updateDatasetRequest.getDataSource()); processDataSourceAsync(dataset.getId(), updateDatasetRequest.getDataSource());
} }
// 更新其他字段(不包括 parentDatasetId,因为它已经在上面的代码中更新了)
datasetRepository.updateById(dataset); datasetRepository.updateById(dataset);
return dataset; return dataset;
} }
@@ -144,7 +164,7 @@ public class DatasetApplicationService {
public Dataset getDataset(String datasetId) { public Dataset getDataset(String datasetId) {
Dataset dataset = datasetRepository.getById(datasetId); Dataset dataset = datasetRepository.getById(datasetId);
BusinessAssert.notNull(dataset, DataManagementErrorCode.DATASET_NOT_FOUND); BusinessAssert.notNull(dataset, DataManagementErrorCode.DATASET_NOT_FOUND);
List<DatasetFile> datasetFiles = datasetFileRepository.findAllByDatasetId(datasetId); List<DatasetFile> datasetFiles = datasetFileRepository.findAllVisibleByDatasetId(datasetId);
dataset.setFiles(datasetFiles); dataset.setFiles(datasetFiles);
applyVisibleFileCounts(Collections.singletonList(dataset)); applyVisibleFileCounts(Collections.singletonList(dataset));
return dataset; return dataset;
@@ -419,7 +439,7 @@ public class DatasetApplicationService {
Map<String, Object> statistics = new HashMap<>(); Map<String, Object> statistics = new HashMap<>();
List<DatasetFile> allFiles = datasetFileRepository.findAllByDatasetId(datasetId); List<DatasetFile> allFiles = datasetFileRepository.findAllVisibleByDatasetId(datasetId);
List<DatasetFile> visibleFiles = filterVisibleFiles(allFiles); List<DatasetFile> visibleFiles = filterVisibleFiles(allFiles);
long totalFiles = visibleFiles.size(); long totalFiles = visibleFiles.size();
long completedFiles = visibleFiles.stream() long completedFiles = visibleFiles.stream()

View File

@@ -58,7 +58,6 @@ import java.time.LocalDateTime;
import java.time.ZoneId; import java.time.ZoneId;
import java.time.format.DateTimeFormatter; import java.time.format.DateTimeFormatter;
import java.util.*; import java.util.*;
import java.util.concurrent.CompletableFuture;
import java.util.function.Function; import java.util.function.Function;
import java.util.stream.Collectors; import java.util.stream.Collectors;
import java.util.stream.Stream; import java.util.stream.Stream;
@@ -83,6 +82,11 @@ public class DatasetFileApplicationService {
XLSX_FILE_TYPE XLSX_FILE_TYPE
); );
private static final String DERIVED_METADATA_KEY = "derived_from_file_id"; private static final String DERIVED_METADATA_KEY = "derived_from_file_id";
private static final String FILE_STATUS_ACTIVE = "ACTIVE";
private static final String FILE_STATUS_ARCHIVED = "ARCHIVED";
private static final String INTERNAL_DIR_NAME = ".datamate";
private static final String INTERNAL_UPLOAD_DIR_NAME = "uploading";
private static final String INTERNAL_VERSIONS_DIR_NAME = "versions";
private final DatasetFileRepository datasetFileRepository; private final DatasetFileRepository datasetFileRepository;
private final DatasetRepository datasetRepository; private final DatasetRepository datasetRepository;
@@ -93,7 +97,7 @@ public class DatasetFileApplicationService {
@Value("${datamate.data-management.base-path:/dataset}") @Value("${datamate.data-management.base-path:/dataset}")
private String datasetBasePath; private String datasetBasePath;
@Value("${datamate.data-management.file.duplicate:COVER}") @Value("${datamate.data-management.file.duplicate:VERSION}")
private DuplicateMethod duplicateMethod; private DuplicateMethod duplicateMethod;
@Autowired @Autowired
@@ -162,9 +166,19 @@ public class DatasetFileApplicationService {
if (dataset == null) { if (dataset == null) {
return PagedResponse.of(new Page<>(page, size)); return PagedResponse.of(new Page<>(page, size));
} }
String datasetPath = dataset.getPath(); Path datasetRoot = Paths.get(dataset.getPath()).toAbsolutePath().normalize();
Path queryPath = Path.of(dataset.getPath() + File.separator + prefix); prefix = Optional.ofNullable(prefix).orElse("").trim().replace("\\", "/");
Map<String, DatasetFile> datasetFilesMap = datasetFileRepository.findAllByDatasetId(datasetId) while (prefix.startsWith("/")) {
prefix = prefix.substring(1);
}
if (prefix.equals(INTERNAL_DIR_NAME) || prefix.startsWith(INTERNAL_DIR_NAME + "/")) {
return new PagedResponse<>(page, size, 0, 0, Collections.emptyList());
}
Path queryPath = datasetRoot.resolve(prefix.replace("/", File.separator)).normalize();
if (!queryPath.startsWith(datasetRoot)) {
return new PagedResponse<>(page, size, 0, 0, Collections.emptyList());
}
Map<String, DatasetFile> datasetFilesMap = datasetFileRepository.findAllVisibleByDatasetId(datasetId)
.stream() .stream()
.filter(file -> file.getFilePath() != null) .filter(file -> file.getFilePath() != null)
.collect(Collectors.toMap( .collect(Collectors.toMap(
@@ -186,7 +200,8 @@ public class DatasetFileApplicationService {
} }
try (Stream<Path> pathStream = Files.list(queryPath)) { try (Stream<Path> pathStream = Files.list(queryPath)) {
List<Path> allFiles = pathStream List<Path> allFiles = pathStream
.filter(path -> path.toString().startsWith(datasetPath)) .filter(path -> path.toAbsolutePath().normalize().startsWith(datasetRoot))
.filter(path -> !isInternalDatasetPath(datasetRoot, path))
.filter(path -> !excludeDerivedFiles .filter(path -> !excludeDerivedFiles
|| Files.isDirectory(path) || Files.isDirectory(path)
|| !derivedFilePaths.contains(normalizeFilePath(path.toString()))) || !derivedFilePaths.contains(normalizeFilePath(path.toString())))
@@ -298,6 +313,86 @@ public class DatasetFileApplicationService {
} }
} }
private boolean isSameNormalizedPath(String left, String right) {
String normalizedLeft = normalizeFilePath(left);
String normalizedRight = normalizeFilePath(right);
if (normalizedLeft == null || normalizedRight == null) {
return false;
}
return normalizedLeft.equals(normalizedRight);
}
private boolean isInternalDatasetPath(Path datasetRoot, Path path) {
if (datasetRoot == null || path == null) {
return false;
}
try {
Path normalizedRoot = datasetRoot.toAbsolutePath().normalize();
Path normalizedPath = path.toAbsolutePath().normalize();
if (!normalizedPath.startsWith(normalizedRoot)) {
return false;
}
Path relative = normalizedRoot.relativize(normalizedPath);
if (relative.getNameCount() == 0) {
return false;
}
return INTERNAL_DIR_NAME.equals(relative.getName(0).toString());
} catch (Exception e) {
return false;
}
}
private String normalizeLogicalPrefix(String prefix) {
if (prefix == null) {
return "";
}
String normalized = prefix.trim().replace("\\", "/");
while (normalized.startsWith("/")) {
normalized = normalized.substring(1);
}
while (normalized.endsWith("/")) {
normalized = normalized.substring(0, normalized.length() - 1);
}
while (normalized.contains("//")) {
normalized = normalized.replace("//", "/");
}
return normalized;
}
private String normalizeLogicalPath(String logicalPath) {
return normalizeLogicalPrefix(logicalPath);
}
private String joinLogicalPath(String prefix, String relativePath) {
String normalizedPrefix = normalizeLogicalPrefix(prefix);
String normalizedRelative = normalizeLogicalPath(relativePath);
if (normalizedPrefix.isEmpty()) {
return normalizedRelative;
}
if (normalizedRelative.isEmpty()) {
return normalizedPrefix;
}
return normalizeLogicalPath(normalizedPrefix + "/" + normalizedRelative);
}
private void assertNotInternalPrefix(String prefix) {
if (prefix == null || prefix.isBlank()) {
return;
}
String normalized = normalizeLogicalPrefix(prefix);
if (normalized.equals(INTERNAL_DIR_NAME) || normalized.startsWith(INTERNAL_DIR_NAME + "/")) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
}
}
private boolean isArchivedStatus(DatasetFile datasetFile) {
if (datasetFile == null) {
return false;
}
String status = datasetFile.getStatus();
return status != null && FILE_STATUS_ARCHIVED.equalsIgnoreCase(status);
}
private boolean isSourceDocument(DatasetFile datasetFile) { private boolean isSourceDocument(DatasetFile datasetFile) {
if (datasetFile == null) { if (datasetFile == null) {
return false; return false;
@@ -327,6 +422,144 @@ public class DatasetFileApplicationService {
} }
} }
private Path resolveDatasetRootPath(Dataset dataset, String datasetId) {
String datasetPath = dataset == null ? null : dataset.getPath();
if (datasetPath == null || datasetPath.isBlank()) {
datasetPath = datasetBasePath + File.separator + datasetId;
if (dataset != null) {
dataset.setPath(datasetPath);
datasetRepository.updateById(dataset);
}
}
Path datasetRoot = Paths.get(datasetPath).toAbsolutePath().normalize();
try {
Files.createDirectories(datasetRoot);
} catch (IOException e) {
log.error("Failed to create dataset root dir: {}", datasetRoot, e);
throw BusinessException.of(SystemErrorCode.FILE_SYSTEM_ERROR);
}
return datasetRoot;
}
private Path resolveStagingRootPath(Path datasetRoot,
DatasetFileUploadCheckInfo checkInfo,
List<FileUploadResult> uploadedFiles) {
if (datasetRoot == null) {
return null;
}
String stagingPath = checkInfo == null ? null : checkInfo.getStagingPath();
if (stagingPath != null && !stagingPath.isBlank()) {
try {
Path stagingRoot = Paths.get(stagingPath).toAbsolutePath().normalize();
if (!stagingRoot.startsWith(datasetRoot)) {
log.warn("Staging root out of dataset root, datasetId={}, stagingRoot={}, datasetRoot={}",
checkInfo == null ? null : checkInfo.getDatasetId(), stagingRoot, datasetRoot);
return null;
}
Path relative = datasetRoot.relativize(stagingRoot);
if (relative.getNameCount() < 3) {
return null;
}
if (!INTERNAL_DIR_NAME.equals(relative.getName(0).toString())
|| !INTERNAL_UPLOAD_DIR_NAME.equals(relative.getName(1).toString())) {
return null;
}
return stagingRoot;
} catch (Exception e) {
log.warn("Invalid staging path: {}", stagingPath, e);
return null;
}
}
if (uploadedFiles == null || uploadedFiles.isEmpty()) {
return null;
}
FileUploadResult firstResult = uploadedFiles.get(0);
File firstFile = firstResult == null ? null : firstResult.getSavedFile();
if (firstFile == null) {
return null;
}
try {
return Paths.get(firstFile.getParent()).toAbsolutePath().normalize();
} catch (Exception e) {
return null;
}
}
private void scheduleCleanupStagingDirAfterCommit(Path stagingRoot) {
if (stagingRoot == null) {
return;
}
Runnable cleanup = () -> deleteDirectoryRecursivelyQuietly(stagingRoot);
if (TransactionSynchronizationManager.isSynchronizationActive()) {
TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronization() {
@Override
public void afterCommit() {
cleanup.run();
}
});
return;
}
cleanup.run();
}
private void deleteDirectoryRecursivelyQuietly(Path directory) {
if (directory == null) {
return;
}
if (!Files.exists(directory)) {
return;
}
try (Stream<Path> paths = Files.walk(directory)) {
paths.sorted(Comparator.reverseOrder()).forEach(path -> {
try {
Files.deleteIfExists(path);
} catch (IOException e) {
log.debug("Failed to delete: {}", path, e);
}
});
} catch (IOException e) {
log.debug("Failed to cleanup staging dir: {}", directory, e);
}
}
private String sanitizeArchiveFileName(String fileName) {
String input = fileName == null ? "" : fileName.trim();
if (input.isBlank()) {
return "file";
}
StringBuilder builder = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++) {
char c = input.charAt(i);
if (c <= 31 || c == 127) {
builder.append('_');
continue;
}
if (c == '/' || c == '\\' || c == ':' || c == '*' || c == '?' || c == '\"'
|| c == '<' || c == '>' || c == '|') {
builder.append('_');
continue;
}
builder.append(c);
}
String sanitized = builder.toString().trim();
return sanitized.isEmpty() ? "file" : sanitized;
}
private String sha256Hex(String value) {
String input = value == null ? "" : value;
try {
java.security.MessageDigest digest = java.security.MessageDigest.getInstance("SHA-256");
byte[] hashed = digest.digest(input.getBytes(java.nio.charset.StandardCharsets.UTF_8));
StringBuilder builder = new StringBuilder(hashed.length * 2);
for (byte b : hashed) {
builder.append(String.format("%02x", b));
}
return builder.toString();
} catch (Exception e) {
return Integer.toHexString(input.hashCode());
}
}
/** /**
* 获取文件详情 * 获取文件详情
*/ */
@@ -349,10 +582,12 @@ public class DatasetFileApplicationService {
public void deleteDatasetFile(String datasetId, String fileId) { public void deleteDatasetFile(String datasetId, String fileId) {
DatasetFile file = getDatasetFile(datasetId, fileId); DatasetFile file = getDatasetFile(datasetId, fileId);
Dataset dataset = datasetRepository.getById(datasetId); Dataset dataset = datasetRepository.getById(datasetId);
dataset.setFiles(new ArrayList<>(Collections.singleton(file)));
datasetFileRepository.removeById(fileId); datasetFileRepository.removeById(fileId);
if (!isArchivedStatus(file)) {
dataset.setFiles(new ArrayList<>(Collections.singleton(file)));
dataset.removeFile(file); dataset.removeFile(file);
datasetRepository.updateById(dataset); datasetRepository.updateById(dataset);
}
datasetFilePreviewService.deletePreviewFileQuietly(datasetId, fileId); datasetFilePreviewService.deletePreviewFileQuietly(datasetId, fileId);
// 删除文件时,上传到数据集中的文件会同时删除数据库中的记录和文件系统中的文件,归集过来的文件仅删除数据库中的记录 // 删除文件时,上传到数据集中的文件会同时删除数据库中的记录和文件系统中的文件,归集过来的文件仅删除数据库中的记录
if (file.getFilePath().startsWith(dataset.getPath())) { if (file.getFilePath().startsWith(dataset.getPath())) {
@@ -393,18 +628,26 @@ public class DatasetFileApplicationService {
if (Objects.isNull(dataset)) { if (Objects.isNull(dataset)) {
throw BusinessException.of(DataManagementErrorCode.DATASET_NOT_FOUND); throw BusinessException.of(DataManagementErrorCode.DATASET_NOT_FOUND);
} }
List<DatasetFile> allByDatasetId = datasetFileRepository.findAllByDatasetId(datasetId); Path datasetRoot = Paths.get(dataset.getPath()).toAbsolutePath().normalize();
Set<String> filePaths = allByDatasetId.stream().map(DatasetFile::getFilePath).collect(Collectors.toSet()); Set<Path> filePaths = datasetFileRepository.findAllVisibleByDatasetId(datasetId).stream()
String datasetPath = dataset.getPath(); .map(DatasetFile::getFilePath)
Path downloadPath = Path.of(datasetPath); .filter(Objects::nonNull)
.map(path -> Paths.get(path).toAbsolutePath().normalize())
.filter(path -> path.startsWith(datasetRoot))
.filter(path -> !isInternalDatasetPath(datasetRoot, path))
.collect(Collectors.toSet());
Path downloadPath = datasetRoot;
response.setContentType("application/zip"); response.setContentType("application/zip");
String zipName = String.format("dataset_%s.zip", String zipName = String.format("dataset_%s.zip",
LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmmss"))); LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmmss")));
response.setHeader(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=" + zipName); response.setHeader(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=" + zipName);
try (ZipArchiveOutputStream zos = new ZipArchiveOutputStream(response.getOutputStream())) { try (ZipArchiveOutputStream zos = new ZipArchiveOutputStream(response.getOutputStream())) {
try (Stream<Path> pathStream = Files.walk(downloadPath)) { try (Stream<Path> pathStream = Files.walk(downloadPath)) {
List<Path> allPaths = pathStream.filter(path -> path.toString().startsWith(datasetPath)) List<Path> allPaths = pathStream
.filter(path -> filePaths.stream().anyMatch(filePath -> filePath.startsWith(path.toString()))) .map(path -> path.toAbsolutePath().normalize())
.filter(path -> path.startsWith(datasetRoot))
.filter(path -> !isInternalDatasetPath(datasetRoot, path))
.filter(path -> filePaths.stream().anyMatch(filePath -> filePath.startsWith(path)))
.toList(); .toList();
for (Path path : allPaths) { for (Path path : allPaths) {
addToZipFile(path, downloadPath, zos); addToZipFile(path, downloadPath, zos);
@@ -461,29 +704,33 @@ public class DatasetFileApplicationService {
throw BusinessException.of(DataManagementErrorCode.DATASET_NOT_FOUND); throw BusinessException.of(DataManagementErrorCode.DATASET_NOT_FOUND);
} }
// 构建上传路径,如果有 prefix 则追加到路径中 String prefix = normalizeLogicalPrefix(chunkUploadRequest == null ? null : chunkUploadRequest.getPrefix());
String prefix = Optional.ofNullable(chunkUploadRequest.getPrefix()).orElse("").trim(); assertNotInternalPrefix(prefix);
prefix = prefix.replace("\\", "/");
while (prefix.startsWith("/")) {
prefix = prefix.substring(1);
}
String uploadPath = dataset.getPath(); Path datasetRoot = resolveDatasetRootPath(dataset, datasetId);
if (uploadPath == null || uploadPath.isBlank()) { Path stagingRoot = datasetRoot
uploadPath = datasetBasePath + File.separator + datasetId; .resolve(INTERNAL_DIR_NAME)
} .resolve(INTERNAL_UPLOAD_DIR_NAME)
if (!prefix.isEmpty()) { .resolve(UUID.randomUUID().toString())
uploadPath = uploadPath + File.separator + prefix.replace("/", File.separator); .toAbsolutePath()
.normalize();
BusinessAssert.isTrue(stagingRoot.startsWith(datasetRoot), CommonErrorCode.PARAM_ERROR);
try {
Files.createDirectories(stagingRoot);
} catch (IOException e) {
log.error("Failed to create staging dir: {}", stagingRoot, e);
throw BusinessException.of(SystemErrorCode.FILE_SYSTEM_ERROR);
} }
ChunkUploadPreRequest request = ChunkUploadPreRequest.builder().build(); ChunkUploadPreRequest request = ChunkUploadPreRequest.builder().build();
request.setUploadPath(uploadPath); request.setUploadPath(stagingRoot.toString());
request.setTotalFileNum(chunkUploadRequest.getTotalFileNum()); request.setTotalFileNum(chunkUploadRequest.getTotalFileNum());
request.setServiceId(DatasetConstant.SERVICE_ID); request.setServiceId(DatasetConstant.SERVICE_ID);
DatasetFileUploadCheckInfo checkInfo = new DatasetFileUploadCheckInfo(); DatasetFileUploadCheckInfo checkInfo = new DatasetFileUploadCheckInfo();
checkInfo.setDatasetId(datasetId); checkInfo.setDatasetId(datasetId);
checkInfo.setHasArchive(chunkUploadRequest.isHasArchive()); checkInfo.setHasArchive(chunkUploadRequest.isHasArchive());
checkInfo.setPrefix(prefix); checkInfo.setPrefix(prefix);
checkInfo.setStagingPath(stagingRoot.toString());
try { try {
ObjectMapper objectMapper = new ObjectMapper(); ObjectMapper objectMapper = new ObjectMapper();
String checkInfoJson = objectMapper.writeValueAsString(checkInfo); String checkInfoJson = objectMapper.writeValueAsString(checkInfo);
@@ -505,6 +752,14 @@ public class DatasetFileApplicationService {
saveFileInfoToDb(uploadResult, datasetId); saveFileInfoToDb(uploadResult, datasetId);
} }
/**
* 取消上传
*/
@Transactional
public void cancelUpload(String reqId) {
fileService.cancelUpload(reqId);
}
private void saveFileInfoToDb(FileUploadResult fileUploadResult, String datasetId) { private void saveFileInfoToDb(FileUploadResult fileUploadResult, String datasetId) {
if (Objects.isNull(fileUploadResult.getSavedFile())) { if (Objects.isNull(fileUploadResult.getSavedFile())) {
// 文件切片上传没有完成 // 文件切片上传没有完成
@@ -527,32 +782,251 @@ public class DatasetFileApplicationService {
} else { } else {
files = Collections.singletonList(fileUploadResult); files = Collections.singletonList(fileUploadResult);
} }
addFileToDataset(datasetId, files); commitUploadedFiles(datasetId, checkInfo, files, fileUploadResult.isAllFilesUploaded());
} }
private void addFileToDataset(String datasetId, List<FileUploadResult> unpacked) { private void commitUploadedFiles(String datasetId,
DatasetFileUploadCheckInfo checkInfo,
List<FileUploadResult> uploadedFiles,
boolean cleanupStagingAfterCommit) {
Dataset dataset = datasetRepository.getById(datasetId); Dataset dataset = datasetRepository.getById(datasetId);
dataset.setFiles(datasetFileRepository.findAllByDatasetId(datasetId)); BusinessAssert.notNull(dataset, DataManagementErrorCode.DATASET_NOT_FOUND);
for (FileUploadResult file : unpacked) {
File savedFile = file.getSavedFile(); Path datasetRoot = resolveDatasetRootPath(dataset, datasetId);
String prefix = checkInfo == null ? "" : normalizeLogicalPrefix(checkInfo.getPrefix());
assertNotInternalPrefix(prefix);
Path stagingRoot = resolveStagingRootPath(datasetRoot, checkInfo, uploadedFiles);
BusinessAssert.notNull(stagingRoot, CommonErrorCode.PARAM_ERROR);
dataset.setFiles(new ArrayList<>(datasetFileRepository.findAllVisibleByDatasetId(datasetId)));
for (FileUploadResult fileResult : uploadedFiles) {
commitSingleUploadedFile(dataset, datasetRoot, stagingRoot, prefix, fileResult);
}
dataset.active();
datasetRepository.updateById(dataset);
if (cleanupStagingAfterCommit) {
scheduleCleanupStagingDirAfterCommit(stagingRoot);
}
}
private void commitSingleUploadedFile(Dataset dataset,
Path datasetRoot,
Path stagingRoot,
String prefix,
FileUploadResult fileResult) {
if (dataset == null || fileResult == null || fileResult.getSavedFile() == null) {
return;
}
Path incomingPath = Paths.get(fileResult.getSavedFile().getPath()).toAbsolutePath().normalize();
BusinessAssert.isTrue(incomingPath.startsWith(stagingRoot), CommonErrorCode.PARAM_ERROR);
String relativePath = stagingRoot.relativize(incomingPath).toString().replace(File.separator, "/");
String logicalPath = joinLogicalPath(prefix, relativePath);
assertNotInternalPrefix(logicalPath);
commitNewFileVersion(dataset, datasetRoot, logicalPath, incomingPath, true);
}
private DatasetFile commitNewFileVersion(Dataset dataset,
Path datasetRoot,
String logicalPath,
Path incomingFilePath,
boolean moveIncoming) {
BusinessAssert.notNull(dataset, CommonErrorCode.PARAM_ERROR);
BusinessAssert.isTrue(datasetRoot != null && Files.exists(datasetRoot), CommonErrorCode.PARAM_ERROR);
String normalizedLogicalPath = normalizeLogicalPath(logicalPath);
BusinessAssert.isTrue(!normalizedLogicalPath.isEmpty(), CommonErrorCode.PARAM_ERROR);
assertNotInternalPrefix(normalizedLogicalPath);
Path targetFilePath = datasetRoot.resolve(normalizedLogicalPath.replace("/", File.separator))
.toAbsolutePath()
.normalize();
BusinessAssert.isTrue(targetFilePath.startsWith(datasetRoot), CommonErrorCode.PARAM_ERROR);
DuplicateMethod effectiveDuplicateMethod = resolveEffectiveDuplicateMethod();
DatasetFile latest = datasetFileRepository.findLatestByDatasetIdAndLogicalPath(dataset.getId(), normalizedLogicalPath);
if (latest == null && dataset.getFiles() != null) {
latest = dataset.getFiles().stream()
.filter(existing -> isSameNormalizedPath(existing == null ? null : existing.getFilePath(), targetFilePath.toString()))
.findFirst()
.orElse(null);
}
if (latest != null && effectiveDuplicateMethod == DuplicateMethod.ERROR) {
throw BusinessException.of(DataManagementErrorCode.DATASET_FILE_ALREADY_EXISTS);
}
long nextVersion = 1L;
if (latest != null) {
long latestVersion = Optional.ofNullable(latest.getVersion()).orElse(1L);
if (latest.getVersion() == null) {
latest.setVersion(latestVersion);
}
if (latest.getLogicalPath() == null || latest.getLogicalPath().isBlank()) {
latest.setLogicalPath(normalizedLogicalPath);
}
nextVersion = latestVersion + 1L;
}
if (latest != null && effectiveDuplicateMethod == DuplicateMethod.VERSION) {
Path archivedPath = archiveDatasetFileVersion(datasetRoot, normalizedLogicalPath, latest);
if (archivedPath != null) {
latest.setFilePath(archivedPath.toString());
} else if (Files.exists(targetFilePath)) {
log.error("Failed to archive latest file, refuse to overwrite. datasetId={}, fileId={}, logicalPath={}, targetPath={}",
dataset.getId(), latest.getId(), normalizedLogicalPath, targetFilePath);
throw BusinessException.of(SystemErrorCode.FILE_SYSTEM_ERROR);
}
latest.setStatus(FILE_STATUS_ARCHIVED);
datasetFileRepository.updateById(latest);
dataset.removeFile(latest);
} else if (latest == null && Files.exists(targetFilePath)) {
archiveOrphanTargetFile(datasetRoot, normalizedLogicalPath, targetFilePath);
}
try {
Files.createDirectories(targetFilePath.getParent());
if (moveIncoming) {
Files.move(incomingFilePath, targetFilePath, java.nio.file.StandardCopyOption.REPLACE_EXISTING);
} else {
Files.copy(incomingFilePath, targetFilePath, java.nio.file.StandardCopyOption.REPLACE_EXISTING);
}
} catch (IOException e) {
log.error("Failed to write dataset file, datasetId={}, logicalPath={}, targetPath={}",
dataset.getId(), normalizedLogicalPath, targetFilePath, e);
throw BusinessException.of(SystemErrorCode.FILE_SYSTEM_ERROR);
}
LocalDateTime currentTime = LocalDateTime.now(); LocalDateTime currentTime = LocalDateTime.now();
String fileName = targetFilePath.getFileName().toString();
long fileSize;
try {
fileSize = Files.size(targetFilePath);
} catch (IOException e) {
fileSize = 0L;
}
DatasetFile datasetFile = DatasetFile.builder() DatasetFile datasetFile = DatasetFile.builder()
.id(UUID.randomUUID().toString()) .id(UUID.randomUUID().toString())
.datasetId(datasetId) .datasetId(dataset.getId())
.fileSize(savedFile.length()) .fileName(fileName)
.fileType(AnalyzerUtils.getExtension(fileName))
.fileSize(fileSize)
.filePath(targetFilePath.toString())
.logicalPath(normalizedLogicalPath)
.version(nextVersion)
.status(FILE_STATUS_ACTIVE)
.uploadTime(currentTime) .uploadTime(currentTime)
.lastAccessTime(currentTime) .lastAccessTime(currentTime)
.fileName(file.getFileName())
.filePath(savedFile.getPath())
.fileType(AnalyzerUtils.getExtension(file.getFileName()))
.build(); .build();
setDatasetFileId(datasetFile, dataset);
datasetFileRepository.saveOrUpdate(datasetFile); datasetFileRepository.saveOrUpdate(datasetFile);
dataset.addFile(datasetFile); dataset.addFile(datasetFile);
triggerPdfTextExtraction(dataset, datasetFile); triggerPdfTextExtraction(dataset, datasetFile);
return datasetFile;
}
private DuplicateMethod resolveEffectiveDuplicateMethod() {
if (duplicateMethod == null) {
return DuplicateMethod.VERSION;
}
if (duplicateMethod == DuplicateMethod.COVER) {
log.warn("duplicateMethod=COVER 会导致标注引用的 fileId 对应内容被覆盖,已强制按 VERSION 处理。");
return DuplicateMethod.VERSION;
}
return duplicateMethod;
}
private Path archiveDatasetFileVersion(Path datasetRoot, String logicalPath, DatasetFile latest) {
if (latest == null || latest.getId() == null || latest.getId().isBlank()) {
return null;
}
Path currentPath;
try {
currentPath = Paths.get(latest.getFilePath()).toAbsolutePath().normalize();
} catch (Exception e) {
log.warn("Invalid latest file path, skip archiving. datasetId={}, fileId={}, filePath={}",
latest.getDatasetId(), latest.getId(), latest.getFilePath());
return null;
}
if (!Files.exists(currentPath) || !Files.isRegularFile(currentPath)) {
log.warn("Latest file not found on disk, skip archiving. datasetId={}, fileId={}, filePath={}",
latest.getDatasetId(), latest.getId(), currentPath);
return null;
}
if (!currentPath.startsWith(datasetRoot)) {
log.warn("Latest file path out of dataset root, skip archiving. datasetId={}, fileId={}, filePath={}",
latest.getDatasetId(), latest.getId(), currentPath);
return null;
}
long latestVersion = Optional.ofNullable(latest.getVersion()).orElse(1L);
String logicalPathHash = sha256Hex(logicalPath);
Path archiveDir = datasetRoot
.resolve(INTERNAL_DIR_NAME)
.resolve(INTERNAL_VERSIONS_DIR_NAME)
.resolve(logicalPathHash)
.resolve("v" + latestVersion)
.toAbsolutePath()
.normalize();
BusinessAssert.isTrue(archiveDir.startsWith(datasetRoot), CommonErrorCode.PARAM_ERROR);
try {
Files.createDirectories(archiveDir);
} catch (IOException e) {
log.error("Failed to create archive dir: {}", archiveDir, e);
throw BusinessException.of(SystemErrorCode.FILE_SYSTEM_ERROR);
}
String fileName = sanitizeArchiveFileName(Optional.ofNullable(latest.getFileName()).orElse(currentPath.getFileName().toString()));
Path archivedPath = archiveDir.resolve(latest.getId() + "__" + fileName).toAbsolutePath().normalize();
BusinessAssert.isTrue(archivedPath.startsWith(archiveDir), CommonErrorCode.PARAM_ERROR);
try {
Files.move(currentPath, archivedPath, java.nio.file.StandardCopyOption.REPLACE_EXISTING);
return archivedPath;
} catch (IOException e) {
log.error("Failed to archive latest file, datasetId={}, fileId={}, from={}, to={}",
latest.getDatasetId(), latest.getId(), currentPath, archivedPath, e);
throw BusinessException.of(SystemErrorCode.FILE_SYSTEM_ERROR);
}
}
private void archiveOrphanTargetFile(Path datasetRoot, String logicalPath, Path targetFilePath) {
if (datasetRoot == null || targetFilePath == null) {
return;
}
if (!Files.exists(targetFilePath) || !Files.isRegularFile(targetFilePath)) {
return;
}
String logicalPathHash = sha256Hex(logicalPath);
Path orphanDir = datasetRoot
.resolve(INTERNAL_DIR_NAME)
.resolve(INTERNAL_VERSIONS_DIR_NAME)
.resolve(logicalPathHash)
.resolve("orphan")
.toAbsolutePath()
.normalize();
if (!orphanDir.startsWith(datasetRoot)) {
return;
}
try {
Files.createDirectories(orphanDir);
String safeName = sanitizeArchiveFileName(targetFilePath.getFileName().toString());
Path orphanPath = orphanDir.resolve("orphan_" + System.currentTimeMillis() + "__" + safeName)
.toAbsolutePath()
.normalize();
if (!orphanPath.startsWith(orphanDir)) {
return;
}
Files.move(targetFilePath, orphanPath, java.nio.file.StandardCopyOption.REPLACE_EXISTING);
} catch (Exception e) {
log.warn("Failed to archive orphan target file, logicalPath={}, targetPath={}", logicalPath, targetFilePath, e);
} }
dataset.active();
datasetRepository.updateById(dataset);
} }
/** /**
@@ -570,11 +1044,16 @@ public class DatasetFileApplicationService {
while (parentPrefix.startsWith("/")) { while (parentPrefix.startsWith("/")) {
parentPrefix = parentPrefix.substring(1); parentPrefix = parentPrefix.substring(1);
} }
parentPrefix = normalizeLogicalPrefix(parentPrefix);
assertNotInternalPrefix(parentPrefix);
String directoryName = Optional.ofNullable(req.getDirectoryName()).orElse("").trim(); String directoryName = Optional.ofNullable(req.getDirectoryName()).orElse("").trim();
if (directoryName.isEmpty()) { if (directoryName.isEmpty()) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR); throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
} }
if (INTERNAL_DIR_NAME.equals(directoryName)) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
}
if (directoryName.contains("..") || directoryName.contains("/") || directoryName.contains("\\")) { if (directoryName.contains("..") || directoryName.contains("/") || directoryName.contains("\\")) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR); throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
} }
@@ -616,6 +1095,9 @@ public class DatasetFileApplicationService {
while (prefix.endsWith("/")) { while (prefix.endsWith("/")) {
prefix = prefix.substring(0, prefix.length() - 1); prefix = prefix.substring(0, prefix.length() - 1);
} }
if (prefix.equals(INTERNAL_DIR_NAME) || prefix.startsWith(INTERNAL_DIR_NAME + "/")) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
}
Path basePath = Paths.get(datasetPath); Path basePath = Paths.get(datasetPath);
Path targetPath = prefix.isEmpty() ? basePath : basePath.resolve(prefix); Path targetPath = prefix.isEmpty() ? basePath : basePath.resolve(prefix);
@@ -652,6 +1134,7 @@ public class DatasetFileApplicationService {
private void zipDirectory(Path sourceDir, Path basePath, ZipArchiveOutputStream zipOut) throws IOException { private void zipDirectory(Path sourceDir, Path basePath, ZipArchiveOutputStream zipOut) throws IOException {
try (Stream<Path> paths = Files.walk(sourceDir)) { try (Stream<Path> paths = Files.walk(sourceDir)) {
paths.filter(path -> !Files.isDirectory(path)) paths.filter(path -> !Files.isDirectory(path))
.filter(path -> !isInternalDatasetPath(basePath.toAbsolutePath().normalize(), path))
.forEach(path -> { .forEach(path -> {
try { try {
Path relativePath = basePath.relativize(path); Path relativePath = basePath.relativize(path);
@@ -690,6 +1173,9 @@ public class DatasetFileApplicationService {
if (prefix.isEmpty()) { if (prefix.isEmpty()) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR); throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
} }
if (prefix.equals(INTERNAL_DIR_NAME) || prefix.startsWith(INTERNAL_DIR_NAME + "/")) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
}
String datasetPath = dataset.getPath(); String datasetPath = dataset.getPath();
Path basePath = Paths.get(datasetPath); Path basePath = Paths.get(datasetPath);
@@ -761,28 +1247,6 @@ public class DatasetFileApplicationService {
} }
} }
/**
* 为数据集文件设置文件id
*
* @param datasetFile 要设置id的文件
* @param dataset 数据集(包含文件列表)
*/
private void setDatasetFileId(DatasetFile datasetFile, Dataset dataset) {
Map<String, DatasetFile> existDatasetFilMap = dataset.getFiles().stream().collect(Collectors.toMap(DatasetFile::getFilePath, Function.identity()));
DatasetFile existDatasetFile = existDatasetFilMap.get(datasetFile.getFilePath());
if (Objects.isNull(existDatasetFile)) {
return;
}
if (duplicateMethod == DuplicateMethod.ERROR) {
log.error("file {} already exists in dataset {}", datasetFile.getFileName(), datasetFile.getDatasetId());
throw BusinessException.of(DataManagementErrorCode.DATASET_FILE_ALREADY_EXISTS);
}
if (duplicateMethod == DuplicateMethod.COVER) {
dataset.removeFile(existDatasetFile);
datasetFile.setId(existDatasetFile.getId());
}
}
/** /**
* 复制文件到数据集目录 * 复制文件到数据集目录
* *
@@ -794,36 +1258,21 @@ public class DatasetFileApplicationService {
public List<DatasetFile> copyFilesToDatasetDir(String datasetId, CopyFilesRequest req) { public List<DatasetFile> copyFilesToDatasetDir(String datasetId, CopyFilesRequest req) {
Dataset dataset = datasetRepository.getById(datasetId); Dataset dataset = datasetRepository.getById(datasetId);
BusinessAssert.notNull(dataset, SystemErrorCode.RESOURCE_NOT_FOUND); BusinessAssert.notNull(dataset, SystemErrorCode.RESOURCE_NOT_FOUND);
Path datasetRoot = resolveDatasetRootPath(dataset, datasetId);
dataset.setFiles(new ArrayList<>(datasetFileRepository.findAllVisibleByDatasetId(datasetId)));
List<DatasetFile> copiedFiles = new ArrayList<>(); List<DatasetFile> copiedFiles = new ArrayList<>();
List<DatasetFile> existDatasetFiles = datasetFileRepository.findAllByDatasetId(datasetId);
dataset.setFiles(existDatasetFiles);
for (String sourceFilePath : req.sourcePaths()) { for (String sourceFilePath : req.sourcePaths()) {
Path sourcePath = Paths.get(sourceFilePath); Path sourcePath = Paths.get(sourceFilePath).toAbsolutePath().normalize();
if (!Files.exists(sourcePath) || !Files.isRegularFile(sourcePath)) { if (!Files.exists(sourcePath) || !Files.isRegularFile(sourcePath)) {
log.warn("Source file does not exist or is not a regular file: {}", sourceFilePath); log.warn("Source file does not exist or is not a regular file: {}", sourceFilePath);
continue; continue;
} }
String fileName = sourcePath.getFileName().toString(); String logicalPath = sourcePath.getFileName().toString();
File sourceFile = sourcePath.toFile(); DatasetFile datasetFile = commitNewFileVersion(dataset, datasetRoot, logicalPath, sourcePath, false);
LocalDateTime currentTime = LocalDateTime.now();
DatasetFile datasetFile = DatasetFile.builder()
.id(UUID.randomUUID().toString())
.datasetId(datasetId)
.fileName(fileName)
.fileType(AnalyzerUtils.getExtension(fileName))
.fileSize(sourceFile.length())
.filePath(Paths.get(dataset.getPath(), fileName).toString())
.uploadTime(currentTime)
.lastAccessTime(currentTime)
.build();
setDatasetFileId(datasetFile, dataset);
dataset.addFile(datasetFile);
copiedFiles.add(datasetFile); copiedFiles.add(datasetFile);
} }
datasetFileRepository.saveOrUpdateBatch(copiedFiles, 100);
dataset.active(); dataset.active();
datasetRepository.updateById(dataset); datasetRepository.updateById(dataset);
CompletableFuture.runAsync(() -> copyFilesToDatasetDir(req.sourcePaths(), dataset));
return copiedFiles; return copiedFiles;
} }
@@ -839,13 +1288,11 @@ public class DatasetFileApplicationService {
public List<DatasetFile> copyFilesToDatasetDirWithSourceRoot(String datasetId, Path sourceRoot, List<String> sourcePaths) { public List<DatasetFile> copyFilesToDatasetDirWithSourceRoot(String datasetId, Path sourceRoot, List<String> sourcePaths) {
Dataset dataset = datasetRepository.getById(datasetId); Dataset dataset = datasetRepository.getById(datasetId);
BusinessAssert.notNull(dataset, SystemErrorCode.RESOURCE_NOT_FOUND); BusinessAssert.notNull(dataset, SystemErrorCode.RESOURCE_NOT_FOUND);
Path datasetRoot = resolveDatasetRootPath(dataset, datasetId);
Path normalizedRoot = sourceRoot.toAbsolutePath().normalize(); Path normalizedRoot = sourceRoot.toAbsolutePath().normalize();
List<DatasetFile> copiedFiles = new ArrayList<>(); dataset.setFiles(new ArrayList<>(datasetFileRepository.findAllVisibleByDatasetId(datasetId)));
List<DatasetFile> existDatasetFiles = datasetFileRepository.findAllByDatasetId(datasetId);
dataset.setFiles(existDatasetFiles);
Map<String, DatasetFile> copyTargets = new LinkedHashMap<>();
List<DatasetFile> copiedFiles = new ArrayList<>();
for (String sourceFilePath : sourcePaths) { for (String sourceFilePath : sourcePaths) {
if (sourceFilePath == null || sourceFilePath.isBlank()) { if (sourceFilePath == null || sourceFilePath.isBlank()) {
continue; continue;
@@ -859,86 +1306,16 @@ public class DatasetFileApplicationService {
log.warn("Source file does not exist or is not a regular file: {}", sourceFilePath); log.warn("Source file does not exist or is not a regular file: {}", sourceFilePath);
continue; continue;
} }
Path relativePath = normalizedRoot.relativize(sourcePath); Path relativePath = normalizedRoot.relativize(sourcePath);
String fileName = sourcePath.getFileName().toString(); String logicalPath = relativePath.toString().replace("\\", "/");
File sourceFile = sourcePath.toFile(); DatasetFile datasetFile = commitNewFileVersion(dataset, datasetRoot, logicalPath, sourcePath, false);
LocalDateTime currentTime = LocalDateTime.now();
Path targetPath = Paths.get(dataset.getPath(), relativePath.toString());
DatasetFile datasetFile = DatasetFile.builder()
.id(UUID.randomUUID().toString())
.datasetId(datasetId)
.fileName(fileName)
.fileType(AnalyzerUtils.getExtension(fileName))
.fileSize(sourceFile.length())
.filePath(targetPath.toString())
.uploadTime(currentTime)
.lastAccessTime(currentTime)
.build();
setDatasetFileId(datasetFile, dataset);
dataset.addFile(datasetFile);
copiedFiles.add(datasetFile); copiedFiles.add(datasetFile);
copyTargets.put(sourceFilePath, datasetFile);
} }
if (copiedFiles.isEmpty()) {
return copiedFiles;
}
datasetFileRepository.saveOrUpdateBatch(copiedFiles, 100);
dataset.active(); dataset.active();
datasetRepository.updateById(dataset); datasetRepository.updateById(dataset);
CompletableFuture.runAsync(() -> copyFilesToDatasetDirWithRelativePath(copyTargets, dataset, normalizedRoot));
return copiedFiles; return copiedFiles;
} }
private void copyFilesToDatasetDir(List<String> sourcePaths, Dataset dataset) {
for (String sourcePath : sourcePaths) {
Path sourceFilePath = Paths.get(sourcePath);
Path targetFilePath = Paths.get(dataset.getPath(), sourceFilePath.getFileName().toString());
try {
Files.createDirectories(Path.of(dataset.getPath()));
Files.copy(sourceFilePath, targetFilePath);
DatasetFile datasetFile = datasetFileRepository.findByDatasetIdAndFileName(
dataset.getId(),
sourceFilePath.getFileName().toString()
);
triggerPdfTextExtraction(dataset, datasetFile);
} catch (IOException e) {
log.error("Failed to copy file from {} to {}", sourcePath, targetFilePath, e);
}
}
}
private void copyFilesToDatasetDirWithRelativePath(
Map<String, DatasetFile> copyTargets,
Dataset dataset,
Path sourceRoot
) {
Path datasetRoot = Paths.get(dataset.getPath()).toAbsolutePath().normalize();
Path normalizedRoot = sourceRoot.toAbsolutePath().normalize();
for (Map.Entry<String, DatasetFile> entry : copyTargets.entrySet()) {
Path sourcePath = Paths.get(entry.getKey()).toAbsolutePath().normalize();
if (!sourcePath.startsWith(normalizedRoot)) {
log.warn("Source file path is out of root: {}", sourcePath);
continue;
}
Path relativePath = normalizedRoot.relativize(sourcePath);
Path targetFilePath = datasetRoot.resolve(relativePath).normalize();
if (!targetFilePath.startsWith(datasetRoot)) {
log.warn("Target file path is out of dataset path: {}", targetFilePath);
continue;
}
try {
Files.createDirectories(targetFilePath.getParent());
Files.copy(sourcePath, targetFilePath);
triggerPdfTextExtraction(dataset, entry.getValue());
} catch (IOException e) {
log.error("Failed to copy file from {} to {}", sourcePath, targetFilePath, e);
}
}
}
/** /**
* 添加文件到数据集(仅创建数据库记录,不执行文件系统操作) * 添加文件到数据集(仅创建数据库记录,不执行文件系统操作)
* *
@@ -951,8 +1328,7 @@ public class DatasetFileApplicationService {
Dataset dataset = datasetRepository.getById(datasetId); Dataset dataset = datasetRepository.getById(datasetId);
BusinessAssert.notNull(dataset, SystemErrorCode.RESOURCE_NOT_FOUND); BusinessAssert.notNull(dataset, SystemErrorCode.RESOURCE_NOT_FOUND);
List<DatasetFile> addedFiles = new ArrayList<>(); List<DatasetFile> addedFiles = new ArrayList<>();
List<DatasetFile> existDatasetFiles = datasetFileRepository.findAllByDatasetId(datasetId); dataset.setFiles(new ArrayList<>(datasetFileRepository.findAllVisibleByDatasetId(datasetId)));
dataset.setFiles(existDatasetFiles);
boolean softAdd = req.softAdd(); boolean softAdd = req.softAdd();
String metadata; String metadata;
@@ -969,8 +1345,43 @@ public class DatasetFileApplicationService {
Path sourcePath = Paths.get(sourceFilePath); Path sourcePath = Paths.get(sourceFilePath);
String fileName = sourcePath.getFileName().toString(); String fileName = sourcePath.getFileName().toString();
File sourceFile = sourcePath.toFile(); File sourceFile = sourcePath.toFile();
LocalDateTime currentTime = LocalDateTime.now(); String logicalPath = normalizeLogicalPath(fileName);
assertNotInternalPrefix(logicalPath);
DatasetFile latest = datasetFileRepository.findLatestByDatasetIdAndLogicalPath(datasetId, logicalPath);
if (latest == null && dataset.getFiles() != null) {
latest = dataset.getFiles().stream()
.filter(existing -> existing != null
&& !isArchivedStatus(existing)
&& Objects.equals(existing.getFileName(), fileName))
.findFirst()
.orElse(null);
}
DuplicateMethod effectiveDuplicateMethod = resolveEffectiveDuplicateMethod();
if (latest != null && effectiveDuplicateMethod == DuplicateMethod.ERROR) {
throw BusinessException.of(DataManagementErrorCode.DATASET_FILE_ALREADY_EXISTS);
}
long nextVersion = 1L;
if (latest != null) {
long latestVersion = Optional.ofNullable(latest.getVersion()).orElse(1L);
if (latest.getVersion() == null) {
latest.setVersion(latestVersion);
}
if (latest.getLogicalPath() == null || latest.getLogicalPath().isBlank()) {
latest.setLogicalPath(logicalPath);
}
nextVersion = latestVersion + 1L;
}
if (latest != null && effectiveDuplicateMethod == DuplicateMethod.VERSION) {
latest.setStatus(FILE_STATUS_ARCHIVED);
datasetFileRepository.updateById(latest);
dataset.removeFile(latest);
}
LocalDateTime currentTime = LocalDateTime.now();
DatasetFile datasetFile = DatasetFile.builder() DatasetFile datasetFile = DatasetFile.builder()
.id(UUID.randomUUID().toString()) .id(UUID.randomUUID().toString())
.datasetId(datasetId) .datasetId(datasetId)
@@ -978,16 +1389,19 @@ public class DatasetFileApplicationService {
.fileType(AnalyzerUtils.getExtension(fileName)) .fileType(AnalyzerUtils.getExtension(fileName))
.fileSize(sourceFile.length()) .fileSize(sourceFile.length())
.filePath(sourceFilePath) .filePath(sourceFilePath)
.logicalPath(logicalPath)
.version(nextVersion)
.status(FILE_STATUS_ACTIVE)
.uploadTime(currentTime) .uploadTime(currentTime)
.lastAccessTime(currentTime) .lastAccessTime(currentTime)
.metadata(metadata) .metadata(metadata)
.build(); .build();
setDatasetFileId(datasetFile, dataset);
datasetFileRepository.saveOrUpdate(datasetFile);
dataset.addFile(datasetFile); dataset.addFile(datasetFile);
addedFiles.add(datasetFile); addedFiles.add(datasetFile);
triggerPdfTextExtraction(dataset, datasetFile); triggerPdfTextExtraction(dataset, datasetFile);
} }
datasetFileRepository.saveOrUpdateBatch(addedFiles, 100);
dataset.active(); dataset.active();
datasetRepository.updateById(dataset); datasetRepository.updateById(dataset);
// Note: addFilesToDataset only creates DB records, no file system operations // Note: addFilesToDataset only creates DB records, no file system operations

View File

@@ -7,5 +7,6 @@ package com.datamate.datamanagement.common.enums;
*/ */
public enum DuplicateMethod { public enum DuplicateMethod {
ERROR, ERROR,
COVER COVER,
VERSION
} }

View File

@@ -152,12 +152,20 @@ public class Dataset extends BaseEntity<String> {
} }
public void removeFile(DatasetFile file) { public void removeFile(DatasetFile file) {
if (this.files.remove(file)) { if (file == null) {
return;
}
boolean removed = this.files.remove(file);
if (!removed && file.getId() != null) {
removed = this.files.removeIf(existing -> Objects.equals(existing.getId(), file.getId()));
}
if (!removed) {
return;
}
this.fileCount = Math.max(0, this.fileCount - 1); this.fileCount = Math.max(0, this.fileCount - 1);
this.sizeBytes = Math.max(0, this.sizeBytes - (file.getFileSize() != null ? file.getFileSize() : 0L)); this.sizeBytes = Math.max(0, this.sizeBytes - (file.getFileSize() != null ? file.getFileSize() : 0L));
this.updatedAt = LocalDateTime.now(); this.updatedAt = LocalDateTime.now();
} }
}
public void active() { public void active() {
if (this.status == DatasetStatusType.DRAFT) { if (this.status == DatasetStatusType.DRAFT) {

View File

@@ -28,12 +28,16 @@ public class DatasetFile {
private String datasetId; // UUID private String datasetId; // UUID
private String fileName; private String fileName;
private String filePath; private String filePath;
/** 文件逻辑路径(相对数据集根目录,包含子目录) */
private String logicalPath;
/** 文件版本号(同一个 logicalPath 下递增) */
private Long version;
private String fileType; // JPG/PNG/DCM/TXT private String fileType; // JPG/PNG/DCM/TXT
private Long fileSize; // bytes private Long fileSize; // bytes
private String checkSum; private String checkSum;
private String tags; private String tags;
private String metadata; private String metadata;
private String status; // UPLOADED, PROCESSING, COMPLETED, ERROR private String status; // ACTIVE/ARCHIVED/DELETED/PROCESSING...
private LocalDateTime uploadTime; private LocalDateTime uploadTime;
private LocalDateTime lastAccessTime; private LocalDateTime lastAccessTime;
private LocalDateTime createdAt; private LocalDateTime createdAt;

View File

@@ -21,4 +21,7 @@ public class DatasetFileUploadCheckInfo {
/** 目标子目录前缀,例如 "images/",为空表示数据集根目录 */ /** 目标子目录前缀,例如 "images/",为空表示数据集根目录 */
private String prefix; private String prefix;
/** 上传临时落盘目录(仅服务端使用,不对外暴露) */
private String stagingPath;
} }

View File

@@ -24,8 +24,19 @@ public interface DatasetFileRepository extends IRepository<DatasetFile> {
List<DatasetFile> findAllByDatasetId(String datasetId); List<DatasetFile> findAllByDatasetId(String datasetId);
/**
* 查询数据集内“可见文件”(默认不包含历史归档版本)。
* 约定:status 为 NULL 视为可见;status = ARCHIVED 视为历史版本。
*/
List<DatasetFile> findAllVisibleByDatasetId(String datasetId);
DatasetFile findByDatasetIdAndFileName(String datasetId, String fileName); DatasetFile findByDatasetIdAndFileName(String datasetId, String fileName);
/**
* 查询指定逻辑路径的最新版本(ACTIVE/NULL)。
*/
DatasetFile findLatestByDatasetIdAndLogicalPath(String datasetId, String logicalPath);
IPage<DatasetFile> findByCriteria(String datasetId, String fileType, String status, String name, IPage<DatasetFile> findByCriteria(String datasetId, String fileType, String status, String name,
Boolean hasAnnotation, IPage<DatasetFile> page); Boolean hasAnnotation, IPage<DatasetFile> page);

View File

@@ -25,6 +25,8 @@ public class DatasetFileRepositoryImpl extends CrudRepository<DatasetFileMapper,
private final DatasetFileMapper datasetFileMapper; private final DatasetFileMapper datasetFileMapper;
private static final String ANNOTATION_EXISTS_SQL = private static final String ANNOTATION_EXISTS_SQL =
"SELECT 1 FROM t_dm_annotation_results ar WHERE ar.file_id = t_dm_dataset_files.id"; "SELECT 1 FROM t_dm_annotation_results ar WHERE ar.file_id = t_dm_dataset_files.id";
private static final String FILE_STATUS_ARCHIVED = "ARCHIVED";
private static final String FILE_STATUS_ACTIVE = "ACTIVE";
@Override @Override
public Long countByDatasetId(String datasetId) { public Long countByDatasetId(String datasetId) {
@@ -51,19 +53,54 @@ public class DatasetFileRepositoryImpl extends CrudRepository<DatasetFileMapper,
return datasetFileMapper.findAllByDatasetId(datasetId); return datasetFileMapper.findAllByDatasetId(datasetId);
} }
@Override
public List<DatasetFile> findAllVisibleByDatasetId(String datasetId) {
return datasetFileMapper.selectList(new LambdaQueryWrapper<DatasetFile>()
.eq(DatasetFile::getDatasetId, datasetId)
.and(wrapper -> wrapper.isNull(DatasetFile::getStatus)
.or()
.ne(DatasetFile::getStatus, FILE_STATUS_ARCHIVED))
.orderByDesc(DatasetFile::getUploadTime));
}
@Override @Override
public DatasetFile findByDatasetIdAndFileName(String datasetId, String fileName) { public DatasetFile findByDatasetIdAndFileName(String datasetId, String fileName) {
return datasetFileMapper.findByDatasetIdAndFileName(datasetId, fileName); return datasetFileMapper.findByDatasetIdAndFileName(datasetId, fileName);
} }
@Override
public DatasetFile findLatestByDatasetIdAndLogicalPath(String datasetId, String logicalPath) {
if (!StringUtils.hasText(datasetId) || !StringUtils.hasText(logicalPath)) {
return null;
}
return datasetFileMapper.selectOne(new LambdaQueryWrapper<DatasetFile>()
.eq(DatasetFile::getDatasetId, datasetId)
.eq(DatasetFile::getLogicalPath, logicalPath)
.and(wrapper -> wrapper.isNull(DatasetFile::getStatus)
.or()
.eq(DatasetFile::getStatus, FILE_STATUS_ACTIVE))
.orderByDesc(DatasetFile::getVersion)
.orderByDesc(DatasetFile::getUploadTime)
.last("LIMIT 1"));
}
public IPage<DatasetFile> findByCriteria(String datasetId, String fileType, String status, String name, public IPage<DatasetFile> findByCriteria(String datasetId, String fileType, String status, String name,
Boolean hasAnnotation, IPage<DatasetFile> page) { Boolean hasAnnotation, IPage<DatasetFile> page) {
return datasetFileMapper.selectPage(page, new LambdaQueryWrapper<DatasetFile>() LambdaQueryWrapper<DatasetFile> wrapper = new LambdaQueryWrapper<DatasetFile>()
.eq(DatasetFile::getDatasetId, datasetId) .eq(DatasetFile::getDatasetId, datasetId)
.eq(StringUtils.hasText(fileType), DatasetFile::getFileType, fileType) .eq(StringUtils.hasText(fileType), DatasetFile::getFileType, fileType)
.eq(StringUtils.hasText(status), DatasetFile::getStatus, status)
.like(StringUtils.hasText(name), DatasetFile::getFileName, name) .like(StringUtils.hasText(name), DatasetFile::getFileName, name)
.exists(Boolean.TRUE.equals(hasAnnotation), ANNOTATION_EXISTS_SQL)); .exists(Boolean.TRUE.equals(hasAnnotation), ANNOTATION_EXISTS_SQL);
if (StringUtils.hasText(status)) {
wrapper.eq(DatasetFile::getStatus, status);
} else {
wrapper.and(visibility -> visibility.isNull(DatasetFile::getStatus)
.or()
.ne(DatasetFile::getStatus, FILE_STATUS_ARCHIVED));
}
return datasetFileMapper.selectPage(page, wrapper);
} }
@Override @Override

View File

@@ -1,8 +1,10 @@
package com.datamate.datamanagement.interfaces.dto; package com.datamate.datamanagement.interfaces.dto;
import com.datamate.datamanagement.common.enums.DatasetStatusType; import com.datamate.datamanagement.common.enums.DatasetStatusType;
import com.fasterxml.jackson.annotation.JsonIgnore;
import jakarta.validation.constraints.NotBlank; import jakarta.validation.constraints.NotBlank;
import jakarta.validation.constraints.Size; import jakarta.validation.constraints.Size;
import lombok.AccessLevel;
import lombok.Getter; import lombok.Getter;
import lombok.Setter; import lombok.Setter;
@@ -24,9 +26,18 @@ public class UpdateDatasetRequest {
/** 归集任务id */ /** 归集任务id */
private String dataSource; private String dataSource;
/** 父数据集ID */ /** 父数据集ID */
@Setter(AccessLevel.NONE)
private String parentDatasetId; private String parentDatasetId;
@JsonIgnore
@Setter(AccessLevel.NONE)
private boolean parentDatasetIdProvided;
/** 标签列表 */ /** 标签列表 */
private List<String> tags; private List<String> tags;
/** 数据集状态 */ /** 数据集状态 */
private DatasetStatusType status; private DatasetStatusType status;
public void setParentDatasetId(String parentDatasetId) {
this.parentDatasetIdProvided = true;
this.parentDatasetId = parentDatasetId;
}
} }

View File

@@ -0,0 +1,33 @@
package com.datamate.datamanagement.interfaces.rest;
import com.datamate.datamanagement.application.DatasetFileApplicationService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.PutMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
/**
* 数据集上传控制器
*/
@Slf4j
@RestController
@RequiredArgsConstructor
@RequestMapping("/data-management/datasets/upload")
public class DatasetUploadController {
private final DatasetFileApplicationService datasetFileApplicationService;
/**
* 取消上传
*
* @param reqId 预上传请求ID
*/
@PutMapping("/cancel-upload/{reqId}")
public ResponseEntity<Void> cancelUpload(@PathVariable("reqId") String reqId) {
datasetFileApplicationService.cancelUpload(reqId);
return ResponseEntity.ok().build();
}
}

View File

@@ -3,7 +3,7 @@
"http://mybatis.org/dtd/mybatis-3-mapper.dtd"> "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="com.datamate.datamanagement.infrastructure.persistence.mapper.DatasetFileMapper"> <mapper namespace="com.datamate.datamanagement.infrastructure.persistence.mapper.DatasetFileMapper">
<sql id="Base_Column_List"> <sql id="Base_Column_List">
id, dataset_id, file_name, file_path, file_type, file_size, check_sum, tags, metadata, status, id, dataset_id, file_name, file_path, logical_path, version, file_type, file_size, check_sum, tags, metadata, status,
upload_time, last_access_time, created_at, updated_at upload_time, last_access_time, created_at, updated_at
</sql> </sql>
@@ -39,13 +39,17 @@
</select> </select>
<select id="countByDatasetId" parameterType="string" resultType="long"> <select id="countByDatasetId" parameterType="string" resultType="long">
SELECT COUNT(*) FROM t_dm_dataset_files WHERE dataset_id = #{datasetId} SELECT COUNT(*)
FROM t_dm_dataset_files
WHERE dataset_id = #{datasetId}
AND (status IS NULL OR status <> 'ARCHIVED')
</select> </select>
<select id="countNonDerivedByDatasetId" parameterType="string" resultType="long"> <select id="countNonDerivedByDatasetId" parameterType="string" resultType="long">
SELECT COUNT(*) SELECT COUNT(*)
FROM t_dm_dataset_files FROM t_dm_dataset_files
WHERE dataset_id = #{datasetId} WHERE dataset_id = #{datasetId}
AND (status IS NULL OR status <> 'ARCHIVED')
AND (metadata IS NULL OR JSON_EXTRACT(metadata, '$.derived_from_file_id') IS NULL) AND (metadata IS NULL OR JSON_EXTRACT(metadata, '$.derived_from_file_id') IS NULL)
</select> </select>
@@ -54,13 +58,19 @@
</select> </select>
<select id="sumSizeByDatasetId" parameterType="string" resultType="long"> <select id="sumSizeByDatasetId" parameterType="string" resultType="long">
SELECT COALESCE(SUM(file_size), 0) FROM t_dm_dataset_files WHERE dataset_id = #{datasetId} SELECT COALESCE(SUM(file_size), 0)
FROM t_dm_dataset_files
WHERE dataset_id = #{datasetId}
AND (status IS NULL OR status <> 'ARCHIVED')
</select> </select>
<select id="findByDatasetIdAndFileName" resultType="com.datamate.datamanagement.domain.model.dataset.DatasetFile"> <select id="findByDatasetIdAndFileName" resultType="com.datamate.datamanagement.domain.model.dataset.DatasetFile">
SELECT <include refid="Base_Column_List"/> SELECT <include refid="Base_Column_List"/>
FROM t_dm_dataset_files FROM t_dm_dataset_files
WHERE dataset_id = #{datasetId} AND file_name = #{fileName} WHERE dataset_id = #{datasetId}
AND file_name = #{fileName}
AND (status IS NULL OR status <> 'ARCHIVED')
ORDER BY version DESC, upload_time DESC
LIMIT 1 LIMIT 1
</select> </select>
@@ -91,6 +101,8 @@
UPDATE t_dm_dataset_files UPDATE t_dm_dataset_files
SET file_name = #{fileName}, SET file_name = #{fileName},
file_path = #{filePath}, file_path = #{filePath},
logical_path = #{logicalPath},
version = #{version},
file_type = #{fileType}, file_type = #{fileType},
file_size = #{fileSize}, file_size = #{fileSize},
upload_time = #{uploadTime}, upload_time = #{uploadTime},
@@ -126,6 +138,7 @@
<foreach collection="datasetIds" item="datasetId" open="(" separator="," close=")"> <foreach collection="datasetIds" item="datasetId" open="(" separator="," close=")">
#{datasetId} #{datasetId}
</foreach> </foreach>
AND (status IS NULL OR status <> 'ARCHIVED')
AND (metadata IS NULL OR JSON_EXTRACT(metadata, '$.derived_from_file_id') IS NULL) AND (metadata IS NULL OR JSON_EXTRACT(metadata, '$.derived_from_file_id') IS NULL)
GROUP BY dataset_id GROUP BY dataset_id
</select> </select>

View File

@@ -0,0 +1,147 @@
package com.datamate.datamanagement.application;
import com.datamate.common.domain.service.FileService;
import com.datamate.datamanagement.domain.model.dataset.Dataset;
import com.datamate.datamanagement.domain.model.dataset.DatasetFile;
import com.datamate.datamanagement.infrastructure.persistence.repository.DatasetFileRepository;
import com.datamate.datamanagement.infrastructure.persistence.repository.DatasetRepository;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.junit.jupiter.api.io.TempDir;
import org.mockito.ArgumentCaptor;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.security.MessageDigest;
import java.util.List;
import static org.assertj.core.api.Assertions.assertThat;
import static org.mockito.ArgumentMatchers.anyString;
import static org.mockito.Mockito.verify;
import static org.mockito.Mockito.when;
@ExtendWith(MockitoExtension.class)
class DatasetFileApplicationServiceVersioningTest {
@TempDir
Path tempDir;
@Mock
DatasetFileRepository datasetFileRepository;
@Mock
DatasetRepository datasetRepository;
@Mock
FileService fileService;
@Mock
PdfTextExtractAsyncService pdfTextExtractAsyncService;
@Mock
DatasetFilePreviewService datasetFilePreviewService;
@Test
void copyFilesToDatasetDirWithSourceRoot_shouldArchiveOldFileAndCreateNewVersionWhenDuplicateLogicalPath()
throws Exception {
String datasetId = "dataset-1";
Path datasetRoot = tempDir.resolve("dataset-root");
Files.createDirectories(datasetRoot);
Path sourceRoot = tempDir.resolve("source-root");
Files.createDirectories(sourceRoot);
Path existingPath = datasetRoot.resolve("a.txt");
Files.writeString(existingPath, "old-content", StandardCharsets.UTF_8);
Path incomingPath = sourceRoot.resolve("a.txt");
Files.writeString(incomingPath, "new-content", StandardCharsets.UTF_8);
Dataset dataset = new Dataset();
dataset.setId(datasetId);
dataset.setPath(datasetRoot.toString());
DatasetFile oldRecord = DatasetFile.builder()
.id("old-file-id")
.datasetId(datasetId)
.fileName("a.txt")
.filePath(existingPath.toString())
.logicalPath(null)
.version(null)
.status(null)
.fileSize(Files.size(existingPath))
.build();
when(datasetRepository.getById(datasetId)).thenReturn(dataset);
when(datasetFileRepository.findAllVisibleByDatasetId(datasetId)).thenReturn(List.of(oldRecord));
when(datasetFileRepository.findLatestByDatasetIdAndLogicalPath(anyString(), anyString())).thenReturn(null);
DatasetFileApplicationService service = new DatasetFileApplicationService(
datasetFileRepository,
datasetRepository,
fileService,
pdfTextExtractAsyncService,
datasetFilePreviewService
);
List<DatasetFile> copied = service.copyFilesToDatasetDirWithSourceRoot(
datasetId,
sourceRoot,
List.of(incomingPath.toString())
);
assertThat(copied).hasSize(1);
assertThat(Files.readString(existingPath, StandardCharsets.UTF_8)).isEqualTo("new-content");
String logicalPathHash = sha256Hex("a.txt");
Path archivedPath = datasetRoot
.resolve(".datamate")
.resolve("versions")
.resolve(logicalPathHash)
.resolve("v1")
.resolve("old-file-id__a.txt")
.toAbsolutePath()
.normalize();
assertThat(Files.exists(archivedPath)).isTrue();
assertThat(Files.readString(archivedPath, StandardCharsets.UTF_8)).isEqualTo("old-content");
ArgumentCaptor<DatasetFile> archivedCaptor = ArgumentCaptor.forClass(DatasetFile.class);
verify(datasetFileRepository).updateById(archivedCaptor.capture());
DatasetFile archivedRecord = archivedCaptor.getValue();
assertThat(archivedRecord.getId()).isEqualTo("old-file-id");
assertThat(archivedRecord.getStatus()).isEqualTo("ARCHIVED");
assertThat(archivedRecord.getLogicalPath()).isEqualTo("a.txt");
assertThat(archivedRecord.getVersion()).isEqualTo(1L);
assertThat(Paths.get(archivedRecord.getFilePath()).toAbsolutePath().normalize()).isEqualTo(archivedPath);
ArgumentCaptor<DatasetFile> createdCaptor = ArgumentCaptor.forClass(DatasetFile.class);
verify(datasetFileRepository).saveOrUpdate(createdCaptor.capture());
DatasetFile newRecord = createdCaptor.getValue();
assertThat(newRecord.getId()).isNotEqualTo("old-file-id");
assertThat(newRecord.getStatus()).isEqualTo("ACTIVE");
assertThat(newRecord.getLogicalPath()).isEqualTo("a.txt");
assertThat(newRecord.getVersion()).isEqualTo(2L);
assertThat(Paths.get(newRecord.getFilePath()).toAbsolutePath().normalize()).isEqualTo(existingPath.toAbsolutePath().normalize());
}
private static String sha256Hex(String value) {
try {
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hashed = digest.digest((value == null ? "" : value).getBytes(StandardCharsets.UTF_8));
StringBuilder builder = new StringBuilder(hashed.length * 2);
for (byte b : hashed) {
builder.append(String.format("%02x", b));
}
return builder.toString();
} catch (Exception e) {
return Integer.toHexString((value == null ? "" : value).hashCode());
}
}
}

View File

@@ -74,6 +74,26 @@ public class FileService {
.build(); .build();
} }
/**
* 取消上传
*/
@Transactional
public void cancelUpload(String reqId) {
if (reqId == null || reqId.isBlank()) {
throw BusinessException.of(CommonErrorCode.PARAM_ERROR);
}
ChunkUploadPreRequest preRequest = chunkUploadRequestMapper.findById(reqId);
if (preRequest == null) {
return;
}
String uploadPath = preRequest.getUploadPath();
if (uploadPath != null && !uploadPath.isBlank()) {
File tempDir = new File(uploadPath, String.format(ChunksSaver.TEMP_DIR_NAME_FORMAT, preRequest.getId()));
ChunksSaver.deleteFolder(tempDir.getPath());
}
chunkUploadRequestMapper.deleteById(reqId);
}
private File uploadFile(ChunkUploadRequest fileUploadRequest, ChunkUploadPreRequest preRequest) { private File uploadFile(ChunkUploadRequest fileUploadRequest, ChunkUploadPreRequest preRequest) {
File savedFile = ChunksSaver.saveFile(fileUploadRequest, preRequest); File savedFile = ChunksSaver.saveFile(fileUploadRequest, preRequest);
preRequest.setTimeout(LocalDateTime.now().plusSeconds(DEFAULT_TIMEOUT)); preRequest.setTimeout(LocalDateTime.now().plusSeconds(DEFAULT_TIMEOUT));

View File

@@ -143,7 +143,20 @@ public class ArchiveAnalyzer {
private static Optional<FileUploadResult> extractEntity(ArchiveInputStream<?> archiveInputStream, ArchiveEntry archiveEntry, Path archivePath) private static Optional<FileUploadResult> extractEntity(ArchiveInputStream<?> archiveInputStream, ArchiveEntry archiveEntry, Path archivePath)
throws IOException { throws IOException {
byte[] buffer = new byte[DEFAULT_BUFFER_SIZE]; byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
Path path = Paths.get(archivePath.getParent().toString(), archiveEntry.getName()); Path archiveRoot = archivePath.getParent().toAbsolutePath().normalize();
String entryName = archiveEntry.getName();
if (entryName == null || entryName.isBlank()) {
return Optional.empty();
}
entryName = entryName.replace("\\", "/");
while (entryName.startsWith("/")) {
entryName = entryName.substring(1);
}
Path path = archiveRoot.resolve(entryName).normalize();
if (!path.startsWith(archiveRoot)) {
log.warn("Skip unsafe archive entry path traversal: {}", archiveEntry.getName());
return Optional.empty();
}
File file = path.toFile(); File file = path.toFile();
long fileSize = 0L; long fileSize = 0L;
FileUtils.createParentDirectories(file); FileUtils.createParentDirectories(file);

View File

@@ -13,7 +13,10 @@ public class CommonUtils {
* @return 文件名(带后缀) * @return 文件名(带后缀)
*/ */
public static String trimFilePath(String filePath) { public static String trimFilePath(String filePath) {
int lastSlashIndex = filePath.lastIndexOf(File.separator); if (filePath == null || filePath.isBlank()) {
return "";
}
int lastSlashIndex = Math.max(filePath.lastIndexOf('/'), filePath.lastIndexOf('\\'));
String filename = filePath; String filename = filePath;
if (lastSlashIndex != -1) { if (lastSlashIndex != -1) {

View File

@@ -1,5 +1,5 @@
import { TaskItem } from "@/pages/DataManagement/dataset.model"; import { TaskItem } from "@/pages/DataManagement/dataset.model";
import { calculateSHA256, checkIsFilesExist } from "@/utils/file.util"; import { calculateSHA256, checkIsFilesExist, streamSplitAndUpload, StreamUploadResult } from "@/utils/file.util";
import { App } from "antd"; import { App } from "antd";
import { useRef, useState } from "react"; import { useRef, useState } from "react";
@@ -9,17 +9,18 @@ export function useFileSliceUpload(
uploadChunk, uploadChunk,
cancelUpload, cancelUpload,
}: { }: {
preUpload: (id: string, params: any) => Promise<{ data: number }>; preUpload: (id: string, params: Record<string, unknown>) => Promise<{ data: number }>;
uploadChunk: (id: string, formData: FormData, config: any) => Promise<any>; uploadChunk: (id: string, formData: FormData, config: Record<string, unknown>) => Promise<unknown>;
cancelUpload: ((reqId: number) => Promise<any>) | null; cancelUpload: ((reqId: number) => Promise<unknown>) | null;
}, },
showTaskCenter = true // 上传时是否显示任务中心 showTaskCenter = true, // 上传时是否显示任务中心
enableStreamUpload = true // 是否启用流式分割上传
) { ) {
const { message } = App.useApp(); const { message } = App.useApp();
const [taskList, setTaskList] = useState<TaskItem[]>([]); const [taskList, setTaskList] = useState<TaskItem[]>([]);
const taskListRef = useRef<TaskItem[]>([]); // 用于固定任务顺序 const taskListRef = useRef<TaskItem[]>([]); // 用于固定任务顺序
const createTask = (detail: any = {}) => { const createTask = (detail: Record<string, unknown> = {}) => {
const { dataset } = detail; const { dataset } = detail;
const title = `上传数据集: ${dataset.name} `; const title = `上传数据集: ${dataset.name} `;
const controller = new AbortController(); const controller = new AbortController();
@@ -37,6 +38,14 @@ export function useFileSliceUpload(
taskListRef.current = [task, ...taskListRef.current]; taskListRef.current = [task, ...taskListRef.current];
setTaskList(taskListRef.current); setTaskList(taskListRef.current);
// 立即显示任务中心,让用户感知上传已开始
if (showTaskCenter) {
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
}
return task; return task;
}; };
@@ -60,7 +69,7 @@ export function useFileSliceUpload(
// 携带前缀信息,便于刷新后仍停留在当前目录 // 携带前缀信息,便于刷新后仍停留在当前目录
window.dispatchEvent( window.dispatchEvent(
new CustomEvent(task.updateEvent, { new CustomEvent(task.updateEvent, {
detail: { prefix: (task as any).prefix }, detail: { prefix: task.prefix },
}) })
); );
} }
@@ -71,7 +80,7 @@ export function useFileSliceUpload(
} }
}; };
async function buildFormData({ file, reqId, i, j }) { async function buildFormData({ file, reqId, i, j }: { file: { slices: Blob[]; name: string; size: number }; reqId: number; i: number; j: number }) {
const formData = new FormData(); const formData = new FormData();
const { slices, name, size } = file; const { slices, name, size } = file;
const checkSum = await calculateSHA256(slices[j]); const checkSum = await calculateSHA256(slices[j]);
@@ -86,12 +95,18 @@ export function useFileSliceUpload(
return formData; return formData;
} }
async function uploadSlice(task: TaskItem, fileInfo) { async function uploadSlice(task: TaskItem, fileInfo: { loaded: number; i: number; j: number; files: { slices: Blob[]; name: string; size: number }[]; totalSize: number }) {
if (!task) { if (!task) {
return; return;
} }
const { reqId, key } = task; const { reqId, key, controller } = task;
const { loaded, i, j, files, totalSize } = fileInfo; const { loaded, i, j, files, totalSize } = fileInfo;
// 检查是否已取消
if (controller.signal.aborted) {
throw new Error("Upload cancelled");
}
const formData = await buildFormData({ const formData = await buildFormData({
file: files[i], file: files[i],
i, i,
@@ -101,6 +116,7 @@ export function useFileSliceUpload(
let newTask = { ...task }; let newTask = { ...task };
await uploadChunk(key, formData, { await uploadChunk(key, formData, {
signal: controller.signal,
onUploadProgress: (e) => { onUploadProgress: (e) => {
const loadedSize = loaded + e.loaded; const loadedSize = loaded + e.loaded;
const curPercent = Number((loadedSize / totalSize) * 100).toFixed(2); const curPercent = Number((loadedSize / totalSize) * 100).toFixed(2);
@@ -116,7 +132,7 @@ export function useFileSliceUpload(
}); });
} }
async function uploadFile({ task, files, totalSize }) { async function uploadFile({ task, files, totalSize }: { task: TaskItem; files: { slices: Blob[]; name: string; size: number; originFile: Blob }[]; totalSize: number }) {
console.log('[useSliceUpload] Calling preUpload with prefix:', task.prefix); console.log('[useSliceUpload] Calling preUpload with prefix:', task.prefix);
const { data: reqId } = await preUpload(task.key, { const { data: reqId } = await preUpload(task.key, {
totalFileNum: files.length, totalFileNum: files.length,
@@ -132,24 +148,29 @@ export function useFileSliceUpload(
reqId, reqId,
isCancel: false, isCancel: false,
cancelFn: () => { cancelFn: () => {
task.controller.abort(); // 使用 newTask 的 controller 确保一致性
newTask.controller.abort();
cancelUpload?.(reqId); cancelUpload?.(reqId);
if (task.updateEvent) window.dispatchEvent(new Event(task.updateEvent)); if (newTask.updateEvent) window.dispatchEvent(new Event(newTask.updateEvent));
}, },
}; };
updateTaskList(newTask); updateTaskList(newTask);
if (showTaskCenter) { // 注意:show:task-popover 事件已在 createTask 中触发,此处不再重复触发
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
}
// // 更新数据状态 // // 更新数据状态
if (task.updateEvent) window.dispatchEvent(new Event(task.updateEvent)); if (task.updateEvent) window.dispatchEvent(new Event(task.updateEvent));
let loaded = 0; let loaded = 0;
for (let i = 0; i < files.length; i++) { for (let i = 0; i < files.length; i++) {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
throw new Error("Upload cancelled");
}
const { slices } = files[i]; const { slices } = files[i];
for (let j = 0; j < slices.length; j++) { for (let j = 0; j < slices.length; j++) {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
throw new Error("Upload cancelled");
}
await uploadSlice(newTask, { await uploadSlice(newTask, {
loaded, loaded,
i, i,
@@ -163,7 +184,7 @@ export function useFileSliceUpload(
removeTask(newTask); removeTask(newTask);
} }
const handleUpload = async ({ task, files }) => { const handleUpload = async ({ task, files }: { task: TaskItem; files: { slices: Blob[]; name: string; size: number; originFile: Blob }[] }) => {
const isErrorFile = await checkIsFilesExist(files); const isErrorFile = await checkIsFilesExist(files);
if (isErrorFile) { if (isErrorFile) {
message.error("文件被修改或删除,请重新选择文件上传"); message.error("文件被修改或删除,请重新选择文件上传");
@@ -189,10 +210,174 @@ export function useFileSliceUpload(
} }
}; };
/**
* 流式分割上传处理
* 用于大文件按行分割并立即上传的场景
*/
const handleStreamUpload = async ({ task, files }: { task: TaskItem; files: File[] }) => {
try {
console.log('[useSliceUpload] Starting stream upload for', files.length, 'files');
const totalSize = files.reduce((acc, file) => acc + file.size, 0);
// 存储所有文件的 reqId,用于取消上传
const reqIds: number[] = [];
const newTask: TaskItem = {
...task,
reqId: -1,
isCancel: false,
cancelFn: () => {
// 使用 newTask 的 controller 确保一致性
newTask.controller.abort();
// 取消所有文件的预上传请求
reqIds.forEach(id => cancelUpload?.(id));
if (newTask.updateEvent) window.dispatchEvent(new Event(newTask.updateEvent));
},
};
updateTaskList(newTask);
let totalUploadedLines = 0;
let totalProcessedBytes = 0;
const results: StreamUploadResult[] = [];
// 逐个处理文件,每个文件单独调用 preUpload
for (let i = 0; i < files.length; i++) {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
throw new Error("Upload cancelled");
}
const file = files[i];
console.log(`[useSliceUpload] Processing file ${i + 1}/${files.length}: ${file.name}`);
const result = await streamSplitAndUpload(
file,
(formData, config) => uploadChunk(task.key, formData, {
...config,
signal: newTask.controller.signal,
}),
(currentBytes, totalBytes, uploadedLines) => {
// 检查是否已取消
if (newTask.controller.signal.aborted) {
return;
}
// 更新进度
const overallBytes = totalProcessedBytes + currentBytes;
const curPercent = Number((overallBytes / totalSize) * 100).toFixed(2);
const updatedTask: TaskItem = {
...newTask,
...taskListRef.current.find((item) => item.key === task.key),
size: overallBytes,
percent: curPercent >= 100 ? 99.99 : curPercent,
streamUploadInfo: {
currentFile: file.name,
fileIndex: i + 1,
totalFiles: files.length,
uploadedLines: totalUploadedLines + uploadedLines,
},
};
updateTaskList(updatedTask);
},
1024 * 1024, // 1MB chunk size
{
resolveReqId: async ({ totalFileNum, totalSize }) => {
const { data: reqId } = await preUpload(task.key, {
totalFileNum,
totalSize,
datasetId: task.key,
hasArchive: task.hasArchive,
prefix: task.prefix,
});
console.log(`[useSliceUpload] File ${file.name} preUpload response reqId:`, reqId);
reqIds.push(reqId);
return reqId;
},
hasArchive: newTask.hasArchive,
prefix: newTask.prefix,
signal: newTask.controller.signal,
maxConcurrency: 3,
}
);
results.push(result);
totalUploadedLines += result.uploadedCount;
totalProcessedBytes += file.size;
console.log(`[useSliceUpload] File ${file.name} processed, uploaded ${result.uploadedCount} lines`);
}
console.log('[useSliceUpload] Stream upload completed, total lines:', totalUploadedLines);
removeTask(newTask);
message.success(`成功上传 ${totalUploadedLines} 个文件(按行分割)`);
} catch (err) {
console.error('[useSliceUpload] Stream upload error:', err);
if (err.message === "Upload cancelled") {
message.info("上传已取消");
} else {
message.error("文件上传失败,请稍后重试");
}
removeTask({
...task,
isCancel: true,
...taskListRef.current.find((item) => item.key === task.key),
});
}
};
/**
* 注册流式上传事件监听
* 返回注销函数
*/
const registerStreamUploadListener = () => {
if (!enableStreamUpload) return () => {};
const streamUploadHandler = async (e: Event) => {
const customEvent = e as CustomEvent;
const { dataset, files, updateEvent, hasArchive, prefix } = customEvent.detail;
const controller = new AbortController();
const task: TaskItem = {
key: dataset.id,
title: `上传数据集: ${dataset.name} (按行分割)`,
percent: 0,
reqId: -1,
controller,
size: 0,
updateEvent,
hasArchive,
prefix,
};
taskListRef.current = [task, ...taskListRef.current];
setTaskList(taskListRef.current);
// 显示任务中心
if (showTaskCenter) {
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
}
await handleStreamUpload({ task, files });
};
window.addEventListener("upload:dataset-stream", streamUploadHandler);
return () => {
window.removeEventListener("upload:dataset-stream", streamUploadHandler);
};
};
return { return {
taskList, taskList,
createTask, createTask,
removeTask, removeTask,
handleUpload, handleUpload,
handleStreamUpload,
registerStreamUploadListener,
}; };
} }

View File

@@ -1,6 +1,6 @@
import { useCallback, useEffect, useMemo, useRef, useState } from "react"; import { useCallback, useEffect, useMemo, useRef, useState } from "react";
import { App, Button, Card, List, Spin, Typography, Tag, Switch, Tree, Empty } from "antd"; import { App, Button, Card, List, Spin, Typography, Tag, Empty } from "antd";
import { LeftOutlined, ReloadOutlined, SaveOutlined, MenuFoldOutlined, MenuUnfoldOutlined, CheckOutlined } from "@ant-design/icons"; import { LeftOutlined, ReloadOutlined, SaveOutlined, MenuFoldOutlined, MenuUnfoldOutlined } from "@ant-design/icons";
import { useNavigate, useParams } from "react-router"; import { useNavigate, useParams } from "react-router";
import { import {
@@ -28,7 +28,6 @@ type EditorTaskListItem = {
hasAnnotation: boolean; hasAnnotation: boolean;
annotationUpdatedAt?: string | null; annotationUpdatedAt?: string | null;
annotationStatus?: AnnotationResultStatus | null; annotationStatus?: AnnotationResultStatus | null;
segmentStats?: SegmentStats;
}; };
type LsfMessage = { type LsfMessage = {
@@ -36,21 +35,6 @@ type LsfMessage = {
payload?: unknown; payload?: unknown;
}; };
type SegmentInfo = {
idx: number;
text: string;
start: number;
end: number;
hasAnnotation: boolean;
lineIndex: number;
chunkIndex: number;
};
type SegmentStats = {
done: number;
total: number;
};
type ApiResponse<T> = { type ApiResponse<T> = {
code?: number; code?: number;
message?: string; message?: string;
@@ -66,10 +50,11 @@ type EditorTaskPayload = {
type EditorTaskResponse = { type EditorTaskResponse = {
task?: EditorTaskPayload; task?: EditorTaskPayload;
segmented?: boolean; segmented?: boolean;
segments?: SegmentInfo[]; totalSegments?: number;
currentSegmentIndex?: number; currentSegmentIndex?: number;
}; };
type EditorTaskListResponse = { type EditorTaskListResponse = {
content?: EditorTaskListItem[]; content?: EditorTaskListItem[];
totalElements?: number; totalElements?: number;
@@ -91,8 +76,6 @@ type ExportPayload = {
requestId?: string | null; requestId?: string | null;
}; };
type SwitchDecision = "save" | "discard" | "cancel";
const LSF_IFRAME_SRC = "/lsf/lsf.html"; const LSF_IFRAME_SRC = "/lsf/lsf.html";
const TASK_PAGE_START = 0; const TASK_PAGE_START = 0;
const TASK_PAGE_SIZE = 200; const TASK_PAGE_SIZE = 200;
@@ -154,16 +137,6 @@ const isAnnotationResultEmpty = (annotation?: Record<string, unknown>) => {
}; };
const resolveTaskStatusMeta = (item: EditorTaskListItem) => { const resolveTaskStatusMeta = (item: EditorTaskListItem) => {
const segmentSummary = resolveSegmentSummary(item);
if (segmentSummary) {
if (segmentSummary.done >= segmentSummary.total) {
return { text: "已标注", type: "success" as const };
}
if (segmentSummary.done > 0) {
return { text: "标注中", type: "warning" as const };
}
return { text: "未标注", type: "secondary" as const };
}
if (!item.hasAnnotation) { if (!item.hasAnnotation) {
return { text: "未标注", type: "secondary" as const }; return { text: "未标注", type: "secondary" as const };
} }
@@ -216,25 +189,6 @@ const buildAnnotationSnapshot = (annotation?: Record<string, unknown>) => {
const buildSnapshotKey = (fileId: string, segmentIndex?: number) => const buildSnapshotKey = (fileId: string, segmentIndex?: number) =>
`${fileId}::${segmentIndex ?? "full"}`; `${fileId}::${segmentIndex ?? "full"}`;
const buildSegmentStats = (segmentList?: SegmentInfo[] | null): SegmentStats | null => {
if (!Array.isArray(segmentList) || segmentList.length === 0) return null;
const total = segmentList.length;
const done = segmentList.reduce((count, seg) => count + (seg.hasAnnotation ? 1 : 0), 0);
return { done, total };
};
const normalizeSegmentStats = (stats?: SegmentStats | null): SegmentStats | null => {
if (!stats) return null;
const total = Number(stats.total);
const done = Number(stats.done);
if (!Number.isFinite(total) || total <= 0) return null;
const safeDone = Math.min(Math.max(done, 0), total);
return { done: safeDone, total };
};
const resolveSegmentSummary = (item: EditorTaskListItem) =>
normalizeSegmentStats(item.segmentStats);
const mergeTaskItems = (base: EditorTaskListItem[], next: EditorTaskListItem[]) => { const mergeTaskItems = (base: EditorTaskListItem[], next: EditorTaskListItem[]) => {
if (next.length === 0) return base; if (next.length === 0) return base;
const seen = new Set(base.map((item) => item.fileId)); const seen = new Set(base.map((item) => item.fileId));
@@ -282,18 +236,13 @@ export default function LabelStudioTextEditor() {
resolve: (payload?: ExportPayload) => void; resolve: (payload?: ExportPayload) => void;
timer?: number; timer?: number;
} | null>(null); } | null>(null);
const exportCheckSeqRef = useRef(0);
const savedSnapshotsRef = useRef<Record<string, string>>({}); const savedSnapshotsRef = useRef<Record<string, string>>({});
const pendingAutoAdvanceRef = useRef(false); const pendingAutoAdvanceRef = useRef(false);
const segmentStatsCacheRef = useRef<Record<string, SegmentStats>>({});
const segmentStatsSeqRef = useRef(0);
const segmentStatsLoadingRef = useRef<Set<string>>(new Set());
const [loadingProject, setLoadingProject] = useState(true); const [loadingProject, setLoadingProject] = useState(true);
const [loadingTasks, setLoadingTasks] = useState(false); const [loadingTasks, setLoadingTasks] = useState(false);
const [loadingTaskDetail, setLoadingTaskDetail] = useState(false); const [loadingTaskDetail, setLoadingTaskDetail] = useState(false);
const [saving, setSaving] = useState(false); const [saving, setSaving] = useState(false);
const [segmentSwitching, setSegmentSwitching] = useState(false);
const [iframeReady, setIframeReady] = useState(false); const [iframeReady, setIframeReady] = useState(false);
const [lsReady, setLsReady] = useState(false); const [lsReady, setLsReady] = useState(false);
@@ -306,16 +255,19 @@ export default function LabelStudioTextEditor() {
const [prefetching, setPrefetching] = useState(false); const [prefetching, setPrefetching] = useState(false);
const [selectedFileId, setSelectedFileId] = useState<string>(""); const [selectedFileId, setSelectedFileId] = useState<string>("");
const [sidebarCollapsed, setSidebarCollapsed] = useState(false); const [sidebarCollapsed, setSidebarCollapsed] = useState(false);
const [autoSaveOnSwitch, setAutoSaveOnSwitch] = useState(false);
// 分段相关状态 // 分段相关状态
const [segmented, setSegmented] = useState(false); const [segmented, setSegmented] = useState(false);
const [segments, setSegments] = useState<SegmentInfo[]>([]);
const [currentSegmentIndex, setCurrentSegmentIndex] = useState(0); const [currentSegmentIndex, setCurrentSegmentIndex] = useState(0);
const [segmentTotal, setSegmentTotal] = useState(0);
const isTextProject = useMemo( const isTextProject = useMemo(
() => (project?.datasetType || "").toUpperCase() === "TEXT", () => (project?.datasetType || "").toUpperCase() === "TEXT",
[project?.datasetType], [project?.datasetType],
); );
const segmentIndices = useMemo(() => {
if (segmentTotal <= 0) return [] as number[];
return Array.from({ length: segmentTotal }, (_, index) => index);
}, [segmentTotal]);
const focusIframe = useCallback(() => { const focusIframe = useCallback(() => {
const iframe = iframeRef.current; const iframe = iframeRef.current;
@@ -330,70 +282,6 @@ export default function LabelStudioTextEditor() {
win.postMessage({ type, payload }, origin); win.postMessage({ type, payload }, origin);
}, [origin]); }, [origin]);
const applySegmentStats = useCallback((fileId: string, stats: SegmentStats | null) => {
if (!fileId) return;
const normalized = normalizeSegmentStats(stats);
setTasks((prev) =>
prev.map((item) =>
item.fileId === fileId
? { ...item, segmentStats: normalized || undefined }
: item
)
);
}, []);
const updateSegmentStatsCache = useCallback((fileId: string, stats: SegmentStats | null) => {
if (!fileId) return;
const normalized = normalizeSegmentStats(stats);
if (normalized) {
segmentStatsCacheRef.current[fileId] = normalized;
} else {
delete segmentStatsCacheRef.current[fileId];
}
applySegmentStats(fileId, normalized);
}, [applySegmentStats]);
const fetchSegmentStatsForFile = useCallback(async (fileId: string, seq: number) => {
if (!projectId || !fileId) return;
if (segmentStatsCacheRef.current[fileId] || segmentStatsLoadingRef.current.has(fileId)) return;
segmentStatsLoadingRef.current.add(fileId);
try {
const resp = (await getEditorTaskUsingGet(projectId, fileId, {
segmentIndex: 0,
})) as ApiResponse<EditorTaskResponse>;
if (segmentStatsSeqRef.current !== seq) return;
const data = resp?.data;
if (!data?.segmented) return;
const stats = buildSegmentStats(data.segments);
if (!stats) return;
segmentStatsCacheRef.current[fileId] = stats;
applySegmentStats(fileId, stats);
} catch (e) {
console.error(e);
} finally {
segmentStatsLoadingRef.current.delete(fileId);
}
}, [applySegmentStats, projectId]);
const prefetchSegmentStats = useCallback((items: EditorTaskListItem[]) => {
if (!projectId) return;
const fileIds = items
.map((item) => item.fileId)
.filter((fileId) => fileId && !segmentStatsCacheRef.current[fileId]);
if (fileIds.length === 0) return;
const seq = segmentStatsSeqRef.current;
let cursor = 0;
const workerCount = Math.min(3, fileIds.length);
const runWorker = async () => {
while (cursor < fileIds.length && segmentStatsSeqRef.current === seq) {
const fileId = fileIds[cursor];
cursor += 1;
await fetchSegmentStatsForFile(fileId, seq);
}
};
void Promise.all(Array.from({ length: workerCount }, () => runWorker()));
}, [fetchSegmentStatsForFile, projectId]);
const confirmEmptyAnnotationStatus = useCallback(() => { const confirmEmptyAnnotationStatus = useCallback(() => {
return new Promise<AnnotationResultStatus | null>((resolve) => { return new Promise<AnnotationResultStatus | null>((resolve) => {
let resolved = false; let resolved = false;
@@ -446,8 +334,6 @@ export default function LabelStudioTextEditor() {
const updateTaskSelection = useCallback((items: EditorTaskListItem[]) => { const updateTaskSelection = useCallback((items: EditorTaskListItem[]) => {
const isCompleted = (item: EditorTaskListItem) => { const isCompleted = (item: EditorTaskListItem) => {
const summary = resolveSegmentSummary(item);
if (summary) return summary.done >= summary.total;
return item.hasAnnotation; return item.hasAnnotation;
}; };
const defaultFileId = const defaultFileId =
@@ -508,9 +394,6 @@ export default function LabelStudioTextEditor() {
if (mode === "reset") { if (mode === "reset") {
prefetchSeqRef.current += 1; prefetchSeqRef.current += 1;
setPrefetching(false); setPrefetching(false);
segmentStatsSeqRef.current += 1;
segmentStatsCacheRef.current = {};
segmentStatsLoadingRef.current = new Set();
} }
if (mode === "append") { if (mode === "append") {
setLoadingMore(true); setLoadingMore(true);
@@ -591,20 +474,19 @@ export default function LabelStudioTextEditor() {
if (seq !== initSeqRef.current) return; if (seq !== initSeqRef.current) return;
// 更新分段状态 // 更新分段状态
const segmentIndex = data?.segmented const isSegmented = !!data?.segmented;
const segmentIndex = isSegmented
? resolveSegmentIndex(data.currentSegmentIndex) ?? 0 ? resolveSegmentIndex(data.currentSegmentIndex) ?? 0
: undefined; : undefined;
if (data?.segmented) { if (isSegmented) {
const stats = buildSegmentStats(data.segments);
setSegmented(true); setSegmented(true);
setSegments(data.segments || []);
setCurrentSegmentIndex(segmentIndex ?? 0); setCurrentSegmentIndex(segmentIndex ?? 0);
updateSegmentStatsCache(fileId, stats); const totalSegments = Number(data?.totalSegments ?? 0);
setSegmentTotal(Number.isFinite(totalSegments) && totalSegments > 0 ? totalSegments : 0);
} else { } else {
setSegmented(false); setSegmented(false);
setSegments([]);
setCurrentSegmentIndex(0); setCurrentSegmentIndex(0);
updateSegmentStatsCache(fileId, null); setSegmentTotal(0);
} }
const taskData = { const taskData = {
@@ -664,19 +546,14 @@ export default function LabelStudioTextEditor() {
} finally { } finally {
if (seq === initSeqRef.current) setLoadingTaskDetail(false); if (seq === initSeqRef.current) setLoadingTaskDetail(false);
} }
}, [iframeReady, message, postToIframe, project, projectId, updateSegmentStatsCache]); }, [iframeReady, message, postToIframe, project, projectId]);
const advanceAfterSave = useCallback(async (fileId: string, segmentIndex?: number) => { const advanceAfterSave = useCallback(async (fileId: string, segmentIndex?: number) => {
if (!fileId) return; if (!fileId) return;
if (segmented && segments.length > 0) { if (segmented && segmentTotal > 0) {
const sortedSegmentIndices = segments const baseIndex = Math.max(segmentIndex ?? currentSegmentIndex, 0);
.map((seg) => seg.idx) const nextSegmentIndex = baseIndex + 1;
.sort((a, b) => a - b); if (nextSegmentIndex < segmentTotal) {
const baseIndex = segmentIndex ?? currentSegmentIndex;
const currentPos = sortedSegmentIndices.indexOf(baseIndex);
const nextSegmentIndex =
currentPos >= 0 ? sortedSegmentIndices[currentPos + 1] : sortedSegmentIndices[0];
if (nextSegmentIndex !== undefined) {
await initEditorForFile(fileId, nextSegmentIndex); await initEditorForFile(fileId, nextSegmentIndex);
return; return;
} }
@@ -698,7 +575,7 @@ export default function LabelStudioTextEditor() {
initEditorForFile, initEditorForFile,
message, message,
segmented, segmented,
segments, segmentTotal,
tasks, tasks,
]); ]);
@@ -772,16 +649,6 @@ export default function LabelStudioTextEditor() {
const snapshot = buildAnnotationSnapshot(isRecord(annotation) ? annotation : undefined); const snapshot = buildAnnotationSnapshot(isRecord(annotation) ? annotation : undefined);
savedSnapshotsRef.current[snapshotKey] = snapshot; savedSnapshotsRef.current[snapshotKey] = snapshot;
// 分段模式下更新当前段落的标注状态
if (segmented && segmentIndex !== undefined) {
const nextSegments = segments.map((seg) =>
seg.idx === segmentIndex
? { ...seg, hasAnnotation: true }
: seg
);
setSegments(nextSegments);
updateSegmentStatsCache(String(fileId), buildSegmentStats(nextSegments));
}
if (options?.autoAdvance) { if (options?.autoAdvance) {
await advanceAfterSave(String(fileId), segmentIndex); await advanceAfterSave(String(fileId), segmentIndex);
} }
@@ -800,69 +667,10 @@ export default function LabelStudioTextEditor() {
message, message,
projectId, projectId,
segmented, segmented,
segments,
selectedFileId, selectedFileId,
tasks, tasks,
updateSegmentStatsCache,
]); ]);
const requestExportForCheck = useCallback(() => {
if (!iframeReady || !lsReady) return Promise.resolve(undefined);
if (exportCheckRef.current) {
if (exportCheckRef.current.timer) {
window.clearTimeout(exportCheckRef.current.timer);
}
exportCheckRef.current.resolve(undefined);
exportCheckRef.current = null;
}
const requestId = `check_${Date.now()}_${++exportCheckSeqRef.current}`;
return new Promise<ExportPayload | undefined>((resolve) => {
const timer = window.setTimeout(() => {
if (exportCheckRef.current?.requestId === requestId) {
exportCheckRef.current = null;
}
resolve(undefined);
}, 3000);
exportCheckRef.current = {
requestId,
resolve,
timer,
};
postToIframe("LS_EXPORT_CHECK", { requestId });
});
}, [iframeReady, lsReady, postToIframe]);
const confirmSaveBeforeSwitch = useCallback(() => {
return new Promise<SwitchDecision>((resolve) => {
let resolved = false;
let modalInstance: { destroy: () => void } | null = null;
const settle = (decision: SwitchDecision) => {
if (resolved) return;
resolved = true;
resolve(decision);
};
const handleDiscard = () => {
if (modalInstance) modalInstance.destroy();
settle("discard");
};
modalInstance = modal.confirm({
title: "当前段落有未保存标注",
content: (
<div className="flex flex-col gap-2">
<Typography.Text></Typography.Text>
<Button type="link" danger style={{ padding: 0, height: "auto" }} onClick={handleDiscard}>
</Button>
</div>
),
okText: "保存并切换",
cancelText: "取消",
onOk: () => settle("save"),
onCancel: () => settle("cancel"),
});
});
}, [modal]);
const requestExport = useCallback((autoAdvance: boolean) => { const requestExport = useCallback((autoAdvance: boolean) => {
if (!selectedFileId) { if (!selectedFileId) {
message.warning("请先选择文件"); message.warning("请先选择文件");
@@ -875,7 +683,7 @@ export default function LabelStudioTextEditor() {
useEffect(() => { useEffect(() => {
const handleSaveShortcut = (event: KeyboardEvent) => { const handleSaveShortcut = (event: KeyboardEvent) => {
if (!isSaveShortcut(event) || event.repeat) return; if (!isSaveShortcut(event) || event.repeat) return;
if (saving || loadingTaskDetail || segmentSwitching) return; if (saving || loadingTaskDetail) return;
if (!iframeReady || !lsReady) return; if (!iframeReady || !lsReady) return;
event.preventDefault(); event.preventDefault();
event.stopPropagation(); event.stopPropagation();
@@ -883,83 +691,7 @@ export default function LabelStudioTextEditor() {
}; };
window.addEventListener("keydown", handleSaveShortcut); window.addEventListener("keydown", handleSaveShortcut);
return () => window.removeEventListener("keydown", handleSaveShortcut); return () => window.removeEventListener("keydown", handleSaveShortcut);
}, [iframeReady, loadingTaskDetail, lsReady, requestExport, saving, segmentSwitching]); }, [iframeReady, loadingTaskDetail, lsReady, requestExport, saving]);
// 段落切换处理
const handleSegmentChange = useCallback(async (newIndex: number) => {
if (newIndex === currentSegmentIndex) return;
if (segmentSwitching || saving || loadingTaskDetail) return;
if (!iframeReady || !lsReady) {
message.warning("编辑器未就绪,无法切换段落");
return;
}
setSegmentSwitching(true);
try {
const payload = await requestExportForCheck();
if (!payload) {
message.warning("无法读取当前标注,已取消切换");
return;
}
const payloadTaskId = payload.taskId;
if (expectedTaskIdRef.current && payloadTaskId) {
if (Number(payloadTaskId) !== expectedTaskIdRef.current) {
message.warning("已忽略过期的标注数据");
return;
}
}
const payloadFileId = payload.fileId || selectedFileId;
const payloadSegmentIndex = resolveSegmentIndex(payload.segmentIndex);
const resolvedSegmentIndex =
payloadSegmentIndex !== undefined
? payloadSegmentIndex
: segmented
? currentSegmentIndex
: undefined;
const annotation = isRecord(payload.annotation) ? payload.annotation : undefined;
const snapshotKey = payloadFileId
? buildSnapshotKey(String(payloadFileId), resolvedSegmentIndex)
: undefined;
const latestSnapshot = buildAnnotationSnapshot(annotation);
const lastSnapshot = snapshotKey ? savedSnapshotsRef.current[snapshotKey] : undefined;
const hasUnsavedChange = snapshotKey !== undefined && lastSnapshot !== undefined && latestSnapshot !== lastSnapshot;
if (hasUnsavedChange) {
if (autoSaveOnSwitch) {
const saved = await saveFromExport(payload);
if (!saved) return;
} else {
const decision = await confirmSaveBeforeSwitch();
if (decision === "cancel") return;
if (decision === "save") {
const saved = await saveFromExport(payload);
if (!saved) return;
}
}
}
await initEditorForFile(selectedFileId, newIndex);
} finally {
setSegmentSwitching(false);
}
}, [
autoSaveOnSwitch,
confirmSaveBeforeSwitch,
currentSegmentIndex,
iframeReady,
initEditorForFile,
loadingTaskDetail,
lsReady,
message,
requestExportForCheck,
saveFromExport,
segmented,
selectedFileId,
segmentSwitching,
saving,
]);
useEffect(() => { useEffect(() => {
setIframeReady(false); setIframeReady(false);
@@ -977,12 +709,9 @@ export default function LabelStudioTextEditor() {
expectedTaskIdRef.current = null; expectedTaskIdRef.current = null;
// 重置分段状态 // 重置分段状态
setSegmented(false); setSegmented(false);
setSegments([]);
setCurrentSegmentIndex(0); setCurrentSegmentIndex(0);
setSegmentTotal(0);
savedSnapshotsRef.current = {}; savedSnapshotsRef.current = {};
segmentStatsSeqRef.current += 1;
segmentStatsCacheRef.current = {};
segmentStatsLoadingRef.current = new Set();
if (exportCheckRef.current?.timer) { if (exportCheckRef.current?.timer) {
window.clearTimeout(exportCheckRef.current.timer); window.clearTimeout(exportCheckRef.current.timer);
} }
@@ -996,12 +725,6 @@ export default function LabelStudioTextEditor() {
loadTasks({ mode: "reset" }); loadTasks({ mode: "reset" });
}, [project?.supported, loadTasks]); }, [project?.supported, loadTasks]);
useEffect(() => {
if (!segmented) return;
if (tasks.length === 0) return;
prefetchSegmentStats(tasks);
}, [prefetchSegmentStats, segmented, tasks]);
useEffect(() => { useEffect(() => {
if (!selectedFileId) return; if (!selectedFileId) return;
initEditorForFile(selectedFileId); initEditorForFile(selectedFileId);
@@ -1026,60 +749,6 @@ export default function LabelStudioTextEditor() {
return () => window.removeEventListener("focus", handleWindowFocus); return () => window.removeEventListener("focus", handleWindowFocus);
}, [focusIframe, lsReady]); }, [focusIframe, lsReady]);
const segmentTreeData = useMemo(() => {
if (!segmented || segments.length === 0) return [];
const lineMap = new Map<number, SegmentInfo[]>();
segments.forEach((seg) => {
const list = lineMap.get(seg.lineIndex) || [];
list.push(seg);
lineMap.set(seg.lineIndex, list);
});
return Array.from(lineMap.entries())
.sort((a, b) => a[0] - b[0])
.map(([lineIndex, lineSegments]) => ({
key: `line-${lineIndex}`,
title: `${lineIndex + 1}`,
selectable: false,
children: lineSegments
.sort((a, b) => a.chunkIndex - b.chunkIndex)
.map((seg) => ({
key: `seg-${seg.idx}`,
title: (
<span className="flex items-center gap-1">
<span>{`${seg.chunkIndex + 1}`}</span>
{seg.hasAnnotation && (
<CheckOutlined style={{ fontSize: 10, color: "#52c41a" }} />
)}
</span>
),
})),
}));
}, [segmented, segments]);
const segmentLineKeys = useMemo(
() => segmentTreeData.map((item) => String(item.key)),
[segmentTreeData]
);
const inProgressSegmentedCount = useMemo(() => {
if (tasks.length === 0) return 0;
return tasks.reduce((count, item) => {
const summary = resolveSegmentSummary(item);
if (!summary) return count;
return summary.done < summary.total ? count + 1 : count;
}, 0);
}, [tasks]);
const handleSegmentSelect = useCallback((keys: Array<string | number>) => {
const [first] = keys;
if (first === undefined || first === null) return;
const key = String(first);
if (!key.startsWith("seg-")) return;
const nextIndex = Number(key.replace("seg-", ""));
if (!Number.isFinite(nextIndex)) return;
handleSegmentChange(nextIndex);
}, [handleSegmentChange]);
useEffect(() => { useEffect(() => {
const handler = (event: MessageEvent<LsfMessage>) => { const handler = (event: MessageEvent<LsfMessage>) => {
if (event.origin !== origin) return; if (event.origin !== origin) return;
@@ -1148,7 +817,7 @@ export default function LabelStudioTextEditor() {
const canLoadMore = taskTotalPages > 0 && taskPage + 1 < taskTotalPages; const canLoadMore = taskTotalPages > 0 && taskPage + 1 < taskTotalPages;
const saveDisabled = const saveDisabled =
!iframeReady || !selectedFileId || saving || segmentSwitching || loadingTaskDetail; !iframeReady || !selectedFileId || saving || loadingTaskDetail;
const loadMoreNode = canLoadMore ? ( const loadMoreNode = canLoadMore ? (
<div className="p-2 text-center"> <div className="p-2 text-center">
<Button <Button
@@ -1265,11 +934,6 @@ export default function LabelStudioTextEditor() {
> >
<div className="px-3 py-2 border-b border-gray-200 bg-white font-medium text-sm flex items-center justify-between gap-2"> <div className="px-3 py-2 border-b border-gray-200 bg-white font-medium text-sm flex items-center justify-between gap-2">
<span></span> <span></span>
{segmented && (
<Tag color="orange" style={{ margin: 0 }}>
{inProgressSegmentedCount}
</Tag>
)}
</div> </div>
<div className="flex-1 min-h-0 overflow-auto"> <div className="flex-1 min-h-0 overflow-auto">
<List <List
@@ -1278,7 +942,6 @@ export default function LabelStudioTextEditor() {
dataSource={tasks} dataSource={tasks}
loadMore={loadMoreNode} loadMore={loadMoreNode}
renderItem={(item) => { renderItem={(item) => {
const segmentSummary = resolveSegmentSummary(item);
const statusMeta = resolveTaskStatusMeta(item); const statusMeta = resolveTaskStatusMeta(item);
return ( return (
<List.Item <List.Item
@@ -1300,11 +963,6 @@ export default function LabelStudioTextEditor() {
<Typography.Text type={statusMeta.type} style={{ fontSize: 11 }}> <Typography.Text type={statusMeta.type} style={{ fontSize: 11 }}>
{statusMeta.text} {statusMeta.text}
</Typography.Text> </Typography.Text>
{segmentSummary && (
<Typography.Text type="secondary" style={{ fontSize: 10 }}>
{segmentSummary.done}/{segmentSummary.total}
</Typography.Text>
)}
</div> </div>
{item.annotationUpdatedAt && ( {item.annotationUpdatedAt && (
<Typography.Text type="secondary" style={{ fontSize: 10 }}> <Typography.Text type="secondary" style={{ fontSize: 10 }}>
@@ -1323,21 +981,28 @@ export default function LabelStudioTextEditor() {
<div className="px-3 py-2 border-b border-gray-200 bg-gray-50 font-medium text-sm flex items-center justify-between"> <div className="px-3 py-2 border-b border-gray-200 bg-gray-50 font-medium text-sm flex items-center justify-between">
<span>/</span> <span>/</span>
<Tag color="blue" style={{ margin: 0 }}> <Tag color="blue" style={{ margin: 0 }}>
{currentSegmentIndex + 1} / {segments.length} {segmentTotal > 0 ? currentSegmentIndex + 1 : 0} / {segmentTotal}
</Tag> </Tag>
</div> </div>
<div className="flex-1 min-h-0 overflow-auto px-2 py-2"> <div className="flex-1 min-h-0 overflow-auto px-2 py-2">
{segments.length > 0 ? ( {segmentTotal > 0 ? (
<Tree <div className="grid grid-cols-[repeat(auto-fill,minmax(44px,1fr))] gap-1">
showLine {segmentIndices.map((segmentIndex) => {
blockNode const isCurrent = segmentIndex === currentSegmentIndex;
selectedKeys={ return (
segmented ? [`seg-${currentSegmentIndex}`] : [] <div
key={segmentIndex}
className={
isCurrent
? "h-7 leading-7 rounded bg-blue-500 text-white text-center text-xs font-medium"
: "h-7 leading-7 rounded bg-gray-100 text-gray-700 text-center text-xs"
} }
expandedKeys={segmentLineKeys} >
onSelect={handleSegmentSelect} {segmentIndex + 1}
treeData={segmentTreeData} </div>
/> );
})}
</div>
) : ( ) : (
<div className="py-6"> <div className="py-6">
<Empty <Empty
@@ -1347,17 +1012,6 @@ export default function LabelStudioTextEditor() {
</div> </div>
)} )}
</div> </div>
<div className="px-3 py-2 border-t border-gray-200 flex items-center justify-between">
<Typography.Text style={{ fontSize: 12 }}>
</Typography.Text>
<Switch
size="small"
checked={autoSaveOnSwitch}
onChange={(checked) => setAutoSaveOnSwitch(checked)}
disabled={segmentSwitching || saving || loadingTaskDetail || !lsReady}
/>
</div>
</div> </div>
)} )}
</div> </div>

View File

@@ -3,16 +3,19 @@ import { get, post, put, del, download } from "@/utils/request";
// 导出格式类型 // 导出格式类型
export type ExportFormat = "json" | "jsonl" | "csv" | "coco" | "yolo"; export type ExportFormat = "json" | "jsonl" | "csv" | "coco" | "yolo";
type RequestParams = Record<string, unknown>;
type RequestPayload = Record<string, unknown>;
// 标注任务管理相关接口 // 标注任务管理相关接口
export function queryAnnotationTasksUsingGet(params?: any) { export function queryAnnotationTasksUsingGet(params?: RequestParams) {
return get("/api/annotation/project", params); return get("/api/annotation/project", params);
} }
export function createAnnotationTaskUsingPost(data: any) { export function createAnnotationTaskUsingPost(data: RequestPayload) {
return post("/api/annotation/project", data); return post("/api/annotation/project", data);
} }
export function syncAnnotationTaskUsingPost(data: any) { export function syncAnnotationTaskUsingPost(data: RequestPayload) {
return post(`/api/annotation/task/sync`, data); return post(`/api/annotation/task/sync`, data);
} }
@@ -25,7 +28,7 @@ export function getAnnotationTaskByIdUsingGet(taskId: string) {
return get(`/api/annotation/project/${taskId}`); return get(`/api/annotation/project/${taskId}`);
} }
export function updateAnnotationTaskByIdUsingPut(taskId: string, data: any) { export function updateAnnotationTaskByIdUsingPut(taskId: string, data: RequestPayload) {
return put(`/api/annotation/project/${taskId}`, data); return put(`/api/annotation/project/${taskId}`, data);
} }
@@ -35,17 +38,17 @@ export function getTagConfigUsingGet() {
} }
// 标注模板管理 // 标注模板管理
export function queryAnnotationTemplatesUsingGet(params?: any) { export function queryAnnotationTemplatesUsingGet(params?: RequestParams) {
return get("/api/annotation/template", params); return get("/api/annotation/template", params);
} }
export function createAnnotationTemplateUsingPost(data: any) { export function createAnnotationTemplateUsingPost(data: RequestPayload) {
return post("/api/annotation/template", data); return post("/api/annotation/template", data);
} }
export function updateAnnotationTemplateByIdUsingPut( export function updateAnnotationTemplateByIdUsingPut(
templateId: string | number, templateId: string | number,
data: any data: RequestPayload
) { ) {
return put(`/api/annotation/template/${templateId}`, data); return put(`/api/annotation/template/${templateId}`, data);
} }
@@ -65,7 +68,7 @@ export function getEditorProjectInfoUsingGet(projectId: string) {
return get(`/api/annotation/editor/projects/${projectId}`); return get(`/api/annotation/editor/projects/${projectId}`);
} }
export function listEditorTasksUsingGet(projectId: string, params?: any) { export function listEditorTasksUsingGet(projectId: string, params?: RequestParams) {
return get(`/api/annotation/editor/projects/${projectId}/tasks`, params); return get(`/api/annotation/editor/projects/${projectId}/tasks`, params);
} }
@@ -77,11 +80,19 @@ export function getEditorTaskUsingGet(
return get(`/api/annotation/editor/projects/${projectId}/tasks/${fileId}`, params); return get(`/api/annotation/editor/projects/${projectId}/tasks/${fileId}`, params);
} }
export function getEditorTaskSegmentUsingGet(
projectId: string,
fileId: string,
params: { segmentIndex: number }
) {
return get(`/api/annotation/editor/projects/${projectId}/tasks/${fileId}/segments`, params);
}
export function upsertEditorAnnotationUsingPut( export function upsertEditorAnnotationUsingPut(
projectId: string, projectId: string,
fileId: string, fileId: string,
data: { data: {
annotation: any; annotation: Record<string, unknown>;
expectedUpdatedAt?: string; expectedUpdatedAt?: string;
segmentIndex?: number; segmentIndex?: number;
} }

View File

@@ -5,7 +5,7 @@ import { Dataset, DatasetType, DataSource } from "../../dataset.model";
import { useCallback, useEffect, useMemo, useState } from "react"; import { useCallback, useEffect, useMemo, useState } from "react";
import { queryTasksUsingGet } from "@/pages/DataCollection/collection.apis"; import { queryTasksUsingGet } from "@/pages/DataCollection/collection.apis";
import { updateDatasetByIdUsingPut } from "../../dataset.api"; import { updateDatasetByIdUsingPut } from "../../dataset.api";
import { sliceFile } from "@/utils/file.util"; import { sliceFile, shouldStreamUpload } from "@/utils/file.util";
import Dragger from "antd/es/upload/Dragger"; import Dragger from "antd/es/upload/Dragger";
const TEXT_FILE_MIME_PREFIX = "text/"; const TEXT_FILE_MIME_PREFIX = "text/";
@@ -90,14 +90,16 @@ async function splitFileByLines(file: UploadFile): Promise<UploadFile[]> {
const lines = text.split(/\r?\n/).filter((line: string) => line.trim() !== ""); const lines = text.split(/\r?\n/).filter((line: string) => line.trim() !== "");
if (lines.length === 0) return []; if (lines.length === 0) return [];
// 生成文件名:原文件名_序号.扩展名 // 生成文件名:原文件名_序号(不保留后缀)
const nameParts = file.name.split("."); const nameParts = file.name.split(".");
const ext = nameParts.length > 1 ? "." + nameParts.pop() : ""; if (nameParts.length > 1) {
nameParts.pop();
}
const baseName = nameParts.join("."); const baseName = nameParts.join(".");
const padLength = String(lines.length).length; const padLength = String(lines.length).length;
return lines.map((line: string, index: number) => { return lines.map((line: string, index: number) => {
const newFileName = `${baseName}_${String(index + 1).padStart(padLength, "0")}${ext}`; const newFileName = `${baseName}_${String(index + 1).padStart(padLength, "0")}`;
const blob = new Blob([line], { type: "text/plain" }); const blob = new Blob([line], { type: "text/plain" });
const newFile = new File([blob], newFileName, { type: "text/plain" }); const newFile = new File([blob], newFileName, { type: "text/plain" });
return { return {
@@ -164,17 +166,75 @@ export default function ImportConfiguration({
// 本地上传文件相关逻辑 // 本地上传文件相关逻辑
const handleUpload = async (dataset: Dataset) => { const handleUpload = async (dataset: Dataset) => {
let filesToUpload = const filesToUpload =
(form.getFieldValue("files") as UploadFile[] | undefined) || []; (form.getFieldValue("files") as UploadFile[] | undefined) || [];
// 如果启用分行分割,处理文件 // 如果启用分行分割,对大文件使用流式处理
if (importConfig.splitByLine && !hasNonTextFile) { if (importConfig.splitByLine && !hasNonTextFile) {
const splitResults = await Promise.all( // 检查是否有大文件需要流式分割上传
filesToUpload.map((file) => splitFileByLines(file)) const filesForStreamUpload: File[] = [];
); const filesForNormalUpload: UploadFile[] = [];
filesToUpload = splitResults.flat();
for (const file of filesToUpload) {
const originFile = file.originFileObj ?? file;
if (originFile instanceof File && shouldStreamUpload(originFile)) {
filesForStreamUpload.push(originFile);
} else {
filesForNormalUpload.push(file);
}
} }
// 大文件使用流式分割上传
if (filesForStreamUpload.length > 0) {
window.dispatchEvent(
new CustomEvent("upload:dataset-stream", {
detail: {
dataset,
files: filesForStreamUpload,
updateEvent,
hasArchive: importConfig.hasArchive,
prefix: currentPrefix,
},
})
);
}
// 小文件使用传统分割方式
if (filesForNormalUpload.length > 0) {
const splitResults = await Promise.all(
filesForNormalUpload.map((file) => splitFileByLines(file))
);
const smallFilesToUpload = splitResults.flat();
// 计算分片列表
const sliceList = smallFilesToUpload.map((file) => {
const originFile = (file.originFileObj ?? file) as Blob;
const slices = sliceFile(originFile);
return {
originFile: originFile,
slices,
name: file.name,
size: originFile.size || 0,
};
});
console.log("[ImportConfiguration] Uploading small files with currentPrefix:", currentPrefix);
window.dispatchEvent(
new CustomEvent("upload:dataset", {
detail: {
dataset,
files: sliceList,
updateEvent,
hasArchive: importConfig.hasArchive,
prefix: currentPrefix,
},
})
);
}
return;
}
// 未启用分行分割,使用普通上传
// 计算分片列表 // 计算分片列表
const sliceList = filesToUpload.map((file) => { const sliceList = filesToUpload.map((file) => {
const originFile = (file.originFileObj ?? file) as Blob; const originFile = (file.originFileObj ?? file) as Blob;
@@ -234,6 +294,10 @@ export default function ImportConfiguration({
if (!data) return; if (!data) return;
console.log('[ImportConfiguration] handleImportData called, currentPrefix:', currentPrefix); console.log('[ImportConfiguration] handleImportData called, currentPrefix:', currentPrefix);
if (importConfig.source === DataSource.UPLOAD) { if (importConfig.source === DataSource.UPLOAD) {
// 立即显示任务中心,让用户感知上传已开始(在文件分割等耗时操作之前)
window.dispatchEvent(
new CustomEvent("show:task-popover", { detail: { show: true } })
);
await handleUpload(data); await handleUpload(data);
} else if (importConfig.source === DataSource.COLLECTION) { } else if (importConfig.source === DataSource.COLLECTION) {
await updateDatasetByIdUsingPut(data.id, { await updateDatasetByIdUsingPut(data.id, {

View File

@@ -102,6 +102,13 @@ export interface DatasetTask {
executionHistory?: { time: string; status: string }[]; executionHistory?: { time: string; status: string }[];
} }
export interface StreamUploadInfo {
currentFile: string;
fileIndex: number;
totalFiles: number;
uploadedLines: number;
}
export interface TaskItem { export interface TaskItem {
key: string; key: string;
title: string; title: string;
@@ -113,4 +120,6 @@ export interface TaskItem {
updateEvent?: string; updateEvent?: string;
size?: number; size?: number;
hasArchive?: boolean; hasArchive?: boolean;
prefix?: string;
streamUploadInfo?: StreamUploadInfo;
} }

View File

@@ -3,25 +3,28 @@ import {
preUploadUsingPost, preUploadUsingPost,
uploadFileChunkUsingPost, uploadFileChunkUsingPost,
} from "@/pages/DataManagement/dataset.api"; } from "@/pages/DataManagement/dataset.api";
import { Button, Empty, Progress } from "antd"; import { Button, Empty, Progress, Tag } from "antd";
import { DeleteOutlined } from "@ant-design/icons"; import { DeleteOutlined, FileTextOutlined } from "@ant-design/icons";
import { useEffect } from "react"; import { useEffect } from "react";
import { useFileSliceUpload } from "@/hooks/useSliceUpload"; import { useFileSliceUpload } from "@/hooks/useSliceUpload";
export default function TaskUpload() { export default function TaskUpload() {
const { createTask, taskList, removeTask, handleUpload } = useFileSliceUpload( const { createTask, taskList, removeTask, handleUpload, registerStreamUploadListener } = useFileSliceUpload(
{ {
preUpload: preUploadUsingPost, preUpload: preUploadUsingPost,
uploadChunk: uploadFileChunkUsingPost, uploadChunk: uploadFileChunkUsingPost,
cancelUpload: cancelUploadUsingPut, cancelUpload: cancelUploadUsingPut,
} },
true, // showTaskCenter
true // enableStreamUpload
); );
useEffect(() => { useEffect(() => {
const uploadHandler = (e: any) => { const uploadHandler = (e: Event) => {
console.log('[TaskUpload] Received upload event detail:', e.detail); const customEvent = e as CustomEvent;
const { files } = e.detail; console.log('[TaskUpload] Received upload event detail:', customEvent.detail);
const task = createTask(e.detail); const { files } = customEvent.detail;
const task = createTask(customEvent.detail);
console.log('[TaskUpload] Created task with prefix:', task.prefix); console.log('[TaskUpload] Created task with prefix:', task.prefix);
handleUpload({ task, files }); handleUpload({ task, files });
}; };
@@ -29,7 +32,13 @@ export default function TaskUpload() {
return () => { return () => {
window.removeEventListener("upload:dataset", uploadHandler); window.removeEventListener("upload:dataset", uploadHandler);
}; };
}, []); }, [createTask, handleUpload]);
// 注册流式上传监听器
useEffect(() => {
const unregister = registerStreamUploadListener();
return unregister;
}, [registerStreamUploadListener]);
return ( return (
<div <div
@@ -55,7 +64,22 @@ export default function TaskUpload() {
></Button> ></Button>
</div> </div>
<Progress size="small" percent={task.percent} /> <Progress size="small" percent={Number(task.percent)} />
{task.streamUploadInfo && (
<div className="flex items-center gap-2 text-xs text-gray-500 mt-1">
<Tag icon={<FileTextOutlined />} size="small">
</Tag>
<span>
: {task.streamUploadInfo.uploadedLines}
</span>
{task.streamUploadInfo.totalFiles > 1 && (
<span>
({task.streamUploadInfo.fileIndex}/{task.streamUploadInfo.totalFiles} )
</span>
)}
</div>
)}
</div> </div>
))} ))}
{taskList.length === 0 && ( {taskList.length === 0 && (

View File

@@ -1,79 +1,657 @@
import { UploadFile } from "antd"; import { UploadFile } from "antd";
import jsSHA from "jssha"; import jsSHA from "jssha";
const CHUNK_SIZE = 1024 * 1024 * 60; // 默认分片大小:5MB(适合大多数网络环境)
export const DEFAULT_CHUNK_SIZE = 1024 * 1024 * 5;
// 大文件阈值:10MB
export const LARGE_FILE_THRESHOLD = 1024 * 1024 * 10;
// 最大并发上传数
export const MAX_CONCURRENT_UPLOADS = 3;
// 文本文件读取块大小:20MB(用于计算 SHA256)
const BUFFER_CHUNK_SIZE = 1024 * 1024 * 20;
export function sliceFile(file, chunkSize = CHUNK_SIZE): Blob[] { /**
* 将文件分割为多个分片
* @param file 文件对象
* @param chunkSize 分片大小(字节),默认 5MB
* @returns 分片数组(Blob 列表)
*/
export function sliceFile(file: Blob, chunkSize = DEFAULT_CHUNK_SIZE): Blob[] {
const totalSize = file.size; const totalSize = file.size;
const chunks: Blob[] = [];
// 小文件不需要分片
if (totalSize <= chunkSize) {
return [file];
}
let start = 0; let start = 0;
let end = start + chunkSize;
const chunks = [];
while (start < totalSize) { while (start < totalSize) {
const end = Math.min(start + chunkSize, totalSize);
const blob = file.slice(start, end); const blob = file.slice(start, end);
chunks.push(blob); chunks.push(blob);
start = end; start = end;
end = start + chunkSize;
} }
return chunks; return chunks;
} }
export function calculateSHA256(file: Blob): Promise<string> { /**
let count = 0; * 计算文件的 SHA256 哈希值
const hash = new jsSHA("SHA-256", "ARRAYBUFFER", { encoding: "UTF8" }); * @param file 文件 Blob
* @param onProgress 进度回调(可选)
* @returns SHA256 哈希字符串
*/
export function calculateSHA256(
file: Blob,
onProgress?: (percent: number) => void
): Promise<string> {
return new Promise((resolve, reject) => { return new Promise((resolve, reject) => {
const hash = new jsSHA("SHA-256", "ARRAYBUFFER", { encoding: "UTF8" });
const reader = new FileReader(); const reader = new FileReader();
let processedSize = 0;
function readChunk(start: number, end: number) { function readChunk(start: number, end: number) {
const slice = file.slice(start, end); const slice = file.slice(start, end);
reader.readAsArrayBuffer(slice); reader.readAsArrayBuffer(slice);
} }
const bufferChunkSize = 1024 * 1024 * 20;
function processChunk(offset: number) { function processChunk(offset: number) {
const start = offset; const start = offset;
const end = Math.min(start + bufferChunkSize, file.size); const end = Math.min(start + BUFFER_CHUNK_SIZE, file.size);
count = end;
readChunk(start, end); readChunk(start, end);
} }
reader.onloadend = function () { reader.onloadend = function (e) {
const arraybuffer = reader.result; const arraybuffer = reader.result as ArrayBuffer;
if (!arraybuffer) {
reject(new Error("Failed to read file"));
return;
}
hash.update(arraybuffer); hash.update(arraybuffer);
if (count < file.size) { processedSize += (e.target as FileReader).result?.byteLength || 0;
processChunk(count);
if (onProgress) {
const percent = Math.min(100, Math.round((processedSize / file.size) * 100));
onProgress(percent);
}
if (processedSize < file.size) {
processChunk(processedSize);
} else { } else {
resolve(hash.getHash("HEX", { outputLen: 256 })); resolve(hash.getHash("HEX", { outputLen: 256 }));
} }
}; };
reader.onerror = () => reject(new Error("File reading failed"));
processChunk(0); processChunk(0);
}); });
} }
/**
* 批量计算多个文件的 SHA256
* @param files 文件列表
* @param onFileProgress 单个文件进度回调(可选)
* @returns 哈希值数组
*/
export async function calculateSHA256Batch(
files: Blob[],
onFileProgress?: (index: number, percent: number) => void
): Promise<string[]> {
const results: string[] = [];
for (let i = 0; i < files.length; i++) {
const hash = await calculateSHA256(files[i], (percent) => {
onFileProgress?.(i, percent);
});
results.push(hash);
}
return results;
}
/**
* 检查文件是否存在(未被修改或删除)
* @param fileList 文件列表
* @returns 返回第一个不存在的文件,或 null(如果都存在)
*/
export function checkIsFilesExist( export function checkIsFilesExist(
fileList: UploadFile[] fileList: Array<{ originFile?: Blob }>
): Promise<UploadFile | null> { ): Promise<{ originFile?: Blob } | null> {
return new Promise((resolve) => { return new Promise((resolve) => {
const loadEndFn = (file: UploadFile, reachEnd: boolean, e) => { if (!fileList.length) {
const fileNotExist = !e.target.result; resolve(null);
return;
}
let checkedCount = 0;
const totalCount = fileList.length;
const loadEndFn = (file: { originFile?: Blob }, e: ProgressEvent<FileReader>) => {
checkedCount++;
const fileNotExist = !e.target?.result;
if (fileNotExist) { if (fileNotExist) {
resolve(file); resolve(file);
return;
} }
if (reachEnd) { if (checkedCount >= totalCount) {
resolve(null); resolve(null);
} }
}; };
for (let i = 0; i < fileList.length; i++) { for (const file of fileList) {
const { originFile: file } = fileList[i];
const fileReader = new FileReader(); const fileReader = new FileReader();
fileReader.readAsArrayBuffer(file); const actualFile = file.originFile;
fileReader.onloadend = (e) =>
loadEndFn(fileList[i], i === fileList.length - 1, e); if (!actualFile) {
checkedCount++;
if (checkedCount >= totalCount) {
resolve(null);
}
continue;
}
fileReader.readAsArrayBuffer(actualFile.slice(0, 1));
fileReader.onloadend = (e) => loadEndFn(file, e);
fileReader.onerror = () => {
checkedCount++;
resolve(file);
};
} }
}); });
} }
/**
* 判断文件是否为大文件
* @param size 文件大小(字节)
* @param threshold 阈值(字节),默认 10MB
*/
export function isLargeFile(size: number, threshold = LARGE_FILE_THRESHOLD): boolean {
return size > threshold;
}
/**
* 格式化文件大小为人类可读格式
* @param bytes 字节数
* @param decimals 小数位数
*/
export function formatFileSize(bytes: number, decimals = 2): string {
if (bytes === 0) return "0 B";
const k = 1024;
const sizes = ["B", "KB", "MB", "GB", "TB", "PB"];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return `${parseFloat((bytes / Math.pow(k, i)).toFixed(decimals))} ${sizes[i]}`;
}
/**
* 并发执行异步任务
* @param tasks 任务函数数组
* @param maxConcurrency 最大并发数
* @param onTaskComplete 单个任务完成回调(可选)
*/
export async function runConcurrentTasks<T>(
tasks: (() => Promise<T>)[],
maxConcurrency: number,
onTaskComplete?: (index: number, result: T) => void
): Promise<T[]> {
const results: T[] = new Array(tasks.length);
let index = 0;
async function runNext(): Promise<void> {
const currentIndex = index++;
if (currentIndex >= tasks.length) return;
const result = await tasks[currentIndex]();
results[currentIndex] = result;
onTaskComplete?.(currentIndex, result);
await runNext();
}
const workers = Array(Math.min(maxConcurrency, tasks.length))
.fill(null)
.map(() => runNext());
await Promise.all(workers);
return results;
}
/**
* 按行分割文本文件内容
* @param text 文本内容
* @param skipEmptyLines 是否跳过空行,默认 true
* @returns 行数组
*/
export function splitTextByLines(text: string, skipEmptyLines = true): string[] {
const lines = text.split(/\r?\n/);
if (skipEmptyLines) {
return lines.filter((line) => line.trim() !== "");
}
return lines;
}
/**
* 创建分片信息对象
* @param file 原始文件
* @param chunkSize 分片大小
*/
export function createFileSliceInfo(
file: File | Blob,
chunkSize = DEFAULT_CHUNK_SIZE
): {
originFile: Blob;
slices: Blob[];
name: string;
size: number;
totalChunks: number;
} {
const slices = sliceFile(file, chunkSize);
return {
originFile: file,
slices,
name: (file as File).name || "unnamed",
size: file.size,
totalChunks: slices.length,
};
}
/**
* 支持的文本文件 MIME 类型前缀
*/
export const TEXT_FILE_MIME_PREFIX = "text/";
/**
* 支持的文本文件 MIME 类型集合
*/
export const TEXT_FILE_MIME_TYPES = new Set([
"application/json",
"application/xml",
"application/csv",
"application/ndjson",
"application/x-ndjson",
"application/x-yaml",
"application/yaml",
"application/javascript",
"application/x-javascript",
"application/sql",
"application/rtf",
"application/xhtml+xml",
"application/svg+xml",
]);
/**
* 支持的文本文件扩展名集合
*/
export const TEXT_FILE_EXTENSIONS = new Set([
".txt",
".md",
".markdown",
".csv",
".tsv",
".json",
".jsonl",
".ndjson",
".log",
".xml",
".yaml",
".yml",
".sql",
".js",
".ts",
".jsx",
".tsx",
".html",
".htm",
".css",
".scss",
".less",
".py",
".java",
".c",
".cpp",
".h",
".hpp",
".go",
".rs",
".rb",
".php",
".sh",
".bash",
".zsh",
".ps1",
".bat",
".cmd",
".svg",
".rtf",
]);
/**
* 判断文件是否为文本文件(支持 UploadFile 类型)
* @param file UploadFile 对象
*/
export function isTextUploadFile(file: UploadFile): boolean {
const mimeType = (file.type || "").toLowerCase();
if (mimeType) {
if (mimeType.startsWith(TEXT_FILE_MIME_PREFIX)) return true;
if (TEXT_FILE_MIME_TYPES.has(mimeType)) return true;
}
const fileName = file.name || "";
const dotIndex = fileName.lastIndexOf(".");
if (dotIndex < 0) return false;
const ext = fileName.slice(dotIndex).toLowerCase();
return TEXT_FILE_EXTENSIONS.has(ext);
}
/**
* 判断文件名是否为文本文件
* @param fileName 文件名
*/
export function isTextFileByName(fileName: string): boolean {
const lowerName = fileName.toLowerCase();
// 先检查 MIME 类型(如果有)
// 这里简化处理,主要通过扩展名判断
const dotIndex = lowerName.lastIndexOf(".");
if (dotIndex < 0) return false;
const ext = lowerName.slice(dotIndex);
return TEXT_FILE_EXTENSIONS.has(ext);
}
/**
* 获取文件扩展名
* @param fileName 文件名
*/
export function getFileExtension(fileName: string): string {
const dotIndex = fileName.lastIndexOf(".");
if (dotIndex < 0) return "";
return fileName.slice(dotIndex).toLowerCase();
}
/**
* 安全地读取文件为文本
* @param file 文件对象
* @param encoding 编码,默认 UTF-8
*/
export function readFileAsText(
file: File | Blob,
encoding = "UTF-8"
): Promise<string> {
return new Promise((resolve, reject) => {
const reader = new FileReader();
reader.onload = (e) => resolve(e.target?.result as string);
reader.onerror = () => reject(new Error("Failed to read file"));
reader.readAsText(file, encoding);
});
}
/**
* 流式分割文件并逐行上传
* 使用 Blob.slice 逐块读取,避免一次性加载大文件到内存
* @param file 文件对象
* @param datasetId 数据集ID
* @param uploadFn 上传函数,接收 FormData 和配置,返回 Promise
* @param onProgress 进度回调 (currentBytes, totalBytes, uploadedLines)
* @param chunkSize 每次读取的块大小,默认 1MB
* @param options 其他选项
* @returns 上传结果统计
*/
export interface StreamUploadOptions {
reqId?: number;
resolveReqId?: (params: { totalFileNum: number; totalSize: number }) => Promise<number>;
onReqIdResolved?: (reqId: number) => void;
fileNamePrefix?: string;
hasArchive?: boolean;
prefix?: string;
signal?: AbortSignal;
maxConcurrency?: number;
}
export interface StreamUploadResult {
uploadedCount: number;
totalBytes: number;
skippedEmptyCount: number;
}
async function processFileLines(
file: File,
chunkSize: number,
signal: AbortSignal | undefined,
onLine?: (line: string, index: number) => Promise<void> | void,
onProgress?: (currentBytes: number, totalBytes: number, processedLines: number) => void
): Promise<{ lineCount: number; skippedEmptyCount: number }> {
const fileSize = file.size;
let offset = 0;
let buffer = "";
let skippedEmptyCount = 0;
let lineIndex = 0;
while (offset < fileSize) {
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
const end = Math.min(offset + chunkSize, fileSize);
const chunk = file.slice(offset, end);
const text = await readFileAsText(chunk);
const combined = buffer + text;
const lines = combined.split(/\r?\n/);
buffer = lines.pop() || "";
for (const line of lines) {
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (!line.trim()) {
skippedEmptyCount++;
continue;
}
const currentIndex = lineIndex;
lineIndex += 1;
if (onLine) {
await onLine(line, currentIndex);
}
}
offset = end;
onProgress?.(offset, fileSize, lineIndex);
}
if (buffer.trim()) {
const currentIndex = lineIndex;
lineIndex += 1;
if (onLine) {
await onLine(buffer, currentIndex);
}
} else if (buffer.length > 0) {
skippedEmptyCount++;
}
return { lineCount: lineIndex, skippedEmptyCount };
}
export async function streamSplitAndUpload(
file: File,
uploadFn: (formData: FormData, config?: { onUploadProgress?: (e: { loaded: number; total: number }) => void }) => Promise<unknown>,
onProgress?: (currentBytes: number, totalBytes: number, uploadedLines: number) => void,
chunkSize: number = 1024 * 1024, // 1MB
options: StreamUploadOptions
): Promise<StreamUploadResult> {
const {
reqId: initialReqId,
resolveReqId,
onReqIdResolved,
fileNamePrefix,
prefix,
signal,
maxConcurrency = 3,
} = options;
const fileSize = file.size;
let uploadedCount = 0;
let skippedEmptyCount = 0;
// 获取文件名基础部分和扩展名
const originalFileName = fileNamePrefix || file.name;
const lastDotIndex = originalFileName.lastIndexOf(".");
const baseName = lastDotIndex > 0 ? originalFileName.slice(0, lastDotIndex) : originalFileName;
const fileExtension = lastDotIndex > 0 ? originalFileName.slice(lastDotIndex) : "";
let resolvedReqId = initialReqId;
if (!resolvedReqId) {
const scanResult = await processFileLines(file, chunkSize, signal);
const totalFileNum = scanResult.lineCount;
skippedEmptyCount = scanResult.skippedEmptyCount;
if (totalFileNum === 0) {
return {
uploadedCount: 0,
totalBytes: fileSize,
skippedEmptyCount,
};
}
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (!resolveReqId) {
throw new Error("Missing pre-upload request id");
}
resolvedReqId = await resolveReqId({ totalFileNum, totalSize: fileSize });
if (!resolvedReqId) {
throw new Error("Failed to resolve pre-upload request id");
}
onReqIdResolved?.(resolvedReqId);
}
if (!resolvedReqId) {
throw new Error("Missing pre-upload request id");
}
/**
* 上传单行内容
* 每行作为独立文件上传,fileNo 对应行序号,chunkNo 固定为 1
*/
async function uploadLine(line: string, index: number): Promise<void> {
// 检查是否已取消
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (!line.trim()) {
skippedEmptyCount++;
return;
}
// 保留原始文件扩展名
const fileIndex = index + 1;
const newFileName = `${baseName}_${String(fileIndex).padStart(6, "0")}${fileExtension}`;
const blob = new Blob([line], { type: "text/plain" });
const lineFile = new File([blob], newFileName, { type: "text/plain" });
// 计算分片(小文件通常只需要一个分片)
const slices = sliceFile(lineFile, DEFAULT_CHUNK_SIZE);
const checkSum = await calculateSHA256(slices[0]);
// 检查是否已取消(计算哈希后)
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
const formData = new FormData();
formData.append("file", slices[0]);
formData.append("reqId", resolvedReqId.toString());
// 每行作为独立文件上传
formData.append("fileNo", fileIndex.toString());
formData.append("chunkNo", "1");
formData.append("fileName", newFileName);
formData.append("fileSize", lineFile.size.toString());
formData.append("totalChunkNum", "1");
formData.append("checkSumHex", checkSum);
if (prefix !== undefined) {
formData.append("prefix", prefix);
}
await uploadFn(formData, {
onUploadProgress: () => {
// 单行文件很小,进度主要用于追踪上传状态
},
});
}
const inFlight = new Set<Promise<void>>();
let uploadError: unknown = null;
const enqueueUpload = async (line: string, index: number) => {
if (signal?.aborted) {
throw new Error("Upload cancelled");
}
if (uploadError) {
throw uploadError;
}
const uploadPromise = uploadLine(line, index)
.then(() => {
uploadedCount++;
})
.catch((err) => {
uploadError = err;
});
inFlight.add(uploadPromise);
uploadPromise.finally(() => inFlight.delete(uploadPromise));
if (inFlight.size >= maxConcurrency) {
await Promise.race(inFlight);
if (uploadError) {
throw uploadError;
}
}
};
let uploadResult: { lineCount: number; skippedEmptyCount: number } | null = null;
try {
uploadResult = await processFileLines(
file,
chunkSize,
signal,
enqueueUpload,
(currentBytes, totalBytes) => {
onProgress?.(currentBytes, totalBytes, uploadedCount);
}
);
if (uploadError) {
throw uploadError;
}
} finally {
if (inFlight.size > 0) {
await Promise.allSettled(inFlight);
}
}
if (!uploadResult || (initialReqId && uploadResult.lineCount === 0)) {
return {
uploadedCount: 0,
totalBytes: fileSize,
skippedEmptyCount: uploadResult?.skippedEmptyCount ?? 0,
};
}
if (!initialReqId) {
skippedEmptyCount = skippedEmptyCount || uploadResult.skippedEmptyCount;
} else {
skippedEmptyCount = uploadResult.skippedEmptyCount;
}
return {
uploadedCount,
totalBytes: fileSize,
skippedEmptyCount,
};
}
/**
* 判断文件是否需要流式分割上传
* @param file 文件对象
* @param threshold 阈值,默认 5MB
*/
export function shouldStreamUpload(file: File, threshold: number = 5 * 1024 * 1024): boolean {
return file.size > threshold;
}

View File

@@ -92,6 +92,14 @@ class Request {
}); });
} }
// 监听 AbortSignal 来中止请求
if (config.signal) {
config.signal.addEventListener("abort", () => {
xhr.abort();
reject(new Error("上传已取消"));
});
}
// 监听上传进度 // 监听上传进度
xhr.upload.addEventListener("progress", function (event) { xhr.upload.addEventListener("progress", function (event) {
if (event.lengthComputable) { if (event.lengthComputable) {

View File

@@ -61,13 +61,15 @@ class DatasetFiles(Base):
dataset_id = Column(String(36), nullable=False, comment="所属数据集ID(UUID)") dataset_id = Column(String(36), nullable=False, comment="所属数据集ID(UUID)")
file_name = Column(String(255), nullable=False, comment="文件名") file_name = Column(String(255), nullable=False, comment="文件名")
file_path = Column(String(1000), nullable=False, comment="文件路径") file_path = Column(String(1000), nullable=False, comment="文件路径")
logical_path = Column(String(1000), nullable=False, comment="文件逻辑路径(相对数据集根目录)")
version = Column(BigInteger, nullable=False, default=1, comment="文件版本号(同 logical_path 递增)")
file_type = Column(String(50), nullable=True, comment="文件格式:JPG/PNG/DCM/TXT等") file_type = Column(String(50), nullable=True, comment="文件格式:JPG/PNG/DCM/TXT等")
file_size = Column(BigInteger, default=0, comment="文件大小(字节)") file_size = Column(BigInteger, default=0, comment="文件大小(字节)")
check_sum = Column(String(64), nullable=True, comment="文件校验和") check_sum = Column(String(64), nullable=True, comment="文件校验和")
tags = Column(JSON, nullable=True, comment="文件标签信息") tags = Column(JSON, nullable=True, comment="文件标签信息")
tags_updated_at = Column(TIMESTAMP, nullable=True, comment="标签最后更新时间") tags_updated_at = Column(TIMESTAMP, nullable=True, comment="标签最后更新时间")
dataset_filemetadata = Column("metadata", JSON, nullable=True, comment="文件元数据") dataset_filemetadata = Column("metadata", JSON, nullable=True, comment="文件元数据")
status = Column(String(50), default='ACTIVE', comment="文件状态:ACTIVE/DELETED/PROCESSING") status = Column(String(50), default='ACTIVE', comment="文件状态:ACTIVE/ARCHIVED/DELETED/PROCESSING")
upload_time = Column(TIMESTAMP, server_default=func.current_timestamp(), comment="上传时间") upload_time = Column(TIMESTAMP, server_default=func.current_timestamp(), comment="上传时间")
last_access_time = Column(TIMESTAMP, nullable=True, comment="最后访问时间") last_access_time = Column(TIMESTAMP, nullable=True, comment="最后访问时间")
created_at = Column(TIMESTAMP, server_default=func.current_timestamp(), comment="创建时间") created_at = Column(TIMESTAMP, server_default=func.current_timestamp(), comment="创建时间")

View File

@@ -19,6 +19,7 @@ from app.db.session import get_db
from app.module.annotation.schema.editor import ( from app.module.annotation.schema.editor import (
EditorProjectInfo, EditorProjectInfo,
EditorTaskListResponse, EditorTaskListResponse,
EditorTaskSegmentResponse,
EditorTaskResponse, EditorTaskResponse,
UpsertAnnotationRequest, UpsertAnnotationRequest,
UpsertAnnotationResponse, UpsertAnnotationResponse,
@@ -87,6 +88,21 @@ async def get_editor_task(
return StandardResponse(code=200, message="success", data=task) return StandardResponse(code=200, message="success", data=task)
@router.get(
"/projects/{project_id}/tasks/{file_id}/segments",
response_model=StandardResponse[EditorTaskSegmentResponse],
)
async def get_editor_task_segment(
project_id: str = Path(..., description="标注项目ID(t_dm_labeling_projects.id)"),
file_id: str = Path(..., description="文件ID(t_dm_dataset_files.id)"),
segment_index: int = Query(..., ge=0, alias="segmentIndex", description="段落索引(从0开始)"),
db: AsyncSession = Depends(get_db),
):
service = AnnotationEditorService(db)
result = await service.get_task_segment(project_id, file_id, segment_index)
return StandardResponse(code=200, message="success", data=result)
@router.put( @router.put(
"/projects/{project_id}/tasks/{file_id}/annotation", "/projects/{project_id}/tasks/{file_id}/annotation",
response_model=StandardResponse[UpsertAnnotationResponse], response_model=StandardResponse[UpsertAnnotationResponse],

View File

@@ -150,6 +150,18 @@ async def create_mapping(
labeling_project, snapshot_file_ids labeling_project, snapshot_file_ids
) )
# 如果启用了分段且为文本数据集,预生成切片结构
if dataset_type == TEXT_DATASET_TYPE and request.segmentation_enabled:
try:
from ..service.editor import AnnotationEditorService
editor_service = AnnotationEditorService(db)
# 异步预计算切片(不阻塞创建响应)
segmentation_result = await editor_service.precompute_segmentation_for_project(labeling_project.id)
logger.info(f"Precomputed segmentation for project {labeling_project.id}: {segmentation_result}")
except Exception as e:
logger.warning(f"Failed to precompute segmentation for project {labeling_project.id}: {e}")
# 不影响项目创建,只记录警告
response_data = DatasetMappingCreateResponse( response_data = DatasetMappingCreateResponse(
id=mapping.id, id=mapping.id,
labeling_project_id=str(mapping.labeling_project_id), labeling_project_id=str(mapping.labeling_project_id),

View File

@@ -79,12 +79,9 @@ class EditorTaskListResponse(BaseModel):
class SegmentInfo(BaseModel): class SegmentInfo(BaseModel):
"""段落信息(用于文本分段标注)""" """段落摘要(用于文本分段标注)"""
idx: int = Field(..., description="段落索引") idx: int = Field(..., description="段落索引")
text: str = Field(..., description="段落文本")
start: int = Field(..., description="在原文中的起始位置")
end: int = Field(..., description="在原文中的结束位置")
has_annotation: bool = Field(False, alias="hasAnnotation", description="该段落是否已有标注") has_annotation: bool = Field(False, alias="hasAnnotation", description="该段落是否已有标注")
line_index: int = Field(0, alias="lineIndex", description="JSONL 行索引(从0开始)") line_index: int = Field(0, alias="lineIndex", description="JSONL 行索引(从0开始)")
chunk_index: int = Field(0, alias="chunkIndex", description="行内分片索引(从0开始)") chunk_index: int = Field(0, alias="chunkIndex", description="行内分片索引(从0开始)")
@@ -100,7 +97,29 @@ class EditorTaskResponse(BaseModel):
# 分段相关字段 # 分段相关字段
segmented: bool = Field(False, description="是否启用分段模式") segmented: bool = Field(False, description="是否启用分段模式")
segments: Optional[List[SegmentInfo]] = Field(None, description="段落列表") total_segments: int = Field(0, alias="totalSegments", description="段落")
current_segment_index: int = Field(0, alias="currentSegmentIndex", description="当前段落索引")
model_config = ConfigDict(populate_by_name=True)
class SegmentDetail(BaseModel):
"""段落内容"""
idx: int = Field(..., description="段落索引")
text: str = Field(..., description="段落文本")
has_annotation: bool = Field(False, alias="hasAnnotation", description="该段落是否已有标注")
line_index: int = Field(0, alias="lineIndex", description="JSONL 行索引(从0开始)")
chunk_index: int = Field(0, alias="chunkIndex", description="行内分片索引(从0开始)")
model_config = ConfigDict(populate_by_name=True)
class EditorTaskSegmentResponse(BaseModel):
"""编辑器单段内容响应"""
segmented: bool = Field(False, description="是否启用分段模式")
segment: Optional[SegmentDetail] = Field(None, description="段落内容")
total_segments: int = Field(0, alias="totalSegments", description="总段落数") total_segments: int = Field(0, alias="totalSegments", description="总段落数")
current_segment_index: int = Field(0, alias="currentSegmentIndex", description="当前段落索引") current_segment_index: int = Field(0, alias="currentSegmentIndex", description="当前段落索引")

View File

@@ -36,7 +36,9 @@ from app.module.annotation.schema.editor import (
EditorProjectInfo, EditorProjectInfo,
EditorTaskListItem, EditorTaskListItem,
EditorTaskListResponse, EditorTaskListResponse,
EditorTaskSegmentResponse,
EditorTaskResponse, EditorTaskResponse,
SegmentDetail,
SegmentInfo, SegmentInfo,
UpsertAnnotationRequest, UpsertAnnotationRequest,
UpsertAnnotationResponse, UpsertAnnotationResponse,
@@ -538,6 +540,50 @@ class AnnotationEditorService:
return value return value
return raw_text return raw_text
def _build_segment_contexts(
self,
records: List[Tuple[Optional[Dict[str, Any]], str]],
record_texts: List[str],
segment_annotation_keys: set[str],
) -> Tuple[List[SegmentInfo], List[Tuple[Optional[Dict[str, Any]], str, str, int, int]]]:
splitter = AnnotationTextSplitter(max_chars=self.SEGMENT_THRESHOLD)
segments: List[SegmentInfo] = []
segment_contexts: List[Tuple[Optional[Dict[str, Any]], str, str, int, int]] = []
segment_cursor = 0
for record_index, ((payload, raw_text), record_text) in enumerate(zip(records, record_texts)):
normalized_text = record_text or ""
if len(normalized_text) > self.SEGMENT_THRESHOLD:
raw_segments = splitter.split(normalized_text)
for chunk_index, seg in enumerate(raw_segments):
segments.append(
SegmentInfo(
idx=segment_cursor,
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=chunk_index,
)
)
segment_contexts.append((payload, raw_text, seg["text"], record_index, chunk_index))
segment_cursor += 1
else:
segments.append(
SegmentInfo(
idx=segment_cursor,
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=0,
)
)
segment_contexts.append((payload, raw_text, normalized_text, record_index, 0))
segment_cursor += 1
if not segments:
segments = [SegmentInfo(idx=0, hasAnnotation=False, lineIndex=0, chunkIndex=0)]
segment_contexts = [(None, "", "", 0, 0)]
return segments, segment_contexts
async def get_project_info(self, project_id: str) -> EditorProjectInfo: async def get_project_info(self, project_id: str) -> EditorProjectInfo:
project = await self._get_project_or_404(project_id) project = await self._get_project_or_404(project_id)
@@ -668,6 +714,124 @@ class AnnotationEditorService:
return await self._build_text_task(project, file_record, file_id, segment_index) return await self._build_text_task(project, file_record, file_id, segment_index)
async def get_task_segment(
self,
project_id: str,
file_id: str,
segment_index: int,
) -> EditorTaskSegmentResponse:
project = await self._get_project_or_404(project_id)
dataset_type = self._normalize_dataset_type(await self._get_dataset_type(project.dataset_id))
if dataset_type != DATASET_TYPE_TEXT:
raise HTTPException(
status_code=400,
detail="当前仅支持 TEXT 项目的段落内容",
)
file_result = await self.db.execute(
select(DatasetFiles).where(
DatasetFiles.id == file_id,
DatasetFiles.dataset_id == project.dataset_id,
)
)
file_record = file_result.scalar_one_or_none()
if not file_record:
raise HTTPException(status_code=404, detail=f"文件不存在或不属于该项目: {file_id}")
if not self._resolve_segmentation_enabled(project):
return EditorTaskSegmentResponse(
segmented=False,
segment=None,
totalSegments=0,
currentSegmentIndex=0,
)
text_content = await self._fetch_text_content_via_download_api(project.dataset_id, file_id)
assert isinstance(text_content, str)
label_config = await self._resolve_project_label_config(project)
primary_text_key = self._resolve_primary_text_key(label_config)
file_name = str(getattr(file_record, "file_name", "")).lower()
records: List[Tuple[Optional[Dict[str, Any]], str]] = []
if file_name.endswith(JSONL_EXTENSION):
records = self._parse_jsonl_records(text_content)
else:
parsed_payload = self._try_parse_json_payload(text_content)
if parsed_payload:
records = [(parsed_payload, text_content)]
if not records:
records = [(None, text_content)]
record_texts = [
self._resolve_primary_text_value(payload, raw_text, primary_text_key)
for payload, raw_text in records
]
if not record_texts:
record_texts = [text_content]
needs_segmentation = len(records) > 1 or any(
len(text or "") > self.SEGMENT_THRESHOLD for text in record_texts
)
if not needs_segmentation:
return EditorTaskSegmentResponse(
segmented=False,
segment=None,
totalSegments=0,
currentSegmentIndex=0,
)
ann_result = await self.db.execute(
select(AnnotationResult).where(
AnnotationResult.project_id == project.id,
AnnotationResult.file_id == file_id,
)
)
ann = ann_result.scalar_one_or_none()
segment_annotations: Dict[str, Dict[str, Any]] = {}
if ann and isinstance(ann.annotation, dict):
segment_annotations = self._extract_segment_annotations(ann.annotation)
segment_annotation_keys = set(segment_annotations.keys())
segments, segment_contexts = self._build_segment_contexts(
records,
record_texts,
segment_annotation_keys,
)
total_segments = len(segment_contexts)
if total_segments == 0:
return EditorTaskSegmentResponse(
segmented=False,
segment=None,
totalSegments=0,
currentSegmentIndex=0,
)
if segment_index < 0 or segment_index >= total_segments:
raise HTTPException(
status_code=400,
detail=f"segmentIndex 超出范围: {segment_index}",
)
segment_info = segments[segment_index]
_, _, segment_text, line_index, chunk_index = segment_contexts[segment_index]
segment_detail = SegmentDetail(
idx=segment_info.idx,
text=segment_text,
hasAnnotation=segment_info.has_annotation,
lineIndex=line_index,
chunkIndex=chunk_index,
)
return EditorTaskSegmentResponse(
segmented=True,
segment=segment_detail,
totalSegments=total_segments,
currentSegmentIndex=segment_index,
)
async def _build_text_task( async def _build_text_task(
self, self,
project: LabelingProject, project: LabelingProject,
@@ -723,7 +887,8 @@ class AnnotationEditorService:
needs_segmentation = segmentation_enabled and ( needs_segmentation = segmentation_enabled and (
len(records) > 1 or any(len(text or "") > self.SEGMENT_THRESHOLD for text in record_texts) len(records) > 1 or any(len(text or "") > self.SEGMENT_THRESHOLD for text in record_texts)
) )
segments: Optional[List[SegmentInfo]] = None segments: List[SegmentInfo] = []
segment_contexts: List[Tuple[Optional[Dict[str, Any]], str, str, int, int]] = []
current_segment_index = 0 current_segment_index = 0
display_text = record_texts[0] if record_texts else text_content display_text = record_texts[0] if record_texts else text_content
selected_payload = records[0][0] if records else None selected_payload = records[0][0] if records else None
@@ -732,46 +897,13 @@ class AnnotationEditorService:
display_text = "\n".join(record_texts) if record_texts else text_content display_text = "\n".join(record_texts) if record_texts else text_content
if needs_segmentation: if needs_segmentation:
splitter = AnnotationTextSplitter(max_chars=self.SEGMENT_THRESHOLD) _, segment_contexts = self._build_segment_contexts(
segment_contexts: List[Tuple[Optional[Dict[str, Any]], str, str, int, int]] = [] records,
segments = [] record_texts,
segment_cursor = 0 segment_annotation_keys,
)
for record_index, ((payload, raw_text), record_text) in enumerate(zip(records, record_texts)):
normalized_text = record_text or ""
if len(normalized_text) > self.SEGMENT_THRESHOLD:
raw_segments = splitter.split(normalized_text)
for chunk_index, seg in enumerate(raw_segments):
segments.append(SegmentInfo(
idx=segment_cursor,
text=seg["text"],
start=seg["start"],
end=seg["end"],
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=chunk_index,
))
segment_contexts.append((payload, raw_text, seg["text"], record_index, chunk_index))
segment_cursor += 1
else:
segments.append(SegmentInfo(
idx=segment_cursor,
text=normalized_text,
start=0,
end=len(normalized_text),
hasAnnotation=str(segment_cursor) in segment_annotation_keys,
lineIndex=record_index,
chunkIndex=0,
))
segment_contexts.append((payload, raw_text, normalized_text, record_index, 0))
segment_cursor += 1
if not segments:
segments = [SegmentInfo(idx=0, text="", start=0, end=0, hasAnnotation=False, lineIndex=0, chunkIndex=0)]
segment_contexts = [(None, "", "", 0, 0)]
current_segment_index = segment_index if segment_index is not None else 0 current_segment_index = segment_index if segment_index is not None else 0
if current_segment_index < 0 or current_segment_index >= len(segments): if current_segment_index < 0 or current_segment_index >= len(segment_contexts):
current_segment_index = 0 current_segment_index = 0
selected_payload, _, display_text, _, _ = segment_contexts[current_segment_index] selected_payload, _, display_text, _, _ = segment_contexts[current_segment_index]
@@ -849,8 +981,7 @@ class AnnotationEditorService:
task=task, task=task,
annotationUpdatedAt=annotation_updated_at, annotationUpdatedAt=annotation_updated_at,
segmented=needs_segmentation, segmented=needs_segmentation,
segments=segments, totalSegments=len(segment_contexts) if needs_segmentation else 1,
totalSegments=len(segments) if segments else 1,
currentSegmentIndex=current_segment_index, currentSegmentIndex=current_segment_index,
) )
@@ -1185,3 +1316,195 @@ class AnnotationEditorService:
except Exception as exc: except Exception as exc:
logger.warning("标注同步知识管理失败:%s", exc) logger.warning("标注同步知识管理失败:%s", exc)
async def precompute_segmentation_for_project(
self,
project_id: str,
max_retries: int = 3
) -> Dict[str, Any]:
"""
为指定项目的所有文本文件预计算切片结构并持久化到数据库
Args:
project_id: 标注项目ID
max_retries: 失败重试次数
Returns:
统计信息:{total_files, succeeded, failed}
"""
project = await self._get_project_or_404(project_id)
dataset_type = self._normalize_dataset_type(await self._get_dataset_type(project.dataset_id))
# 只处理文本数据集
if dataset_type != DATASET_TYPE_TEXT:
logger.info(f"项目 {project_id} 不是文本数据集,跳过切片预生成")
return {"total_files": 0, "succeeded": 0, "failed": 0}
# 检查是否启用分段
if not self._resolve_segmentation_enabled(project):
logger.info(f"项目 {project_id} 未启用分段,跳过切片预生成")
return {"total_files": 0, "succeeded": 0, "failed": 0}
# 获取项目的所有文本文件(排除源文档)
files_result = await self.db.execute(
select(DatasetFiles)
.join(LabelingProjectFile, LabelingProjectFile.file_id == DatasetFiles.id)
.where(
LabelingProjectFile.project_id == project_id,
DatasetFiles.dataset_id == project.dataset_id,
)
)
file_records = files_result.scalars().all()
if not file_records:
logger.info(f"项目 {project_id} 没有文件,跳过切片预生成")
return {"total_files": 0, "succeeded": 0, "failed": 0}
# 过滤源文档文件
valid_files = []
for file_record in file_records:
file_type = str(getattr(file_record, "file_type", "") or "").lower()
file_name = str(getattr(file_record, "file_name", "")).lower()
is_source_document = (
file_type in SOURCE_DOCUMENT_TYPES or
any(file_name.endswith(ext) for ext in SOURCE_DOCUMENT_EXTENSIONS)
)
if not is_source_document:
valid_files.append(file_record)
total_files = len(valid_files)
succeeded = 0
failed = 0
label_config = await self._resolve_project_label_config(project)
primary_text_key = self._resolve_primary_text_key(label_config)
for file_record in valid_files:
file_id = str(file_record.id) # type: ignore
file_name = str(getattr(file_record, "file_name", ""))
for retry in range(max_retries):
try:
# 读取文本内容
text_content = await self._fetch_text_content_via_download_api(project.dataset_id, file_id)
if not isinstance(text_content, str):
logger.warning(f"文件 {file_id} 内容不是字符串,跳过切片")
failed += 1
break
# 解析文本记录
records: List[Tuple[Optional[Dict[str, Any]], str]] = []
if file_name.lower().endswith(JSONL_EXTENSION):
records = self._parse_jsonl_records(text_content)
else:
parsed_payload = self._try_parse_json_payload(text_content)
if parsed_payload:
records = [(parsed_payload, text_content)]
if not records:
records = [(None, text_content)]
record_texts = [
self._resolve_primary_text_value(payload, raw_text, primary_text_key)
for payload, raw_text in records
]
if not record_texts:
record_texts = [text_content]
# 判断是否需要分段
needs_segmentation = len(records) > 1 or any(
len(text or "") > self.SEGMENT_THRESHOLD for text in record_texts
)
if not needs_segmentation:
# 不需要分段的文件,跳过
succeeded += 1
break
# 执行切片
splitter = AnnotationTextSplitter(max_chars=self.SEGMENT_THRESHOLD)
segment_cursor = 0
segments = {}
for record_index, ((payload, raw_text), record_text) in enumerate(zip(records, record_texts)):
normalized_text = record_text or ""
if len(normalized_text) > self.SEGMENT_THRESHOLD:
raw_segments = splitter.split(normalized_text)
for chunk_index, seg in enumerate(raw_segments):
segments[str(segment_cursor)] = {
SEGMENT_RESULT_KEY: [],
SEGMENT_CREATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
SEGMENT_UPDATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
}
segment_cursor += 1
else:
segments[str(segment_cursor)] = {
SEGMENT_RESULT_KEY: [],
SEGMENT_CREATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
SEGMENT_UPDATED_AT_KEY: datetime.utcnow().isoformat() + "Z",
}
segment_cursor += 1
if not segments:
succeeded += 1
break
# 构造分段标注结构
final_payload = {
SEGMENTED_KEY: True,
"version": 1,
SEGMENTS_KEY: segments,
SEGMENT_TOTAL_KEY: segment_cursor,
}
# 检查是否已存在标注
existing_result = await self.db.execute(
select(AnnotationResult).where(
AnnotationResult.project_id == project_id,
AnnotationResult.file_id == file_id,
)
)
existing = existing_result.scalar_one_or_none()
now = datetime.utcnow()
if existing:
# 更新现有标注
existing.annotation = final_payload # type: ignore[assignment]
existing.annotation_status = ANNOTATION_STATUS_IN_PROGRESS # type: ignore[assignment]
existing.updated_at = now # type: ignore[assignment]
else:
# 创建新标注记录
record = AnnotationResult(
id=str(uuid.uuid4()),
project_id=project_id,
file_id=file_id,
annotation=final_payload,
annotation_status=ANNOTATION_STATUS_IN_PROGRESS,
created_at=now,
updated_at=now,
)
self.db.add(record)
await self.db.commit()
succeeded += 1
logger.info(f"成功为文件 {file_id} 预生成 {segment_cursor} 个切片")
break
except Exception as e:
logger.warning(
f"为文件 {file_id} 预生成切片失败 (重试 {retry + 1}/{max_retries}): {e}"
)
if retry == max_retries - 1:
failed += 1
await self.db.rollback()
logger.info(
f"项目 {project_id} 切片预生成完成: 总计 {total_files}, 成功 {succeeded}, 失败 {failed}"
)
return {
"total_files": total_files,
"succeeded": succeeded,
"failed": failed,
}

View File

@@ -375,9 +375,9 @@ def _register_output_dataset(
insert_file_sql = text( insert_file_sql = text(
""" """
INSERT INTO t_dm_dataset_files ( INSERT INTO t_dm_dataset_files (
id, dataset_id, file_name, file_path, file_type, file_size, status id, dataset_id, file_name, file_path, logical_path, version, file_type, file_size, status
) VALUES ( ) VALUES (
:id, :dataset_id, :file_name, :file_path, :file_type, :file_size, :status :id, :dataset_id, :file_name, :file_path, :logical_path, :version, :file_type, :file_size, :status
) )
""" """
) )
@@ -395,6 +395,7 @@ def _register_output_dataset(
for file_name, file_path, file_size in image_files: for file_name, file_path, file_size in image_files:
ext = os.path.splitext(file_name)[1].lstrip(".").upper() or None ext = os.path.splitext(file_name)[1].lstrip(".").upper() or None
logical_path = os.path.relpath(file_path, output_dir).replace("\\", "/")
conn.execute( conn.execute(
insert_file_sql, insert_file_sql,
{ {
@@ -402,6 +403,8 @@ def _register_output_dataset(
"dataset_id": output_dataset_id, "dataset_id": output_dataset_id,
"file_name": file_name, "file_name": file_name,
"file_path": file_path, "file_path": file_path,
"logical_path": logical_path,
"version": 1,
"file_type": ext, "file_type": ext,
"file_size": int(file_size), "file_size": int(file_size),
"status": "ACTIVE", "status": "ACTIVE",
@@ -411,6 +414,7 @@ def _register_output_dataset(
for file_name, file_path, file_size in annotation_files: for file_name, file_path, file_size in annotation_files:
ext = os.path.splitext(file_name)[1].lstrip(".").upper() or None ext = os.path.splitext(file_name)[1].lstrip(".").upper() or None
logical_path = os.path.relpath(file_path, output_dir).replace("\\", "/")
conn.execute( conn.execute(
insert_file_sql, insert_file_sql,
{ {
@@ -418,6 +422,8 @@ def _register_output_dataset(
"dataset_id": output_dataset_id, "dataset_id": output_dataset_id,
"file_name": file_name, "file_name": file_name,
"file_path": file_path, "file_path": file_path,
"logical_path": logical_path,
"version": 1,
"file_type": ext, "file_type": ext,
"file_size": int(file_size), "file_size": int(file_size),
"status": "ACTIVE", "status": "ACTIVE",

View File

@@ -1,9 +1,9 @@
{ {
"query_sql": "SELECT * FROM t_task_instance_info WHERE instance_id IN (:instance_id)", "query_sql": "SELECT * FROM t_task_instance_info WHERE instance_id IN (:instance_id)",
"insert_sql": "INSERT INTO t_task_instance_info (instance_id, meta_file_name, meta_file_type, meta_file_id, meta_file_size, file_id, file_size, file_type, file_name, file_path, status, operator_id, error_code, incremental, child_id, slice_num) VALUES (:instance_id, :meta_file_name, :meta_file_type, :meta_file_id, :meta_file_size, :file_id, :file_size, :file_type, :file_name, :file_path, :status, :operator_id, :error_code, :incremental, :child_id, :slice_num)", "insert_sql": "INSERT INTO t_task_instance_info (instance_id, meta_file_name, meta_file_type, meta_file_id, meta_file_size, file_id, file_size, file_type, file_name, file_path, status, operator_id, error_code, incremental, child_id, slice_num) VALUES (:instance_id, :meta_file_name, :meta_file_type, :meta_file_id, :meta_file_size, :file_id, :file_size, :file_type, :file_name, :file_path, :status, :operator_id, :error_code, :incremental, :child_id, :slice_num)",
"insert_dataset_file_sql": "INSERT INTO t_dm_dataset_files (id, dataset_id, file_name, file_path, file_type, file_size, status, upload_time, last_access_time, created_at, updated_at) VALUES (:id, :dataset_id, :file_name, :file_path, :file_type, :file_size, :status, :upload_time, :last_access_time, :created_at, :updated_at)", "insert_dataset_file_sql": "INSERT INTO t_dm_dataset_files (id, dataset_id, file_name, file_path, logical_path, version, file_type, file_size, status, upload_time, last_access_time, created_at, updated_at) VALUES (:id, :dataset_id, :file_name, :file_path, :logical_path, :version, :file_type, :file_size, :status, :upload_time, :last_access_time, :created_at, :updated_at)",
"insert_clean_result_sql": "INSERT INTO t_clean_result (instance_id, src_file_id, dest_file_id, src_name, dest_name, src_type, dest_type, src_size, dest_size, status, result) VALUES (:instance_id, :src_file_id, :dest_file_id, :src_name, :dest_name, :src_type, :dest_type, :src_size, :dest_size, :status, :result)", "insert_clean_result_sql": "INSERT INTO t_clean_result (instance_id, src_file_id, dest_file_id, src_name, dest_name, src_type, dest_type, src_size, dest_size, status, result) VALUES (:instance_id, :src_file_id, :dest_file_id, :src_name, :dest_name, :src_type, :dest_type, :src_size, :dest_size, :status, :result)",
"query_dataset_sql": "SELECT file_size FROM t_dm_dataset_files WHERE dataset_id = :dataset_id", "query_dataset_sql": "SELECT file_size FROM t_dm_dataset_files WHERE dataset_id = :dataset_id AND (status IS NULL OR status <> 'ARCHIVED')",
"update_dataset_sql": "UPDATE t_dm_datasets SET size_bytes = :total_size, file_count = :file_count WHERE id = :dataset_id;", "update_dataset_sql": "UPDATE t_dm_datasets SET size_bytes = :total_size, file_count = :file_count WHERE id = :dataset_id;",
"update_task_sql": "UPDATE t_clean_task SET status = :status, after_size = :total_size, finished_at = :finished_time WHERE id = :task_id", "update_task_sql": "UPDATE t_clean_task SET status = :status, after_size = :total_size, finished_at = :finished_time WHERE id = :task_id",
"create_tables_sql": "CREATE TABLE IF NOT EXISTS t_task_instance_info (instance_id VARCHAR(255), meta_file_name TEXT, meta_file_type VARCHAR(100), meta_file_id BIGINT, meta_file_size VARCHAR(100), file_id BIGINT, file_size VARCHAR(100), file_type VARCHAR(100), file_name TEXT, file_path TEXT, status INT, operator_id VARCHAR(255), error_code VARCHAR(100), incremental VARCHAR(50), child_id BIGINT, slice_num INT DEFAULT 0);", "create_tables_sql": "CREATE TABLE IF NOT EXISTS t_task_instance_info (instance_id VARCHAR(255), meta_file_name TEXT, meta_file_type VARCHAR(100), meta_file_id BIGINT, meta_file_size VARCHAR(100), file_id BIGINT, file_size VARCHAR(100), file_type VARCHAR(100), file_name TEXT, file_path TEXT, status INT, operator_id VARCHAR(255), error_code VARCHAR(100), incremental VARCHAR(50), child_id BIGINT, slice_num INT DEFAULT 0);",

View File

@@ -54,19 +54,22 @@ CREATE TABLE IF NOT EXISTS t_dm_dataset_files (
dataset_id VARCHAR(36) NOT NULL COMMENT '所属数据集ID(UUID)', dataset_id VARCHAR(36) NOT NULL COMMENT '所属数据集ID(UUID)',
file_name VARCHAR(255) NOT NULL COMMENT '文件名', file_name VARCHAR(255) NOT NULL COMMENT '文件名',
file_path VARCHAR(1000) NOT NULL COMMENT '文件路径', file_path VARCHAR(1000) NOT NULL COMMENT '文件路径',
logical_path VARCHAR(1000) NOT NULL COMMENT '文件逻辑路径(相对数据集根目录)',
version BIGINT NOT NULL DEFAULT 1 COMMENT '文件版本号(同 logical_path 递增)',
file_type VARCHAR(50) COMMENT '文件格式:JPG/PNG/DCM/TXT等', file_type VARCHAR(50) COMMENT '文件格式:JPG/PNG/DCM/TXT等',
file_size BIGINT DEFAULT 0 COMMENT '文件大小(字节)', file_size BIGINT DEFAULT 0 COMMENT '文件大小(字节)',
check_sum VARCHAR(64) COMMENT '文件校验和', check_sum VARCHAR(64) COMMENT '文件校验和',
tags JSON COMMENT '文件标签信息', tags JSON COMMENT '文件标签信息',
tags_updated_at TIMESTAMP NULL COMMENT '标签最后更新时间', tags_updated_at TIMESTAMP NULL COMMENT '标签最后更新时间',
metadata JSON COMMENT '文件元数据', metadata JSON COMMENT '文件元数据',
status VARCHAR(50) DEFAULT 'ACTIVE' COMMENT '文件状态:ACTIVE/DELETED/PROCESSING', status VARCHAR(50) DEFAULT 'ACTIVE' COMMENT '文件状态:ACTIVE/ARCHIVED/DELETED/PROCESSING',
upload_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '上传时间', upload_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '上传时间',
last_access_time TIMESTAMP NULL COMMENT '最后访问时间', last_access_time TIMESTAMP NULL COMMENT '最后访问时间',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间', updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
FOREIGN KEY (dataset_id) REFERENCES t_dm_datasets(id) ON DELETE CASCADE, FOREIGN KEY (dataset_id) REFERENCES t_dm_datasets(id) ON DELETE CASCADE,
INDEX idx_dm_dataset (dataset_id), INDEX idx_dm_dataset (dataset_id),
INDEX idx_dm_dataset_logical_path (dataset_id, logical_path, version),
INDEX idx_dm_file_type (file_type), INDEX idx_dm_file_type (file_type),
INDEX idx_dm_file_status (status), INDEX idx_dm_file_status (status),
INDEX idx_dm_upload_time (upload_time) INDEX idx_dm_upload_time (upload_time)