Commit Graph

13 Commits

Author SHA1 Message Date
hhhhsc701
ab4523b556 add export type settings and enhance metadata structure (#181)
* fix(session): enhance database connection settings with pool pre-ping and recycle options

* feat(metadata): add export type settings and enhance metadata structure

* fix(base_op): improve sample handling by introducing target_type key and consolidating text/data retrieval logic

* feat(metadata): add export type settings and enhance metadata structure

* feat(metadata): add export type settings and enhance metadata structure
2025-12-19 11:54:08 +08:00
hhhhsc701
be875086db feat: add operator-packages-volume to docker-compose and update Docke… (#179)
* feat: add operator-packages-volume to docker-compose and update Dockerfile for site-packages path

* feat: add retry
2025-12-18 20:32:42 +08:00
hhhhsc701
761f7f6a51 fix: optimize PDF parsing by implementing concurrent processing with … (#177)
* fix: optimize PDF parsing by implementing concurrent processing with ThreadPoolExecutor

* Refactor to async processing for file extraction

Refactor the file processing to use asyncio for improved performance and concurrency.
2025-12-18 15:28:30 +08:00
hhhhsc701
924d977d6f 支持mineru npu处理 (#174)
* feature: unstructured支持简单pdf处理

* feature: update values.yaml to enhance ray-cluster configuration with security context, environment variables, and resource limits

* feature: update deploy.yaml and process.py for mineru server configuration and PDF processing enhancements

* feature: update deploy.yaml and process.py for mineru server configuration and PDF processing enhancements

* feature: improve PDF processing logic and update dependencies in process.py and pyproject.toml

* feature: improve PDF processing logic and update dependencies in process.py and pyproject.toml

* feature: update Dockerfile for improved package source mirrors and add mineru-npu to build targets
2025-12-17 16:31:06 +08:00
hhhhsc701
d59c167da4 算子将抽取与落盘固定到流程中 (#134)
* feature: 将抽取动作移到每一个算子中

* feature: 落盘算子改为默认执行

* feature: 优化前端展示

* feature: 使用pyproject管理依赖
2025-12-05 17:26:29 +08:00
hhhhsc701
f1bffdcd61 bugfix: 创建清洗任务时修改数据集状态;无法删除已在模板/运行任务的算子
* bugfix: 创建清洗任务时修改数据集状态;无法删除已在模板/运行任务的算子
2025-11-27 17:34:53 +08:00
hhhhsc701
f78475e29f Develop hsc (#58)
feature: 优化镜像构建/部署
2025-11-06 17:14:54 +08:00
hhhhsc701
05b26a2981 feature: 更新算子名称;增加创建任务、模板校验 (#57)
* feature: 更新算子名称;增加创建任务、模板校验

* feature: 镜像构建增加缓存
2025-11-05 17:38:03 +08:00
Startalker
a600c1d793 feature: modify UnstructuredFormatter and ExternalPDFFormatter description (#44)
* feature: add UnstructuredFormatter

* feature: add UnstructuredFormatter in db

* feature: add unstructured[docx]==0.18.15

* feature: support doc

* feature: add mineru

* feature: add external pdf extract operator by using mineru

* feature: mineru docker install bugfix

* feature: add unstructured xlsx/xls/csv/pptx/ppt

* feature: modify UnstructuredFormatter and ExternalPDFFormatter description

---------

Co-authored-by: Startalker <438747480@qq.com>
2025-10-31 10:32:14 +08:00
Startalker
06b05a65a9 feature: add unstructured xlsx/xls/csv/pptx/ppt (#41)
* feature: add UnstructuredFormatter

* feature: add UnstructuredFormatter in db

* feature: add unstructured[docx]==0.18.15

* feature: support doc

* feature: add mineru

* feature: add external pdf extract operator by using mineru

* feature: mineru docker install bugfix

* feature: add unstructured xlsx/xls/csv/pptx/ppt

---------

Co-authored-by: Startalker <438747480@qq.com>
2025-10-30 20:21:12 +08:00
Startalker
155603b1ca feature: add external pdf extract operator by using mineru (#36)
* feature: add UnstructuredFormatter

* feature: add UnstructuredFormatter in db

* feature: add unstructured[docx]==0.18.15

* feature: support doc

* feature: add mineru

* feature: add external pdf extract operator by using mineru

* feature: mineru docker install bugfix

---------

Co-authored-by: Startalker <438747480@qq.com>
2025-10-30 15:55:10 +08:00
Startalker
f86d4fae25 feature: add unstructured formatter operator for doc/docx (#17)
* feature: add UnstructuredFormatter

* feature: add UnstructuredFormatter in db

* feature: add unstructured[docx]==0.18.15

* feature: support doc

---------

Co-authored-by: Startalker <438747480@qq.com>
2025-10-23 16:49:03 +08:00
Dallas98
1c97afed7d init datamate 2025-10-21 23:00:48 +08:00