q792602257/DataMate

Fork 0

T

q792602257 f707ce9dae

CodeQL Advanced / Analyze (actions) (push) Has been cancelled

Details

CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled

Details

CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled

Details

CodeQL Advanced / Analyze (python) (push) Has been cancelled

Details

feat(auto-annotation): add batch progress updates to reduce DB write pressure

Throttle progress updates to reduce database write operations during large dataset processing.

Key features:
- Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env)
- Conditional progress updates: Only write to DB when (now - last_update) >= interval
- Use time.monotonic() for timing (immune to system clock adjustments)
- Final status updates (completed/stopped/failed) always execute (not throttled)

Implementation:
- Initialize last_progress_update timestamp before as_completed() loop
- Replace unconditional _update_task_status() with conditional call based on time interval
- Update docstring to reflect throttling capability

Performance impact (T=2s):
- 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction)
- 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction)
- Small datasets (10 files): Minimal difference

Backward compatibility:
- PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior)
- Heartbeat mechanism unaffected (2s interval << 300s timeout)
- Stop check mechanism independent of progress updates
- Final status updates always execute

Testing:
- 14 unit tests all passed (11 existing + 3 new):
  * Fast processing with throttling
  * PROGRESS_UPDATE_INTERVAL=0 updates every file
  * Slow processing (per-file > T) updates every file
- py_compile syntax check passed

Edge cases handled:
- Single file task: Works normally
- Very slow processing: Degrades to per-file updates
- Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay

2026-02-10 16:49:37 +08:00

.github/workflows

feature: add mysql collection and starrocks collection (#222 )

2026-01-04 19:05:08 +08:00

backend

feat(auth): 角色管理CRUD与角色权限绑定功能

2026-02-10 00:09:48 +08:00

deployment

2026-02-02 16:09:25 +08:00

editions

feature: 对接deer-flow (#54 )

2025-11-04 20:30:40 +08:00

frontend

feat(auto-annotation): add LLM-based annotation operators

2026-02-10 15:22:23 +08:00

runtime

feat(auto-annotation): add batch progress updates to reduce DB write pressure

2026-02-10 16:49:37 +08:00

scripts

feat(annotation): 自动标注任务支持非图像类型数据集（TEXT/AUDIO/VIDEO）

2026-02-09 23:23:05 +08:00

.editorconfig

feature: Implement the basic knowledge generation function (#40 )

2025-10-30 16:50:54 +08:00

.gitignore

feat: Improve makefile readability, Add user control on volume keep at uninstallation, Add Label Studio install and uninstall via Make. (#106 )

2025-11-25 17:37:28 +08:00

LICENSE

Change license to MIT with additional conditions

2025-11-06 11:32:49 +08:00

Makefile

chore(gateway): 移除Dockerfile中的离线模式参数

2026-01-30 14:13:16 +08:00

Makefile.offline.mk

feat(scripts): 添加 APT 缓存预装功能解决离线构建问题

2026-02-03 13:16:17 +08:00

README-zh.md

bugfix (#164 )

2025-12-11 23:17:01 +08:00

README.md

bugfix (#164 )

2025-12-11 23:17:01 +08:00

README.md

DataMate All-in-One Data Work Platform

DataMate is an enterprise-level data processing platform for model fine-tuning and RAG retrieval, supporting core functions such as data collection, data management, operator marketplace, data cleaning, data synthesis, data annotation, data evaluation, and knowledge generation.

简体中文 | English

If you like this project, please give it a Star⭐️!

🌟 Core Features

Core Modules: Data Collection, Data Management, Operator Marketplace, Data Cleaning, Data Synthesis, Data Annotation, Data Evaluation, Knowledge Generation.
Visual Orchestration: Drag-and-drop data processing workflow design.
Operator Ecosystem: Rich built-in operators and support for custom operators.

🚀 Quick Start

Prerequisites

Git (for pulling source code)
Make (for building and installing)
Docker (for building images and deploying services)
Docker-Compose (for service deployment - Docker method)
Kubernetes (for service deployment - k8s method)
Helm (for service deployment - k8s method)

This project supports deployment via two methods: docker-compose and helm. After executing the command, please enter the corresponding number for the deployment method. The command echo is as follows:

Choose a deployment method:
1. Docker/Docker-Compose
2. Kubernetes/Helm
Enter choice:

Clone the Code

git clone git@github.com:ModelEngine-Group/DataMate.git
cd DataMate

Deploy the basic services

make install

If the machine you are using does not have make installed, please run the following command to deploy it:

# Windows
set REGISTRY=ghcr.io/modelengine-group/
docker compose -f ./deployment/docker/datamate/docker-compose.yml up -d
docker compose -f ./deployment/docker/milvus/docker-compose.yml up -d

# Linux/Mac
export REGISTRY=ghcr.io/modelengine-group/
docker compose -f ./deployment/docker/datamate/docker-compose.yml up -d
docker compose -f ./deployment/docker/milvus/docker-compose.yml up -d

Once the container is running, access http://localhost:30000 in a browser to view the front-end interface.

To list all available Make targets, flags and help text, run:

make help

Build and deploy Mineru Enhanced PDF Processing

make build-mineru
make install-mineru

Deploy the DeerFlow service

make install-deer-flow

Local Development and Deployment

After modifying the local code, please execute the following commands to build the image and deploy using the local image.

make build
make install dev=true

Uninstall

make uninstall

When running make uninstall, the installer will prompt once whether to delete volumes; that single choice is applied to all components. The uninstall order is: milvus -> label-studio -> datamate, which ensures the datamate network is removed cleanly after services that use it have stopped.

🤝 Contribution Guidelines

Thank you for your interest in this project! We warmly welcome contributions from the community. Whether it's submitting bug reports, suggesting new features, or directly participating in code development, all forms of help make the project better.

• 📮 GitHub Issues: Submit bugs or feature suggestions.

• 🔧 GitHub Pull Requests: Contribute code improvements.

📄 License

DataMate is open source under the MIT license. You are free to use, modify, and distribute the code of this project in compliance with the license terms.

Languages

JavaScript 41.9%

TypeScript 19.9%

Java 16.7%

Python 15.6%

Smarty 4.4%

Other 1.5%