feat(auto-annotation): add batch progress updates to reduce DB write pressure
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled
CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled

Throttle progress updates to reduce database write operations during large dataset processing.

Key features:
- Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env)
- Conditional progress updates: Only write to DB when (now - last_update) >= interval
- Use time.monotonic() for timing (immune to system clock adjustments)
- Final status updates (completed/stopped/failed) always execute (not throttled)

Implementation:
- Initialize last_progress_update timestamp before as_completed() loop
- Replace unconditional _update_task_status() with conditional call based on time interval
- Update docstring to reflect throttling capability

Performance impact (T=2s):
- 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction)
- 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction)
- Small datasets (10 files): Minimal difference

Backward compatibility:
- PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior)
- Heartbeat mechanism unaffected (2s interval << 300s timeout)
- Stop check mechanism independent of progress updates
- Final status updates always execute

Testing:
- 14 unit tests all passed (11 existing + 3 new):
  * Fast processing with throttling
  * PROGRESS_UPDATE_INTERVAL=0 updates every file
  * Slow processing (per-file > T) updates every file
- py_compile syntax check passed

Edge cases handled:
- Single file task: Works normally
- Very slow processing: Degrades to per-file updates
- Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay
This commit is contained in:
2026-02-10 16:49:37 +08:00
parent 9988ff00f5
commit f707ce9dae
2 changed files with 86 additions and 12 deletions

View File

@@ -13,6 +13,8 @@ the datamate-python backend and frontend can display real-time status.
- 任务内文件并发: 可通过 AUTO_ANNOTATION_FILE_WORKERS 配置线程池大小,
单任务内并行处理多个文件(LLM I/O 密集型场景尤其有效)。
算子链通过对象池隔离,每个线程使用独立的链实例。
- 进度更新节流: 可通过 AUTO_ANNOTATION_PROGRESS_INTERVAL 控制进度写入频率,
避免大数据集每文件都写 DB 造成的写压力(默认 2 秒间隔)。
- 启动时自动恢复心跳超时的 running 任务:未处理文件重置为 pending,
已有部分进度的标记为 failed,由用户决定是否手动重试。
"""
@@ -133,6 +135,8 @@ WORKER_COUNT = int(os.getenv("AUTO_ANNOTATION_WORKER_COUNT", "1"))
FILE_WORKERS = int(os.getenv("AUTO_ANNOTATION_FILE_WORKERS", "1"))
PROGRESS_UPDATE_INTERVAL = float(os.getenv("AUTO_ANNOTATION_PROGRESS_INTERVAL", "2.0"))
def _recover_stale_running_tasks() -> int:
"""启动时恢复心跳超时的 running 任务。
@@ -1274,6 +1278,7 @@ def _process_single_task(task: Dict[str, Any]) -> None:
# --- 并发文件处理 ---
stop_check_interval = max(1, effective_file_workers * 2)
completed_since_check = 0
last_progress_update = time.monotonic()
with ThreadPoolExecutor(max_workers=effective_file_workers) as executor:
future_to_file = {
@@ -1299,18 +1304,21 @@ def _process_single_task(task: Dict[str, Any]) -> None:
current_processed = processed
current_detected = detected_total
progress = int(current_processed * 100 / total_images) if total_images > 0 else 100
_update_task_status(
task_id,
run_token=run_token,
status="running",
progress=progress,
processed_images=current_processed,
detected_objects=current_detected,
total_images=total_images,
output_path=output_dir,
output_dataset_id=output_dataset_id,
)
now = time.monotonic()
if PROGRESS_UPDATE_INTERVAL <= 0 or (now - last_progress_update) >= PROGRESS_UPDATE_INTERVAL:
progress = int(current_processed * 100 / total_images) if total_images > 0 else 100
_update_task_status(
task_id,
run_token=run_token,
status="running",
progress=progress,
processed_images=current_processed,
detected_objects=current_detected,
total_images=total_images,
output_path=output_dir,
output_dataset_id=output_dataset_id,
)
last_progress_update = now
except Exception as e:
logger.error(
"Failed to process file for task {}: file_path={}, error={}",