You've already forked DataMate
Some checks failed
Throttle progress updates to reduce database write operations during large dataset processing. Key features: - Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env) - Conditional progress updates: Only write to DB when (now - last_update) >= interval - Use time.monotonic() for timing (immune to system clock adjustments) - Final status updates (completed/stopped/failed) always execute (not throttled) Implementation: - Initialize last_progress_update timestamp before as_completed() loop - Replace unconditional _update_task_status() with conditional call based on time interval - Update docstring to reflect throttling capability Performance impact (T=2s): - 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction) - 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction) - Small datasets (10 files): Minimal difference Backward compatibility: - PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior) - Heartbeat mechanism unaffected (2s interval << 300s timeout) - Stop check mechanism independent of progress updates - Final status updates always execute Testing: - 14 unit tests all passed (11 existing + 3 new): * Fast processing with throttling * PROGRESS_UPDATE_INTERVAL=0 updates every file * Slow processing (per-file > T) updates every file - py_compile syntax check passed Edge cases handled: - Single file task: Works normally - Very slow processing: Degrades to per-file updates - Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay