feat(auto-annotation): add batch progress updates to reduce DB write pressure
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (java-kotlin) (push) Has been cancelled
CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled

Throttle progress updates to reduce database write operations during large dataset processing.

Key features:
- Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env)
- Conditional progress updates: Only write to DB when (now - last_update) >= interval
- Use time.monotonic() for timing (immune to system clock adjustments)
- Final status updates (completed/stopped/failed) always execute (not throttled)

Implementation:
- Initialize last_progress_update timestamp before as_completed() loop
- Replace unconditional _update_task_status() with conditional call based on time interval
- Update docstring to reflect throttling capability

Performance impact (T=2s):
- 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction)
- 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction)
- Small datasets (10 files): Minimal difference

Backward compatibility:
- PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior)
- Heartbeat mechanism unaffected (2s interval << 300s timeout)
- Stop check mechanism independent of progress updates
- Final status updates always execute

Testing:
- 14 unit tests all passed (11 existing + 3 new):
  * Fast processing with throttling
  * PROGRESS_UPDATE_INTERVAL=0 updates every file
  * Slow processing (per-file > T) updates every file
- py_compile syntax check passed

Edge cases handled:
- Single file task: Works normally
- Very slow processing: Degrades to per-file updates
- Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay
This commit is contained in:
2026-02-10 16:49:37 +08:00
parent 9988ff00f5
commit f707ce9dae
2 changed files with 86 additions and 12 deletions

View File

@@ -320,5 +320,71 @@ class TestWorkerLoopSimplified(unittest.TestCase):
mock_recover.assert_not_called()
class TestProgressThrottling(unittest.TestCase):
"""Test time-based progress update throttling (improvement #5)."""
def test_progress_updates_throttled(self):
"""With PROGRESS_UPDATE_INTERVAL>0, rapid completions should batch DB writes."""
update_calls: List[float] = []
lock = threading.Lock()
def mock_update(*args, **kwargs):
with lock:
update_calls.append(time.monotonic())
interval = 0.05 # 50ms throttle interval
processed = 0
# Initialize in the past so the first file triggers an update
last_progress_update = time.monotonic() - interval
total_files = 50
# Simulate the throttled update loop from _process_single_task
for i in range(total_files):
processed += 1
now = time.monotonic()
if interval <= 0 or (now - last_progress_update) >= interval:
mock_update(processed=processed, total=total_files)
last_progress_update = now
# Simulate very fast file processing (~1ms)
time.sleep(0.001)
# With 50 files at ~1ms each (~50ms total) and 50ms interval,
# should get far fewer updates than total_files
self.assertLess(len(update_calls), total_files)
self.assertGreater(len(update_calls), 0)
def test_progress_interval_zero_updates_every_file(self):
"""PROGRESS_UPDATE_INTERVAL=0 should update on every file completion."""
update_count = 0
interval = 0.0
total_files = 20
last_progress_update = time.monotonic()
for i in range(total_files):
now = time.monotonic()
if interval <= 0 or (now - last_progress_update) >= interval:
update_count += 1
last_progress_update = now
self.assertEqual(update_count, total_files)
def test_progress_throttle_with_slow_processing(self):
"""When each file takes longer than the interval, every file triggers an update."""
update_count = 0
interval = 0.01 # 10ms interval
total_files = 5
last_progress_update = time.monotonic() - 1.0 # Start in the past
for i in range(total_files):
time.sleep(0.02) # 20ms per file > 10ms interval
now = time.monotonic()
if interval <= 0 or (now - last_progress_update) >= interval:
update_count += 1
last_progress_update = now
# Every file should trigger an update since processing time > interval
self.assertEqual(update_count, total_files)
if __name__ == "__main__":
unittest.main()