Throttle progress updates to reduce database write operations during large dataset processing.
Key features:
- Add PROGRESS_UPDATE_INTERVAL config (default 2.0s, configurable via AUTO_ANNOTATION_PROGRESS_INTERVAL env)
- Conditional progress updates: Only write to DB when (now - last_update) >= interval
- Use time.monotonic() for timing (immune to system clock adjustments)
- Final status updates (completed/stopped/failed) always execute (not throttled)
Implementation:
- Initialize last_progress_update timestamp before as_completed() loop
- Replace unconditional _update_task_status() with conditional call based on time interval
- Update docstring to reflect throttling capability
Performance impact (T=2s):
- 1,000 files / 100s processing: DB writes reduced from 1,000 to ~50 (95% reduction)
- 10,000 files / 500s processing: DB writes reduced from 10,000 to ~250 (97.5% reduction)
- Small datasets (10 files): Minimal difference
Backward compatibility:
- PROGRESS_UPDATE_INTERVAL=0: Updates every file (identical to previous behavior)
- Heartbeat mechanism unaffected (2s interval << 300s timeout)
- Stop check mechanism independent of progress updates
- Final status updates always execute
Testing:
- 14 unit tests all passed (11 existing + 3 new):
* Fast processing with throttling
* PROGRESS_UPDATE_INTERVAL=0 updates every file
* Slow processing (per-file > T) updates every file
- py_compile syntax check passed
Edge cases handled:
- Single file task: Works normally
- Very slow processing: Degrades to per-file updates
- Concurrent FILE_WORKERS > 1: Counters accurate (lock-protected), DB reflects with max T seconds delay
Automatically recover running tasks with stale heartbeats on worker startup, preventing tasks from being permanently stuck after container restarts.
Key changes:
- Add HEARTBEAT_TIMEOUT_SECONDS constant (default 300s, configurable via env)
- Add _recover_stale_running_tasks() function:
* Scans for status='running' tasks with heartbeat timeout
* No progress (processed=0) → reset to pending (auto-retry)
* Has progress (processed>0) → mark as failed with Chinese error message
* Each task recovery is independent (single failure doesn't affect others)
* Skip recovery if timeout is 0 or negative (disable feature)
- Call recovery function in _worker_loop() before polling loop
- Update file header comments to reflect recovery mechanism
Recovery logic:
- Query: status='running' AND (heartbeat_at IS NULL OR heartbeat_at < NOW() - timeout)
- Decision based on processed_images count
- Clear run_token to allow other workers to claim
- Single transaction per task for atomicity
Edge cases handled:
- Database unavailable: recovery failure doesn't block worker startup
- Concurrent recovery: UPDATE WHERE status='running' prevents duplicates
- NULL heartbeat: extreme case (crash right after claim) also recovered
- stop_requested tasks: automatically excluded by _fetch_pending_task()
Testing:
- 8 unit tests all passed:
* No timeout tasks
* Timeout disabled
* No progress → pending
* Has progress → failed
* NULL heartbeat recovery
* Multiple tasks mixed processing
* DB error doesn't crash
* Negative timeout disables feature
Automatically convert auto-annotation outputs to Label Studio format and write to t_dm_annotation_results table, enabling seamless editing in the annotation editor.
New file:
- runtime/python-executor/datamate/annotation_result_converter.py
* 4 converters for different annotation types:
- convert_text_classification → choices type
- convert_ner → labels (span) type
- convert_relation_extraction → labels + relation type
- convert_object_detection → rectanglelabels type
* convert_annotation() dispatcher (auto-detects task_type)
* generate_label_config_xml() for dynamic XML generation
* Pipeline introspection utilities
* Label Studio ID generation logic
Modified file:
- runtime/python-executor/datamate/auto_annotation_worker.py
* Preserve file_id through processing loop (line 918)
* Collect file_results as (file_id, annotations) pairs
* New _create_labeling_project_with_annotations() function:
- Creates labeling project linked to source dataset
- Snapshots all files
- Converts results to Label Studio format
- Writes to t_dm_annotation_results in single transaction
* label_config XML stored in t_dm_labeling_projects.configuration
Key features:
- Supports 4 annotation types: text classification, NER, relation extraction, object detection
- Deterministic region IDs for entity references in relation extraction
- Pixel to percentage conversion for object detection
- XML escaping handled by xml.etree.ElementTree
- Partial results preserved on task stop
Users can now view and edit auto-annotation results seamlessly in the annotation editor.
Add three new LLM-powered auto-annotation operators:
- LLMTextClassification: Text classification using LLM
- LLMNamedEntityRecognition: Named entity recognition with type validation
- LLMRelationExtraction: Relation extraction with entity and relation type validation
Key features:
- Load LLM config from t_model_config table via modelId parameter
- Lazy loading of LLM configuration on first execute()
- Result validation with whitelist checking for entity/relation types
- Fault-tolerant: returns empty results on LLM failure instead of throwing
- Fully compatible with existing Worker pipeline
Files added:
- runtime/ops/annotation/_llm_utils.py: Shared LLM utilities
- runtime/ops/annotation/llm_text_classification/: Text classification operator
- runtime/ops/annotation/llm_named_entity_recognition/: NER operator
- runtime/ops/annotation/llm_relation_extraction/: Relation extraction operator
Files modified:
- runtime/ops/annotation/__init__.py: Register 3 new operators
- runtime/python-executor/datamate/auto_annotation_worker.py: Add to Worker whitelist
- frontend/src/pages/DataAnnotation/OperatorCreate/hooks/useOperatorOperations.ts: Add to frontend whitelist
* Enhance CleaningTaskService to track cleaning process progress and update ExecutorType to DATAMATE
* Refactor project to use 'datamate' naming convention for services and configurations