feat：问题生成过程优化及COT数据生成优化 (#169)

* fix(chart): update Helm chart helpers and values for improved configuration * feat(SynthesisTaskTab): enhance task table with tooltip support and improved column widths * feat(CreateTask, SynthFileTask): improve task creation and detail view with enhanced payload handling and UI updates * feat(SynthFileTask): enhance file display with progress tracking and delete action * feat(SynthFileTask): enhance file display with progress tracking and delete action * feat(SynthDataDetail): add delete action for chunks with confirmation prompt * feat(SynthDataDetail): update edit and delete buttons to icon-only format * feat(SynthDataDetail): add confirmation modals for chunk and synthesis data deletion * feat(DocumentSplitter): add enhanced document splitting functionality with CJK support and metadata detection * feat(DataSynthesis): refactor data synthesis models and update task handling logic * feat(DataSynthesis): streamline synthesis task handling and enhance chunk processing logic * feat(DataSynthesis): refactor data synthesis models and update task handling logic * fix(generation_service): ensure processed chunks are incremented regardless of question generation success * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options
2025-12-18 16:51:18 +08:00
parent 761f7f6a51
commit e0e9b1d94d
14 changed files with 1362 additions and 571 deletions
--- a/runtime/datamate-python/app/module/shared/util/model_chat.py
+++ b/runtime/datamate-python/app/module/shared/util/model_chat.py
@@ -14,7 +14,8 @@ def call_openai_style_model(base_url, api_key, model_name, prompt, **kwargs):
    )
    return response.choices[0].message.content

-def _extract_json_substring(raw: str) -> str:
+
+def extract_json_substring(raw: str) -> str:
    """从 LLM 的原始回答中提取最可能的 JSON 字符串片段。

    处理思路：
@@ -22,11 +23,21 @@ def _extract_json_substring(raw: str) -> str:
    - 优先在文本中查找第一个 '{' 或 '[' 作为 JSON 起始；
    - 再从后向前找最后一个 '}' 或 ']' 作为结束；
    - 如果找不到合适的边界，就退回原始字符串。
+    - 部分模型可能会在回复中加入 `<think>...</think>` 内部思考内容，应在解析前先去除。
    该方法不会保证截取的一定是合法 JSON，但能显著提高 json.loads 的成功率。
    """
    if not raw:
        return raw

+    # 先移除所有 <think>...</think> 段落（包括跨多行的情况）
+    try:
+        import re
+
+        raw = re.sub(r"<think>[\s\S]*?</think>", "", raw, flags=re.IGNORECASE)
+    except Exception:
+        # 正则异常时不影响后续逻辑，继续使用原始文本
+        pass
+
    start = None
    end = None