DocParser API

概述

DocParser 是一个 GPU 加速的 PDF 解析 API 服务，提供两个专用引擎：struct 擅长文档结构分析和公式识别；polyglot 擅长中文及多语言文档解析。通过异步任务队列实现分布式 GPU 调度，支持多机多卡并发处理。

Base URL: https://docparser.deconbear.cn

认证

除 /health 和 /gpustatus 外，所有接口都需要在 HTTP 请求头中携带 API Key：

X-API-Key: dp_YOUR_API_KEY

API 密钥在 GPU 机器的管理面板中创建和管理（仅限本机访问）。密钥采用 SHA-256 哈希存储，创建时仅显示一次明文。

快速开始

提交 PDF → 轮询状态 → 获取结果，三步完成解析：

步骤 1：提交 PDF

curl -X POST https://docparser.deconbear.cn/parse \
  -F "file=@paper.pdf" \
  -F "engine=struct" \
  -H "X-API-Key: dp_YOUR_KEY"

import requests

r = requests.post(
    "https://docparser.deconbear.cn/parse",
    files={"file": open("paper.pdf", "rb")},
    data={"engine": "struct"},
    headers={"X-API-Key": "dp_YOUR_KEY"},
)
print(r.json())  # {"task_id":"...", "status":"queued"}

const form = new FormData();
form.append("file", pdfFile);
form.append("engine", "struct");

const res = await fetch("https://docparser.deconbear.cn/parse", {
  method: "POST",
  headers: { "X-API-Key": "dp_YOUR_KEY" },
  body: form,
});
const { task_id } = await res.json();
// {"task_id":"...", "status":"queued"}

返回 202 状态码和 task_id，任务已加入 GPU 处理队列。

步骤 2：轮询状态

curl -H "X-API-Key: dp_YOUR_KEY" \
  https://docparser.deconbear.cn/status/TASK_ID

import time

while True:
    r = requests.get(
        f"https://docparser.deconbear.cn/status/{task_id}",
        headers={"X-API-Key": "dp_YOUR_KEY"},
    )
    status = r.json()["status"]
    if status in ("success", "failure"):
        break
    time.sleep(5)

let status: string;
do {
  await new Promise(r => setTimeout(r, 5000));
  const res = await fetch(
    `https://docparser.deconbear.cn/status/${task_id}`,
    { headers: { "X-API-Key": "dp_YOUR_KEY" } },
  );
  ({ status } = await res.json());
} while (status !== "success" && status !== "failure");

状态流转：queued → started → success 或 failure，建议每 5 秒轮询一次。

步骤 3：获取结果

curl -H "X-API-Key: dp_YOUR_KEY" \
  https://docparser.deconbear.cn/result/TASK_ID

r = requests.get(
    f"https://docparser.deconbear.cn/result/{task_id}",
    headers={"X-API-Key": "dp_YOUR_KEY"},
)
data = r.json()
print(f"Engine: {data['engine']}, Time: {data['parse_time_s']:.1f}s")
print(data["markdown"])

const res = await fetch(
  `https://docparser.deconbear.cn/result/${task_id}`,
  { headers: { "X-API-Key": "dp_YOUR_KEY" } },
);
const { markdown, engine, parse_time_s } = await res.json();
console.log(`Engine: ${engine}, Time: ${parse_time_s}s`);
console.log(markdown);

任务完成后返回 Markdown 格式的解析结果。若任务尚未完成，返回 409 状态码。

接口参考

POST `/parse`

提交 PDF 文件进行解析。文件被保存后加入异步任务队列，立即返回 task_id。

需要 API Key

请求参数

参数	类型	必填	说明
`file`	file (PDF)	是	要解析的 PDF 文件，最大 100 MB
`engine`	string	否	解析引擎：`struct`（默认，适合学术论文）或 `polyglot`（适合中文/多语言文档）

响应 (202)

{"task_id": "f3a1b9c0d2e4", "status": "queued", "message": "Task submitted"}

GET `/status/{task_id}`

查询解析任务的当前状态和进度。

需要 API Key

路径参数

参数	类型	说明
`task_id`	string	POST /parse 返回的任务 ID

响应

{
  "task_id": "f3a1b9c0d2e4",
  "status": "started",
  "engine": "struct",
  "created_at": "2025-06-15T14:32:00Z",
  "started_at": "2025-06-15T14:32:05Z",
  "completed_at": null,
  "error": null
}

状态值：queued | started | retrying | success | failure

GET `/result/{task_id}`

获取已完成的解析结果（Markdown 格式）。任务未完成时返回 409。

需要 API Key

路径参数

参数	类型	说明
`task_id`	string	POST /parse 返回的任务 ID

响应

{
  "task_id": "f3a1b9c0d2e4",
  "status": "success",
  "markdown": "# Paper Title\n\n## Abstract\n\n...",
  "engine": "struct",
  "parse_time_s": 42.3,
  "error": null
}

DELETE `/task/{task_id}`

取消一个排队中或正在执行的任务。已完成的任务不受影响。

需要 API Key

路径参数

参数	类型	说明
`task_id`	string	POST /parse 返回的任务 ID

响应

{"task_id": "f3a1b9c0d2e4", "status": "revoked"}

GET `/health`

服务健康检查。返回 GPU 状态、Redis 连接状态和活跃 Worker 数量。

无需认证

响应

{
  "status": "healthy",
  "gpus": [{"index":0, "name":"NVIDIA GeForce RTX 4090",
            "utilization_pct":15.0, "memory_pct":33.3}],
  "redis_connected": true,
  "workers": 2
}

GET `/gpustatus`

仅返回 GPU 利用率与显存快照，比 /health 更轻量，适合脚本批量检查 GPU 可用性。

无需认证

错误码

HTTP 状态码	含义
`202`	任务已接受，排队等待处理
`400`	请求格式错误（非 PDF 文件或空文件）
`401`	缺少或无效的 API Key
`403`	管理面板仅限本机访问
`404`	任务不存在（已过期或 ID 错误）
`409`	任务存在但尚未完成，请先轮询 /status
`413`	PDF 文件超过 100 MB 大小限制

架构

边缘层

Nginx + SSL

HTTPS 终止 / 反向代理

中继层

SSH Reverse Tunnel

加密 TCP 桥接

应用层

FastAPI

REST API / 认证 / 调度

队列层

Redis + Celery

分布式任务队列

引擎层

Struct / Polyglot

GPU 解析引擎

多机多卡：每张 GPU 运行一个 Celery Worker（./run_worker.sh 0），所有 Worker 共享同一个 Redis 队列。每个 Worker 在领取任务前通过 NVML 检查 GPU 是否空闲。GPU 繁忙时任务自动重试，可被其他空闲机器的 Worker 接管。

完整示例

import requests
import time

HOST = "https://docparser.deconbear.cn"
HEADERS = {"X-API-Key": "dp_YOUR_KEY"}


def parse_pdf(filepath: str, engine: str = "struct") -> str | None:
    # 1. Submit
    with open(filepath, "rb") as f:
        r = requests.post(
            f"{HOST}/parse",
            files={"file": f},
            data={"engine": engine},
            headers=HEADERS,
        )
    r.raise_for_status()
    task_id = r.json()["task_id"]
    print(f"Task submitted: {task_id}")

    # 2. Poll until done
    while True:
        r = requests.get(
            f"{HOST}/status/{task_id}",
            headers=HEADERS,
        )
        r.raise_for_status()
        status = r.json()["status"]
        print(f"  {status}")
        if status in ("success", "failure"):
            break
        time.sleep(5)

    if status == "failure":
        raise RuntimeError(f"Parse failed: {r.json().get('error')}")

    # 3. Get result
    r = requests.get(
        f"{HOST}/result/{task_id}",
        headers=HEADERS,
    )
    r.raise_for_status()
    data = r.json()
    print(f"Done. engine={data['engine']}, time={data['parse_time_s']:.1f}s")
    return data["markdown"]


if __name__ == "__main__":
    md = parse_pdf("paper.pdf", "struct")
    print(md[:500])

/**
 * DocParser API — complete TypeScript example
 * Requires: Node.js 18+ (built-in fetch)
 */
const HOST = "https://docparser.deconbear.cn";
const KEY = "dp_YOUR_KEY";

interface ParseResponse {
  task_id: string;
  status: string;
  message: string;
}

interface TaskStatus {
  task_id: string;
  status: "queued" | "started" | "retrying" | "success" | "failure";
  engine: string;
  error: string | null;
}

interface TaskResult {
  task_id: string;
  status: string;
  markdown: string | null;
  engine: string | null;
  parse_time_s: number | null;
  error: string | null;
}

async function parsePDF(
  file: File | Blob,
  engine: "struct" | "polyglot" = "struct",
): Promise {
  // 1. Submit
  const form = new FormData();
  form.append("file", file);
  form.append("engine", engine);

  const submitRes = await fetch(`${HOST}/parse`, {
    method: "POST",
    headers: { "X-API-Key": KEY },
    body: form,
  });
  if (!submitRes.ok) {
    throw new Error(`Submit failed: ${submitRes.status}`);
  }
  const { task_id } = (await submitRes.json()) as ParseResponse;

  // 2. Poll until done
  let status: string;
  do {
    await new Promise((resolve) => setTimeout(resolve, 5000));
    const statusRes = await fetch(`${HOST}/status/${task_id}`, {
      headers: { "X-API-Key": KEY },
    });
    if (!statusRes.ok) {
      throw new Error(`Status check failed: ${statusRes.status}`);
    }
    ({ status } = (await statusRes.json()) as TaskStatus);
  } while (status !== "success" && status !== "failure");

  if (status === "failure") {
    throw new Error("PDF parsing failed");
  }

  // 3. Get result
  const resultRes = await fetch(`${HOST}/result/${task_id}`, {
    headers: { "X-API-Key": KEY },
  });
  if (!resultRes.ok) {
    throw new Error(`Result fetch failed: ${resultRes.status}`);
  }

  const { markdown } = (await resultRes.json()) as TaskResult;
  if (markdown === null) {
    throw new Error("Empty result");
  }
  return markdown;
}

// Usage (Node.js)
import { readFileSync } from "node:fs";
const pdfBlob = new Blob([readFileSync("paper.pdf")]);
const markdown = await parsePDF(pdfBlob, "struct");
console.log(markdown.slice(0, 500));

概述

认证

快速开始

步骤 1：提交 PDF

步骤 2：轮询状态

步骤 3：获取结果

接口参考

POST /parse

请求参数

响应 (202)

GET /status/{task_id}

路径参数

响应

GET /result/{task_id}

路径参数

响应

DELETE /task/{task_id}

路径参数

响应

GET /health

响应

GET /gpustatus

错误码

架构

完整示例

POST `/parse`

GET `/status/{task_id}`

GET `/result/{task_id}`

DELETE `/task/{task_id}`

GET `/health`

GET `/gpustatus`