Overview

This demo showcases how to use Trigger.dev with Python to extract structured form data from a PDF file available at a URL.

Prerequisites

Features

GitHub repo

View the project on GitHub

Click here to view the full code for this project in our examples repository on GitHub. You can fork it and use it as a starting point for your own project.

The code

Build configuration

After you’ve initialized your project with Trigger.dev, add these build settings to your trigger.config.ts file:

trigger.config.ts
import { pythonExtension } from "@trigger.dev/python/extension";
import { defineConfig } from "@trigger.dev/sdk/v3";

export default defineConfig({
  runtime: "node",
  project: "<your-project-ref>",
  // Your other config settings...
  build: {
    extensions: [
      pythonExtension({
        // The path to your requirements.txt file
        requirementsFile: "./requirements.txt",
        // The path to your Python binary
        devPythonBinaryPath: `venv/bin/python`,
        // The paths to your Python scripts to run
        scripts: ["src/python/**/*.py"],
      }),
    ],
  },
});

Learn more about executing scripts in your Trigger.dev project using our Python build extension here.

Task code

This task uses the python.runScript method to run the image-processing.py script with the given image URL as an argument. You can adjust the image processing parameters in the payload, with options such as height, width, quality, output format, etc.

src/trigger/pythonPdfTask.ts
import { task } from "@trigger.dev/sdk/v3";
import { python } from "@trigger.dev/python";

export const processPdfForm = task({
  id: "process-pdf-form",
  run: async (payload: { pdfUrl: string }, io: any) => {
    const { pdfUrl } = payload;
    const args = [pdfUrl];

    const result = await python.runScript("./src/python/extract-pdf-form.py", args);

    // Parse the JSON output from the script
    let formData;
    try {
      formData = JSON.parse(result.stdout);
    } catch (error) {
      throw new Error(`Failed to parse JSON output: ${result.stdout}`);
    }

    return {
      formData,
      stderr: result.stderr,
      exitCode: result.exitCode,
    };
  },
});

Add a requirements.txt file

Add the following to your requirements.txt file. This is required in Python projects to install the dependencies.

requirements.txt
PyMuPDF==1.23.8
requests==2.31.0

The Python script

The Python script uses PyMuPDF to extract form data from a PDF file. You can see the original script in our examples repository here.

src/python/extract-pdf-form.py
import fitz  # PyMuPDF
import requests
import os
import json
import sys
from urllib.parse import urlparse

def download_pdf(url):
    """Download PDF from URL to a temporary file"""
    response = requests.get(url)
    response.raise_for_status()

    # Get filename from URL or use default
    filename = os.path.basename(urlparse(url).path) or "downloaded.pdf"
    filepath = os.path.join("/tmp", filename)

    with open(filepath, 'wb') as f:
        f.write(response.content)
    return filepath

def extract_form_data(pdf_path):
    """Extract form data from a PDF file."""
    doc = fitz.open(pdf_path)
    form_data = {}

    for page_num, page in enumerate(doc):
        fields = page.widgets()
        for field in fields:
            field_name = field.field_name or f"unnamed_field_{page_num}_{len(form_data)}"
            field_type = field.field_type_string
            field_value = field.field_value

            # For checkboxes, convert to boolean
            if field_type == "CheckBox":
                field_value = field_value == "Yes"

            form_data[field_name] = {
                "type": field_type,
                "value": field_value,
                "page": page_num + 1
            }

    return form_data

def main():
    if len(sys.argv) < 2:
        print(json.dumps({"error": "PDF URL is required as an argument"}), file=sys.stderr)
        return 1

    url = sys.argv[1]

    try:
        pdf_path = download_pdf(url)
        form_data = extract_form_data(pdf_path)

        # Convert to JSON for structured output
        structured_output = json.dumps(form_data, indent=2)
        print(structured_output)
        return 0
    except Exception as e:
        print(json.dumps({"error": str(e)}), file=sys.stderr)
        return 1

if __name__ == "__main__":
    sys.exit(main())

Testing your task

  1. Create a virtual environment python -m venv venv
  2. Activate the virtual environment, depending on your OS: On Mac/Linux: source venv/bin/activate, on Windows: venv\Scripts\activate
  3. Install the Python dependencies pip install -r requirements.txt
  4. Copy the project ref from your Trigger.dev dashboard and add it to the trigger.config.ts file.
  5. Run the Trigger.dev CLI dev command (it may ask you to authorize the CLI if you haven’t already).
  6. Test the task in the dashboard by providing a valid PDF URL.
  7. Deploy the task to production using the Trigger.dev CLI deploy command.

Learn more about using Python with Trigger.dev

Python build extension

Learn how to use our built-in Python build extension to install dependencies and run your Python code.