feat: tweak pdf parser for corner cases and add 120s demo

YoungVor · YoungVor · commit 4667fdf65124 · 2025-10-14T14:33:40.000-07:00
diff --git a/examples/fenic_in_120_seconds/18_pdf_processing.ipynb b/examples/fenic_in_120_seconds/18_pdf_processing.ipynb
@@ -0,0 +1,381 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "To run this Fenic demo, click **Runtime** > **Run all**.\n",
+        "\n",
+        "<div class=\"align-center\">\n",
+        "<a href=\"https://github.com/typedef-ai/fenic\"><img src=\"https://github.com/typedef-ai/fenic/blob/main/docs/images/typedef-fenic-logo-github-yellow.png?raw=true\" height=\"50\"></a>\n",
+        "<a href=\"https://discord.gg/GdqF3J7huR\"><img src=\"https://github.com/typedef-ai/fenic/blob/main/docs/images/join-the-discord.png?raw=true\" height=\"50\"></a>\n",
+        "<a href=\"https://docs.fenic.ai/latest/\"><img src=\"https://github.com/typedef-ai/fenic/blob/main/docs/images/documentation.png?raw=true\" height=\"50\"></a>\n",
+        "\n",
+        "Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [GitHub](https://github.com/typedef-ai/fenic).\n",
+        "\n",
+        "</div>\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!pip uninstall -y sklearn-compat ibis-framework imbalanced-learn google-genai\n",
+        "!pip install polars==1.30.0\n",
+        "!pip install huggingface_hub\n",
+        "# === GOOGLE GEMINI ===\n",
+        "#!pip install fenic[google]\n",
+        "# === ANTHROPIC CLAUDE ===\n",
+        "#!pip install fenic[anthropic]\n",
+        "# === OPENAI (Default) ===\n",
+        "!pip install \"fenic[google]\"\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os \n",
+        "import getpass\n",
+        "\n",
+        "# 🔌 MULTI-PROVIDER SETUP - Choose your preferred LLM provider\n",
+        "# Uncomment provider sections you are using in your semantic config\n",
+        "\n",
+        "# === OPENAI (Default) ===\n",
+        "#os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n",
+        "\n",
+        "# === GOOGLE GEMINI ===\n",
+        "#os.environ[\"GOOGLE_API_KEY\"] = getpass.getpass(\"Google API Key:\")\n",
+        "\n",
+        "# === ANTHROPIC CLAUDE ===\n",
+        "# os.environ[\"ANTHROPIC_API_KEY\"] = getpass.getpass(\"Anthropic API Key:\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# 📄 PDF Processing & Analysis\n",
+        "\n",
+        "**Hook:** *\"Transform PDFs into structured, queryable data in seconds\"*\n",
+        "\n",
+        "Research papers, whitepapers, technical documents - PDFs contain valuable information but are notoriously difficult to work with. Traditional PDF processing requires complex parsing, layout analysis, and manual extraction. Watch AI-powered PDF processing convert unstructured documents into structured, searchable data.\n",
+        "\n",
+        "**What you'll see in this 2-minute demo:**\n",
+        "- 📚 **PDF to Markdown** - Intelligent conversion preserving structure and formatting\n",
+        "- 🧠 **Content Categorization** - Automatic classification of document sections\n",
+        "- 📊 **Structured Extraction** - Products, training methods, key topics identified\n",
+        "- ⚡ **Batch Processing** - Multiple PDFs processed and analyzed efficiently\n",
+        "\n",
+        "Perfect for research analysis, document management, and content discovery.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import fenic as fc\n",
+        "from pydantic import BaseModel, Field\n",
+        "from typing import List\n",
+        "import huggingface_hub as hf\n",
+        "import shutil\n",
+        "\n",
+        "# ⚡ Configure for PDF processing with multiple models\n",
+        "session = fc.Session.get_or_create(fc.SessionConfig(\n",
+        "    app_name=\"pdf_processing_demo\",\n",
+        "    semantic=fc.SemanticConfig(\n",
+        "        language_models={\n",
+        "            \"parse_model\": fc.GoogleDeveloperLanguageModel(\n",
+        "                model_name=\"gemini-2.5-flash-lite\",\n",
+        "                rpm=500,\n",
+        "                tpm=1_000_000,\n",
+        "            ),\n",
+        "            \"cheap_model\": fc.OpenAILanguageModel(\n",
+        "                model_name=\"gpt-5-nano\",\n",
+        "                rpm=500,\n",
+        "                tpm=200_000,\n",
+        "            ),\n",
+        "        },\n",
+        "        default_language_model=\"cheap_model\"\n",
+        "    )\n",
+        "))\n",
+        "\n",
+        "print(\"✅ PDF processing session configured with Gemini for parsing and GPT-5-nano for analysis\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 📄 Step 1: Download Sample PDFs\n",
+        "\n",
+        "Let's grab some real whitepapers to process - these are complex technical documents perfect for demonstrating AI-powered PDF analysis.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📚 Download sample whitepapers from Hugging Face\n",
+        "data_dir = \"sample_pdfs/\"\n",
+        "os.makedirs(data_dir, exist_ok=True)\n",
+        "\n",
+        "repo_id = \"typedef-ai/pdf_data\" \n",
+        "files = hf.list_repo_files(repo_id=repo_id, repo_type=\"dataset\")\n",
+        "\n",
+        "print(f\"📥 Downloading whitepapers from {repo_id}...\")\n",
+        "for file in files:\n",
+        "    if file.startswith(\"whitepapers/\"):\n",
+        "        hf.hf_hub_download(repo_id=repo_id, repo_type=\"dataset\", filename=file, local_dir=data_dir)\n",
+        "        print(f\"  ✅ Downloaded: {file}\")\n",
+        "\n",
+        "print(f\"📁 PDFs saved to: {data_dir}\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 🧠 Step 2: AI-Powered PDF to Markdown Conversion\n",
+        "\n",
+        "Now the magic happens - watch AI convert complex PDFs into clean, structured markdown while preserving all the important formatting and hierarchy.\n",
+        "\n",
+        "First we can filter which documents we parse based on the PDF metadata.  In this case, we're only interested in longer, unencrypted documents.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "pdf_filtered_df = session.read.pdf_metadata(f\"{data_dir}/**/*.pdf\").filter(\n",
+        "    (fc.col(\"page_count\") > 3) & (~fc.col(\"is_encrypted\"))\n",
+        ")\n",
+        "\n",
+        "print(f\"📊 Found {pdf_filtered_df.count()} valid PDFs to process\")\n",
+        "pdf_filtered_df.select(\"title\", \"page_count\", \"file_path\").show()\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 🚀 Convert PDFs to Markdown using AI\n",
+        "print(\"🤖 Converting PDFs to markdown using Gemini...\")\n",
+        "pdf_to_md_content = pdf_filtered_df.with_column(\n",
+        "    \"markdown_content\", \n",
+        "    fc.semantic.parse_pdf(fc.col(\"file_path\"), model_alias=\"parse_model\")\n",
+        ").cache()\n",
+        "\n",
+        "print(\"✅ PDF to Markdown conversion complete!\")\n",
+        "print(f\"📄 Processed {pdf_to_md_content.count()} documents\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 📊 Step 3: Extract Document Structure\n",
+        "\n",
+        "Fenic's powerful markdown processing can extract any structure from the converted content. Let's break down documents into sections and generate table of contents.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📋 Extract document structure and table of contents\n",
+        "pdf_sections_df = pdf_to_md_content.select(\n",
+        "    fc.when(\n",
+        "        fc.col(\"title\").is_not_null(), \n",
+        "        fc.col(\"title\")\n",
+        "    ).otherwise(\n",
+        "        fc.text.split_part(fc.col(\"file_path\"), \"/\", -1)\n",
+        "    ).alias(\"name\"),\n",
+        "    \"markdown_content\",\n",
+        "    # Extract sections up to level 3 headers\n",
+        "    fc.markdown.extract_header_chunks(fc.col(\"markdown_content\"), header_level=3).alias(\"sections\"),\n",
+        "    # Generate table of contents\n",
+        "    fc.markdown.generate_toc(fc.col(\"markdown_content\")).alias(\"toc\")\n",
+        ")\n",
+        "\n",
+        "print(\"📊 Document structure extracted:\")\n",
+        "pdf_sections_df.select(\"name\", \"sections\", \"toc\").show()\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 🧠 Step 4: AI-Powered Content Analysis\n",
+        "\n",
+        "Now let's use AI to analyze the content and extract structured insights - what products are mentioned, what sections discuss model training, and more.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 🎯 Define content categorization schema\n",
+        "class PDFContentCategorization(BaseModel):\n",
+        "    \"\"\"AI-powered PDF content categorization.\"\"\"\n",
+        "    summary: str = Field(description=\"Brief one sentence summary of the PDF given its table of contents\")\n",
+        "    sections_about_model_training: List[str] = Field(description=\"List of headings that are specifically about model training\")\n",
+        "    products_mentioned: List[str] = Field(description=\"All product names mentioned in the PDF table of contents\")\n",
+        "\n",
+        "print(\"🎯 Content categorization schema defined\")\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 🤖 AI-powered content analysis using table of contents\n",
+        "pdf_filtered_details = pdf_sections_df.with_column(\n",
+        "    \"content_categorization\", \n",
+        "    fc.semantic.extract(fc.col(\"toc\").cast(fc.StringType), PDFContentCategorization, model_alias=\"cheap_model\")\n",
+        ").cache()\n",
+        "\n",
+        "print(\"✅ AI content analysis complete!\")\n",
+        "\n",
+        "#pdf_filtered_details.to_polars()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 📊 Step 5: Display Results\n",
+        "\n",
+        "Let's see what insights AI extracted from our PDFs - summaries, products mentioned, and training-related sections.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📊 Display whitepaper summaries and insights\n",
+        "print(\"=\"*70)\n",
+        "print(\"📄 WHITEPAPER ANALYSIS RESULTS\")\n",
+        "print(\"=\"*70)\n",
+        "\n",
+        "\n",
+        "for row in pdf_filtered_details.to_pylist():\n",
+        "    print(f\"\\n📚 Whitepaper: {row['name']}\")\n",
+        "    print(f\"📝 Summary: {row['content_categorization']['summary']}\")\n",
+        "    print(f\"🏷️  Products mentioned: {row['content_categorization']['products_mentioned']}\")\n",
+        "    print(f\"🧠 Training sections: {row['content_categorization']['sections_about_model_training']}\")\n",
+        "    print(\"-\" * 50)\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 🔍 Step 6: Deep Dive into Training Sections\n",
+        "\n",
+        "Let's filter and examine only the sections that discuss model training - perfect for researchers analyzing AI methodologies.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 🔍 Filter sections specifically about model training\n",
+        "model_training_sections_df = pdf_filtered_details.explode(\"sections\").filter(\n",
+        "    fc.col(\"sections\").is_not_null() &\n",
+        "    fc.col(\"content_categorization\").is_not_null() &\n",
+        "    fc.array_contains(fc.col(\"content_categorization\").sections_about_model_training, fc.col(\"sections\").heading)\n",
+        ")\n",
+        "\n",
+        "print(\"=\"*70)\n",
+        "print(\"🧠 MODEL TRAINING SECTIONS ANALYSIS\")\n",
+        "print(\"=\"*70)\n",
+        "print(f\"📊 Found {model_training_sections_df.count()} sections about model training:\")\n",
+        "print()\n",
+        "\n",
+        "# Display training sections\n",
+        "for row in model_training_sections_df.to_pylist():\n",
+        "    print(f\"📚 Document: {row['name']}\")\n",
+        "    print(f\"📖 Section: {row['sections']['heading']}\")\n",
+        "    print(f\"📝 Content preview: {row['sections']['content'][:200]}...\")\n",
+        "    print(\"-\" * 50)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 🎉 What Just Happened?\n",
+        "\n",
+        "**You just witnessed the future of document processing:**\n",
+        "\n",
+        "1. **📄 PDF → Markdown**: AI converted complex PDFs into clean, structured markdown while preserving formatting and hierarchy\n",
+        "2. **🧠 Content Analysis**: AI analyzed document structure and extracted key insights like products mentioned and training sections  \n",
+        "3. **📊 Structured Data**: Transformed unstructured PDFs into queryable, structured data\n",
+        "4. **🔍 Smart Filtering**: Automatically identified and extracted only relevant sections\n",
+        "\n",
+        "**This is semantic AI in action** - understanding document content, not just extracting text. Perfect for research analysis, document management, and content discovery.\n",
+        "\n",
+        "**Try this with your own PDFs** - just change the `data_dir` path and watch AI work its magic!\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 🧹 Cleanup\n",
+        "print(\"🧹 Cleaning up downloaded files...\")\n",
+        "shutil.rmtree(data_dir)\n",
+        "session.stop()\n",
+        "print(\"✅ Cleanup complete!\")\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.12"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}
diff --git a/src/fenic/_backends/local/semantic_operators/parse_pdf.py b/src/fenic/_backends/local/semantic_operators/parse_pdf.py
@@ -24,8 +24,9 @@ class ParsePDF(BaseSingleColumnFilePathOperator[str, str]):
     """Operator for parsing PDF files using language models with PDF parsing capabilities."""
     SYSTEM_PROMPT = jinja2.Template(dedent("""\
         Transcribe the main content of this PDF document to clean, well-formatted markdown.
-         - Output should be raw markdown, don't surround in code fences or backticks.
-         - Preserve the structure, formatting, headings, lists, and any tables to the best of your ability
+         - Output should be raw markdown, don't surround the whole output in code fences or backticks.
+         - For each topic, create a markdown heading. For key terms, use bold text.
+         - Preserve the structure, formatting, headings, lists, table of contents, and any tables using markdown syntax.
          - Format tables as github markdown tables, however:
              - for table headings, immediately add ' |' after the table heading
         {% if multiple_pages %}
diff --git a/tests/_backends/local/semantic_operators/test_parse_pdf.py b/tests/_backends/local/semantic_operators/test_parse_pdf.py