Skip to content

Commit 4667fdf

Browse files
committed
feat: tweak pdf parser for corner cases and add 120s demo
1 parent 243194d commit 4667fdf

File tree

3 files changed

+387
-4
lines changed

3 files changed

+387
-4
lines changed
Lines changed: 381 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,381 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"To run this Fenic demo, click **Runtime** > **Run all**.\n",
8+
"\n",
9+
"<div class=\"align-center\">\n",
10+
"<a href=\"https://github.com/typedef-ai/fenic\"><img src=\"https://github.com/typedef-ai/fenic/blob/main/docs/images/typedef-fenic-logo-github-yellow.png?raw=true\" height=\"50\"></a>\n",
11+
"<a href=\"https://discord.gg/GdqF3J7huR\"><img src=\"https://github.com/typedef-ai/fenic/blob/main/docs/images/join-the-discord.png?raw=true\" height=\"50\"></a>\n",
12+
"<a href=\"https://docs.fenic.ai/latest/\"><img src=\"https://github.com/typedef-ai/fenic/blob/main/docs/images/documentation.png?raw=true\" height=\"50\"></a>\n",
13+
"\n",
14+
"Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [GitHub](https://github.com/typedef-ai/fenic).\n",
15+
"\n",
16+
"</div>\n"
17+
]
18+
},
19+
{
20+
"cell_type": "code",
21+
"execution_count": null,
22+
"metadata": {},
23+
"outputs": [],
24+
"source": [
25+
"!pip uninstall -y sklearn-compat ibis-framework imbalanced-learn google-genai\n",
26+
"!pip install polars==1.30.0\n",
27+
"!pip install huggingface_hub\n",
28+
"# === GOOGLE GEMINI ===\n",
29+
"#!pip install fenic[google]\n",
30+
"# === ANTHROPIC CLAUDE ===\n",
31+
"#!pip install fenic[anthropic]\n",
32+
"# === OPENAI (Default) ===\n",
33+
"!pip install \"fenic[google]\"\n"
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": null,
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"import os \n",
43+
"import getpass\n",
44+
"\n",
45+
"# 🔌 MULTI-PROVIDER SETUP - Choose your preferred LLM provider\n",
46+
"# Uncomment provider sections you are using in your semantic config\n",
47+
"\n",
48+
"# === OPENAI (Default) ===\n",
49+
"#os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n",
50+
"\n",
51+
"# === GOOGLE GEMINI ===\n",
52+
"#os.environ[\"GOOGLE_API_KEY\"] = getpass.getpass(\"Google API Key:\")\n",
53+
"\n",
54+
"# === ANTHROPIC CLAUDE ===\n",
55+
"# os.environ[\"ANTHROPIC_API_KEY\"] = getpass.getpass(\"Anthropic API Key:\")\n"
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"metadata": {},
61+
"source": [
62+
"# 📄 PDF Processing & Analysis\n",
63+
"\n",
64+
"**Hook:** *\"Transform PDFs into structured, queryable data in seconds\"*\n",
65+
"\n",
66+
"Research papers, whitepapers, technical documents - PDFs contain valuable information but are notoriously difficult to work with. Traditional PDF processing requires complex parsing, layout analysis, and manual extraction. Watch AI-powered PDF processing convert unstructured documents into structured, searchable data.\n",
67+
"\n",
68+
"**What you'll see in this 2-minute demo:**\n",
69+
"- 📚 **PDF to Markdown** - Intelligent conversion preserving structure and formatting\n",
70+
"- 🧠 **Content Categorization** - Automatic classification of document sections\n",
71+
"- 📊 **Structured Extraction** - Products, training methods, key topics identified\n",
72+
"- ⚡ **Batch Processing** - Multiple PDFs processed and analyzed efficiently\n",
73+
"\n",
74+
"Perfect for research analysis, document management, and content discovery.\n"
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": null,
80+
"metadata": {},
81+
"outputs": [],
82+
"source": [
83+
"import fenic as fc\n",
84+
"from pydantic import BaseModel, Field\n",
85+
"from typing import List\n",
86+
"import huggingface_hub as hf\n",
87+
"import shutil\n",
88+
"\n",
89+
"# ⚡ Configure for PDF processing with multiple models\n",
90+
"session = fc.Session.get_or_create(fc.SessionConfig(\n",
91+
" app_name=\"pdf_processing_demo\",\n",
92+
" semantic=fc.SemanticConfig(\n",
93+
" language_models={\n",
94+
" \"parse_model\": fc.GoogleDeveloperLanguageModel(\n",
95+
" model_name=\"gemini-2.5-flash-lite\",\n",
96+
" rpm=500,\n",
97+
" tpm=1_000_000,\n",
98+
" ),\n",
99+
" \"cheap_model\": fc.OpenAILanguageModel(\n",
100+
" model_name=\"gpt-5-nano\",\n",
101+
" rpm=500,\n",
102+
" tpm=200_000,\n",
103+
" ),\n",
104+
" },\n",
105+
" default_language_model=\"cheap_model\"\n",
106+
" )\n",
107+
"))\n",
108+
"\n",
109+
"print(\"✅ PDF processing session configured with Gemini for parsing and GPT-5-nano for analysis\")\n"
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"metadata": {},
115+
"source": [
116+
"## 📄 Step 1: Download Sample PDFs\n",
117+
"\n",
118+
"Let's grab some real whitepapers to process - these are complex technical documents perfect for demonstrating AI-powered PDF analysis.\n"
119+
]
120+
},
121+
{
122+
"cell_type": "code",
123+
"execution_count": null,
124+
"metadata": {},
125+
"outputs": [],
126+
"source": [
127+
"# 📚 Download sample whitepapers from Hugging Face\n",
128+
"data_dir = \"sample_pdfs/\"\n",
129+
"os.makedirs(data_dir, exist_ok=True)\n",
130+
"\n",
131+
"repo_id = \"typedef-ai/pdf_data\" \n",
132+
"files = hf.list_repo_files(repo_id=repo_id, repo_type=\"dataset\")\n",
133+
"\n",
134+
"print(f\"📥 Downloading whitepapers from {repo_id}...\")\n",
135+
"for file in files:\n",
136+
" if file.startswith(\"whitepapers/\"):\n",
137+
" hf.hf_hub_download(repo_id=repo_id, repo_type=\"dataset\", filename=file, local_dir=data_dir)\n",
138+
" print(f\" ✅ Downloaded: {file}\")\n",
139+
"\n",
140+
"print(f\"📁 PDFs saved to: {data_dir}\")\n"
141+
]
142+
},
143+
{
144+
"cell_type": "markdown",
145+
"metadata": {},
146+
"source": [
147+
"## 🧠 Step 2: AI-Powered PDF to Markdown Conversion\n",
148+
"\n",
149+
"Now the magic happens - watch AI convert complex PDFs into clean, structured markdown while preserving all the important formatting and hierarchy.\n",
150+
"\n",
151+
"First we can filter which documents we parse based on the PDF metadata. In this case, we're only interested in longer, unencrypted documents.\n"
152+
]
153+
},
154+
{
155+
"cell_type": "code",
156+
"execution_count": null,
157+
"metadata": {},
158+
"outputs": [],
159+
"source": [
160+
"pdf_filtered_df = session.read.pdf_metadata(f\"{data_dir}/**/*.pdf\").filter(\n",
161+
" (fc.col(\"page_count\") > 3) & (~fc.col(\"is_encrypted\"))\n",
162+
")\n",
163+
"\n",
164+
"print(f\"📊 Found {pdf_filtered_df.count()} valid PDFs to process\")\n",
165+
"pdf_filtered_df.select(\"title\", \"page_count\", \"file_path\").show()\n"
166+
]
167+
},
168+
{
169+
"cell_type": "code",
170+
"execution_count": null,
171+
"metadata": {},
172+
"outputs": [],
173+
"source": [
174+
"# 🚀 Convert PDFs to Markdown using AI\n",
175+
"print(\"🤖 Converting PDFs to markdown using Gemini...\")\n",
176+
"pdf_to_md_content = pdf_filtered_df.with_column(\n",
177+
" \"markdown_content\", \n",
178+
" fc.semantic.parse_pdf(fc.col(\"file_path\"), model_alias=\"parse_model\")\n",
179+
").cache()\n",
180+
"\n",
181+
"print(\"✅ PDF to Markdown conversion complete!\")\n",
182+
"print(f\"📄 Processed {pdf_to_md_content.count()} documents\")\n"
183+
]
184+
},
185+
{
186+
"cell_type": "markdown",
187+
"metadata": {},
188+
"source": [
189+
"## 📊 Step 3: Extract Document Structure\n",
190+
"\n",
191+
"Fenic's powerful markdown processing can extract any structure from the converted content. Let's break down documents into sections and generate table of contents.\n"
192+
]
193+
},
194+
{
195+
"cell_type": "code",
196+
"execution_count": null,
197+
"metadata": {},
198+
"outputs": [],
199+
"source": [
200+
"# 📋 Extract document structure and table of contents\n",
201+
"pdf_sections_df = pdf_to_md_content.select(\n",
202+
" fc.when(\n",
203+
" fc.col(\"title\").is_not_null(), \n",
204+
" fc.col(\"title\")\n",
205+
" ).otherwise(\n",
206+
" fc.text.split_part(fc.col(\"file_path\"), \"/\", -1)\n",
207+
" ).alias(\"name\"),\n",
208+
" \"markdown_content\",\n",
209+
" # Extract sections up to level 3 headers\n",
210+
" fc.markdown.extract_header_chunks(fc.col(\"markdown_content\"), header_level=3).alias(\"sections\"),\n",
211+
" # Generate table of contents\n",
212+
" fc.markdown.generate_toc(fc.col(\"markdown_content\")).alias(\"toc\")\n",
213+
")\n",
214+
"\n",
215+
"print(\"📊 Document structure extracted:\")\n",
216+
"pdf_sections_df.select(\"name\", \"sections\", \"toc\").show()\n",
217+
"\n"
218+
]
219+
},
220+
{
221+
"cell_type": "markdown",
222+
"metadata": {},
223+
"source": [
224+
"## 🧠 Step 4: AI-Powered Content Analysis\n",
225+
"\n",
226+
"Now let's use AI to analyze the content and extract structured insights - what products are mentioned, what sections discuss model training, and more.\n"
227+
]
228+
},
229+
{
230+
"cell_type": "code",
231+
"execution_count": null,
232+
"metadata": {},
233+
"outputs": [],
234+
"source": [
235+
"# 🎯 Define content categorization schema\n",
236+
"class PDFContentCategorization(BaseModel):\n",
237+
" \"\"\"AI-powered PDF content categorization.\"\"\"\n",
238+
" summary: str = Field(description=\"Brief one sentence summary of the PDF given its table of contents\")\n",
239+
" sections_about_model_training: List[str] = Field(description=\"List of headings that are specifically about model training\")\n",
240+
" products_mentioned: List[str] = Field(description=\"All product names mentioned in the PDF table of contents\")\n",
241+
"\n",
242+
"print(\"🎯 Content categorization schema defined\")\n"
243+
]
244+
},
245+
{
246+
"cell_type": "code",
247+
"execution_count": null,
248+
"metadata": {},
249+
"outputs": [],
250+
"source": [
251+
"# 🤖 AI-powered content analysis using table of contents\n",
252+
"pdf_filtered_details = pdf_sections_df.with_column(\n",
253+
" \"content_categorization\", \n",
254+
" fc.semantic.extract(fc.col(\"toc\").cast(fc.StringType), PDFContentCategorization, model_alias=\"cheap_model\")\n",
255+
").cache()\n",
256+
"\n",
257+
"print(\"✅ AI content analysis complete!\")\n",
258+
"\n",
259+
"#pdf_filtered_details.to_polars()"
260+
]
261+
},
262+
{
263+
"cell_type": "markdown",
264+
"metadata": {},
265+
"source": [
266+
"## 📊 Step 5: Display Results\n",
267+
"\n",
268+
"Let's see what insights AI extracted from our PDFs - summaries, products mentioned, and training-related sections.\n"
269+
]
270+
},
271+
{
272+
"cell_type": "code",
273+
"execution_count": null,
274+
"metadata": {},
275+
"outputs": [],
276+
"source": [
277+
"# 📊 Display whitepaper summaries and insights\n",
278+
"print(\"=\"*70)\n",
279+
"print(\"📄 WHITEPAPER ANALYSIS RESULTS\")\n",
280+
"print(\"=\"*70)\n",
281+
"\n",
282+
"\n",
283+
"for row in pdf_filtered_details.to_pylist():\n",
284+
" print(f\"\\n📚 Whitepaper: {row['name']}\")\n",
285+
" print(f\"📝 Summary: {row['content_categorization']['summary']}\")\n",
286+
" print(f\"🏷️ Products mentioned: {row['content_categorization']['products_mentioned']}\")\n",
287+
" print(f\"🧠 Training sections: {row['content_categorization']['sections_about_model_training']}\")\n",
288+
" print(\"-\" * 50)\n",
289+
"\n"
290+
]
291+
},
292+
{
293+
"cell_type": "markdown",
294+
"metadata": {},
295+
"source": [
296+
"## 🔍 Step 6: Deep Dive into Training Sections\n",
297+
"\n",
298+
"Let's filter and examine only the sections that discuss model training - perfect for researchers analyzing AI methodologies.\n"
299+
]
300+
},
301+
{
302+
"cell_type": "code",
303+
"execution_count": null,
304+
"metadata": {},
305+
"outputs": [],
306+
"source": [
307+
"# 🔍 Filter sections specifically about model training\n",
308+
"model_training_sections_df = pdf_filtered_details.explode(\"sections\").filter(\n",
309+
" fc.col(\"sections\").is_not_null() &\n",
310+
" fc.col(\"content_categorization\").is_not_null() &\n",
311+
" fc.array_contains(fc.col(\"content_categorization\").sections_about_model_training, fc.col(\"sections\").heading)\n",
312+
")\n",
313+
"\n",
314+
"print(\"=\"*70)\n",
315+
"print(\"🧠 MODEL TRAINING SECTIONS ANALYSIS\")\n",
316+
"print(\"=\"*70)\n",
317+
"print(f\"📊 Found {model_training_sections_df.count()} sections about model training:\")\n",
318+
"print()\n",
319+
"\n",
320+
"# Display training sections\n",
321+
"for row in model_training_sections_df.to_pylist():\n",
322+
" print(f\"📚 Document: {row['name']}\")\n",
323+
" print(f\"📖 Section: {row['sections']['heading']}\")\n",
324+
" print(f\"📝 Content preview: {row['sections']['content'][:200]}...\")\n",
325+
" print(\"-\" * 50)\n"
326+
]
327+
},
328+
{
329+
"cell_type": "markdown",
330+
"metadata": {},
331+
"source": [
332+
"## 🎉 What Just Happened?\n",
333+
"\n",
334+
"**You just witnessed the future of document processing:**\n",
335+
"\n",
336+
"1. **📄 PDF → Markdown**: AI converted complex PDFs into clean, structured markdown while preserving formatting and hierarchy\n",
337+
"2. **🧠 Content Analysis**: AI analyzed document structure and extracted key insights like products mentioned and training sections \n",
338+
"3. **📊 Structured Data**: Transformed unstructured PDFs into queryable, structured data\n",
339+
"4. **🔍 Smart Filtering**: Automatically identified and extracted only relevant sections\n",
340+
"\n",
341+
"**This is semantic AI in action** - understanding document content, not just extracting text. Perfect for research analysis, document management, and content discovery.\n",
342+
"\n",
343+
"**Try this with your own PDFs** - just change the `data_dir` path and watch AI work its magic!\n"
344+
]
345+
},
346+
{
347+
"cell_type": "code",
348+
"execution_count": null,
349+
"metadata": {},
350+
"outputs": [],
351+
"source": [
352+
"# 🧹 Cleanup\n",
353+
"print(\"🧹 Cleaning up downloaded files...\")\n",
354+
"shutil.rmtree(data_dir)\n",
355+
"session.stop()\n",
356+
"print(\"✅ Cleanup complete!\")\n"
357+
]
358+
}
359+
],
360+
"metadata": {
361+
"kernelspec": {
362+
"display_name": ".venv",
363+
"language": "python",
364+
"name": "python3"
365+
},
366+
"language_info": {
367+
"codemirror_mode": {
368+
"name": "ipython",
369+
"version": 3
370+
},
371+
"file_extension": ".py",
372+
"mimetype": "text/x-python",
373+
"name": "python",
374+
"nbconvert_exporter": "python",
375+
"pygments_lexer": "ipython3",
376+
"version": "3.11.12"
377+
}
378+
},
379+
"nbformat": 4,
380+
"nbformat_minor": 2
381+
}

src/fenic/_backends/local/semantic_operators/parse_pdf.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,9 @@ class ParsePDF(BaseSingleColumnFilePathOperator[str, str]):
2424
"""Operator for parsing PDF files using language models with PDF parsing capabilities."""
2525
SYSTEM_PROMPT = jinja2.Template(dedent("""\
2626
Transcribe the main content of this PDF document to clean, well-formatted markdown.
27-
- Output should be raw markdown, don't surround in code fences or backticks.
28-
- Preserve the structure, formatting, headings, lists, and any tables to the best of your ability
27+
- Output should be raw markdown, don't surround the whole output in code fences or backticks.
28+
- For each topic, create a markdown heading. For key terms, use bold text.
29+
- Preserve the structure, formatting, headings, lists, table of contents, and any tables using markdown syntax.
2930
- Format tables as github markdown tables, however:
3031
- for table headings, immediately add ' |' after the table heading
3132
{% if multiple_pages %}

0 commit comments

Comments
 (0)