Documentation Scraper

A Python script to crawl and scrape documentation websites, converting their content into a single, consolidated Markdown file. This is useful for offline reading, archiving, or feeding documentation into other systems.

Features

Site Crawler: Discovers all unique, internal pages of a documentation site starting from a given base URL.
Content Extraction: Intelligently extracts the main content from each page, prioritizing common HTML structures like <main> tags, or falling back to article, .main-content, or #content selectors.
HTML to Markdown: Converts the extracted HTML content into clean Markdown.
Consolidated Output: Combines content from all scraped pages into a single Markdown file.
Dynamic Filename Generation: Creates a descriptive filename for the output Markdown file based on the source URL.
Error Handling: Gracefully handles network errors and issues during page processing.
Command-Line Interface: Easy to use with a simple CLI argument for the target URL.

Requirements

Python 3.9 or higher
Poetry (for dependency management and installation)

Installation

Using Poetry (Recommended)

Clone the repository:

git clone https://github.com/thepingdoctor/scrape-api-docs
cd scrape-api-docs

Install with Poetry:
```
poetry install
```
Activate the virtual environment:
```
poetry shell
```

Using pip (Alternative)

You can also install directly from the repository:

pip install git+https://github.com/thepingdoctor/scrape-api-docs.git

Or install in development mode:

git clone https://github.com/thepingdoctor/scrape-api-docs
cd scrape-api-docs
pip install -e .

Legacy Method (Direct Script)

If you prefer to use the standalone script without packaging:

python scrape.py <URL>

Usage

Web Interface (Streamlit UI)

Launch the interactive web interface:

scrape-docs-ui

This will open a comprehensive web interface in your browser with:

📝 Easy URL input with validation
⚙️ Advanced configuration options (timeout, max pages, custom filename)
📊 Real-time progress tracking with visual feedback
📄 Results preview and downloadable output
⚠️ Detailed error reporting
🎨 Modern, user-friendly interface

Or run with Streamlit directly:

streamlit run src/scrape_api_docs/streamlit_app.py

For detailed UI usage instructions, see the Streamlit UI Guide.

Command-Line Interface

For quick command-line usage:

scrape-docs <URL>

Example:

scrape-docs https://netboxlabs.com/docs/netbox/

Development

Setting up for development

Clone the repository and install dependencies:

git clone https://github.com/thepingdoctor/scrape-api-docs
cd scrape-api-docs
poetry install

Run tests (when available):
```
poetry run pytest
```
Format code with Black:
```
poetry run black src/
```
Run linting:
```
poetry run flake8 src/
```

Building and Publishing

Build the package:

poetry build

Publish to PyPI (requires credentials):

poetry publish

Disclaimer

This script is designed for legitimate purposes, such as the archival of API documentation for personal or internal team use. Users are responsible for ensuring they have the right to scrape any website and must comply with the website's terms of service and robots.txt file. The author is not responsible for any misuse of this script.

This script is provided "as is" without warranty of any kind, express or implied, and no support is provided.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src/scrape_api_docs		src/scrape_api_docs
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
FEATURES.md		FEATURES.md
LICENSE		LICENSE
POETRY_SETUP.md		POETRY_SETUP.md
README.md		README.md
STREAMLIT_UI_GUIDE.md		STREAMLIT_UI_GUIDE.md
pyproject.toml		pyproject.toml
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Documentation Scraper

Features

Requirements

Installation

Using Poetry (Recommended)

Using pip (Alternative)

Legacy Method (Direct Script)

Usage

Web Interface (Streamlit UI)

Command-Line Interface

Development

Setting up for development

Building and Publishing

Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

thepingdoctor/scrape-api-docs

Folders and files

Latest commit

History

Repository files navigation

Documentation Scraper

Features

Requirements

Installation

Using Poetry (Recommended)

Using pip (Alternative)

Legacy Method (Direct Script)

Usage

Web Interface (Streamlit UI)

Command-Line Interface

Development

Setting up for development

Building and Publishing

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages