Welcome to the Unstructured Data RAG Platform Repo

Table of Contents

Unstructured Data RAG Platform Overview
unstructuredDataHandler Roadmap
Overview
Resources
FAQ
Documentation
Contributing
Communicating with the Team
Developer Guidance
Code of Conduct

This repository contains the source code for the Unstructured Data RAG Platform, a Python-based framework for handling unstructured documents in safety-critical and cybersecurity software development lifecycles.

It ingests documents (PDF, Word, PlantUML, Draw.io, Mermaid), converts them to structured JSON, embeds them with PGVector, stores raw artifacts in MinIO, and enables agentic RAG using LangChain DeepAgent.

Unstructured Data RAG Platform Overview

Unstructured Data RAG Platform is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.

Primary Language: Python 3.10-3.12
Secondary Languages: TypeScript (for Azure pipelines), Shell scripts
Project Type: AI/ML library and tooling for SDLC workflows

Related repositories include:

unstructuredDataHandler Documentation (Placeholder)

unstructuredDataHandler Roadmap

The plan for the unstructuredDataHandler is described here and will be updated as the project proceeds.

The project specification, plan, and task breakdown are defined in YAML files:

Overview

The Unstructured Data RAG Platform provides:

Ingestion and parsing for priority formats: PDF, Word, PlantUML, Draw.io, Mermaid.
Unified JSON schema for structured content.
Storage in Postgres + PGVector for semantic search.
Raw object storage in MinIO, bi-directionally linked with Postgres entries.
Context-preserving chunking for embeddings.
Agentic workflows with LangChain DeepAgent.
LLM-as-judge to validate consistency and traceability.
A React frontend for manual review, labeling, and editing.
Multi-LLM support (OpenAI API, Anthropic, local LLaMA2).
Structured logging and error capture for debugging and compliance.

Architecture

                ┌───────────────┐
                │   Frontend    │
                │ (React/Next)  │
                └───────▲───────┘
                        │
                        ▼
                 ┌──────────────┐
                 │   Backend    │
                 │  (FastAPI)   │
                 └──────▲───────┘
                        │
    ┌───────────────────┼───────────────────┐
    ▼                   ▼                   ▼
 [Parsers]         [Postgres+PGVector]    [MinIO]
 (PDF/Word/        (JSON + embeddings)    (raw files,
 PlantUML/Drawio)                          binaries, images)

         ┌───────────────────────────────────┐
         │       LangChain DeepAgent         │
         │  Retrieval + Generation + Judge   │
         └───────────────────────────────────┘

🚀 Modules

Agents

The agents module provides the core components for creating AI agents. It includes a flexible FlexibleAgent (formerly SDLCFlexibleAgent) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a MockAgent for testing and CI.

The agents module integrates LangChain DeepAgent. It handles retrieval from PGVector, answer generation, and LLM-as-judge evaluations. Supports multiple LLM providers (OpenAI, Anthropic, LLaMA2 local via Ollama).

Parsers

The parsers module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.

The parsers module processes multiple formats:

PDF & Word via Unstructured.io
PlantUML, Draw.io, Mermaid via custom parsers

Extracted text, metadata, and relationships are normalized into JSON, stored in Postgres, and linked to raw files in MinIO.

Resources

FAQ

Q1: Why not store raw docs directly in Postgres? A1: To separate structured vs. unstructured storage. Raw files live in MinIO; Postgres stores structured JSON + embeddings with bi-directional links.

Q2: Can I use my own LLM? A2: Yes. The platform supports OpenAI, Anthropic, local models (via Ollama), or self-hosted LLaMA2.

Documentation

All project documentation is located at softwaremodule-docs. Architecture, schema, and developer guides are maintained here.

If you would like to contribute to the documentation, please submit a pull request on the unstructuredDataHandler Documentation repository.

🔧 Key Components

📁 specs/        → Project specification
📁 plan/         → High-level plan
📁 tasks/        → Task breakdown
📁 config/       → YAML config for models, prompts, logging
📁 data/         → Prompts, embeddings, and other dynamic content
📁 examples/     → Minimal scripts to test key features
📁 notebooks/    → Quick experiments and prototyping
📁 src/          → The core engine — all logic lives here (./src/README.md)
📁 tests/        → Unit, integration, and end-to-end tests
📁 doc/          → Architecture, schema, guides, and API documentation (doc/codeDocs/)
📁 deployments/  → Docker, Kubernetes, Helm, monitoring

⚡ Best Practices

Track prompt versions and results
Separate configs using YAML files
Maintain separation between model clients
Structure code by clear module boundaries
Cache responses to reduce latency and cost
Handle errors with custom exceptions
Use notebooks for rapid testing and iteration
Monitor API usage and set rate limits
Keep code and docs in sync
Normalize all parsed content into JSON schema.
Chunk documents with context preservation.
Monitor agents via LangSmith.
Store only raw files in MinIO, not Git.
Run CI/CD pipelines for linting, testing, and type checks.

🧭 Getting Started

Clone the repository.
Set up your Python environment. A Python version between 3.10 and 3.12 is recommended.
Install dependencies. The project's dependencies are split into several files. For general development, you will need requirements-dev.txt.
```
pip install -r requirements-dev.txt
```
Set up your environment variables. Copy the .env.template file to .env and fill in the required API keys for the LLM providers you want to use.
Explore the examples. The examples/ directory contains scripts that demonstrate the key features of the project.
Experiment in notebooks. The notebooks/ directory is a great place to start experimenting with the codebase.

💡 Development Tips

Use modular structure
Test components early
Track with version control
Keep datasets fresh
Keep documentation updated
Monitor API usage and limits
Keep LLM API usage monitored.
Keep specs/plans/tasks updated in version control.
Test parsers with representative docs.
Use notebooks for quick experiments.
Run type checks (mypy) and linting (ruff) before PRs.

📁 Core Files

specs/spec.yaml – System specification
plan/plan.yaml – Roadmap & phases
tasks/tasks.yaml – Task breakdown with dependencies
requirements.txt – Core package dependencies for the project.
requirements-dev.txt - Dependencies for development and testing.
requirements-docs.txt - Dependencies for generating documentation.
AGENTS.md - Instructions for AI agents working with this repository.
README.md – Project overview and usage.
Dockerfile – Container build instructions.

Running tests

We provide a small helper script that creates an isolated virtualenv and runs the test suite.

Run the full test suite locally:

./scripts/run-tests.sh

Or run just the deepagent tests (fast):

./scripts/run-tests.sh test/unit -k deepagent

You can also use the Makefile targets:

make test
make lint
make typecheck
make format
make coverage       # terminal coverage summary
make coverage-html  # generate HTML report in ./htmlcov/
make ci             # tests with coverage + ruff + mypy + pylint

How to test locally (two options)

Preferred (isolated venv): Use ./scripts/run-tests.sh. It creates .venv_ci, pins pytest, and runs with PYTHONPATH set correctly.
Alternative (your own venv):
1. python3 -m venv .venv
2. source .venv/bin/activate
3. pip install -U pip
4. pip install -r requirements-dev.txt
5. PYTHONPATH=. python -m pytest test/ -v

Optional: run with coverage

Isolated venv script (add flags after the script path):
- ./scripts/run-tests.sh --cov=src --cov-report=term-missing
Local venv (after installing requirements-dev.txt):
- PYTHONPATH=. python -m pytest test/ --cov=src --cov-report=term-missing

Makefile shortcuts:

make coverage — terminal summary
make coverage-html — generates an HTML report in ./htmlcov/

Quick lint and type checks

Makefile shortcut:
- make lint
- make lint-fix # ruff check with autofix
- make typecheck # mypy with router exclusion
- make format # ruff formatter
Manual (useful in CI or local venv):
- python -m pylint src/ --exit-zero
- python -m mypy src/ --ignore-missing-imports --exclude="src/llm/router.py"

Note: The mypy exclusion for src/llm/router.py avoids a duplicate module conflict with src/fallback/router.py during type analysis.

Contributing

We are excited to work with the community to build and enhance this project.

BEFORE you start work on a feature/fix, please read & follow our Contributor's Guide to help avoid any wasted or duplicate effort.

Developer setup: pre-commit hooks (optional but recommended)

To keep code quality consistent, we provide pre-commit hooks for ruff (lint+format) and mypy; and a pre-push hook that runs tests with coverage.

Install dev deps (once): pip install -r requirements-dev.txt
Install hooks (once): pre-commit install --install-hooks
Optional: enable pre-push test runner: pre-commit install --hook-type pre-push

Hooks configured in .pre-commit-config.yaml:

ruff (with autofix) and ruff-format
mypy with the router exclusion
pre-push: ./scripts/run-tests.sh --cov=src --cov-report=term-missing

Communicating with the Team

The easiest way to communicate with the team is via GitHub issues.

Please file new issues, feature requests and suggestions, but DO search for similar open/closed preexisting issues before creating a new issue.

If you would like to ask a question that you feel doesn't warrant an issue (yet), please reach out to us via email: info@softwaredevlabs.com

Developer Guidance

Please review these brief docs below about our coding practices.

👉 If you find something missing from these docs, feel free to contribute to any of our documentation files anywhere in the repository (or write some new ones!)

This is a work in progress as we learn what we'll need to provide people in order to be effective contributors to our project.

Code of Conduct

This project has adopted the Code of Conduct. For more information see the Code of Conduct or contact info@softwaredevlabs.com with any additional questions or comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Welcome to the Unstructured Data RAG Platform Repo

Unstructured Data RAG Platform Overview

unstructuredDataHandler Roadmap

Overview

Architecture

🚀 Modules

Agents

Parsers

Resources

FAQ

Documentation

🔧 Key Components

⚡ Best Practices

🧭 Getting Started

💡 Development Tips

📁 Core Files

Running tests

How to test locally (two options)

Optional: run with coverage

Quick lint and type checks

Contributing

Developer setup: pre-commit hooks (optional but recommended)

Communicating with the Team

Developer Guidance

Code of Conduct

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.codacy		.codacy
.github		.github
.venv		.venv
build		build
config		config
data		data
doc		doc
documentation-output		documentation-output
examples		examples
notebooks		notebooks
oss		oss
plan		plan
res		res
scripts		scripts
specs		specs
src		src
tasks		tasks
test		test
tools		tools
.editorconfig		.editorconfig
.env		.env
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
NOTICE.md		NOTICE.md
PR_UPDATE.md		PR_UPDATE.md
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
docker-compose.yml		docker-compose.yml
metrics.svg		metrics.svg
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-docs.txt		requirements-docs.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

SoftwareDevLabs/unstructuredDataHandler

Folders and files

Latest commit

History

Repository files navigation

Welcome to the Unstructured Data RAG Platform Repo

Unstructured Data RAG Platform Overview

unstructuredDataHandler Roadmap

Overview

Architecture

🚀 Modules

Agents

Parsers

Resources

FAQ

Documentation

🔧 Key Components

⚡ Best Practices

🧭 Getting Started

💡 Development Tips

📁 Core Files

Running tests

How to test locally (two options)

Optional: run with coverage

Quick lint and type checks

Contributing

Developer setup: pre-commit hooks (optional but recommended)

Communicating with the Team

Developer Guidance

Code of Conduct

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages