Table of Contents
This repository contains the source code for the Unstructured Data RAG Platform, a Python-based framework for handling unstructured documents in safety-critical and cybersecurity software development lifecycles.
It ingests documents (PDF, Word, PlantUML, Draw.io, Mermaid), converts them to structured JSON, embeds them with PGVector, stores raw artifacts in MinIO, and enables agentic RAG using LangChain DeepAgent.
Unstructured Data RAG Platform is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.
- Primary Language: Python 3.10-3.12
- Secondary Languages: TypeScript (for Azure pipelines), Shell scripts
- Project Type: AI/ML library and tooling for SDLC workflows
Related repositories include:
- unstructuredDataHandler Documentation (Placeholder)
The plan for the unstructuredDataHandler is described here and will be updated as the project proceeds.
The project specification, plan, and task breakdown are defined in YAML files:
The Unstructured Data RAG Platform provides:
- Ingestion and parsing for priority formats: PDF, Word, PlantUML, Draw.io, Mermaid.
- Unified JSON schema for structured content.
- Storage in Postgres + PGVector for semantic search.
- Raw object storage in MinIO, bi-directionally linked with Postgres entries.
- Context-preserving chunking for embeddings.
- Agentic workflows with LangChain DeepAgent.
- LLM-as-judge to validate consistency and traceability.
- A React frontend for manual review, labeling, and editing.
- Multi-LLM support (OpenAI API, Anthropic, local LLaMA2).
- Structured logging and error capture for debugging and compliance.
βββββββββββββββββ
β Frontend β
β (React/Next) β
βββββββββ²ββββββββ
β
βΌ
ββββββββββββββββ
β Backend β
β (FastAPI) β
ββββββββ²ββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
βΌ βΌ βΌ
[Parsers] [Postgres+PGVector] [MinIO]
(PDF/Word/ (JSON + embeddings) (raw files,
PlantUML/Drawio) binaries, images)
βββββββββββββββββββββββββββββββββββββ
β LangChain DeepAgent β
β Retrieval + Generation + Judge β
βββββββββββββββββββββββββββββββββββββ
The agents module provides the core components for creating AI agents. It includes a flexible FlexibleAgent
(formerly SDLCFlexibleAgent) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama)
and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills.
Key components include a planner and an executor (currently placeholders for future development) and a MockAgent for testing and CI.
The agents module integrates LangChain DeepAgent. It handles retrieval from PGVector, answer generation, and LLM-as-judge evaluations. Supports multiple LLM providers (OpenAI, Anthropic, LLaMA2 local via Ollama).
The parsers module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO.
It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database.
This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction,
making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database,
such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.
The parsers module processes multiple formats:
- PDF & Word via Unstructured.io
- PlantUML, Draw.io, Mermaid via custom parsers
Extracted text, metadata, and relationships are normalized into JSON, stored in Postgres, and linked to raw files in MinIO.
Q1: Why not store raw docs directly in Postgres? A1: To separate structured vs. unstructured storage. Raw files live in MinIO; Postgres stores structured JSON + embeddings with bi-directional links.
Q2: Can I use my own LLM? A2: Yes. The platform supports OpenAI, Anthropic, local models (via Ollama), or self-hosted LLaMA2.
All project documentation is located at softwaremodule-docs. Architecture, schema, and developer guides are maintained here.
If you would like to contribute to the documentation, please submit a pull request on the unstructuredDataHandler Documentation repository.
π specs/ β Project specification
π plan/ β High-level plan
π tasks/ β Task breakdown
π config/ β YAML config for models, prompts, logging
π data/ β Prompts, embeddings, and other dynamic content
π examples/ β Minimal scripts to test key features
π notebooks/ β Quick experiments and prototyping
π src/ β The core engine β all logic lives here (./src/README.md)
π tests/ β Unit, integration, and end-to-end tests
π doc/ β Architecture, schema, guides, and API documentation (doc/codeDocs/)
π deployments/ β Docker, Kubernetes, Helm, monitoring
- Track prompt versions and results
- Separate configs using YAML files
- Maintain separation between model clients
- Structure code by clear module boundaries
- Cache responses to reduce latency and cost
- Handle errors with custom exceptions
- Use notebooks for rapid testing and iteration
- Monitor API usage and set rate limits
- Keep code and docs in sync
- Normalize all parsed content into JSON schema.
- Chunk documents with context preservation.
- Monitor agents via LangSmith.
- Store only raw files in MinIO, not Git.
- Run CI/CD pipelines for linting, testing, and type checks.
-
Clone the repository.
-
Set up your Python environment. A Python version between 3.10 and 3.12 is recommended.
-
Install dependencies. The project's dependencies are split into several files. For general development, you will need
requirements-dev.txt.pip install -r requirements-dev.txt
-
Set up your environment variables. Copy the
.env.templatefile to.envand fill in the required API keys for the LLM providers you want to use. -
Explore the examples. The
examples/directory contains scripts that demonstrate the key features of the project. -
Experiment in notebooks. The
notebooks/directory is a great place to start experimenting with the codebase.
- Use modular structure
- Test components early
- Track with version control
- Keep datasets fresh
- Keep documentation updated
- Monitor API usage and limits
- Keep LLM API usage monitored.
- Keep specs/plans/tasks updated in version control.
- Test parsers with representative docs.
- Use notebooks for quick experiments.
- Run type checks (
mypy) and linting (ruff) before PRs.
specs/spec.yamlβ System specificationplan/plan.yamlβ Roadmap & phasestasks/tasks.yamlβ Task breakdown with dependenciesrequirements.txtβ Core package dependencies for the project.requirements-dev.txt- Dependencies for development and testing.requirements-docs.txt- Dependencies for generating documentation.AGENTS.md- Instructions for AI agents working with this repository.README.mdβ Project overview and usage.Dockerfileβ Container build instructions.
We provide a small helper script that creates an isolated virtualenv and runs the test suite.
Run the full test suite locally:
./scripts/run-tests.shOr run just the deepagent tests (fast):
./scripts/run-tests.sh test/unit -k deepagentYou can also use the Makefile targets:
make test
make lint
make typecheck
make format
make coverage # terminal coverage summary
make coverage-html # generate HTML report in ./htmlcov/
make ci # tests with coverage + ruff + mypy + pylint- Preferred (isolated venv): Use
./scripts/run-tests.sh. It creates.venv_ci, pins pytest, and runs withPYTHONPATHset correctly. - Alternative (your own venv):
python3 -m venv .venvsource .venv/bin/activatepip install -U pippip install -r requirements-dev.txtPYTHONPATH=. python -m pytest test/ -v
- Isolated venv script (add flags after the script path):
./scripts/run-tests.sh --cov=src --cov-report=term-missing
- Local venv (after installing
requirements-dev.txt):PYTHONPATH=. python -m pytest test/ --cov=src --cov-report=term-missing
Makefile shortcuts:
make coverageβ terminal summarymake coverage-htmlβ generates an HTML report in./htmlcov/
- Makefile shortcut:
make lintmake lint-fix# ruff check with autofixmake typecheck# mypy with router exclusionmake format# ruff formatter
- Manual (useful in CI or local venv):
python -m pylint src/ --exit-zeropython -m mypy src/ --ignore-missing-imports --exclude="src/llm/router.py"
Note: The mypy exclusion for src/llm/router.py avoids a duplicate module conflict with src/fallback/router.py during type analysis.
We are excited to work with the community to build and enhance this project.
BEFORE you start work on a feature/fix, please read & follow our Contributor's Guide to help avoid any wasted or duplicate effort.
To keep code quality consistent, we provide pre-commit hooks for ruff (lint+format) and mypy; and a pre-push hook that runs tests with coverage.
- Install dev deps (once):
pip install -r requirements-dev.txt - Install hooks (once):
pre-commit install --install-hooks - Optional: enable pre-push test runner:
pre-commit install --hook-type pre-push
Hooks configured in .pre-commit-config.yaml:
- ruff (with autofix) and ruff-format
- mypy with the router exclusion
- pre-push:
./scripts/run-tests.sh --cov=src --cov-report=term-missing
The easiest way to communicate with the team is via GitHub issues.
Please file new issues, feature requests and suggestions, but DO search for similar open/closed preexisting issues before creating a new issue.
If you would like to ask a question that you feel doesn't warrant an issue (yet), please reach out to us via email: info@softwaredevlabs.com
Please review these brief docs below about our coding practices.
π If you find something missing from these docs, feel free to contribute to any of our documentation files anywhere in the repository (or write some new ones!)
This is a work in progress as we learn what we'll need to provide people in order to be effective contributors to our project.
This project has adopted the Code of Conduct. For more information see the Code of Conduct or contact info@softwaredevlabs.com with any additional questions or comments.