Skip to content

SoftwareDevLabs/unstructuredDataHandler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Welcome to the Unstructured Data RAG Platform Repo

Table of Contents

This repository contains the source code for the Unstructured Data RAG Platform, a Python-based framework for handling unstructured documents in safety-critical and cybersecurity software development lifecycles.

It ingests documents (PDF, Word, PlantUML, Draw.io, Mermaid), converts them to structured JSON, embeds them with PGVector, stores raw artifacts in MinIO, and enables agentic RAG using LangChain DeepAgent.

Unstructured Data RAG Platform Overview

Unstructured Data RAG Platform is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.

  • Primary Language: Python 3.10-3.12
  • Secondary Languages: TypeScript (for Azure pipelines), Shell scripts
  • Project Type: AI/ML library and tooling for SDLC workflows

Related repositories include:

unstructuredDataHandler Roadmap

The plan for the unstructuredDataHandler is described here and will be updated as the project proceeds.

The project specification, plan, and task breakdown are defined in YAML files:


Overview

The Unstructured Data RAG Platform provides:

  • Ingestion and parsing for priority formats: PDF, Word, PlantUML, Draw.io, Mermaid.
  • Unified JSON schema for structured content.
  • Storage in Postgres + PGVector for semantic search.
  • Raw object storage in MinIO, bi-directionally linked with Postgres entries.
  • Context-preserving chunking for embeddings.
  • Agentic workflows with LangChain DeepAgent.
  • LLM-as-judge to validate consistency and traceability.
  • A React frontend for manual review, labeling, and editing.
  • Multi-LLM support (OpenAI API, Anthropic, local LLaMA2).
  • Structured logging and error capture for debugging and compliance.

Architecture

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚   Frontend    β”‚
                β”‚ (React/Next)  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚   Backend    β”‚
                 β”‚  (FastAPI)   β”‚
                 β””β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό                   β–Ό                   β–Ό
 [Parsers]         [Postgres+PGVector]    [MinIO]
 (PDF/Word/        (JSON + embeddings)    (raw files,
 PlantUML/Drawio)                          binaries, images)

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚       LangChain DeepAgent         β”‚
         β”‚  Retrieval + Generation + Judge   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Modules

Agents

The agents module provides the core components for creating AI agents. It includes a flexible FlexibleAgent (formerly SDLCFlexibleAgent) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a MockAgent for testing and CI.

The agents module integrates LangChain DeepAgent. It handles retrieval from PGVector, answer generation, and LLM-as-judge evaluations. Supports multiple LLM providers (OpenAI, Anthropic, LLaMA2 local via Ollama).

Parsers

The parsers module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.

The parsers module processes multiple formats:

  • PDF & Word via Unstructured.io
  • PlantUML, Draw.io, Mermaid via custom parsers

Extracted text, metadata, and relationships are normalized into JSON, stored in Postgres, and linked to raw files in MinIO.


Resources


FAQ

Q1: Why not store raw docs directly in Postgres? A1: To separate structured vs. unstructured storage. Raw files live in MinIO; Postgres stores structured JSON + embeddings with bi-directional links.

Q2: Can I use my own LLM? A2: Yes. The platform supports OpenAI, Anthropic, local models (via Ollama), or self-hosted LLaMA2.


Documentation

All project documentation is located at softwaremodule-docs. Architecture, schema, and developer guides are maintained here.

If you would like to contribute to the documentation, please submit a pull request on the unstructuredDataHandler Documentation repository.


πŸ”§ Key Components

πŸ“ specs/        β†’ Project specification
πŸ“ plan/         β†’ High-level plan
πŸ“ tasks/        β†’ Task breakdown
πŸ“ config/       β†’ YAML config for models, prompts, logging
πŸ“ data/         β†’ Prompts, embeddings, and other dynamic content
πŸ“ examples/     β†’ Minimal scripts to test key features
πŸ“ notebooks/    β†’ Quick experiments and prototyping
πŸ“ src/          β†’ The core engine β€” all logic lives here (./src/README.md)
πŸ“ tests/        β†’ Unit, integration, and end-to-end tests
πŸ“ doc/          β†’ Architecture, schema, guides, and API documentation (doc/codeDocs/)
πŸ“ deployments/  β†’ Docker, Kubernetes, Helm, monitoring

⚑ Best Practices

  • Track prompt versions and results
  • Separate configs using YAML files
  • Maintain separation between model clients
  • Structure code by clear module boundaries
  • Cache responses to reduce latency and cost
  • Handle errors with custom exceptions
  • Use notebooks for rapid testing and iteration
  • Monitor API usage and set rate limits
  • Keep code and docs in sync
  • Normalize all parsed content into JSON schema.
  • Chunk documents with context preservation.
  • Monitor agents via LangSmith.
  • Store only raw files in MinIO, not Git.
  • Run CI/CD pipelines for linting, testing, and type checks.

🧭 Getting Started

  1. Clone the repository.

  2. Set up your Python environment. A Python version between 3.10 and 3.12 is recommended.

  3. Install dependencies. The project's dependencies are split into several files. For general development, you will need requirements-dev.txt.

    pip install -r requirements-dev.txt
  4. Set up your environment variables. Copy the .env.template file to .env and fill in the required API keys for the LLM providers you want to use.

  5. Explore the examples. The examples/ directory contains scripts that demonstrate the key features of the project.

  6. Experiment in notebooks. The notebooks/ directory is a great place to start experimenting with the codebase.


πŸ’‘ Development Tips

  • Use modular structure
  • Test components early
  • Track with version control
  • Keep datasets fresh
  • Keep documentation updated
  • Monitor API usage and limits
  • Keep LLM API usage monitored.
  • Keep specs/plans/tasks updated in version control.
  • Test parsers with representative docs.
  • Use notebooks for quick experiments.
  • Run type checks (mypy) and linting (ruff) before PRs.

πŸ“ Core Files

  • specs/spec.yaml – System specification
  • plan/plan.yaml – Roadmap & phases
  • tasks/tasks.yaml – Task breakdown with dependencies
  • requirements.txt – Core package dependencies for the project.
  • requirements-dev.txt - Dependencies for development and testing.
  • requirements-docs.txt - Dependencies for generating documentation.
  • AGENTS.md - Instructions for AI agents working with this repository.
  • README.md – Project overview and usage.
  • Dockerfile – Container build instructions.

Running tests

We provide a small helper script that creates an isolated virtualenv and runs the test suite.

Run the full test suite locally:

./scripts/run-tests.sh

Or run just the deepagent tests (fast):

./scripts/run-tests.sh test/unit -k deepagent

You can also use the Makefile targets:

make test
make lint
make typecheck
make format
make coverage       # terminal coverage summary
make coverage-html  # generate HTML report in ./htmlcov/
make ci             # tests with coverage + ruff + mypy + pylint

How to test locally (two options)

  • Preferred (isolated venv): Use ./scripts/run-tests.sh. It creates .venv_ci, pins pytest, and runs with PYTHONPATH set correctly.
  • Alternative (your own venv):
    1. python3 -m venv .venv
    2. source .venv/bin/activate
    3. pip install -U pip
    4. pip install -r requirements-dev.txt
    5. PYTHONPATH=. python -m pytest test/ -v

Optional: run with coverage

  • Isolated venv script (add flags after the script path):
    • ./scripts/run-tests.sh --cov=src --cov-report=term-missing
  • Local venv (after installing requirements-dev.txt):
    • PYTHONPATH=. python -m pytest test/ --cov=src --cov-report=term-missing

Makefile shortcuts:

  • make coverage β€” terminal summary
  • make coverage-html β€” generates an HTML report in ./htmlcov/

Quick lint and type checks

  • Makefile shortcut:
    • make lint
    • make lint-fix # ruff check with autofix
    • make typecheck # mypy with router exclusion
    • make format # ruff formatter
  • Manual (useful in CI or local venv):
    • python -m pylint src/ --exit-zero
    • python -m mypy src/ --ignore-missing-imports --exclude="src/llm/router.py"

Note: The mypy exclusion for src/llm/router.py avoids a duplicate module conflict with src/fallback/router.py during type analysis.


Contributing

We are excited to work with the community to build and enhance this project.

BEFORE you start work on a feature/fix, please read & follow our Contributor's Guide to help avoid any wasted or duplicate effort.

Developer setup: pre-commit hooks (optional but recommended)

To keep code quality consistent, we provide pre-commit hooks for ruff (lint+format) and mypy; and a pre-push hook that runs tests with coverage.

  1. Install dev deps (once): pip install -r requirements-dev.txt
  2. Install hooks (once): pre-commit install --install-hooks
  3. Optional: enable pre-push test runner: pre-commit install --hook-type pre-push

Hooks configured in .pre-commit-config.yaml:

  • ruff (with autofix) and ruff-format
  • mypy with the router exclusion
  • pre-push: ./scripts/run-tests.sh --cov=src --cov-report=term-missing

Communicating with the Team

The easiest way to communicate with the team is via GitHub issues.

Please file new issues, feature requests and suggestions, but DO search for similar open/closed preexisting issues before creating a new issue.

If you would like to ask a question that you feel doesn't warrant an issue (yet), please reach out to us via email: info@softwaredevlabs.com

Developer Guidance

Please review these brief docs below about our coding practices.

πŸ‘‰ If you find something missing from these docs, feel free to contribute to any of our documentation files anywhere in the repository (or write some new ones!)

This is a work in progress as we learn what we'll need to provide people in order to be effective contributors to our project.


Code of Conduct

This project has adopted the Code of Conduct. For more information see the Code of Conduct or contact info@softwaredevlabs.com with any additional questions or comments.

About

important core libaries for front end, backend and commonly used libraries

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7