DemoDiff is a diffusion-based molecular foundation model for in-context inverse molecular design. It leverages graph diffusion transformers to generate molecules based on contextual examples, enabling few-shot molecular design across diverse chemical tasks without task-specific fine-tuning.
- In-Context Learning: Generate molecules using only contextual examples (no fine-tuning required)
- Graph-Based Tokenization: Novel molecular graph tokenization with BPE-style vocabulary
- Comprehensive Benchmarks: 30+ downstream tasks covering drug discovery, docking, and polymer design
The main interface for using DemoDiff on molecular design tasks.
Core Scripts:
inference.pyMain inference for molecular generation and evaluationprepare_model.py: Downloads pretrained DemoDiff-0.7B model from HuggingFaceprepare_data_and_oracle.py: Downloads benchmark datasets and oracles
Key Components:
context_data/: 30+ benchmark tasks organized by category (which can be downloaded byprepare_data_and_oracle.py)docking/: Molecular docking infrastructure with AutoDock Vinaoracle/: Property prediction oracles for specialized tasks (which can be downloaded byprepare_data_and_oracle.py)pretrained/: Model checkpoints and configuration filesmodels/: Neural network architecturesutils/: Molecular processing and tokenization utilities
Coming soon
processed/vocab3000ring300/: Cached metadatatokenizer/vocab3000ring300/: Molecular graph tokenization filespretrain-token.node: Node vocabularypretrain-token.edge: Edge vocabularypretrain-token.ring: Ring structure vocabulary
config.yaml: Hyperparameters, training settings, and tokenization config
- Model Type: Discrete diffusion transformer with marginal transitions
- Architecture: 24-layer transformer, 16 attention heads, 1280 hidden dimensions
- Tokenization: Graph BPE with 3000 vocabulary + 300 ring tokens
- Training: 500 diffusion steps with cosine noise schedule
- Context Length: Up to 150 tokens per molecule