This repository contains the implementation for the NeurIPS 2025 paper "Composing Linear Layers from Irreducibles" by Travis Pence, Daisuke Yamada, and Vikas Singh.
Paper: arXiv:2507.11688
This work demonstrates how linear layers in large language models can be decomposed into geometric primitives (bivectors) using Clifford algebra, achieving exponential parameter reduction from O(d²) to O(log²d) while maintaining competitive performance. We replace key, query, and value projections in LLM attention layers with rotor-based transformations that compose simple geometric rotations. The bivector to rotor process via invariant decomposition is visualized below, which we go into much more detail in the paper and the differential algorithm we present.
- Clone thhis repo, the submission branch of torch_ga_fix (https://github.com/TravisNP/torch_ga_fix/tree/submission#), and fast-hadamard-transform (https://github.com/Dao-AILab/fast-hadamard-transform.git).
git clone git@github.com:vsingh-group/ComposingLinearLayers.git
git clone git@github.com:TravisNP/torch_ga_fix.git
git clone git@github.com:Dao-AILab/fast-hadamard-transform.git
cd torch_ga_fix
git checkout submission
cd ../- Pull the pytorch docker image and create/start/attach to a container
docker pull pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel
docker run -it \
--name ComposingLinearLayers \
--gpus all \
-v "$(pwd):/workspace" \
pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel \
bash- Install torch_ga_fix and fast-hadamard-transform
cd torch_ga_fix/
pip install .
cd ../fast-hadamard-transform/
pip install .- Install requirements.txt
cd ../ComposingLinearLayers/
pip install -r requirements.txt- Set HUGGINGTOKEN environment variable with
export HUGGINGTOKEN=<yourtokenhere>To stop the container, type exit. To start/attach to the container later, use docker start -ai ComposingLinearLayers.
To replace attention layers in different LLMs, navigate to the ComposingLinearLayers directory and run the corresponding script:
./run/run_<model_name>.shAvailable models:
run_llama.sh- LLaMA-3.2 1B / LlaMa-3.2 3Brun_qwen.sh- Qwen-2.5 1.5Brun_fox.sh- Fox-1.0 1.6B
Below are the average PPL for replacing up to three transformer layers of LLaMa and Qwen
To reproduce the projection convergence analysis:
python -m run.test_projection_convergenceBelow are the results showing that while larger rotors initially require more iterations to converge, they eventually converge just as quickly as smaller models.
The main script main.py accepts the following arguments:
--layers: Comma-separated layer indices to replace (e.g.,"12,13,14")--root: Root directory for storing data and model outputs--config: Path to YAML configuration file (without.yamlextension)--dataset: Dataset for evaluation- Options:
arc_challenge,hellaswag,wikitext,c4,ptb
- Options:
--model: LLM model to use- Options:
llama1B,llama3B,Qwen2.5-1.5B,fox
- Options:
--replacement_type: Type of layer replacement- Options:
rotor,lowrank_linear,bh_linear
- Options:
--train_projo: If set, trains the output projection layer (o_proj) after replacing attention layers--eval_datatype: Data type for evaluation- Options:
float32(default),bfloat16 - Note: Not sure if
bfloat16works
- Options:
--rank: Rank for low-rank linear approximation (required when--replacement_type=lowrank_linear)--llm_batch_size: Number of prompts to process simultaneously during data extraction--remove: If set, deletes extracted training data after processing to save disk space
See the scripts in the run directory. We have scripts that recreate the experiments for the Fox, LLaMa, and Qwen models
Configuration files (.yaml) contain hyperparameters specific to each replacement type. See the paper (Appendix C) for details on hyperparameter settings used in experiments.
The pipeline consists of three main steps for each layer:
-
Data Extraction: Extract hidden states (inputs) and projection outputs (targets) from the original model using the specified dataset
-
Training: Train the replacement layer (rotor, low-rank, or block-Hadamard) to minimize MSE between predicted and true projection outputs
- Optionally retrain the output projection (
o_proj) if--train_projois set
- Optionally retrain the output projection (
-
Evaluation: Evaluate on perplexity (for language modeling datasets) or accuracy (for multiple-choice benchmarks)
When replacing multiple layers, they are processed sequentially - each new layer is trained with all previously replaced layers active.
If you use this code, please cite:
@inproceedings{pence2025composing,
title={Composing Linear Layers from Irreducibles},
author={Pence, Travis and Yamada, Daisuke and Singh, Vikas},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}For questions or issues, please open a GitHub issue or contact Travis Pence at tnpence at wisc dot edu.


