Protein language models reveal evolutionary constraints on synonymous codon choice

This repository contains all code used in data analysis, figure creation, and model training and evaluation for the codon constraints project, currently available as a preprint at https://www.biorxiv.org/content/10.1101/2025.08.05.668603v1.

Setup and installation

Dependencies are listed in the codon_env.yml file.

A conda environment containing required packages can be set up via:

conda create --name codon_env --file spec_file
conda activate codon_env
pip3 install -r requirements.txt

Replicating Figures and Analysis

Most of the figures and statistics used in the paper can be replicated by running analysis/main_figures.ipynb. This notebook takes two files as input - data/results/model_results.csv, which includes the output of our model on positions from the S. cerevisiae test set, and data/results/pooled_comp_results.csv, which records the results from our pooled competition experiment. Instructions on how to replicate these files are provided below. Figure panels are saved in the figures folder. Figure panels for Supplementary Figure 3 (Paired Fluorescent Competion) can be replicated by running paired_competition/paired_competition_analysis.ipynb. Code used to train the random forest comparison can be found in analysis/random_forest.ipynb.

Evaluating our model on the S. cerevisiae test dataset

Trained models are stored under data/models, including both our main model trained on a balanced dataset (nov12_24_undersample) and a comparison model trained on the full, unbalanced dataset (june18_24_unweighted). In order to repeat the evaluation process, run model/generate_esm_rep_scer.py to create ESM2 representations (https://github.com/facebookresearch/esm) for all S. cerevisiae genes. Then run model/evaluate.py to load and evaluate both models on S. cerevisiae. Next, call model/combine_results.py in order to create data/results/model_results.csv.

Replicating pooled competition analysis

A library of synonymous yeast variants was grown together in a pooled competition for 81 generations, with two editing replicates (AB and CD) and two experimental replicates per editing replidate (A, B, C, and D). The library was sequenced at the start (T0) and end (T5) of the competition. Raw sequencing data is obtainable from the NCBI Sequence Read Archive via accession number PRJNA1297196, and was processed using pooled_competition/process_illumina.py. The raw read counts per variant per sample are stored in data/pooled_comp/counts_db.csv. DESeq2 was used to call significantly advantageous or deleterious variants. This step can be replicated via pooled_competition/run_deseq2.py, and requires DESeq2 and R to be installed. The final results file (data/results/pooled_comp_results.csv) is created using pooled_competition/combine_results.py.

Training the model

Trained model weights are saved in data/models. In order to replicate the model training process, first generate an HDF5 file containing ESM2 representations of all genes in the training dataset using model/process_data.py. Model training can be replicated via model/train_model.py. Model settings and parameters can be changed within the file.

Additional information

Plasmids and oligo sequences can be found under data/plasmids_and_oligos. OD measurements from the pooled competition can be found under data/pooled_comp/pooled_comp_od.csv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein language models reveal evolutionary constraints on synonymous codon choice

Setup and installation

Replicating Figures and Analysis

Evaluating our model on the S. cerevisiae test dataset

Replicating pooled competition analysis

Training the model

Additional information

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
analysis		analysis
data		data
figures		figures
model		model
paired_competition		paired_competition
pooled_competition		pooled_competition
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codon_env.yml		codon_env.yml
requirements.txt		requirements.txt
spec_file.txt		spec_file.txt

License

lareaulab/codon_constraints

Folders and files

Latest commit

History

Repository files navigation

Protein language models reveal evolutionary constraints on synonymous codon choice

Setup and installation

Replicating Figures and Analysis

Evaluating our model on the S. cerevisiae test dataset

Replicating pooled competition analysis

Training the model

Additional information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages