This repo contains files that will run kraken on the HPC to classify reads into Magnaporthe or Rice. Then split those reads into separate files and run the Mo ones through through kallisto count generating count files for each sample.
Snakemake is a workflow manager. Crucially, it can restart where it left off, so if one step of the pipeline fails, not all previous steps need to be re-run. It manages the steps that need to be run for you. This makes it very useful for bioinformatics pipelines especially this one, where we want to divide some steps into many similar small ones. Managing a snakemake process is easier than managing the hundreds (literally) of files that come out from the splitting and classifying and counting steps in a pipeline like this, especially when one goes wrong.
1. extract reads matching Mo from raw Fastq (kraken)
2. count reads extracted
3. count extracted reads per Mo transcript per sample (kallisto)
4. process count files into count matrix
5. summarise number of reads extracted
The raw reads are compared to a kraken database containing Rice and Mo transcript sequences. Each read is classified according to the most likely origin. The classified reads are then separated into files of only Mo or Os reads as appropriate
The number of reads extracted per sample are summarised in results files.
The Mo reads are put through the kallisto count pipeline to generate estimates of transcript abundance in each read file. A single output file is generated for each sample. A count matrix is compiled.
The pipeline is just a collection of scripts - see the folder scripts. Do not run these directly, they are run from a master script! To install the pipeline, pull this repo into a folder in your home area and cd into that folder. This will be your working directory.
snakemake pipelines are very tightly linked to the filesystem, so you need to tell it where everything it needs is. The config.yaml file does this.
scratch: "/some/place/in/your/scratch"
projroot: "/some/place/in/your/home/split_and_count"
kraken_db: "/tsl/data/krakendb/rice_magna"
reference_genome: "/tsl/data/sequences/fungi/magnaporthe/ensemble_genome/Magnaporthe_oryzae.MG8.50.cdna.all.fa"
scratchis the place in your scratch that you want the intermediate files to go toprojrootis the place in your home area that you are working from, it should be the place you pulled the repo tokrakendbis the location of thekrakendatabase, the given value is currently correctreference_genomeis the file of reference sequences thatkallistowill use, this is currently correct for Mo genes, change it for Os genes (you will need a cultivar specific file).tax_idis the NCBI taxonomy id of the species you want to use,318829for Mo,4530for Os (either cultivar)
The pipeline needs to know how samples map to the input files. The file lib/samples_to_files.csv does this. The structure of the file is inherited from an earlier project. It looks like this
| name | timepoint | biorep | samplename | alt_samplename | misc_info | fq1 | fq2 |
|---|---|---|---|---|---|---|---|
| co39_leaf | 0 | 1 | co39_leaf_0_1 | B1 | NA | /path/to/fq1 | /path/to/fq2 |
You must modify this for your info, but the column positions and names must stay the same.
The key columns are
namewhich must contain the namesamplenamewhich must contain a unique sample identifier - usually the concatenation ofnametimepointandbiorep.fq1andfq2which contain the full path to the fastq files in the hpc.
Note these must be in the 1st, 4th, 7th and 8th columns of a .csv file with a header, otherwise the data structure cannot be made.
Submitting the snakemake pipeline to the cluster is done from the script bash scripts/do_count_matrix.sh.
1. See the help - `bash scripts/do_count_matrix.sh -h`
2. Do a dryrun (check what remains of the pipeline to run and check everything is in place) - `bash scripts/do_count_matrix.sh dryrun`
3. Check the log to see that all is ok - `less split_and_count.log`
4. Submit the pipeline - `bash scripts/do_count_matrix.sh`
The pipeline creates a master job called split_and_count and that creates subjobs, this will persist for as long as the pipeline runs. Output from the main process goes
into a file split_and_count.log. Other job output goes into job id specific slurm**.out files.
In this pipeline all output goes into the home working directory in a folder results. Expect the following files
- "results/all_samples_tpm_matrix.txt" - a tpm matrix
- "results/run_metadata.txt" - the metadata file needed for
sleuth - "results/kallisto_abundances.gz" - a compressed file of abundance files for
sleuth - "results/read_counts.txt" - a summary of read_counts per sample
- "results/percent_reads_per_taxid.txt" - a summary of what percent of reads mapped to each taxid in the
krakenstep
If the pipeline fails (usually because some file wasn't created somewhere), once a fix is made, the pipeline can be restarted from the point it failed. Just redo the submission step bash scripts/do_count_matrix.sh . The pipeline will go from as far as it got previously, so if you need to redo earlier steps, you need to delete their output files. If the process crashes a lock will sometimes be generated on the folder, remove it with bash scripts/do_count_matrix.sh unlock