Some patterns commonly encountered when writing CWL workflows
- Manifest file via Javascript
- Embedding scripts
- Embedding a bash script (style 2)
- Manipulating a list of files using expressions
- Link input files to working directory
- How to handle port type mismatches
Which of the Workflow Patterns Initiative patterns does CWL support?
My tool takes in a list of filenames as input. I used to pass each file name on the command line, but when I scale up I run into unix "Argument list too long" errors. What can I do?
If the program accepts, or can be altered to accept, a manifest file containing a list of input file paths, the CWL can be written to generate this manifest file on the fly before invoking the tool (example).
I have a Python script that I want to embed in my tool wrapper (not make part of the docker image). How would I do this?
You can embed the script as an InitialWorkDirRequirement
(example).
I use good software practices. My Python/Ruby/Haskell/... script is in a separate file from my CWL wrapper.
You can use the $include directive to pull in the file into your CWL.
You can use it in InitialWorkDirRequirement
(example) or in File Contents (example)
My embedded (Bash/Python/R) script has "$" signs in it and this is conflicting with CWL parameter references. How do get them to play nicely together?
You can escape the $ in your script with \$
I'd rather just paste the script into the CWL and not have to go through it, escaping the "$" signs. Do I have another option?
You could embed your script in the "contents" field of the default value of a file. To note in this solution is that a user can supply a different file to the script input and override the default script. This can be considered a bug or a feature, depending on your use case (example).
A bash script can be passed as a string via the command line and invoked from the command line. (example).
I have an input that is a list of files. I wish to do processing based on the file paths.
This depends a bit on what the expression is intended to do. The easiest is if the whole processing can be done in javascript. In this case the pattern looks like
${
  var cmd = "";
  for( var i = 0; i < inputs.files.length; i++) {
     cmd += "\n echo " + inputs.files[i].path;
  }
  return cmd;  
}
(example)
This works well for when we can do with a JSON or simple string return value
from the JS code. If what we really want is an embedded script (say a bash or
Python script) to be generated, say via the InitialWorkDirRequirement it
becomes cumbersome to write in this fashion. One way of doing this is to write a
succinct JS expression that converts the passed JSON object into a list of paths
in the syntax accepted by the script.
Here is an example for Python and for bash
I have a tool that does not do well with arbitrary file paths. I'd like to link the files into the working directory so I don't have to deal with arbitrary mount paths and so on.
You can use InitialWorkDirRequirement to link the files
(example).
You can mix this with embedding scripts (example).
- Tool A produces a list of Files (or strings, ints ...)
- Tool B accepts only a single File (or string, int ...)
- How do I connect A to B?
If you are sure this is not going to be a problem, e.g. in this context A will
only ever produce one file, or you are only interested in one file, you can use
a step valueFrom expression to convert the types.
Here is a workflow that will raise validation warnings and will fail on execution because of port type mismatches.
Here is the same workflow with
valueFrom added to make the port types match.
You can tailor the input/output types to your situation.
flatten-nestedarray.cwl shows how to flatten an array such as [["a0", "a1"], ["b0", "b1"]] into ["a0", "a1", "b0", "b1"].
batch-array.cwl shows how to divide an array such as ["a0", "a1", "a2", "a3", "a4"] into a nested array [["a0", "a1"], ["a2", "a3"], ["a4"]] with batch size 2. With batch size 3, the batched nested array would be [["a0", "a1", "a2"], ["a3", "a4"]].
array-to-dir.cwl shows how to stage an array such as [file0, dir0, file1, dir1] in a new directory, with file0, dir0, file1, dir1 as its content.
nestedarray-to-dir.cwl shows how to stage a nested array such as [[file0, dir0], [file1, dir1]] in a new directory, with file0, dir0, file1, dir1 as its content.
get-vcfs.cwl is a bioinformatics specific tool that processes a 'Directory' with files such as a.vcf.gz, a.vcf.gz.tbi, b.vcf.gz, b.vcf.gz.tbi into an array [a.vcf.gz, b.vcf.gz], where a.vcf.gz.tbi is the secondary file of a.vcf.gz, and b.vcf.gz.tbi is the secondary file of b.vcf.gz, respectively.