User guide
Broadly, the input of a minute run consists of a set of multiplexed paired-end FASTQ files which are demultiplexed and aligned to a reference genome (with quality-control metrics and intermediate steps such as deduplication, optional filtering by mapping quality and/or user-defined excluded loci). The final numbers of reads aligned for both the input and each ChIP are used to normalize and produce MINUTE-scaled bigWig files.
Standard run
This is the most comon use-case for minute users, and the recommended starting point.
Configuration
In order to process your data, minute
needs the barcode sequence corresponding
to each library in the pool, and which of them is used as reference for scaling.
You can simply specify these in a tab-separated barcodes.tsv
file:
# Barcode information file
# Columns: barcode name, replicate id, barcode, reference
reference_condition 1 CCGGCGTT mm39
reference_condition 2 TTATTTCT mm39
treatment_condition 1 GATGTTTG mm39
treatment_condition 2 TTGTGAAA mm39
This file has the following columns:
- Condition name. Any name you want to attach to the barcode. It should not contain space characters.
- Replicate id. Different replicates for the same condition name.
- Barcode sequence. Nucleotide sequence that defines the pair condition, replicate.
- Genome reference. Reference identifier to map the reads to. This identifier should match an
identifier in the
minute.yaml
file.
Configuration steps
- Create a barcodes
barcodes.tsv
file describing the barcodes for each condition of your experiment as described. -
Activate
minute
conda environment (see the installation guide).conda activate minute
-
Run:
minute init myexperiment --reads path/to/fastq --barcodes path/to/barcodes.tsv --input inp_prefix
Note that
inp_prefix
must match one of the FASTQ read pairs in yourfastq
directory:inp_prefix_R1.fastq.gz
,inp_prefix_R2.fastq.gz
.This will create a
myexperiment
directory that contains afastq
subdirectory with symlinks to each FASTQ file inpath/to/fastq
, a templateminute.yaml
configuration file andlibraries.tsv
,groups.tsv
files.Optionally, if you have ran
minute
before in your system and you already have someminute.yaml
configuration you want to reuse, you can specify it via the--config
parameter. This will make a copy of the provided file in the directory whereminute
will run.minute init myexperiment \ --reads path/to/fastq \ --barcodes path/to/barcodes.tsv \ --input inp_prefix \ --config path/to/minute.yaml
-
Edit
minute.yaml
if required. References present inbarcodes.tsv
must be specified in it. For example, if you specifymm39
as reference genome, there must be amm39
entry inminute.yaml
with paths to matching bowtie2 indexes (if aligning with bowtie2) and FASTA reference.
Note
minute runs on paired-end FASTQ files. Read 1 and read 2 files must have names ending in _R1.fastq.gz
and _R2.fastq.gz
,
respectively.
This configuration is enough for a standard minute
run.
Running minute
Once you have set up a minute
run, your experiment directory should look
similar to this (with your own FASTQ file pairs):
myexperiment
├── fastq
│ ├── H3K4me3-ChIP_R1.fastq.gz
│ ├── H3K4me3-ChIP_R2.fastq.gz
│ ├── INPUT_R1.fastq.gz
│ └── INPUT_R2.fastq.gz
├── groups.tsv
├── libraries.tsv
└── minute.yaml
At this point, you just have to move to your experiment directory and run minute:
cd myexperiment && minute run
Note
You probably want to run minute run
with the option -n
first, which
will do a “dry run”, that is, it only shows which steps would be executed and
does not actually run them.
Result folders and files
reports/multiqc_report.html
: MultiQC reportfinal/bam
: Final BAM filesfinal/bigwig
: Scaled and unscaled BigWig files.final/fastq
: Demultiplexed FASTQ files.tmp/
: Intermediate files.
Advanced options
It is possible to further customize a minute
run. You would only need to do
this if you have any of these following specific use cases:
- Scaling groups do not correspond 1:1 to each FASTQ R1/R2 pair.
- Some FASTQ R1/R2 pairs have different barcode/conditions.
- The reference is not always the same on each scaling group, and/or you do not want to use the pooled sample final read count as scaling factor.
- You would like to align some samples to a different reference.
Note that in any of these cases it is probably easier to further edit the
configuration obtained from a minute init
run,
so it is still recommended to first configure a standard run.
Custom configuration
Alternatively, you can edit or manually create the libraries.tsv
and
groups.tsv
as described.
- Create an empty folder somewhere (called
myexperiment
in the following) - Create a subfolder
fastq
in themyexperiment
folder - Create symbolic links within the
fastq/
folder that point to your input FASTQ files (those that contain the multiplexed libraries). - Copy the
testdata/minute.yaml
file intomyexperiment/
and edit it as required. - Create
libraries.tsv
andgroups.tsv
files describing your libraries (see below).
Once the custom configuration is done, you can proceed to run it. You can see more details about the configuration files below.
Configuration files
The configuration files libraries.tsv
, groups.tsv
and minute.yaml
need to be placed in the myexperiment/
directory. Use the provided
example files under testdata
as templates and adjust as needed.
The libraries.tsv file
This is a text file in tab-separated value format describing the sequenced libraries, one row per library.
# Columns: sample name, replicate id, barcode, FASTQ base name
#
# If the barcode is ".", the FASTQ file will be used directly.
# If the barcode is set to a nucleotide sequence, the FASTQ file
# will be demultiplexed.
#
H3K4me3_reference_condition 1 CCGGCGTT H3K4me3-ChIP
H3K4me3_reference_condition 2 TTATTTCT H3K4me3-ChIP
H3K4me3_treatment_condition 1 GATGTTTG H3K4me3-ChIP
H3K4me3_treatment_condition 2 TTGTGAAA H3K4me3-ChIP
Input_reference_condition 1 CCGGCGTT INPUT
Input_reference_condition 2 TTATTTCT INPUT
Input_treatment_condition 1 GATGTTTG INPUT
Input_treatment_condition 2 TTGTGAAA INPUT
The columns are:
- Sample name. Any name you want to give to the sample.
- Replicate number.
- Barcode sequence if applicable, a dot (
.
) otherwise. If a.
is specified, the demultiplexing step of this library will be skipped. This is a feature that allows to reprocess already demultiplexed MINUTE-ChIP datasets. - FASTQ base name. This must match the prefix of the FASTQ file pairs that
are going to be processed (without
_R1.fastq.gz
,_R2.fastq.gz
).
Sample names in the first column must include some information about the matched FASTQ file pair, since no two pairs sample_name plus replicate can be repeated in the samplesheet. As this is a multiplexed experiment, it is usually the case that condition+replicate+barcode are the same on each ChIP and Input. If that is the case, configuring your run using the standard run will simplify the process.
The FASTQ base name refers to files within the fastq/
folder. The suffixes
_R1.fastq.gz
and _R2.fastq.gz
will be added automatically.
If the barcode is ".", the FASTQ file will be used directly. If the barcode is set to a nucleotide sequence, the FASTQ file pair will be demultiplexed.
In the example above, the expected FASTQ files according to the fourth column
are H3K4me3-ChIP_R1.fastq.gz
, H3K4me3-ChIP_R2.fastq.gz
, INPUT_R1.fastq.gz
, INPUT_R2.fastq.gz
If you use an SRA accession for the FASTQ base name, a separate command can be used to automatically download the listed samples from the SRA, see downloading data from SRA.
The groups.tsv file
This is a text file in tab-separated value format that defines which of the libraries are the treatments, which are the controls (or “inputs”) and to which reference they need to be scaled to. It is possible to specify different scaling groups, each one with its own reference library. For each scaling group, the first library will be taken as reference. Note that a given library cannot be specified in multiple scaling groups at the same time.
# Columns: treatment name, replicate id, control name, scaling group, ref genome
#
H3K4me3_reference_condition pooled Input_reference_condition H3K4me3 mm39
H3K4me3_treatment_condition pooled Input_treatment_condition H3K4me3 mm39
The columns are:
- Treatment sample name.
- Replicate id - This column can also contain the reserved word
pooled
to indicate that the replicates are to be pooled before scaling. - Control sample name.
- Scaling group.
- Reference genome. Reads will be mapped to this genome. This identifier
must match a
reference
in theminute.yaml
file.
The minute.yaml file
The minute.yaml
is used to configure everything else. If you initialized your
experiment as a Standard run using minute init
you should have a template
minute.yaml
file on your myexperiment
directory. Please open the file in
an editor, read through the comments and edit as required.
# Configuration settings for the minute pipeline
# Paths (unless absolute) are relative to the directory in which snakemake is run.
references:
mini: # Arbitrary name for this reference. This is also used in output file names.
# Path to a reference FASTA file (may be gzip-compressed).
# A matching Bowtie2 index must exist in the same location.
fasta: "ref/ref1.fa.gz"
# Path to a BED file with regions to exclude.
# If this is empty, no regions are excluded.
exclude: "exclude1.bed"
small:
fasta: "ref/ref2.fa.gz"
exclude:
# Length of the 5' UMI
umi_length: 6
# Fragment length (insert size)
fragment_size: 150
# Allow this many errors in the barcode when demultiplexing
max_barcode_errors: 1
# Filter out reads under certain mapping quality (0: no filtering)
mapping_quality: 0
# If filtered_bigwigs rule is run, additional bigWig files are produced,
# where reads with MQ lower than this value are filtered out. This parameter
# is independent of mapping_quality parameter above.
mapping_quality_bigwig: 20
# Aligner to use in the genome alignment step
# Valid options: bowtie2, strobealign
aligner: "bowtie2"
# Bowtie2 alignment mode:
# fast, fast-local, sensitive, sensitive-local, very-sensitive
bowtie2_mode: "fast"
Downloading data from the Sequence Read Archive (SRA)
To use data from SRA that you have not already downloaded, create the
libraries.tsv
file listing your libraries as described above.
For all samples to be downloaded from the SRA, write a dot (".") in
the barcode (third) column and put the SRA run accession
(starting with SRR, ERR, or DRR) into the FASTQ base name (fourth) column.
The minute.yaml
and groups.tsv
files are not needed for this step.
Then run:
minute download
This will download the files into the fastq/
directory.
If your libraries.tsv
refers to non-SRA entries and you have not
placed those files into the fastq/
directory, you will get an error
message for them, but the other files will still be downloaded.
You can now move, copy or link the downloaded files into a separate folder somewhere for later use. The next time they are needed, you can avoid the download and instead use the existing files.
If you download the data yourself, note that Minute expects file names
ending in _R1.fastq.gz
and _R2.fastq.gz
. Because fastq-dump
creates files named ..._1.fastq.gz
and ..._2.fastq.gz
, they need to be
renamed.
Running on SLURM clusters
Snakemake supports running on HPC environments. As such, it is possible to
run minute on SLURM clusters. Handling of minute.yaml
and libraries.tsv
files
will work the same. You just need to have conda available
and an active minute environment that you can install as described in the
setup section.
To specify cluster-specific parameters you can create a cluster.yaml
config file with your defaults:
__default__:
partition: "core"
cpus: "{threads}"
time: "0-06:00:00"
project: "project-code"
jobname: "{rule}_{jobid}"
Note that you can use rule-dependent parameters such as rule
, jobid
and
threads
. Then you call minute run
:
minute run --jobs 20 --cluster-config path/to/cluster.yaml --cluster 'sbatch -A {cluster.project} -t {cluster.time} -c {cluster.cpus} -e logs_slurm/{cluster.jobname}.err -o logs_slurm/{cluster.jobname}.out -J {cluster.jobname}'
The project
field is required, as SLURM will not queue your jobs if they are
not attached to a computing project. The --jobs
parameter in the snakemake
command limits the maximum number of jobs to be queued, and it's also required.
In this example, separate log files for each job (stdout and stderr) are written
to a logs_slurm
folder (otherwise you get a bunch of slurm-<jobid>.out
files
in the working directory). If you want this behavior, you need to create that
directory before running snakemake
.
You can also wrap your minute pipeline call in a sbatch
file itself, so
the scheduler does not run on a login node:
- Create the
logs_slurm
directory in themyexperiment
directory. - Assign a
-t
(time) parameter long enough, as this will be mostly waiting for other jobs to run. Consider that this includes not only running time but also queuing time. - Ask only for 1 core.
- Make sure you call
conda activate minute
in the wrapper.