MapSplice User Guide

 User Guide

 

Note: For inquiries and bug reports, please contact mapsplice (at) netlab.uky.edu.

 

 


System requirement

OS: Ubuntu 9.04 (64-bit), Red Hat 4.1.2(64-bit), Red Hat 4.3.0 (64-bit)
Compiler: g++ 4.1.2, g++ 4.3.0, g++ 4.3.3 or higher
Script: Python 2.4.3, Python 2.5.1, Python 2.6

* MapSplice 1.15.2 is tested in the environment mentioned above

 

Obtaining and installing MapSplice

You can download the MapSplice 1.15.2 release package here.

We use Bowtie in MapSplice pipeline for segment mapping. The bowtie and bowtie-build are in the path of MapSplice/bin/. The bowtie version tested with MapSplice 1.15.2 is 0.12.7.

Run MapSplice with configuration file

1. Download MapSplice 1.15.2 package

2. Edit MapSplice.cfg file for your input data files and output directory. You may also need to edit the default settings

3. Run MapSplice pipeline with "python bin/mapsplice_segments.py MapSplice.cfg"


Inputs and Command-line options

 The following is a detailed description of the options used to control the MapSplice script:

Usage:  
 

python bin/mapsplice_segments.py MapSplice.cfg

or

python bin/mapsplice_segments.py [inputs|options] MapSplice.cfg

or

python bin/mapsplice_segments.py [inputs|options]

Inputs and output:
 
-u/--reads-file <string>

 A comma separated (no blank space) list of FASTA or FASTQ read files(inlcude path)
 Notes:
 For paired-end reads, the order should be as follows: reads1_end1,reads1_end2,reads2_end1,read2_end2...
 For two ends from the same read, the read names should be in the following format: read_base_name/1 and read_base_name/2
 -The read_base_name should be the same for two ends

 Format constraint: Reads names after @ or > should not contain a blank space or tab 

-c/--chromosome-files-dir <string>

The directory containing the sequence files corresponding to the reference genome (in FASTA format)
 -One chromosome per file
 -The chromosome name after '>' should not contain a tab or a blank space
 -The chromosome name should be the same as the basename of the chromosome file
 -The suffix of the chromosome file name should be 'fa'
 -eg. If the chromosome name after '>' is 'chr1', then the file name should be 'chr1.fa'

-B/--Bowtieidx <string>

The path and basename of index to be searched by Bowtie.

 -E.g. if the index file name is index.1.ewbt, then the base name is index
 -If the index does not exist, it will be built from reference genomes indicated by option -c with bowtie-build. 

(Index only need to be built once, and the pre-built indexes of various reference genomes are downloadable at Bowtie's page.)

However, use cation when downloading a pre-indexed genome (i.e. know what you are downloading, be sure the bowtie index is consistent with the chromosome files specified with -c option)

-o/--output-dir <string>

The name of the directory in which MapSplice will write its output. The default is "mapsplice_out/" under the current directory MapSplice is run in.

 -t/--avoid-regions <string> (optional)

 Regions to avoid (i.e. mask) while searching for alignments

 - gff format required

 - e.g. ~/examples/islands.gff

-T/--interested-regions <string>(optional)

Regions of interest while searching for alignments

- gff format required

-M/--sam-file <string> (optional)

A comma separated (no blank space) list of sam files (including path) (optional)
        -Only supports single end reads
        -If this value is specified, then reads_file option will not be used
        -The unmapped reads in the sam files will be converted into fastq format to be used as input reads

--bam <string> (optional) A comma separated (no blank space) list of bam files (including path) (optional)
        -Only supports single end reads
        -If this value is specified, then reads_file option will not be used,
        -The unmapped reads in the bam files will be converted into fastq format to be used as input reads

--filter-fusion-by-repeat<string> (optional)


Filter fusion junction if the doner sequence and acceptor sequence appears repeatedly
         -blat needs to be installed on the system, chromosome index in blat format needs to be provided
        -e.g. human index in blat format: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
         -The output is "fusion_remap_junction.unique.chr_seq.extracted.repeat_filtered"

 

 
Basic options: Basic options are options suggested to be specified to run MapSplice correctly
-L/--seglen <int>

 Description: Length of read segments
 -Suggested to be in range of [18,25], if the segment is too short it will be mapped everywhere,
 -Segment length should not be longer than half of the read length
 -Segment length should not be longer than 25
 -If the read length can't be divided evenly, the read sequence will be truncated at the end for now. (e.g. segment length of 25 for a 60 bp read will use segments of nucleotides 1-25 and 26-50)

-Q/--reads-format <string> Format of input reads, fa OR fq
--pairend

Whether or not the input reads are paired-end or single.Need to be specified for paired-end reads

   
Advanced options:  
 -E/--segment-mismatches <int>

 The maximum number of mismatches (Hamming distance) that are allowed in an unspliced aligned read and segment. The default is 1. Must be in range [0-3]

--non-canonical | --semi-canonical

Whether or not the semi-canonical and non-canonical junctions should be outputted

If --non-canonical specified, output all junctions.

If --semi-canonical specified, output semi-canonical and canonical junctions

If none of them are specified, output is only canonical junctions

--fusion-non-canonical | --fusion-semi-canonical

Whether or not the semi-canonical and non-canonical fusion junctions should be outputted

If --fusion-non-canonical specified, output all fusion junctions.

If --fusion-semi-canonical specified, output semi-canonical and canonical fusion junctions

If none of them are specified, output is only canonical fusion junctions

suggest output only canonical fusion junction

--not-rem-temp

If specified, do not remove temporary directory and files after MapSplice is finished running 

--full-running

If specified, run a remapping step to increase the junction coverage 

-n/--min-anchor <int>

The anchor length that will be used for single anchored spliced alignment
 -Decreasing this value will find more alignments but use more running time

 -Should be greater than or equal to 6


-R/--remap-mismatches <int> The maximum number of mismatches that are allowed during remapping. The default is 2. Should be in range [0-3] 
-m/--splice-mismatches <int> The maximum number of mismatches that are allowed in a segment crossing a splice junction. The default is 1.
-i/--min-intron-length <int> The "minimum intron length". Mapsplice will not report alignments with a gap less than this many bases. The default is 1.
-x/--max-intron-length <int> The "maximum intron length". Mapsplice will not report alignments with a gap longer than this many bases apart for a single anchored spliced alignment. The default is 200000.
 -X/--threads<int> Number of threads to run bowtie on when mapping reads
--max-hits<int> max_hits x 10 is the maximum repeated hits permitted during segment mapping and read mapping (default is 4 x 10 = 40)
-r/--max-insert <int>
The maximum small indel length (default is 3, suggested to be in [0-3])
--min-missed-seg <int>
An option to output incomplete alignments.
        # The minimal number of segments contained in alignment.
        # eg. If read length is 75bp, segment_length is 25, then setting min_missed_seg to  1 will output 50bp alignments if there are no 75bp alignments for the corresponding reads
        #-The default is output alignments of full read length 
--search-whole-chromosome

If specified, search up to the maximum intron length away in exonic region and non-exonic region.
        # exonic region: segment mapped region during segment mapping
        # Normally MapSplice will only search up to the maximum intron length away in exonic region for fractions (i.e. small exons < segment length) of a spliced segment
        # -This enables MapSplice to find spliced alignments in small exons (< segment length) at head and tail across the chromosome, but will increase running time
 

--map-segments-directly

 #If specified, MapSplice will try to find spliced alignments and unspliced alignments of a read, and select the best alignment. (will increase running time)
 #If not specified, MapSplice will try to find unspliced alignments of a read, then if no unspliced alignments are found, MapSplice will try to find spliced alignments for the read

--run-MapPER If specified, run MapPER (PMID 20576625)and generate reads mappings based on a probabilistic framework, valid for PER reads 
 --fusion Whether or not fusion junctions should be outputted
        # -Reads not aligned as normal unspliced or spliced alignments are consider as fusion candidates
        # -The outputs are "fusion.junction" and "fusion_junction.unique" if full-running is not turned on
        # -The outputs are "fusion_remap_junction.unique.chr_seq.extracted" if full-running is turned on

 --cluster
Whether or not to use paired-end reads to generate cluster regions for fusion read mappings
        # Use paired-end reads to find fusion alignments with a single anchored method
        # e.g. use 2x50 paired read and 25bp segment length to find fusion alignments
        # -Only valid for paired-end reads and the full running model and do_fusion on (set full_running = yes and do_fusion = yes)
Help and version options:  
-h/--help Print the help message and exit
-v/--version Print the version of MapSplice and exit


 

 


Examples

Three examples run on  hg18 chr20 reference genome

Before run the examples, make sure bowtie and bowtie-build are in MapSplice path, reference genome and index are in the path indicated in the command_line options.

 Example 1 1M 36bp fastq reads  

python mapsplice_segments.py -Q fq -o 1M_36bp_output_path -c chr20_sequence_index_path -u reads_path/1M_36bp_fastq.txt -B chr20_sequence_index _path/index -L 18 2>36bp_time.log

Example 2 1M 50bp fastq reads 

python mapsplice_segments.py -Q fq -o 1M_50bp_output_path -c chr20_sequence_index_path -u reads_path/1M_50bp_fastq.txt -B chr20_sequence_index_path /index -L 25 2>50bp_time.log

Example 3 1M 100bp fastq reads 

python mapsplice_segments.py -Q fq -o 1M_100bp_output_path -c chr20_sequence_index_path -u reads_path/1M_100bp_fastq.txt -B chr20_sequence_index_path/index -L 25 2>100bp_time.log

*hints:

*bowtie-build index will take hours to build index, but the index only need to be built 1 time, and is reusable, so don't delete the index

*time log is output to stderr. It can be redirected by '2>'.

 

 

 

MapSplice Output


best_remapped_junction.bed | best_junction.bed Junctions in UCSC bed format
alignments.sam Spliced and unspliced reads alignment in SAM format.

fusion_junction |

fusion_remap_junction.unique.chr_seq.extracted (remapped junction) |

fusion_remap_junction.unique.chr_seq.extracted.repeat_filtered (remapped repeat filtered junction)

Format
fusion.remapped.unique Format
prob_alignment.sam Predicated reads alignment based on  probabilistic framework 

 

  
 

 

Last Update: 07/08