What is MapSplice?
MapSplice is a software for mapping RNA-seq read to reference genome for splice junction discovery.
- It depends only on reference genome, and not on any further annotations
- It supports both paired-end reads and single-end reads, and utilizes the advantage of pair-end read for better mapping accuracy
It supports variable length reads
It aligns unspliced and spliced alignments simultaneous
- It detects:
- novel canonical, semi-canonical and non-canonical splice junctions
- novel insertions and deletions
- novel gene fusion events
How does MapSplice work?
MapSplice first splits reads into segments, and maps them to reference genome by using Bowtie. Then for unmapped segements, MapSplice tries to fix it as gapped alignments, with each gap corresponding to a splice junction. And later a remapping step is used to identify spliced alignments that are in the presence of small exons. MapSplice leverages the quality and diversity of read alignments of a given splice to increase accuracy.
What's new in MapSplice 2?
- MapSplice 2 improved mapping sensitiviy.
- MapSplice 2 now supports multi-thread, dramatically improves the running time on multi-core system.
- MapSplice 2 now supports variable length reads.
- MapSplice 2 is optimized for repeats.
- All the command line parameters have been re-designed for easier use.
- OS: Linux x86 64bit system
- Memory: 6GB
- Compiler: g++ 4.3.3 or higher
- Script: Python 2.4.3 or higher
Obtaining & Building MapSplice 2
- You can download the lastest version of MapSplice here. For better compatibility and user's convenience, Bowtie 0.12.7 and SAMtools 0.1.9 are included in the package.
- To build MapSplice, extract compressed file, go to the MapSplice directory, and run "make".
python mapsplice.py [options]* -c <Reference_Sequence> -x <Bowtie_Index> -1 <Read_List1> -2 <Read_List2>
(note: mapsplice.py is in the root directory of MapSplice, not in the "bin" directory)
The directory containing the sequence files of reference genome. All sequence files are required to:
The basename (including directory path) of Bowtie index to be searched. The basename is the name of any of the index files up to but not including the final .1.ebwt / .rev.1.ebwt / etc.
|-1 <string>||Comma-separated (no blank space) list of read sequence files in FASTA/FASTQ format. When running with pair-end read, this should contain #1 mates (filename usually includes _1).|
|-2 <string>||Comma-separated (no blank space) list of read sequence files in FASTA/FASTQ format. -2 is only used when running with pair-end read. This should contain #2 mates (filename usually includes _2). Files must be in the same order with those specified in -1.|
- Input/Output and Performance options
- Alignment options
|-s / --seglen <int>||Read will be divided into <int> bp segments for initial aligning. Default is 25.
MapSplice will only report read alignments that can be completely mapped or mapped no less than <int> bases. Default is 50. Set this option to 0 to only report completely mapped reads.
|-i / --min-intron <int>||Minimum length of splice junctions. Mapsplice will not search for any splice junctions with a gap shorter than <int> bp. Default is 50.|
|-I / --max-intron <int>||Maximum length of splice junctions. Mapsplice will not search for any splice junctions with a gap longer than <int> bp. Default is 300,000.|
|--non-canonical-double-anchor||Search for double anchored non-canonical junctions in addition to the default canonical and semi-canonical junctions.|
|--non-canonical-single-anchor||Search for single anchored non-canonical junctions in addition to the default canonical and semi-canonical junctions.|
|-m / --splice-mis <int>||
Maximum number of mismatches that are allowed in the first/last segment crossing a splice junction in the range of [0, 2]. Default is 1.
(Maximum number of mismatches that are allowed in the middle segment crossing a splice junction is always fixed at 2.)
|--max-append-mis <int>||Maximum number of mismatches allowed to append a high error exonic segment next to an adjacent low error segment. Default is 3.|
|--ins <int>||Maximum insertion length. (insertion in read / deletion in reference genome). Default is 6, must be in range [0, 10]|
|--del <int>||Maximum deletion length. (deletion in read / insertion in reference genome). Default is 6.|
|--fusion | --fusion-non-canonical||--fusion: Search for canonical and semi-canonical fusion junctions.
--fusion-non-canonical: Search for canonical, semi-canonical, and non-canonical fusion junctions.
|--min-fusion-distance <int>||Minimim distance between two segments to be considered as fusion candidate. This threshold applies when the two segments aligned to the same chromosome but on different strand or in correct order. Default is 10,000. Consider set this to 200 if you are detecting Circurlar RNA.|
Gene annotation file in GTF format, used to annotate fusion junctions. Can be downloaded from ENSEMBL ftp site. (e.g, for human hg19: Homo_sapiens.GRCh37.66.gtf.gz, older version that does not have "gene_biotype" field is NOT supported). Required for the detection of Circular RNA.
The stringency level of filtering splice junctions in the range of [1, 2]. Default is 2.
- Other Options
|-h/--help||Print the usage message|
|-v/--version||Print the version of MapSplice|
Output of MapSplice 2
By default, read alignments are reported in SAM format to alignments.sam. If --bam is specified, read alignments are reported in BAM format to alignments.bam. Please see the SAM / BAM format specification.
- Normal Splice Junction
Splice junctions are reported to "junctions.txt". Please see the detailed description of all the columns here.
Inserstions are reported to insertions.txt. Please see the detailed description of all the columns here.
Deletions are reported to deletions.txt. Please see the detailed description of all the columns here.
- Fusion alignment
- Fusion Splice junction
If --fusion | --fusion-non-canonical is specified, raw fusion splice junctions are reported to fusions_raw.txt / fusions_candidates.txt. And if --gene-gtf is specified, annotated fusion splice junctions are reported to fusions_well_annotated.txt / fusions_not_well_annotated.txt / circular_RNAs.txt. Please see the detailed description of all the columns here.