DiffSplice: Running DiffSplice

 User Guide





System requirement

OS: Ubuntu 9.04 (64-bit)
Compiler: g++ 4.3.3 or higher

Installing DiffSplice

You can download the DiffSplice release package here. Both binary files and source codes can be found in the release package. To use the binary files, simply find the directory where the DiffSplice folder locates and execute the binary diffsplice with configurations explained below. Alternatively the diffsplice binary can be built from the sources using the command make, which will use g++ to compile. If the binary files do not work and your compiler installed is not compatible with g++, please modify the file Makefile in the release package and use the command your compiler requires.


Running DiffSplice


1. Download the latest DiffSplice package, unzip and go to the DiffSplice folder.
2. Give the SAM files of the RNA-seq reads as well as their grouping information in the format of file datafile.cfg.
3. Edit the running parameters in file settings.cfg.
4. Run DiffSplice with "./diffsplice settings.cfg datafile.cfg path_to_result > runname.log"

We have prepared a test data set that you can play with, available at the download page. The test data package contains SAM files from a simulated two-group comparison along with config files. Please download it together with the latest DiffSplice release for a test run.



The full expression of running DiffSplice is 

./diffsplice  [options]  settings.cfg  datafile.cfg  path_to_result > runname.log

 1. options

 -p  parse the SAM files and prepare input files for the following steps of DiffSplice
 -s  construct the expression-weighted splice graph (ESG), derive the alternative splice modules (ASMs), and estimate the abundance of the alternative paths
 -x  perform permutation test to select differentially expressed genes and differentially transcribed ASMs

This field is optional. If you are running DiffSplice for the first time, please skip this field. You may specify these options only when you have results of DiffSplice from previous runs and you want to rerun only subsequent steps like generating the ESG and ASMs or the differential test.

The DiffSplice pipeline consists of 3 steps: preprocess the SAM files, construct the ESG and derive the ASMs, perform the differential tests. The first step will extract information required by DiffSplice from the SAM files you provide. It will prepare a folder named data under the target folder you specify by the argument path_to_result and store outputs there. The second step will collect junctions and expressed exonic units from the preprocessed files, construct the ESG, generate the ASMs and estimate the abundance of alternative transcription paths. The outputs will be stored in the folder named result, including the gtf tracks of the splice graph and the alternative splicing modules. The last step will perform the differential tests and select genes and ASMs with significant changes of expression. Its outputs will also locate in the result folder and contain the two final tables (differential_expression.txt and differential_transcription.txt) as well as the false discovery rates at different cutoff (FDR_expression_all.txt and FDR_transcription_all.txt).

These steps can be run independently, with results from the previous steps provided. For example, you may use the following command if you hope to redo step 2 and 3:

./diffsplice  -sx  settings.cfg  datafile.cfg  path_to_result > runname.log


 2. settings. cfg

This file gives the settings for running DiffSplice.

Parameter name Value Meaning
thresh_junction_filter_max_read_support numeric a splice junction will be filtered if the maximum number of its junction spanning reads in any of these samples is <= thresh_junction_filter_max_read_support
thresh_junction_filter_mean_read_support numeric a splice junction will be filtered if the average of its junction spanning reads over all samples is <= thresh_junction_filter_mean_read_support
thresh_junction_filter_num_samples_presence numeric a splice junction will be filtered ifthe junction is found in <= thresh_junction_filter_num_samples_presence samples
ignore_minor_alternative_splicing_variants yes/no when trying to catalog alternative splicing and calculating the differential transcription signal, only consider the "major" transcript paths that have estimated proportion (averaged over all samples) no less than 5% in the ASM
thresh_average_read_coverage_exon numeric expression threshold on exons (averaged over all samples)
thresh_average_read_coverage_intron numeric expression threshold on introns (averaged over all samples)
balanced_design_for_permutation_test yes/no in the permutation test, match the individual name and shuffle the samples from a same individual if specified "yes"; otherwise, the permutation test will randomly shuffle all samples in the groups; see datafile.cfg below for more explanation
false_discovery_rate numeric false discovery rate threshold for differential test
thresh_foldchange_up numeric minimum fold change for significantly upregulated gene expression from group1 to group2
thresh_foldchange_down numeric maximum fold change for significantly downregulated gene expression from group1 to group2
thresh_sqrtJSD numeric minimum value square root of JSD for significant differential transcription from group1 to group2

 3. datafile.cfg

The statistical significance of the differential transcription and differential expression between two sample groups is assessed through a permutation test. This file lists the data files and their grouping information in the following format:

     group_name    individual_name    sample_name    data_file

For example, a basic specification might be

     g1    id1    s1    file1.sam
     g1    id1    s2    file2.sam
     g1    id1    s3    file3.sam
     g1    id1    s4    file4.sam
     g2    id1    s1    file5.sam
     g2    id1    s2    file6.sam
     g2    id1    s3    file7.sam
     g2    id1    s4    file8.sam

and it will put files 1-4 into group g1 and files 5-8 into group g2. During the permutation test, all samples in a same group will be treated equally, all 8 samples will be shuffled together.

In some cases, it might be preferred to separate samples in a group into subgroups, or blocks. For example, suppose that the 8 samples in the data set (samples 1-4 for treatment/condition group 1 and samples 5-8 for treatment/condition group 2) are actually from 2 individuals: samples 1-2 and samples 5-6 are from individual 1, and samples 3-4 and samples 7-8 are from individual 2. If we hope to test the difference between the two groups and we want to control the variation between indiviuals 1 and 2, we may specify balanced permutation in settings.cfg and set the data files as the follows

     g1    id1    s1    file1.sam
     g1    id1    s2    file2.sam
     g1    id2    s1    file3.sam
     g1    id2    s2    file4.sam
     g2    id1    s1    file5.sam
     g2    id1    s2    file6.sam
     g2    id2    s1    file7.sam
     g2    id2    s2    file8.sam

During the permutation test, samples 1,2,5,6 will be shuffled together and samples 3,4,7,8 will be shuffled together. However, please be noted that this block design will decrease the number of permutations and may affect the accuracy of the evaluation of significance. Therefore, avoid the block design and use the basic style if the sample size is limited.

4. path_to_result

The absolute directory where the results should be stored. An example might be /NGS_result/. Please do not use the relative directory like ./ and etc.