Manual 0.7
0. Before you start, you should know
The pipeline is optimized by mtDNA sequencing data, thus there may be some problems when running it with other types of data (starting position on the reference is not 1;or too much data), we would likt to work with you to solve all the problems you met.
Indels are always assigned to the forward strand, thus became strand-specific, we will solve this problem in a later version
code in the dotted box is an example using the Poisson method
I am particularly interested in detecting low-level mutations in tumor tissues, let me know if you want to initiate a callaboration. I am also looking for a postdoc now.
1.Generating pileup file from sorted SAM file
To build a pileup file, you need CRISP package CRISP Website (in case they shut down the webpage sometimes, you can contact its author or me)
python sam_to_pileup.py indi1.sorted.sam refsequence.fasta > indi1.pileup
samtools mpileup can not output the mismatch number
2. Generating ssp file from the pileup file (indi.pileup)
Refine the pileup file by mismatch number, quality score, mapping quality score
perl filter_and_summary.pl
-i pileup file
-d number of mismatches allowed
-q minimum quality score
-m minimum mapping quality score
-r length of the reads
-s length of the read bins <10bp>
-a your targeted region: chrX:Y-Z, e.g., MT:100-1500 (only one region is supported in this version)
to save the output, use "> output.file"
3. Generating error profile by using reference panel (for POISSON/Fisher method)
All the ssp files should be saved in one folder, and with a specific suffix
perl error_profile_pois.pl
-d folder including all population data(in ssp format)
-s suffix of the ssp files
-i length of the read bins<10bp>
-r length of the reads
The present folder should include all the ssp files: test.ssp, test1.ssp, test2.ssp ..., they should have the same suffix specified by [-s]
after running this scrip, you should have error_pois.index and error_pois.position under the same folder
3.1 Generating error profile by using reference panel (for EMP method)
All the ssp files should be saved in one folder, and with a specific suffix.
perl error_profile_emp.pl
-d folder including all population data(in ssp format)
-s suffix of the ssp files
-i length of the read bins<10bp>
-m minimum number of samples,otherwise,combine nucleotides
-t minimum coverage in each bin
After running this scrip, you should have error_emp.index under the same folder
-t: the bin whose coverage is lower than that defined by <-t> won't be included in the reference error database
-m: if the number of error rates (reference sample size) in the bin is lower than that defined by <-m>, different nucleotides will be merged if the T-test p-value>0.005
If the number of error rates in the bin is lower than 10, the sample will be skipped as error rates could be highly biased
If the number of error rates in the bin is >50 and lower than <-m>, psudo error rate would be generated assuming a normal distribution
4.Quality score calculation with Poisson/Fisher distribution
perl Dreep_poisson.pl
-i ssp file
-r error profile file (error_pois.index)
-n length of the reads bins(10bp)
-f minimum error rate (0.002), does not applicable for Fisher method
-t remove the C-stretch and STR (for mitochondrial data)
-e use it when the questioned individual is included in the reference panel
-d minimum coverage to apply consensus-specific error profile(1000)
-b output Bias instead of Phred-scaled Quality Score
By running this script, you should have an output file of test_pois.log, please include -e if your questioned samples are used to calculate the error_pois.index
If you don't want to use the empirical error rate(which we highly recommendate), you can give a fixed error rate, e.g.,
This could result in more conserved result unless the error rate is higher than the one you specified, this option could be used to scan common variations
4.1. With empirical distribution
perl Dreep_emp.pl
-i ssp file
-r error profile file (error_emp.index)
-l length of the read bins(10bp)
-m minimum number of error rate entries in error_emp.index,otherwise,merge different nucleotides at the same position
-s lowest p_value for each bin(the empirical p_value depends on the sample size, <-s> is used to define the lower bound 0.01~0.001 would be fine).
-b output Bias instead of Phred-scaled Quality Score
An output with suffix of pois.log/emp.log will be generated for each sample
5. Specify your own heteroplasmy(LLM) criteria
perl Dreep_poi_filter.pl
-n directory including all the log file (output of Dreep_emp.pl Dreep_pois.pl)
-s suffix of all the result file (log)
-a minor allele frequency
-b minor allele frequency per strand
-c minor allele count
-d minor allele count per strand
-e minor allele count (distinct reads)
-f minor allele count per strand(distinct reads)
-g minium coverage
-h minimum QS
-i minimum QS per strand
-j minimum perc of supported reads (for pois only)
-k minimum perc of supported reads per strand (for pois only)
-l maximum perc of the 3rd/3rd+4th allele (out of all non-major allele)
-t ignore the C-stretch region
-r minimum p_value allowed for the Fisher Exact test for minor allele counts and major allele counts on different strands (0,deactivated)
-q if you want to skip some samples, put their IDs in a file (per ID per line).
-p ignore the 10-bp adjacent to an indel
The present directory should include all the _pois.log files that having the same suffix
By running this command, you would have all low-level mutations candidate printed on the screen. or use > to save it elsewhere
If you want it works for one file, give more specific suffix