1. what should be used as the reference? target region or whole genome?
Current version can not do retrieve a specified region on the reference (no option to define a region),instead,it always starts from the 1st position on the reference, even you don't have it in your pileup file. Therefore, using your target region as the reference will make it faster.
This problem is solved by introducing -a in filter_and_summary.pl from version 0.6
2. How to define a bin?
Length of the read bin should be determined by the development of the error rate on the read. In our test dataset (mtDNA mock mixture, sequencing coverage=2000x, 76bp), a size of 5bp, 10bp, 20bp gave the same result(with no false positives/false negatives). Theoretically, higher depth in one bin leads to better inference, Meanwhile, we suggest the number of bins should be at least 4, to counteract the effect of duplicate reads.
3. Which method should be used?
POISSON method assume the sequencing error follows Poisson distribution, Fisher and EMP method doesn't assume any specific distribution of the sequencing error, please see details in the comming paper.In case you are working with reads generated in the same lane, EMP method is perferred, as it is more sensitive than the POIS/Fisher method, otherwise, POIS/Fisher should be perferred, as they are more robust to the error variation among different lanes/runs. I have to admit that Poisson/Fisher methods are much eaier to use than the EMP method (less parameters), you do need to know how the EMP method works to specify these parameters, therefore, I suggest you start with the POISSON/Fisher method, if you can feel the advantage of this method, then you can try the EMP method.
4. What does our new statistics "Bias" mean (from where our Quality Score were obtained)?
Bias >1 indicates that the observed minor allele count is lower than the expected error count, with higher Bias values indicating a lower minor allele count (relative to the expected error count); Bias < 1 indicates that the observed minor allele count is higher than the expected error count, with lower Bias values indicating a higher minor allele count (relative to the expected error count). Moreover, we have shown in this study that a cut-off of 0.1 could remove all the false positives in the simulation and distinguish all the real LLMs from sequencing errors in the artificially-mixed samples. The p-value in each bin has a lower bound of 0.000001, thus the range of Bias (after n-th square root) is [0.000001, 1000000]. However, only the positions with Bias of [0.000001,1] are of interest for LLMs detection (i.e., minor allele count > expected error count). For convenience, we convert these Bias values into a Phred-scaled score which ranges between [0,60]; all other Bias values are converted to 0, which indicates the minor allele count is equal to or less than the expected error count.
5. Format of the output
log file is sorted by the following order:
Each row corresponds to a position on the reference, position will be skipped if no minor allele observed at this postion.
In Eeach row, it is soted by [position] [Flag](reserved) [Quality Score for the forward strand] [Percentage of the reads that support a real LLM: proportion of reads that located in the bin whose minor allele count is greater than the expected error count] [Quality Score for the reverse strand] [Percentage of the reads that support a real LLM] [Quality Score for all reads] [Percentage of the reads that support a real LLM] [Coverage at this position] [Minor allele frequency] [Major allele] [Major allele count] [Minor allele] [Minor allele count] [Major allele:supported by (1|2) strands:forward-reverse:unique reads forward-unique reads reverse](unique reads are defined by their positions on reads) [Minor allele:supported by (1|2) strands:forward-reverse:unique reads forward-unique reads reverse] [minor allele frequency on the forward strand] [minor allele frequency on the reverse strand] [all minor alleles] [minor allele count other than the first and the secondary alleles]
Example: 12 L 3.4 76 0.0 0 1.0 51 409 0.002 T 408 A 1 FC:2:(273-135)(12-17) SC:1:(1-0)(1-0) 0.004 0.000 A 0
6. How to use a control dataset?
A control dataset (either without low-level mutations or with designated LLMs) could help to determine your cut-off value, thus is highly recomendated.
7. Should I use other filters available at Dreep_result_filter.pl ?
Yes, especially when you want to detect heteroplasmy at a frequency lower than 5%. My suggested options is -b 0.02 -d 3 -f 3 -g 100 -i 10 -t -r 0.00001 , use -r 0 if you want to include the indels.