19. dmc_Bayes.py¶
19.1. Description¶
Different from statistical testing, this program tries to estimates “how different the means between the two groups are” using the Bayesian approach. An MCMC is used to estimate the “means”, “difference of means”, “95% HDI (highest posterior density interval)”, and the posterior probability that the HDI does NOT include “0”.
It is similar to John Kruschke’s BEST algorithm (Bayesian Estimation Supersedes T test)
Notes
This program is much slower than T-test due to MCMC (Markov chain Monte Carlo) step. Running it with multiple threads is highly recommended.
19.2. Options¶
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -i INPUT_FILE, --input_file=INPUT_FILE
Data file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). Except for the 1st row and 1st column, any non-numerical values will be considered as “missing values” and ignored. This file can be a regular text file or compressed file (.gz, .bz2).
- -g GROUP_FILE, --group=GROUP_FILE
Group file defining the biological group of each sample. It is a comma-separated 2 columns file with the 1st column containing sample IDs, and the 2nd column containing group IDs. It must have a header row. Sample IDs should match to the “Data file”. Note: Only for two group comparison.
- -n N_ITER, --niter=N_ITER
Iteration times when using MCMC Metropolis-Hastings’s agorithm to draw samples from the posterior distribution. default=5000
- -b N_BURN, --burnin=N_BURN
Number of simulated samples to discard. Thes initial samples are usually not completely valid because the Markov Chain has not stabilized to the stationary distribution. default=500.
- -p N_PROCESS, --processor=N_PROCESS
The number of processes. default=1
- -s SEED, --seed=SEED
The seed used by the random number generator. default=99
- -o OUT_FILE, --output=OUT_FILE
The prefix of the output file.
19.3. Input files (examples)¶
19.4. Command¶
$ dmc_Bayes.py -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp.csv.gz -p 10 -o dmc_output
19.5. Output files¶
dmc_output.bayes.tsv: this file consists of 6 columns:
ID : CpG ID
mu1 : Mean methylation level estimated from group1
mu2 : Mean methylation level estimated from gropu2
mu_diff : Difference between mu1 and mu2
mu_diff (95% HDI) : 95% of “High Density Interval” of mu_diff. The HDI indicates which points of distribution are most credible. This interval spans 95% of mu_diff’s distribution.
The probability that mu1 and mu2 are different.
$head -10 dmc_output.bayes.tsv
ID mu1 mu2 mu_diff mu_diff (95% HDI) Probability
cg00001099 0.775209 0.795404 -0.020196 (-0.065148,0.023974) 0.811024
cg00000363 0.610565 0.469523 0.141042 (0.030769,0.232965) 0.994665
cg00000884 0.845973 0.873761 -0.027787 (-0.051976,-0.004398) 0.984882
cg00000714 0.190868 0.199233 -0.008365 (-0.030071,0.014006) 0.816141
cg00000957 0.772905 0.827528 -0.054623 (-0.092116,-0.016465) 0.995327
cg00000292 0.748394 0.766326 -0.017932 (-0.051286,0.012583) 0.889729
cg00000807 0.729162 0.683732 0.045430 (-0.001523,0.086588) 0.981551
cg00000721 0.935903 0.935080 0.000823 (-0.013210,0.018628) 0.508686
cg00000948 0.898609 0.897536 0.001073 (-0.020663,0.026813) 0.518238