1. Overview

CpGtools package provides a number of Python programs to annotate, QC, visualize, and analyze DNA methylation data generated from Illumina HumanMethylation450 BeadChip (450K) / MethylationEPIC BeadChip (850K) array or RRBS / WGBS.

These programs can be divided into three groups:

  • CpG position analysis modules

  • CpG signal analysis modules

  • Differential CpG analysis modules

1.1. CpG position analysis modules

These modules are primarily used to analyze CpG’s genomic locations.

Name

Description

CpG_anno_probe.py

add comprehensive annotation information to each 450K/850K probe ID.

CpG_anno_position.py

add comprehensive annotation information to each CpG based on its genomic coordinate.

CpG_aggregation.py

Aggregate proportion values of a list of CpGs that located in give genomic regions.

CpG_distrb_chrom.py

Calculates the distribution of CpG over chromosomes.

CpG_distrb_gene_centered.py

Calculates the distribution of CpG over gene-centered genomic regions including ‘Coding exons’, ‘UTR exons’, ‘Introns’, ‘Upstream intergenic regions’, and ‘Downstream intergenic regions’.

CpG_distrb_region.py

Calculates the distribution of CpG over user-specified genomic regions (such as promoters, enhancers).

CpG_logo.py

Generates DNA motif logo for a given set of CpGs (to visualize the genomic context of these CpGs).

CpG_to_gene.py

Assigns CpGs to their putative target genes. Follows the “Basel plus extension” rules used by the GREAT algorithm.

1.2. CpG signal analysis modules

These modules are primarily used to analyze CpG’s DNA methylation beta values

Name

Description

beta_PCA.py

Performs PCA for samples.

beta_jitter_plot.py

Generates jitter plot and bean plot for each sample (column).

beta_m_conversion.py

Converts Beta-value into M-value or vice versa.

beta_profile_gene_centered.py

Calculates the methylation profile (i.e., average beta value) for gene-centered genomic regions.

beta_profile_region.py

Calculates the methylation profile (i.e., average beta value) for user specified genomic regions.

beta_stacked_barplot.py

Creates stacked barplot for each sample. The stacked barplot showing the proportions of CpGs whose beta values are falling into these 4 ranges: [0.00, 0.25], (0.25, 0.50], (0.50, 0.75], and (0.75, 1.00].

beta_stats.py

Gives basic information of CpGs located in genomic regions. These information include “Number of CpGs”, “Min methylation level”, “Max methylation level”, “Mean methylation level across all CpGs”, “Median methylation level across all CpGs” and “Standard deviation”

beta_topN.py

This program picks the N most variable CpGs from the input file. The resulting file can be used for PCA/t-SNE or clustering analysis.

beta_trichotmize.py

This program uses Bayesian Gaussian Mixture model to trichotmize beta values into three status: “Un-methylated”,”Semi-methylated”, “Full-methylated”, and “unassigned”

beta_tSNE.py

This program performs t-SNE (t-Distributed Stochastic Neighbor Embedding ) analysis.

1.3. Differential CpG analysis modules

These modules are primarily used to identify CpGs that are differentially methylated between groups

Name

Description

dmc_Bayes.py

Different from statistical testing, this program tries to estimates “how different the means between the two groups are” using Bayesian approach. An MCMC is used to estimate the “means”, “difference of means”, “95% HDI (highest posterior density interval)”, and the posterior probability that the HDI does NOT include “0”. It is similar to John Kruschke’s BEST algorithm.

dmc_bb.py

This program performs differential CpG analysis using the beta binomial model based on methylation proportions (in the form of “c,n”, where “c” indicates “number of reads with methylated C” and “n” indicates “Number of total reads”.

dmc_fisher.py

This program performs differential CpG analysis using Fisher’s Exact Test. It only applies to two sample comparison with no replicates.

dmc_glm.py

This program performs differential CpG analysis using the linear regression model.

dmc_logit.py

This program performs differential CpG analysis using the logistic regression model.

dmc_nonparametric.py

This program performs differential CpG analysis using Mann-Whitney test for two group comparison, and the Kruskal-Wallis H-test for multiple groups comparison.

dmc_ttest.py

This program performs differential CpG analysis using the T test for two group comparison, and ANOVA for multiple groups comparison.