1. CpG_aggregation.py

1.1. Description

Aggregate proportion values of a list of CpGs that located in give genomic regions (eg. CpG islands, promoters, exons, etc.).

Example of input file

Chrom  Start   End     score
chr1   100017748       100017749       3,10
chr1   100017769       100017770       0,10
chr1   100017853       100017854       16,21

Notes

Outlier CpG will be removed if the probability of observing its proportion value is less than p-cutoff. For example, if alpha set to 0.05, and there are 10 CpGs (n - 10) located in a particular genomic region, the p-cutoff of this genomic region is 0.005 (0.05/10). Supposing the total reads mapped to this region is 100, out of which 25 are methylated reads (i.e. regional methylation level (beta) - 25/100 - 0.25)

  • The probability of observing CpG (3,10) is : pbinom(q=3, size=10, prob=0.25) = 0.7759

  • The probability of observing CpG (0,10) is : qpbinom(q=0, size=10, prob=0.25) = 0.05631

  • The probability of observing CpG (16,21) is : pbinom(q=16, size=21, prob=0.25, lower.tail=F) = 1.19e-07 (outlier)

1.2. Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input=INPUT_FILE

Input CpG file in BED format. The first 3 columns contain “Chrom”, “Start”, and “End”. The 4th column contains proportion values.

-a ALPHA_CUT, --alpha=ALPHA_CUT

The chance of mistakingly assign a particular CpG as an outlier for each genomic region. default-0.05

-b BED_FILE, --bed=BED_FILE

BED3+ file specifying the genomic regions.

-o OUT_FILE, --output=OUT_FILE

Prefix of the output file.

1.4. Command

$CpG_aggregation.py -b hg19.RefSeq.union.1Kpromoter.bed.gz  -i 0_du145_133_glp_sh1.bed -o out

1.5. Output

chr1    567292  568293  3       0       93      3       0       93
chr1    713567  714568  6       0       100     6       0       100
chr1    762401  763402  7       0       110     7       0       110
chr1    762470  763471  10      0       158     10      0       158
chr1    854571  855572  2       12      16      2       12      16
chr1    860620  861621  16      91      232     16      91      232
chr1    894178  895179  12      151     229     41      506     735

Description

  • Column1-3: Genome coordinates

  • Column4-6: numbers of “CpG”, “aggregated methyl reads”, and “aggregate total reads” after outlier filtering

  • Column7-9: numbers of “CpG”, “aggregated methyl reads”, and “aggregate total reads” before outlier filtering