3. CpG_anno_probe.py¶
3.1. Description¶
This program adds comprehensive annotation information to each 450K/850K array probe ID. It will add 17 columns to the original input data file. These 17 columns include (from left to right):
Header Name |
Description |
hg19_pos |
The genomic position of the CpG on human genome assembly hg19 (or GRCh37) |
hg38_pos |
The genomic position of the CpG on human genome assembly hg38 (or GRCh38). |
strand |
Strand of the CpG. Value - “R” (reverse strand) or “F” (forward strand). |
geneSymbol |
Genes the CpG has been assigned to. “N/A” indicates no genes were found. This is retrieved from the Illumina MethylationEpic v1.0 B4 manifest file. |
CpGisland |
The CpG island (CGI) that overlaps with this CpG. “N/A” indicates no CGIs were found. |
with_450K |
Boolean indicating whether this CpG probe is also included in 450K. “0” - No, “1”- Yes. |
SNP_ID |
SNPs (rsID) that are close to this CpG. Multiple SNPs are separated by “;”. “N/A” indicates no SNPs were found. |
SNP_distance |
The nucleotide distances between SNPs and the CpG. |
SNP_MAF |
The minor allele frequencies (MAF) of SNPs. |
Cross_Reactive |
Boolean (“0” - No, “1”- Yes) indicating whether this CpG could be affected by cross-hybridization or underlying genetic variation as reported by this paper. |
ENCODE_TF_ChIP |
Transcription factor (TF) binding sites identified from ChIP-seq experiments performed by the ENCODE project. Peaks from 1264 experiments representing 338 transcription factors in 130 cell types are combined (N = 10,560,472). BED format file was downloaded from the UCSC Tabel Browser, and a detailed description is provided here. |
ENCODE_DNaseI |
DNase I hypersensitivity sites identified from ENCODE DNase-seq experiments. Peaks from 125 cell types are combined (N - 1,867,665). BED format file was downloaded from the UCSC Table Browser, and a detailed description is provided here. |
ENCODE_H3K27ac_ChIP |
H3K27ac peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 665,650) |
ENCODE_H3K4me1_ChIP |
H3K4me1 peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 1,435,550) |
ENCODE_H3K4me3_ChIP |
H3K4me3 peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 525,824) |
ENCODE_chromHMM |
Chromatin State Segmentation by chromHMM from ENCODE. Chromatin states across 9 cell types (GM12878, H1-hESC, K562, HepG2, HUVEC, HMEC, HSMM, NHEK, NHLF) were learned by computationally by integrating 9 factors (CTCF, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H4K20me1 ) plus input. A total of 15 states were identified, include: State-1 (Active Promoter), state-2 (Weak Promoter), state-3 (Inactive/poised Promoter), state-4 and 5 (Strong enhancer), state-6 and 7 (Weak/poised enhancer), state-8 (insulator), state-9 (Transcriptional transition), state-10 (Transcriptional elongation), state-11 (Weak transcribed), state-12 (Polycomb-repressed), state-13 (Heterochromatin or low signal), state-14 and 15 (Repetitive/Copy Number Variation). Orignal chromatin state BED file was downloaded from UCSC Table Browser, and detailed description is provided here. |
FANTOM_enhancer |
PHANTOM5 human enhancers downloaded from here. |
3.2. Notes¶
For peaks identified from ENCODE ChIP-seq and DNase-seq (ENCODE_TF_ChIP, ENCODE_H3K27ac_ChIP, ENCODE_H3K4me1_ChIP, ENCODE_H3K4me3_ChIP, and ENCODE_DNaseI), we require the probe must be located in the 100 bp window centered on the middle of the peak.
3.3. Options¶
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -i INPUT_FILE, --input_file=INPUT_FILE
Input data file (Tab-separated) with a certain column containing 450K/850K array CpG IDs. This file can be a regular text file or compressed file (.gz, .bz2).
- -a ANNO_FILE, --annotation=ANNO_FILE
Annotation file. This file can be a regular text file or compressed file (.gz, .bz2).
- -o OUT_FILE, --output=OUT_FILE
Prefix of the output file.
- -p PROBE_COL, --probe_column=PROBE_COL
The number specifying which column contains probe IDs. Note: the column index starts with 0. default-0.
- -l, --header
Input data file has a header row.
3.4. Input files (examples)¶
3.5. Command¶
# probe IDs are located in the 4th column (-p 3)
$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv -i test_01.hg19.bed6 -o output
or (take gzipped files as input)
$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv.gz -i test_01.hg19.bed6.gz -o output
@ 2019-06-28 09:12:41: Read annotation file "../epic/MethylationEPIC_CpGtools.tsv" ...
@ 2019-06-28 09:12:52: Add annotation information to "test_01.hg19.bed6" ...
3.6. Output files¶
output.anno.txt