CpG_anno_probe.py
==================
Description
-----------------
This program adds comprehensive annotation information to each 450K/850K array probe ID.
It will add 17 columns to the original input data file. These 17 columns include
(from left to right):
+-----------------------+-------------------------------------------------------------------------+
| Header Name |Description |
+-----------------------+-------------------------------------------------------------------------+
| hg19_pos |The genomic position of the CpG on human genome assembly `hg19 (or |
| |GRCh37) `_ |
+-----------------------+-------------------------------------------------------------------------+
| hg38_pos |The genomic position of the CpG on human genome assembly `hg38 (or |
| |GRCh38) `_. |
+-----------------------+-------------------------------------------------------------------------+
| strand |Strand of the CpG. Value - "R" (reverse strand) or "F" (forward strand). |
+-----------------------+-------------------------------------------------------------------------+
| geneSymbol |Genes the CpG has been assigned to. "N/A" indicates no genes were found. |
| |This is retrieved from the Illumina `MethylationEpic v1.0 B4 |
| |`_ manifest file. |
+-----------------------+-------------------------------------------------------------------------+
| CpGisland |The CpG island (CGI) that overlaps with this CpG. "N/A" indicates no |
| |CGIs were found. |
+-----------------------+-------------------------------------------------------------------------+
| with_450K |Boolean indicating whether this CpG probe is also included in 450K. |
| |"0" - No, "1"- Yes. |
+-----------------------+-------------------------------------------------------------------------+
| SNP_ID |SNPs (rsID) that are close to this CpG. Multiple SNPs are separated |
| |by ";". "N/A" indicates no SNPs were found. |
+-----------------------+-------------------------------------------------------------------------+
| SNP_distance |The nucleotide distances between SNPs and the CpG. |
+-----------------------+-------------------------------------------------------------------------+
| SNP_MAF |The `minor allele frequencies (MAF) `_ of SNPs. |
+-----------------------+-------------------------------------------------------------------------+
| Cross_Reactive |Boolean ("0" - No, "1"- Yes) indicating whether this CpG could be |
| |affected by cross-hybridization or underlying genetic variation as |
| |reported by this `paper `_. |
+-----------------------+-------------------------------------------------------------------------+
| ENCODE_TF_ChIP |Transcription factor (TF) binding sites identified from ChIP-seq |
| |experiments performed by the `ENCODE `_ |
| |project. Peaks from 1264 experiments representing 338 transcription |
| |factors in 130 cell types are combined (N = 10,560,472). |
| |BED format file was downloaded from the `UCSC Tabel Browser |
| |`_, and a detailed description |
| |is provided `here `_. |
+-----------------------+-------------------------------------------------------------------------+
| ENCODE_DNaseI |DNase I hypersensitivity sites identified from ENCODE `DNase-seq |
| |`_ experiments. Peaks from |
| |125 cell types are combined (N - 1,867,665). BED format file was |
| |downloaded from the `UCSC Table Browser |
| |`_, and a detailed description |
| |is provided `here `_. |
+-----------------------+-------------------------------------------------------------------------+
|ENCODE_H3K27ac_ChIP |H3K27ac peaks identified from ENCODE histone ChIP-seq experiments. Peaks |
| |from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, |
| |K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 665,650) |
+-----------------------+-------------------------------------------------------------------------+
|ENCODE_H3K4me1_ChIP |H3K4me1 peaks identified from ENCODE histone ChIP-seq experiments. Peaks |
| |from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, |
| |K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 1,435,550) |
+-----------------------+-------------------------------------------------------------------------+
|ENCODE_H3K4me3_ChIP |H3K4me3 peaks identified from ENCODE histone ChIP-seq experiments. Peaks |
| |from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, |
| |K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 525,824) |
+-----------------------+-------------------------------------------------------------------------+
|ENCODE_chromHMM |Chromatin State Segmentation by `chromHMM `_ from ENCODE. Chromatin states across 9 cell types |
| |(GM12878, H1-hESC, K562, HepG2, HUVEC, HMEC, HSMM, NHEK, NHLF) were |
| |learned by computationally by integrating 9 factors (CTCF, H3K27ac, |
| |H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H4K20me1 ) |
| |plus input. A total of 15 states were identified, include: State-1 |
| |(Active Promoter), state-2 (Weak Promoter), state-3 (Inactive/poised |
| |Promoter), state-4 and 5 (Strong enhancer), state-6 and 7 |
| |(Weak/poised enhancer), state-8 (insulator), state-9 (Transcriptional |
| |transition), state-10 (Transcriptional elongation), state-11 (Weak |
| |transcribed), state-12 (Polycomb-repressed), state-13 (Heterochromatin or|
| |low signal), state-14 and 15 (Repetitive/Copy Number Variation). |
| |Orignal chromatin state BED file was downloaded from `UCSC Table Browser |
| |`_, and detailed description |
| |is provided `here `_. |
+-----------------------+-------------------------------------------------------------------------+
|FANTOM_enhancer |PHANTOM5 human enhancers downloaded from `here `_. |
+-----------------------+-------------------------------------------------------------------------+
Notes
-------
- For peaks identified from ENCODE ChIP-seq and DNase-seq (ENCODE_TF_ChIP, ENCODE_H3K27ac_ChIP,
ENCODE_H3K4me1_ChIP, ENCODE_H3K4me3_ChIP, and ENCODE_DNaseI), we require the probe must be
located in the 100 bp window centered on the **middle** of the peak.
Options
-------
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input_file=INPUT_FILE
Input data file (Tab-separated) with a certain column
containing 450K/850K array CpG IDs. This file can be
a regular text file or compressed file (.gz, .bz2).
-a ANNO_FILE, --annotation=ANNO_FILE
Annotation file. This file can be a regular text file
or compressed file (.gz, .bz2).
-o OUT_FILE, --output=OUT_FILE
Prefix of the output file.
-p PROBE_COL, --probe_column=PROBE_COL
The number specifying which column contains probe IDs.
Note: the column index starts with 0. default-0.
-l, --header Input data file has a header row.
Input files (examples)
----------------------
- `test_01.hg19.bed6 `_
- `MethylationEPIC_CpGtools.tsv.gz `_
Command
-------
::
# probe IDs are located in the 4th column (-p 3)
$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv -i test_01.hg19.bed6 -o output
or (take gzipped files as input)
$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv.gz -i test_01.hg19.bed6.gz -o output
@ 2019-06-28 09:12:41: Read annotation file "../epic/MethylationEPIC_CpGtools.tsv" ...
@ 2019-06-28 09:12:52: Add annotation information to "test_01.hg19.bed6" ...
Output files
-------------
- output.anno.txt