beta_tSNE.py
=============
Description
------------
This program performs `t-SNE (t-Distributed Stochastic Neighbor Embedding) `_
analysis for samples.
**Example of input data file**
::
ID Sample_01 Sample_02 Sample_03 Sample_04
cg_001 0.831035 0.878022 0.794427 0.880911
cg_002 0.249544 0.209949 0.234294 0.236680
cg_003 0.845065 0.843957 0.840184 0.824286
...
**Example of input group file**
::
Sample,Group
Sample_01,normal
Sample_02,normal
Sample_03,tumor
Sample_04,tumo
...
**Notes**
- Rows with missing values will be removed
- Beta values will be standardized into z scores
- Only the first two components will be visualized
- Different perplexity values can result in significantly different results
- Even with same data and save parameters, different run might give you (slightly)
different result. It is perfectly fine to run t-SNE a number of times (with the same
data and parameters), and to select the visualization with the lowest value of the
objective function as your final visualization.
Options
--------
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input_file=INPUT_FILE
Tab-separated data frame file containing beta values
with the 1st row containing sample IDs and the 1st
column containing CpG IDs.
-g GROUP_FILE, --group=GROUP_FILE
Comma-separated group file defining the biological
groups of each sample. Different groups will be
colored differently in the t-SNE plot.
-p PERPLEXITY_VALUE, --perplexity=PERPLEXITY_VALUE
This is a tunable parameter of t-SNE, and has a
profound effect on the resulting 2D map. Consider
selecting a value between 5 and 50, and the selected
value should be smaller than the number of samples
(i.e., number of points on the t-SNE 2D map). Default
= 5
-n N_COMPONENTS, --ncomponent=N_COMPONENTS
Number of components. default=2
--n_iter=N_ITERATIONS
The maximum number of iterations for the optimization.
Should be at least 250. default=5000
--learning_rate=LEARNING_RATE
The learning rate for t-SNE is usually in the range
[10.0, 1000.0]. If the learning rate is too high, the
data may look like a ‘ball’ with any point
approximately equidistant from its nearest neighbors.
If the learning rate is too low, most points may look
compressed in a dense cloud with few outliers. If the
cost function gets stuck in a bad local minimum
increasing the learning rate may help. default=200.0
-o OUT_FILE, --output=OUT_FILE
The prefix of the output file.
Input files (examples)
-------------------------
- `cirrHCV_vs_normal.data.tsv `_
- `cirrHCV_vs_normal.grp.csv `_
Command
----------
::
$beta_tSNE.py -i cirrHCV_vs_normal.data.tsv -g cirrHCV_vs_normal.grp.csv -o HCV_vs_normal
Output files
---------------
- HCV_vs_normal.t-SNE.r
- HCV_vs_normal.t-SNE.tsv
- HCV_vs_normal.t-SNE.pdf
.. image:: ../_static/HCV_vs_normal.tSNE.png
:height: 450 px
:width: 450 px
:scale: 100 %