16. beta_tSNE.py¶
16.1. Description¶
This program performs t-SNE (t-Distributed Stochastic Neighbor Embedding) analysis for samples.
Example of input data file
ID Sample_01 Sample_02 Sample_03 Sample_04
cg_001 0.831035 0.878022 0.794427 0.880911
cg_002 0.249544 0.209949 0.234294 0.236680
cg_003 0.845065 0.843957 0.840184 0.824286
...
Example of input group file
Sample,Group
Sample_01,normal
Sample_02,normal
Sample_03,tumor
Sample_04,tumo
...
Notes
Rows with missing values will be removed
Beta values will be standardized into z scores
Only the first two components will be visualized
Different perplexity values can result in significantly different results
Even with same data and save parameters, different run might give you (slightly) different result. It is perfectly fine to run t-SNE a number of times (with the same data and parameters), and to select the visualization with the lowest value of the objective function as your final visualization.
16.2. Options¶
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -i INPUT_FILE, --input_file=INPUT_FILE
Tab-separated data frame file containing beta values with the 1st row containing sample IDs and the 1st column containing CpG IDs.
- -g GROUP_FILE, --group=GROUP_FILE
Comma-separated group file defining the biological groups of each sample. Different groups will be colored differently in the t-SNE plot.
- -p PERPLEXITY_VALUE, --perplexity=PERPLEXITY_VALUE
This is a tunable parameter of t-SNE, and has a profound effect on the resulting 2D map. Consider selecting a value between 5 and 50, and the selected value should be smaller than the number of samples (i.e., number of points on the t-SNE 2D map). Default = 5
- -n N_COMPONENTS, --ncomponent=N_COMPONENTS
Number of components. default=2
- --n_iter=N_ITERATIONS
The maximum number of iterations for the optimization. Should be at least 250. default=5000
- --learning_rate=LEARNING_RATE
The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbors. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help. default=200.0
- -o OUT_FILE, --output=OUT_FILE
The prefix of the output file.
16.3. Input files (examples)¶
16.4. Command¶
$beta_tSNE.py -i cirrHCV_vs_normal.data.tsv -g cirrHCV_vs_normal.grp.csv -o HCV_vs_normal