Higher dimensional Data in R(PCA)

Part 1

Introduction:

~ Understand the concept behind PCA. Here I took Biological data as an example, you might take your dataset as an example~

Assuming that we have 10 tumor samples, real-time PCR experience, and measurements showing the comparison between two genes, These data can be displayed as a scatter plot, where I can place g1 on the x-axis and g2 on the y-axis to obtain data that expresses dots that signify samples in the graph, i.e. this is a 2D space.

This is how data is typically visualized. If we have gene 1, gene 2, and gene 3, we may need to build numerous graphs for a three-dimensional dataset, however, three-dimensional datasets are simpler to view by reducing their dimension.

Let's take a look at an 8000-sample microarray data set to see how we can't do this using a line graph and run into mathematical problems when attempting to analyze the data. Image analysis, RNA sequencing, and microarrays are examples of high-throughput technologies that call for more sophisticated visualization techniques.

~Dimension reduction of data:

It is challenging to visualize and analyze higher dimensional data. Dimension reduction techniques projects/embed the data in lower dimensions with minimum loss of relevan5t information. Two common dimension reduction methods in Biology are PCA and t-SNE.

~Principle component Analysis in R :

~Directory :

Get the directory first set on your desktop. The setwd() function in R sets the working directory to the specified path. In this case, my working directory is set to "C:/Users/Desktop/PCA"

The getwd() function is used to check the current working directory in R. When you run getwd() after running setwd("C:/Users/Desktop/PCA"), it should return "C:/Users/KIIT/Desktop/Dissertation".

~Run Principle component analysis in R :

The script is for performing Principal Component Analysis (PCA) using the 'gg biplot' package in R. PCA is a technique used for dimensionality reduction and visualization of high-dimensional data.

To use the 'gg biplot' package, you first need to install the 'devtools' package from the R repository, which is done using install.packages("devtools"). After installing 'devtools', you can then install the 'ggbiplot' package from GitHub using install_github ("vqv/ggbiplot").

The script assumes that you have already installed the required packages and loaded the 'ggbiplot' package using library(ggbiplot).

Note that the script also mentions that it is better to have the latest R version installed, as some packages may not work with older versions of R. If you encounter any issues with running the script, you may want to check that you have the latest version of R installed.

~Import your data in PCA :

The code you provided is performing Principal Component Analysis (PCA) on a dataset represented by a matrix called ge.data. The first step in PCA is to extract the subset of the data that you want to analyze, which is done by selecting the first three columns of the original matrix using ge.data[, 1:3] and storing it in a new matrix called d.

#We need to reduce the dimention from 3d to wd to visualiza the data in 2D

d <- ge.data [ ,1:3]

#to perform PCA

#use prcomp() function

d.pca <-prcomp (d, centre = TRUE ,scale = TRUE)

The next step is to perform PCA using the prcomp() function in R. PCA is a technique for dimensionality reduction that aims to identify the most important patterns in the data by projecting it onto a lower-dimensional space.

The prcomp() function takes the data matrix d as its input and performs the PCA calculation. The centre = TRUE argument centers the data to have a mean of zero, and the scale = TRUE argument scales the data to have unit variance. These steps are typically performed in PCA to ensure that each variable has the same influence in the analysis.

~Summary as an output in R:

After running the code we get a summary as an output which seems like

summary (d.pca)

Importance of components: #PC1 PC2 PC3

Standard deviation 1.2686 1.0030 0.6202

The proportion of Variance 0.5364 0.3353 0.1282

Cumulative Proportion 0.5364 0.8718 1.0000

The summary() function applied to the d.pca object provides a summary of the results of the PCA analysis. The output includes information about the standard deviation of each principal component, the proportion of variance explained by each principal component, and the cumulative proportion of variance explained up to each principal component.

In this specific example, the output shows that the first principal component (PC1) has a standard deviation of 1.2686, which is the largest among the three principal components, indicating that PC1 captures the most variation in the data. PC2 and PC3 have smaller standard deviations of 1.0030 and 0.6202, respectively.

The proportion of variance explained by each principal component is also provided. In this example, PC1 explains 53.64% of the total variance in the data, PC2 explains 33.53%, and PC3 explains 12.82%. The cumulative proportion of variance explained up to each principal component is also shown, indicating that PC1 and PC2 together explain 87.18% of the variance in the data.

This information can be used to determine the optimal number of principal components to retain for further analysis or to visualize the data in a lower-dimensional space. In this case, selecting the first two principal components would explain a large proportion of the variation in the data and allow for visualization in a 2D space.

Conclusion :

To sum up, this article has explained how PCA functions in R and how it handles higher dimensional data.

Higher dimensional Data in R(PCA)

Table of contents

No headings in the article.