Elements of R programming
R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
It was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R programming language is an implementation of the S programming language. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software i.e open source. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and macOS.
The R environment
R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes
- an effective data handling and storage facility,
 - a suite of operators for calculations on arrays, in particular matrices,
 - a large, coherent, integrated collection of intermediate tools for data analysis,
 - graphical facilities for data analysis and display either on-screen or on hardcopy, and
 - a well-developed, simple, and effective programming language which includes conditionals, loops, user-defined recursive functions, and input and output facilities.
 
features of R / why use R? :
Univariate, Bivariate and Multivariate Analysis
Univariate Analysis :-
Univariate analysis can be done in three ways:
1. Summary statistics -Determines the value’s center and spread.
2. Frequency table -This shows how frequently various values occur.
3. Charts -A visual representation of the distribution of values.
1. Summary Statistics
consider any vector and summary statistics functions on it.
Let’s start with the mean of the variable,
Now we can find out the median of the data
Range of the variable
We can now compute the interquartile range (spread of middle 50 percent of values)
Standard deviation is important for the continuous data variables,
2. Frequency Table
The term “frequency” refers to how frequently something occurs. The number of times an event occurs is indicated by the observation frequency.
The frequency distribution table may include numeric or quantitative data that are category or qualitative. The distribution provides a glimpse of the data and allows you to identify trends.
To create a frequency table for our variable, we can use the following syntax:
We can infer the output like,
The value 5 occurs 2 times
The value 7.5 occurs 1 time
The value 8 occurs 3 time
And so on.
3. Charts
A boxplot is a graph that displays a dataset’s five-number summary.
The following are the five numbers that make up the five-number summary:
The bare minimum.
The top quartile.
The average value.
The third quartile of the population.
The highest possible value.
A histogram is a sort of graphic that displays frequencies using vertical bars. A helpful technique to show the distribution of values in a dataset is to use this type of graphic.
The distribution of values in a dataset is represented by a density curve, which is a curve on a graph.
It’s especially useful for viewing a distribution’s “shape,” such as whether the distribution contains one or more “peaks” of often occurring values and if the distribution is skewed to the left or right.
Each of these graphs provides a different perspective on the distribution of values for our variable.
Bivariate Analysis :-
The purpose of bivariate analysis is to understand the relationship between two variables.
1. Scatterplots
2. Correlation Coefficients
3. Simple Linear Regression
#create data frame df <- data.frame(hours=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 8), score=c(75, 66, 68, 74, 78, 72, 85, 82, 90, 82, 80, 88, 85, 90, 92, 94, 94, 88, 91, 96)) #view first six rows of data frame head(df) hours score 1 1 75 2 1 66 3 1 68 4 2 74 5 2 78 6 2 72
1. Scatterplots
#create scatterplot of hours studied vs. exam score plot(df$hours, df$score, pch=16, col='steelblue', main='Hours Studied vs. Exam Score', xlab='Hours Studied', ylab='Exam Score')

2. Correlation Coefficients
A Pearson Correlation Coefficient is a way to quantify the linear relationship between two variables.
We can use the cor() function in R to calculate the Pearson Correlation Coefficient between two variables:
#calculate correlation between hours studied and exam score received
cor(df$hours, df$score)
[1] 0.891306
This output value is close to 1, which indicates a strong positive correlation between hours studied and exam score received.
3. Simple Linear Regression
Simple linear regression is a statistical method we can use to find the equation of the line that best “fits” a dataset, which we can then use to understand the exact relationship between two variables.
We can use the lm() function in R
#fit simple linear regression model fit <- lm(score ~ hours, data=df) #view summary of model summary(fit) Call: lm(formula = score ~ hours, data = df) Residuals: Min 1Q Median 3Q Max -6.920 -3.927 1.309 1.903 9.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 69.0734 1.9651 35.15 < 2e-16 *** hours 3.8471 0.4613 8.34 1.35e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.171 on 18 degrees of freedom Multiple R-squared: 0.7944, Adjusted R-squared: 0.783 F-statistic: 69.56 on 1 and 18 DF, p-value: 1.347e-07
multi-variate Analysis:
densityplot() is used to study the distribution of a numerical variable. It comes from the lattice package for statistical graphics, which is pre-installed with every distribution of R.bwplot() makes box-and-whisker plots for numerical variables. xyplot() function from the lattice package.


Comments
Post a Comment