Elements of R programming

R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

It was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R programming language is an implementation of the S programming language. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software i.e open source. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and macOS.

The R environment

R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes

an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hardcopy, and
a well-developed, simple, and effective programming language which includes conditionals, loops, user-defined recursive functions, and input and output facilities.

features of R / why use R? :

R has many features to recommend it:

■ Most commercial statistical software platforms cost thousands, if not tens of thousands of dollars. R is free! If you’re a teacher or a student, the benefits are obvious.

■ R is a comprehensive statistical platform, offering all manner of data analytic techniques. Just about any type of data analysis can be done in R.

■ R has state-of-the-art graphics capabilities. If you want to visualize complex data, R has the most comprehensive and powerful feature set available.

■ R is a powerful platform for interactive data analysis and exploration. From its inception it was designed to support the approach outlined in figure 1.1. For example, the results of any analytic step can easily be saved, manipulated, and used as input for additional analyses.

■ Getting data into a usable form from multiple sources can be a challenging proposition. R can easily import data from a wide variety of sources, including text files, database management systems, statistical packages, and specialized data repositories. It can write data out to these systems as well.

■ R provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner. It’s easily extensible and provides a natural language for quickly programming recently published methods.

■ R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis. If you’re a SAS user, imagine getting a new SAS PROC every few days.

■ If you don’t want to learn a new language, a variety of graphic user interfaces (GUIs) are available, offering the power of R through menus and dialogs.

■ R runs on a wide array of platforms, including Windows, Unix, and Mac OS X. It’s likely to run on any computer you might have (I’ve even come across guides for installing R on an iPhone, which is impressive but probably not a good idea)

few PDF links :

Introduction to Data Science

built-in functions

control statements in r

Univariate, Bivariate and Multivariate Analysis

Univariate Analysis :-

The term “univariate analysis” refers to a single-variable analysis. Univariate analysis is a fundamental statistical data analysis technique. The data comprises only one variable.

Univariate analysis can be done in three ways:

1. Summary statistics -Determines the value’s center and spread.

2. Frequency table -This shows how frequently various values occur.

3. Charts -A visual representation of the distribution of values.

1. Summary Statistics

consider any vector and summary statistics functions on it.

data<- c(10, 5, 8, 7.5, 8, 45, 40, 51, 5, 16.5, 27, 7.8, 8, 10, 15)

Let’s start with the mean of the variable,

mean(data)
[1] 17.58667

Now we can find out the median of the data

median(data)
[1] 10

Range of the variable

max(data)
[1] 51
min(data)
[1] 5
max(data) - min(data)
[1] 46

We can now compute the interquartile range (spread of middle 50 percent of values)

IQR(data)
[1] 13.85

Standard deviation is important for the continuous data variables,

sd(data)
[1] 15.51952

2. Frequency Table

The term “frequency” refers to how frequently something occurs. The number of times an event occurs is indicated by the observation frequency.

The frequency distribution table may include numeric or quantitative data that are category or qualitative. The distribution provides a glimpse of the data and allows you to identify trends.

To create a frequency table for our variable, we can use the following syntax:

table(data)
data
   5  7.5  7.8    8   10   15 16.5   27   40   45   51
   2    1    1    3    2    1    1    1    1    1    1

We can infer the output like,

The value 5 occurs 2 times

The value 7.5 occurs 1 time

The value 8 occurs 3 time

And so on.

3. Charts

A boxplot is a graph that displays a dataset’s five-number summary.

The following are the five numbers that make up the five-number summary:

The bare minimum.

The top quartile.

The average value.

The third quartile of the population.

The highest possible value.

boxplot(data)

A histogram is a sort of graphic that displays frequencies using vertical bars. A helpful technique to show the distribution of values in a dataset is to use this type of graphic.

hist(data)

The distribution of values in a dataset is represented by a density curve, which is a curve on a graph.

It’s especially useful for viewing a distribution’s “shape,” such as whether the distribution contains one or more “peaks” of often occurring values and if the distribution is skewed to the left or right.

plot(density(data))

Each of these graphs provides a different perspective on the distribution of values for our variable.

Bivariate Analysis :-

The purpose of bivariate analysis is to understand the relationship between two variables.

There are three common ways to perform bivariate analysis:
1. Scatterplots
2. Correlation Coefficients
3. Simple Linear Regression

The following example shows how to perform each of these types of bivariate analysis using the following dataset that contains information about two variables: (1) Hours spent studying and (2) Exam score received by 20 different students:

#create data frame
df <- data.frame(hours=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3,
                         3, 4, 4, 5, 5, 6, 6, 6, 7, 8),
                 score=c(75, 66, 68, 74, 78, 72, 85, 82, 90, 82,
                         80, 88, 85, 90, 92, 94, 94, 88, 91, 96))

#view first six rows of data frame
head(df)

  hours score
1     1    75
2     1    66
3     1    68
4     2    74
5     2    78
6     2    72

1. Scatterplots

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables.

#create scatterplot of hours studied vs. exam score
plot(df$hours, df$score, pch=16, col='steelblue',
     main='Hours Studied vs. Exam Score',
     xlab='Hours Studied', ylab='Exam Score')

From the plot we can see that there is a positive relationship between the two variables.

2. Correlation Coefficients

A Pearson Correlation Coefficient is a way to quantify the linear relationship between two variables.

We can use the cor() function in R to calculate the Pearson Correlation Coefficient between two variables:

#calculate correlation between hours studied and exam score received
cor(df$hours, df$score)

[1] 0.891306

This output value is close to 1, which indicates a strong positive correlation between hours studied and exam score received.

3. Simple Linear Regression

Simple linear regression is a statistical method we can use to find the equation of the line that best “fits” a dataset, which we can then use to understand the exact relationship between two variables.

We can use the lm() function in R

#fit simple linear regression model
fit <- lm(score ~ hours, data=df)

#view summary of model
summary(fit)

Call:
lm(formula = score ~ hours, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-6.920 -3.927  1.309  1.903  9.385 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  69.0734     1.9651   35.15  < 2e-16 ***
hours         3.8471     0.4613    8.34 1.35e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.171 on 18 degrees of freedom
Multiple R-squared:  0.7944,	Adjusted R-squared:  0.783 
F-statistic: 69.56 on 1 and 18 DF,  p-value: 1.347e-07

multi-variate Analysis:

Multivariate analysis is the statistical study of experiments in which multiple measurements are made on each experimental unit and for which the relationship among multivariate measurements and their structure are important to the experiment's understanding.

The lattice package is a graphics and data visualization package inspired by the trellis graphics package. The main focus of the package is multivariate data. It has a wide variety of functions that enable it to create basic plots of the base R package as well as enhance on them.

density plot :

The function densityplot() is used to study the distribution of a numerical variable. It comes from the lattice package for statistical graphics, which is pre-installed with every distribution of R.

example :

require(lattice)

library(lattice)

attach(mtcars)

densityplot(~mpg,main="Density Plot",xlab="Miles per Gallon")

gear.f<-factor(gear,levels=c(3,4,5),labels=c("3gears","4gears","5gears"))

cyl.f <-factor(cyl,levels=c(4,6,8),labels=c("4cyl","6cyl","8cyl"))

densityplot(~mpg|cyl.f,main="Density Plot by Number of Cylinders",xlab="Miles per Gallon")

boxplot :

The function bwplot() makes box-and-whisker plots for numerical variables.

example:

bwplot(cyl.f~mpg|gear.f,ylab="Cylinders", xlab="Miles per Gallon",main="Mileage by Cylinders and Gears",layout=(c(1,3)))

xyplot :

We focus on studying the relationship between two quantitative variables -- possibly in conjunction with one or more categorical variables. We use the xyplot() function from the lattice package.

example:

xyplot(mpg~wt)

xyplot(mpg~wt|cyl.f*gear.f,main="Scatterplots by Cylinders and Gears",ylab="Miles per Gallon", xlab="Car Weight")

R objects:

every programming language has its own data types to store values or any information so that the user can assign these data types to the variables and perform operations respectively. unlike other programming languages, variables are assigned to objects rather than data types in R programming.

in R there are five types of objects

1. vectors:

vectors are the basic objects in R which store the collection of homogenous data.

some example for vectors in R:

#Vectors of various types

v<-c(15,29,"Hello",'A',4.3,as.raw(157))

print(v)

class(v)

t=c(2L,5L,7L,8L,10L)

class(t)

x=c(TRUE,FALSE,TRUE,TRUE)

class(x)

y=c(3:13)

print(y)

z=c(seq(15,27,by=2))

print(z)

a=c(seq(29,10,by=-2))

print(a)

#Accesing elements of a vector

k=c(12,56.7,"MSDS",FALSE)

print(k)

class(k)

#Elements at index 2

print(k[2])

k[2]="me"

print(k)

#Adding new elements

k[5]=56

k[6]=7.8

print(k)

#Accessing values using index

print(k[c(1,2,5)])

print(k[-3])

#Removing vector

k=NULL

print(k)

#Vector recycling

v1=c(1,4,7,2,8)

v2=c(2,5)

v3=v1+v2

print(v3)

Search This Blog

CSE