Introduction R software

R is used for Statistical Computing. It is a free software available for various operating systems. It can be downloaded from https://www.r-project.org/. R runs on a wide variety of platforms (e.g. Linux, Windows and MacOS). Many books in Statistics available in TCD libraries show example of R code to illustrate the topics they address.

Why learn about R? Data Analysts Captivated by R s Power, New York times 2009

Some references

If needed you can find help with

videos on Youtube e.g. ... videos explaining how R is installed:

Forecasting with R

Lab Notes for Time Series / Applied Forecasting (ST3010/ST7005)

Applied Linear Statistical models with R

Multiple Regression Using lm() and glm() (Case study: Carbohydrate Diet)

In this dataset (taken from: An Introduction to Generalized Linear Models, A. J. Dobson & A. G. Barnett, 3rd edition p.96), the response variable y corresponds to the percentage of total calories obtained from complex carbohydrates for 20 male insulin-dependent diabetics who have been on a high carbohydrate diet for six months. Additional information is collected about the individuals taking part in the study including age (in years), weight (relative to ideal weight) and other calories intake from protein (as percentage).

carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37)
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)

#using lm
res.lm=lm(carbohydrate~age+weight+protein)
summary(res.lm)

#using glm
res.glm=glm(carbohydrate~age+weight+protein,family=gaussian)
summary(res.glm)

Computation of Least square estimate using Linear algebra in R (Case study: Carbohydrate Diet)

In this dataset (taken from: An Introduction to Generalized Linear Models, A. J. Dobson & A. G. Barnett, 3rd edition p.96), the response variable y corresponds to the percentage of total calories obtained from complex carbohydrates for 20 male insulin-dependent diabetics who have been on a high carbohydrate diet for six months. Additional information is collected about the individuals taking part in the study including age (in years), weight (relative to ideal weight) and other calories intake from protein (as percentage).

This code used linear algebra to recover some of the results found by functions lm() and glm() used in the previous example (same dataset).

carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37) # response vector
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)

#
X=matrix(1,20,4)
X[,1]=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
X[,2]=age
X[,3]=weight
X[,4]=protein

# least square estimate
BetaHat=solve(t(X)%*%X)%*%t(X)%*%carbohydrate

fitted =X %*% BetaHat
residuals=carbohydrate-fitted
SSE=t(residuals)%*%residuals
sigmahatsq=SSE/(length(carbohydrate)-length(BetaHat))
#Residual standard error
sqrt(sigmahatsq)

# uncertainty of BetaHat (standard error)
CovBetaHat=solve(t(X)%*%X)*as.numeric(sigmahatsq)
Std.Error.BetaHat=sqrt(diag(CovBetaHat))

#Multiple R-squared: (coefficient of determination)
S0=sum((carbohydrate-mean(carbohydrate))*(carbohydrate-mean(carbohydrate)))
Rsq=(S0-SSE)/S0

Logistic Regression with Bernouilli distribution (Case Study Surgery)

The dataset is explained in this pdf file: Case Study Surgery.

graphics.off()
rm(list = ls(all = TRUE))

# data

patientnumber=(1:40)
y=0*patientnumber
y[c(14,15,19, 22,23,24,25,27,30,32,34,35,37,39)]=1

Age=c(50,50,51,51,53,54,54,54,55,55,56,56,56,57,57,57,57,58,59,60,61,61,61,62,62,62,62,63,63,63,64,64,65,67,67,68,68,69,70,71)
Age2<-Age-mean(Age)

## Logistic Regression with Bernouilli ( Bernouilli special case of Binomial distribution)
data.mat=cbind(y,1-y)

Survival Analysis (Case Study Remission)

rm(list = ls(all = TRUE))
require(MASS)
require(survival)
data(gehan)

temp=as.numeric(gehan\$treat)
# gehan\$treat encodes 2 control, and 1 for 6 MP
# changing the explanatory variable
x=2-temp # 0 control, 1 6-MP

# exponential distribution: lambda=1/scale=1
res1<-survreg(Surv(gehan\$time,gehan\$cens)~x,dist="weibull",scale=1)
# equivalent to
res2<-survreg(Surv(gehan\$time,gehan\$cens)~x,dist="exponential")

# Weibull distribution: the scale 1/lambda is estimated as well
res3<-survreg(Surv(gehan\$time,gehan\$cens)~x,dist="weibull")

extractAIC(res1)
extractAIC(res2)
extractAIC(res3)

Poisson Regression (Case Study Insurance Claims)

graphics.off()
rm(list = ls(all = TRUE))

## Example  Insurance dataset
data ( Insurance , package = "MASS" )

## Info on dataset with: ? MASS::Insurance

## Poisson Regression
result.poisson= glm( Claims ~ District + Group + Age + offset ( log ( Holders ) ) , data = Insurance , family = poisson )

Poisson and Binomial Regression (Case Study Beetles)

The explanatory variable represents the dose level x of toxic substance given to a group having n beetles where the response variable, y , is the number of beetles that died consequently.

graphics.off()
rm(list = ls(all = TRUE))

#####################################
##   Beetle data
#####################################

y=c(6,13,18,28,52,53,61,60)  ## responses stored in one vector
n=c(59,60,62,56,63,59,62,60)  ## exposed populations stored in one vector
x=c(1.6907,1.7242,1.7552,1.7842,1.8113,1.8369,1.8610,1.8839) ## doses stored in one vector
n_y=n-y  ## difference between vectors
beetle.mat=cbind(y,n_y)  ## create a matrix with two columns: y and n-y

#####################################
##   Binomial distribution + logit link function
## function glm is used and the output is stored in "result.binomial.logit"
#####################################

#####################################
##   Binomial distribution + probit link function
#####################################

#####################################
##   Poisson distribution + log link function
##
#####################################

## use 'summary' function to see information about outputs

#####################################
## Visualisation
#####################################
windows(5,5)
plot(x,result.binomial.logit\$fitted.values,type="b",col="red",lwd=2,main="Case study: Beetles (red: Binomial+logit, green: Binomial+probit, blue: Poisson+log, black: saturated solutions)",xlab="drug dose",ylim=c(0,1),ylab="Proportion theta of dead beetles")
par(new=TRUE,ann=FALSE)
plot(x,result.binomial.probit\$fitted.values,type="b",ylim=c(0,1),col="green",lwd=2)
par(new=TRUE,ann=FALSE)
plot(x,result.poisson.log\$fitted.values/n,type="b",ylim=c(0,1),col="blue",lwd=2)
par(new=TRUE,ann=FALSE)
plot(x,y/n,type="p",col="black",lwd=3,ylim=c(0,1))

Stochastic processes in space and time with R

Kriging with GSTAT package

1. The meuse data set: a brief tutorial for the gstat R package, Edzer Pebesma, 2015 with corresponding R code for convenience. Read the tutorial proposed by E. Pebesma (doc 2) up to section 7.
2. A Practical Guide to Geostatistical Mapping by Tomislav Hengl 2009. Read overviews of R and Google Earth in Hengl's book (e.g. section 3.2 pages 72 to 77, and section 3.3 pages 78 to 87)
3. meuse_lead.kml: Meuse dataset locations data in KML format to visualise with Google earth
4. Other: QGIS: A Free and Open Source Geographic Information System

Periodogram in R: example Brightness of a star

install.packages("TSA")
require(TSA)

#load 'star' time series
data(star)

# ?star in the console to get help
#Brightness (magnitude) of a particular
#star at midnight on 600 consecutive nights

plot(star,xlab='Day',ylab='Brightness')

#periodogram
spectrum(star)

spectrum(star)\$freq

spectrum(star)\$spec

Finding period(s) in noisy time series In R we create a time series as follow:

N<-1000

t<-seq(1,N)

Noise <- rnorm(N)

PureSine <- 3*sin(2*pi*t/5)

NoisySine <- PureSine+Noise

freq<- t/N

Look at the help file associated with any function you are not familiar with.
1. Visualise the time plot of these time series (Noise, PureSine, NoisySine ) e.g.

windows(5,5)

plot(t, Noise,type="l",main="time plot : Noise ")

Do you notice any obvious seasonal behaviour in these time series?
2. Visualise the Power Spectrum of these of these time series (Noise, PureSine, NoisySine ) e.g.

windows(5,5)

plot(freq,Mod(fft(NoisySine)),type="l",main="DFT: NoisySine")

Can you see any seasonal pattern now in the Fourier domain? At what frequencies? Does it make sense (considering the definition of PureSine)? Comment on the Power Spectrum of the Noise.
3. Create a Noise with higher standard deviation (see rnorm help file). What is the impact on the power Spectrum?

Run and understand the following code (i.e. copy the code in a R script and insert comment lines) :

with(cars, {

plot(speed, dist)

lines(ksmooth(speed, dist, "normal", bandwidth = 2), col = 2)

lines(ksmooth(speed, dist, "normal", bandwidth = 5), col = 3)

})

Comment on the effect of changing the bandwidth. Change the kernel and comment.

Regression on basis of functions Install and load in memory the fda package (and its dependencies). Run and understand the following code (i.e. copy the code in a R script and insert comment lines):

gaitbasis3 <- create.fourier.basis(nbasis=5)

gaitfd3 <- Data2fd(gait[,1,1], basisobj=gaitbasis3,nderiv=0)

plotfit.fd(gait[,1,1], seq(0,1,len=20), gaitfd3)

Plot the basis of functions used for regression. Increase the number of functions used for regression. Change the Fourier basis (e.g. to bspline).