Kai's Study Note in San Francisco: 8月 2015

2015年8月30日星期日

8.28 Coursera Regression Models Project Comments

Apparently, I don't think that I can get full credits on this subject because the uncertainty is not fully explained by giving confidence intervals. However, these comments do help me to improve and remind me keep learning regression models.

peer 1 → I believe the student did a good job in selecting the key predictors. Also the student provided a good explanation of how the variables selected and how the final model was derived. I believe the student left out some key steps but I can he did not want to have more than six pages.

The last sentence makes me laugh. But I need to learn how to create pdf within page limits. My way of compiling pdf is not the best.

peer 2 → Strengths: Nice structure and analysis. I came up with the same conclusions via slightly different methods. Weaknesses: It doesn't appear that you used knitr to generate this PDF. The code appears to be more of a word document format rather than output of a ".Rmd" processed by knitr into a PDF as required. Additionally, you omitted a bunch of steps on how you arrived at your optimal model.

I should include key steps, and probably I need to take "Reproducible Research" to learn how to make pdf using knitr.

peer 3 → I think the final model suffered a little by focusing on r^2 and the p-values. I think that testing the assumptions behind linear regression should also be considered because the final model has a few violations with unequal variance of the error term and a non-normal distribution. A really good thing I saw is that the final model doesn't appear to have any issues with multi-collinearity, which was a big issue not addressed in the other 3 projects I saw.

This is the most specific comments of the three. It points out some weaknesses of my strategy, and I believe the person has knowledge in math more than I do.

8.28 Coursera Exploratory Analysis Project Comments

Motor vehicles include more than just motorcycles. As a data scientist, one should always try to answer the question correctly.

peer 1 → I see no issues. Full marks given.

peer 2 → Nice use of colouring and graphics in general. However, the student has assumed that "motor vehicles" are the same as "motor cycles", and this is not the case, so I subtracted 1 point because of this.

peer 3 → Overall, work is good, with the exception that the two plots on motor-vehicle sources use only motorcycle data. Though discussion on the forums made it clear that there could be several interpretations of motor vehicles, I think it is quite clear that more than just motorcycles should be included, especially since the motorcycle-only emissions represent a tiny fraction of the total motor vehicle emissions (if defined as on-road emissions). The code does appear to produce the plots, so I gave credit for that, but deducted points on the plots since they don't tell enough of the story to meaningfully address the motor vehicle question in my opinion.

peer 4 → [This area was left blank by the evaluator.]

2015年8月29日星期六

8.28 Why everything crams up again

I need to archive everything. My student laptop, my email account, there are so many files!!

The life is treating me frreeaking well.

8.28 Coursera Regression Models Quiz 2 Question 9

Question 9 states the following:

Refer back to the mtcars data set with mpg as an outcome and weight (wt) as the predictor. About what is the ratio of the the sum of the squared errors, ∑ni=1(Yi−Y^i)2 when comparing a model with just an intercept (denominator) to the model with the intercept and slope (numerator)?

I am confused about the term numerator and denominator.

Michele Coleman's comment:

Yukai: the use of "numerator" and "denominator" are very specific to this question and not applicable to regression models in general.

In this question the word denominator refers to the phrase "a model with just an intercept" and the word numerator refers to the phrase "the model with the intercept and slope".

The question is asking you to compare two different models by calculating the ratio of the sum of squared errors. You are asked to find A / B, where A = the sum of squared errors for model 1. B = the sum of squared errors for model 2, where model 1 has a slope and an intercept and model 2 has just an intercept.

A/B
= (the sum of squared errors for model 1) / (the sum of squared errors for model 2)

= the sum of squared errors for the model with an intercept and slope) / (the sum of squared errors for the model with just an intercept)

2015年8月20日星期四

8.20 Some posts on FSL forum (recommended from Donna)

2015年8月19日星期三

8.19 Reshape Matrix in R

http://www.sixhat.net/how-to-reshape-data-in-r.html

Basically, using the matrix() call specifying the dimension you want, and the output will be simply reshaped!

> a <- c(1, 2, 3, 4, 5, 6)
> matrix(a, 2, 3)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> matrix(a)
     [,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5
[6,]    6

2015年8月17日星期一

8.17 GGPlot Cheat Sheet

Good resource to check now and then - credited from Coursera TA Al Warren:

http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/

2015年8月16日星期日

8.16 Coursera - Multiple R-squared and Adjusted R-squared

This is part of the discussion in the forum of regression model. I will paste the response here in case later the class is finished and the forum will be closed.

Michele's response:

Multiple R-squared (sometimes just called R-squared) *always* will go up when you add more variables! The model with the added variables will better match the training data. The multiple R-squared is a measure of how good the fit is on the training data. But probably your training data has some spurious signals and coincidences that are not true features of the underlying population of interest.

Since multiple R-squared always gets higher when you include a new variable you cannot use it to decide if including that new variable is a good idea. If the new variable actually isn't related to your outcome you are just adding noise and "overfitting" your model.

The adjusted R-squared attempts to compensate for this by adding a "penalty" for including the extra variable. If the adjusted R-squared goes down when you include a new variable that suggests that it is not valuable to keep that variable in your model.

I am not a statistics expert so I will not try to explain how the adjusted R-square accomplishes this magic. My explanation would probably be not wholly correct.

2015年8月13日星期四

8.12 Murphy's Law is not coming from Murphy

Anyway. Anything will be misquoted will be misquoted.

2015年8月6日星期四

8.6 Programming is such an essential working skill

Since learning MATLAB and later on R, my capability of improving working efficiency has greatly strengthened. But I believe that what I have used is just a tiny portion of it. The true power is, well, I don't know yet.

The core of programming is to identify patterns which are repeatable. Though everything differ from each other a little bit, by forming assumptions, certain things can be grouped together. So start with simple model first, then adding extra stuff to it to let it become more flexible and specific.

By learning writing R codes and turn it to be executable using ssh command lines, I should be able to modify my previous codes to design the matrix and generate summary using that.

I should also coded up the part of typing in a bunch of cases to put them in the mytbss folder. What I need in essence is just a list of scancode. I don't have to going back and forth to edit the directory every time. Just using the list of scancode, with task specifying (e.g. -m cp ), expected rename pattern (-n s/ns/whatever), I can get a txt file with all the cases to be copied, and what I need to do is just copy and paste and there we go!

So my focus is not spending to much time with that part. With the help of codes, I will be able to focus on more important stuff.

# Can't wait. I just draft it here.
tbsscp <- function(scancode,group) {
cbind(paste(group, scancode, ".nii", sep = ""))
}

scancode <- c("CCS001-1a11", "CCS002-1a22", "CCS003-1a33")
group <- "ns"

# Refresh: http://stackoverflow.com/questions/2470248/write-lines-of-text-to-a-file-in-r
tbsscplist <- file("tbsscplisttest.txt")
writeLines(tbsscp(scancode, group), tbsscplist)
close(tbsscplist)

8.6 I should probably start thinking about investing fitness

Standard Membership
• Access to Mission Bay or Parnassus fitness center
• Workout towel service
• Discounted parking: early morning, evening, and weekends
• Exclusive member activities
• Members-only pricing on massage, personal training, swim lessons, and other programs
• 10% discount at Sports Basement

Enrollment Fee: $50
Mission Bay: $52/month
Parnassus: $49/month
- See more at: http://www.campuslifeservices.ucsf.edu/fitnessrecreation/information/program_index/membership_ucsf_staff#sthash.NChIqh6O.dpuf

Oops. They thought I want some more information about the standard membership. I should have replied using my email address haha.

8.6 Coursera Biostat Quiz 2 Question 6

Let

X be a uniform random variable with support of length 1, but where we don't know where it starts. So that the density is

f(x)=1 for

x∈(θ,θ+1) and

0 otherwise. We observe a random variable from this distribution, say

x1. What does the likelihood look like?

A horizontal line between x1 and x1 - 1
A parabola
A diagonal line from x1 to x1 + 1
A point at x1

Thoughts after getting the right answer: do some coding and plotting will help so much in making the concept intuitive. The point is: the likelihood function is just another way to see probability density function, when fixing the variable and then see how the function changes when theta (just another parameter) change.

Discussion Forum posts:
Eric:
You don't need

F (x)

. I think of the likelyhood function (

L (θ)

) as a probability density function for the parameter

θ

given the data

\bar{X} = {x_{1}, x_{2}, \dots}

.

Often, you can use Bayes' theorem to go from

f (x | θ)

f (θ | x) = L (θ)

. (still hard to see how Bayes' theorem relate with this content) For this problem though, I think it just takes wrapping your head around the likelyhood function concept. Only one of the multiple choice answers makes sense.

I'll post the answer explanation on Monday, after the deadline.

Michelle:
You don't need $$F(x)$$. You can think of the ordinary density function $$f(x)$$ as actually a function of $$f(x,\theta)$$ (This is perfect, to look at the function with an extra parameter). Usually we treat $$\theta$$ as known and then ask what the density is at different values of $$x$$. A normal density plot has $$x$$ on the horizontal axis and $$f(x,\theta)$$ on the vertical axis, for a single, constant value of $$\theta$$ .

For this problem instead think of the plot with $$\theta$$ on the horizontal axis and $$f(x,\theta)$$ on the vertical axis, for a single constant value of $$x$$.

My approach was that I just assigned a value to x, such as $$x=2.5$$. Then I made a table with different values of $$\theta$$ in one column and then $$f(x,\theta)$$ in another column. I calculated half a dozen values, then made my plot of $$f(x,\theta)$$ vs. $$\theta$$.

2015年8月4日星期二

8.4 Interweave in dataframe in R

# Interweave in dataframe
# Useful Link: http://stackoverflow.com/questions/14784861/interweave-two-data-frames-in-r

a = data.frame(x=1:5, y=5:1)
b = data.frame(x=2:6, y=4:0)

# Interweave rows
df <- rbind(data.frame(a, index = 1:nrow(a)), data.frame(b, index = 1:nrow(b)))
df <- df[order(df$index), c("x", "y")]

# Interweave columns
dfc <- cbind(rbind(a, 1:ncol(a)), rbind(b, 1:ncol(b)))
dfc[order(dfc[6,])]

# Try: interweave columns using names of the columns
dfc[order(names(dfc))] # This is faster and single line call!

8.4 Compare two dataframes + Only focus on the differences

# Find elements that are not identical in two dataframes
# Useful link: http://stackoverflow.com/questions/3171426/compare-two-data-frames-to-find-the-rows-in-data-frame-1-that-are-not-present-in

library(xlsx)
library(proto)
library(sqldf)
library(compare)

x <- c(1,2,3,4,5)
y1 <- c(1,2,3,4,5)
y2 <- c(1,2,3,4,6)

d1 <- data.frame(x, y1)
d2 <- data.frame(x, y2)

compare(d1,d2)

# SQLDF solution
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

# DPLYR solution
library(dplyr)

anti_join(a1,a2)
semi_join(a1,a2)
full_join(a1,a2)

(8.6 update) To only focus on the columns containing different elements instead of viewing a bunch of identical columns (such as ages, education, you know they will all be the same and less likely to be different), the trick is to add an extra row below the original dataframe which is a logical value, indicating if the two columns are identical. Can use idential() function in R to do that. Then, it is just a matter of dataframe subsetting.

Pretty cool. Huh?

2015年8月3日星期一

8.3 Writing R that can be run in UNIX command lines

#! /usr/bin/env Rscript
# Useful website: http://stackoverflow.com/questions/2151212/how-can-i-read-command-line-parameters-from-an-r-script
options(echo=FALSE) # if you don't want to see commands in output file
args <- commandArgs(trailingOnly = TRUE)
# trailingOnly=TRUE means that only your arguments are returned, check:
# print(commandsArgs(trailingOnly=FALSE))

cat("Hello", args, "\n")

Return in UNIX:

>YukaideMacBook-Air:Desktop yukaizou$ ./script.R Kai

>Hello Kai

Alternatively,

>YukaideMacBook-Air:Desktop yukaizou$ Rscript script.R Kai

>Hello Kai

Another useful website here: http://www.cureffi.org/2014/01/15/running-r-batch-mode-linux/
It also shows how to use the optparse package to make options. That guy is a genius. I might follow the blog for some other materials.

# 8.6 update: learning how to add options
library(getopt)
library(optparse)

options(echo=FALSE) # if you don't want to see commands in output file
args <- commandArgs(trailingOnly = TRUE)
# trailingOnly=TRUE means that only your arguments are returned, check:
# print(commandsArgs(trailingOnly=FALSE))

option_list <- list(make_option(c("-a", "--avar"), action = "store", default = NA, type = "character",
help = "just a variable name"),
make_option(c("-v", "--verbose"), action = "store_true", default = TRUE,
help = "Should the program print the extra stuff out? [default %default]"),
make_option(c("-q", "--quiet"), action = "store_false", dest = "verbose",
help = "Make the program not verbose." )
)
opt <- parse_args(OptionParser(option_list = option_list))

if (opt$v) {
cat("Haha, you might think this is an verbose. Quiet me by adding a little -q!", "\n")
}

if (!is.na(opt$avar) ) {
cat("You have just typed in:", "\n")
cat(opt$avar, "\n")
} else {
cat("Please specify variable a", "\n")
}

8.3 Such a low efficiency

It's August already.

Such a low efficiency now. It's 3 p.m.. What am I doing at VA. Waiting for a bunch of cases to be archived... Don't have motivations to do some other things.

Done Quiz 1, Quiz 2, and Project 1 of Exploratory Data Analysis, still working on Project 2 but that is ~40% complete. Have more time working on Regression Model and Biostat.

訂閱：文章 (Atom)

2015年8月30日 星期日

2015年8月29日 星期六

2015年8月20日 星期四

2015年8月19日 星期三

2015年8月17日 星期一

2015年8月16日 星期日

2015年8月13日 星期四

2015年8月6日 星期四

2015年8月4日 星期二

2015年8月3日 星期一