Kai's Study Note in San Francisco: 2016

2016年7月6日星期三

7.6 Get familiar with grant application process

https://researchtraining.nih.gov/programs/career-development

If can do this during PhD, that would be good as it's better to get familiar with the process earlier.

2016年6月29日星期三

6.29 Remind me of calculating degree of freedom

http://ron.dotsch.org/degrees-of-freedom/

When reporting the F statistics, follow this format: F(df1, df2) = ...

df1: # of levels (or factors) - 1
df2: residual; # of subjects - # of levels

2016年6月23日星期四

6.23 Paired t-test versus ANCOVA

Paired t-test compares differences between means, doing less than ANCOVA. ANCOVA can compare differences between means, controlling covariates, and making predictions according to the coefficients.

6.23 About choosing model

Always think about what question one needs to address. A different model may no longer fit for the question.
Considering covariate does not mean it has to stay in the model forever. If not significant, can be removed and report the raw value (which is statistically equivalent to considering it in the model).

2016年6月20日星期一

6.20 Can we study brain cultivated from stem cells?

Excited when reading Dr.Guy Mckhann's article on how differentiating iPSCs and developing them into brain cells can inspire many questions such as studying the Zika virus.

The most exciting part of this research involves creating "mini-brains." These mini-brains, about the size of the head of a pin, can be used to study the development of the human brain and how development is altered by a virus. Hopkins researchers are currently using mini-brains to study the Zika virus. This study is just the tip of the iceberg.

6.20 Better skill of organizing is urgently needed

Inevitably, I write pieces of draft papers everyday during the daily work. If not promptly archived, they will start to fly everywhere on my desk, and I simply can't get a clear mind out of these residues. I need to keep the desk clean which helps me keep the mind sharp.

Usually there are multiple steps in the image processing pipelines, so when I run the pipeline I need to remind myself.
Also, sometimes there are short-hand calculations for example the adjusted p-value after multiple correction.
It also happens when I need to construct a big table format that I should follow so as to smoothly run the statistical analysis.
There are also printed reports with figures and tables to be discussed with PI.
It's also important to keep several notebook, for daily recording, for technical details, etc.

2016年6月16日星期四

6.16 Notes of Statistics

It is not appropriate to use standard Bonferroni correction for the non-apriori regions, because like always, our neuroimaging dependent measures will be at least moderately intercorrelated. The standard Bonferroni assumes orthogonality (independence), which is not the case for our FA measures. It would be much better to use the modified Bonf method that I've used forever (refer to the Sankoh's paper), or FDR.

Given there are multiple ways to modify the Bonf which seems to be a black box to me, just use FDR for now.

I used Spearman's rho to investigate associations between smoking and drinking measures and FA in all groups. All of these associations must be adjusted for age because of the age association with FA. It does not matter if the groups are or are not different with age. We want to know if the association are significant after adjusting for the influence of age. Additionaly, lifetime years of smoking is related to age, so age absolutely must be used as a covariate. You cannot use covariates with the standard Spearman method. These analyses must be repeated with linear regression, using age as a covariate.

In SPSS, there's something called part correlation (i.e. semi-partial correlation, different from partial correlation).

Maybe scores of zero were assigned to non-smokers for lifetime years of smoking. If this did happen, this is a fatal design flaw. You can't assign a score of zero to someone who does not have the behavior, i.e., history of smoking. A score of zero is meaningless and creates a "zero clumping" issue that will absolutely lead to spurious results for simple correlations or linear regression.

e.g. non-smokers can't have "0" for lifetime years of smoking, which would otherwise be equivalent to the smokers who does not have a history of smoking. This will confuse group design and mess the analysis up.

2016年6月6日星期一

6.6 Reasons that I unenrolled in ML Classification

I already finished the case-study course and the regression course, get an idea of what machine learning about. When have time, I can always come back and review the context that I've learnt, or picking up new knowledge without knowing nothing of it. The regression course is of particular importance as I will learn most of the materials again in my PhD.

I cannot grade the Quiz unless pay $79 to Coursera to upgrade to a verified certificate. It is pointless to just watch the video and doing ungraded quiz, and it is unfair for learners who do not wish to pay for verified certificate to pay for grading the quiz. This is a total nonsense.

I have more tasks on the plate, processing more images and analyzing more data, as well as manuscript to work on. I am simply running out of time. The classification courses won't finish until the end of July when I will be already on the road trip. Either I will be studying while traveling (which is weird), or I shall finish the course earlier to make time for the trip, which adds up stress.

To this stage, completing the first two courses are sufficient for me to pick up python. I am thinking about strengthening my understanding in data structure and algorithm so as to write better codes. I should think about what's the best strategy to fit that in. But now I am so disappointed with Coursera. That company tried to squeeze money rather than benefiting the learning community.

2016年5月16日星期一

5.16 Two things impacted me from SF

Coding
Start-up culture

2016年5月10日星期二

5.10 Wrote to myself

(May 2nd, 2016 posted on Facebook)
Thanks for everyone’s birthday wishes!
Today turns out to be a day of hardworking. Although still got a lot to work on, I am glad to see my first scientific manuscript getting better every single day.
After work, with a little spirit of celebration, I got myself a new pair of StanSmith, a cup of S’more Frapuccino, and a wonderful dinner in SOMA at the Kimchi Burrito place watching Spur’s game.

I want to remind myself that in 2015, I wrote down several TODOs:
- Learn R programming
I am so glad that I’ve been picking up the programming skill, which helped me accomplished my master thesis and now become my primary tool for daily work. Every day I learn a bit of it and immediately apply in problem solving, which gives me a full sense of accomplishment.
- Keep traveling
I spent a wonderful year traveling around US and China, as well as Okinawa for PhD interview. It can’t be more enjoyable to get reunion with old friends, families, as well as meeting new friends. Will certainly keep the spirit on!
- Keep reading
I am not reading quite much. Tons of daily fragment readings wouldn’t count. Two of my current interests are “How to be a Modern Scientist” by Jeff Leek and “The Art of Data Science” by Roger Peng, both authors are the instructors of John Hopkins University’s Data Sciences Specialisation on Coursera. Will try reading things that are seemingly irrelevant but genuinely inspiring.
- Keep being grateful for life
Living in a different culture and country is certainly not easy. Though did experienced down moments and hard time, I always had supports in all kinds of ways which helped me got through. Life is certainly wonderful in both explainable truths and unexplainable beauties.
- Make more jokes about life
Have I? I will keep making more!
- Try not to get horned by SF driver
I've become one of them?!
At the age of 26, I want to:
- Get my first vehicle and get the kick on Route 66
- Keep the positive attitude
- Learn machine learning with python
- Not discouraged by the on-going challenges
- Try my best to be an excellent PhD student
‪#‎StillOnMyWay‬

2016年5月6日星期五

5.6 Stats Question in R

Remind myself of these conversations of learning statistics through using R these days.

Thanks Tim, your comments are very helpful.

1) The reason I excluded the intercept (i.e., -1) is only for the convenience of extracting the coefficients (i.e. return the intercept, which refers to the Estimate column, respectively for each group). I indeed included the intercept (i.e., no -1) when running the actual analyses. The difference of taking in/out of the intercept is just the table outputs, not affecting any result of the analyses.

2) Dieter - Tim was correct. These two commands in fact did the same thing in R, meaning main effects are given by both commands in addition to the interaction term.
>lm(FA_308_27 ~ gp*smoker + Age - 1, data = dat3.2)
                     Estimate   Std. Error   t value     Pr(>|t|)
gp1mALC           0.610268207 0.0425865499 14.330069 4.398040e-22
gp1wkALC          0.603016628 0.0469235498 12.851045 9.528076e-20
smokers           0.028498249 0.0194763982 1.463220 1.480830e-01
Age              -0.001429693 0.0007651574 -1.868495 6.606585e-02
gp1wkALC:smokers -0.040359689 0.0368267023 -1.095935 2.770308e-01

>lm(FA_308_27 ~ gp + smoker + gp*smoker + Age - 1, data = dat3.2)
                     Estimate   Std. Error   t value     Pr(>|t|)
gp1mALC           0.610268207 0.0425865499 14.330069 4.398040e-22
gp1wkALC          0.603016628 0.0469235498 12.851045 9.528076e-20
smokers           0.028498249 0.0194763982 1.463220 1.480830e-01
Age              -0.001429693 0.0007651574 -1.868495 6.606585e-02
gp1wkALC:smokers -0.040359689 0.0368267023 -1.095935 2.770308e-01

3) I have read in pairwise comparisons in R (http://www.r-bloggers.com/r-tutorial-series-two-way-anova-with-pairwise-comparisons/) and Two-way ANOVA with Interactions and Simple Main Effects (http://rtutorialseries.blogspot.com/2011/02/r-tutorial-series-two-way-anova-with.html). I will change my statistical procedure accordingly.

Sincerely,

Yukai

===
Hi Yukai,

1) I'm wondering why you do not include the intercept in the model. In your equation, you have -1. I strongly suggest you include the intercept (i.e., +1). Depending on the statistical procedure you run, it can make a huge difference. I don't remember if matters in the standard lm model, but I always include it for form.

2) It is fine to do gp*smoking. In lm, if you use * e.g., variable1*variable2, the main effects for each variable in the interaction term will automatically be included.

3) The reason you do not see means for the 4 groups (ns1mALC, s1mALC, ns1wkALC, and s1wkALC), is because the model you built only includes the main effects and interactions. To get the adjusted means and standard errors for each individual group, you need to do pairwise comparisons among those groups. Just google pairwise comparisons in R, pairwise t-tests in R or try http://www.r-statistics.com/. There should be some sample code that you can adapt, and it is relatively straight-forward to execute.

Tim

2016年4月27日星期三

4.27 UW Machine Learning Regression Week 2 Assignment 1

4 Quiz Questions that I did wrong:

1) If you double the value of a given feature (i.e. a specific column of the feature matrix), what happens to the least-squares estimated coefficients for every other feature? (assume you have no other feature that depends on the doubled feature i.e. no interaction terms).

It is impossible to tell from the information provided (wrong)

They stay the same

Considering when interpreting a parameter, we assume that other features are constant (i.e. no matter a particular feature's measurement got doubled or not).

2) Gradient descent/ascent is...

An approximation to simple linear regression (wrong)
A modeling technique in machine learning (wrong)

An algorithm for minimizing/maximizing a function
by definition...

3) Let's analyze how many computations are required to fit a multiple linear regression model using the closed-form solution based on a data set with 50 observations and 10 features. In the videos, we said that computing the inverse of the 10x10 matrix (H^T)H was on the order of D^3 operations. Let's focus on forming this matrix prior to inversion. How many multiplications are required to form the matrix (H^T)H?

1000 (wrong) not reading the question carefully... N x N x D = 50 x 50 x 10 = 25000 (wrong) not did the linear algebra correctly... N x D x D = 50 x 10 x 10 = 5000 (review Linear Algebra) be patient

4) More generally, if you have D features and N observations what is the total complexity of computing ((H^T)H)^(-1)?

O(D^3) (wrong) see first failure in 3)

O(N^2D + D^3) (wrong) see second failure in 3)
O(ND^2 + D^3) that's how I got 3) correct

訂閱：文章 (Atom)

2016年7月6日 星期三

2016年6月29日 星期三

2016年6月23日 星期四

2016年6月20日 星期一

2016年6月16日 星期四

2016年6月6日 星期一

2016年5月16日 星期一

2016年5月10日 星期二

2016年5月6日 星期五

2016年4月27日 星期三