2015年11月29日 星期日

11.29 Coursera Projects Feedback

Anyways, seems that I stole these points successfully...


Reproducible Research

Nice work! Your report does a good job hitting all the basic requirements. I gave you full points. There are some questions you might want to think about to further strengthen the report. 1. Why do you add injuries and fatalities together? It seems to me they are different levels of impacts and cannot be directly compared. 2. Why do you only consider property damage and not crop damage for economic impact? 3. Did you see that the PROPDMGEXP column should have been used to change the magnitude of the property damage value? 4. Would a bar plot be a better visualization than a pie chart to compare the difference of the event types?


Data Product

peer 1 → Very good application for the assignment. The documentation sholuld be combined with the app and attached as a separate link, but it is ok. The presentation does not work...

peer 2 → Your app is very simple. Dead simple. Seems like you skipped the machine learning class, as you are using a complicated approach to prediction by using individual lm coefficients instead of properly building the model and predicting values with appropriate functions. This is just to explain, why I did not give you the extra +1 for the course project. Apart from that, great effort and clean code. Good job!

peer 3 → It worked well! Nice sales job on presentation. Very energetic.

peer 4 → Hi, good work. The only missing thing is and html output. Anyway, great work !!

2015年11月23日 星期一

2015年11月19日 星期四

11.19 A Review of what is overfitting

(From lecture notes of Regression Modeling) The variance estimate is biased if we underfit the model. The variance estimate is unbiased if we correctly or overfit the model, including all necessary covariates and/or unnecessary covariates. When including unnecessary variables (i.e. overfit), the variance of the variance is larger.

(From lecture notes of Practical Machine Learning) Overfitting will leads to the out-of-sample error rate higher than the in-sample error rate.
Data have two parts: signal and noise. The goal of a predictor is to find signal, and a perfect in-sample predictor can always be designed, but predictor won't perform as well on new samples.

Overfitting will do worse in extrapolating the missing data. Can fit perfectly within the model, but will generalize poorly, not work as robust when going to the outside world.
Adding extra variable is equal to constraining in a new dimension.

11.19 Reproducible Research Checklist

A comprehensive checklist from "Reproducible Research" class in Data Sciences Specialization.

Do:
  • Start with good science
  • Teach a computer
  • Use some version control (e.g. github commit)
  • Keep track of software environment
  • Set seed for random number generators 
  • Think about the entire pipeline
Don't:
  • Don't do everything by hand; if must, document it
  • Don't use point-and-click software
  • Don't save output (improve efficiency)