2016年4月1日 星期五

4.1 Thoughts in Data Analysis

Constructing a clean table requires patience and many hard works. The rewards are huge when performing downstream analyses.

Double-checking upon each query is very important, which can avoid data flying around.

What's wrong with my ML's answer of "predicting the majority class"?
(ref: Carlos in the ML class)
The "Major Classifier" was explained in the video "What's a good accuracy?" (Under week 3, Section 2: "Evaluating classification models") But I also have a hard time understanding it. I also would like some instructor to help me understand this better. This is what I understand so far:

A "class" is one group a where a classifier can place an event. For example flip a coin, there are 2 classes: heads or tails. This is a Binary Classification and I have a 50% possibility to guess a coin flip correctly. After many coin tosses, chances are I will be 0.5 accurate on my guesses. If a classification has more classes (k), the accuracy of the classificator will be =1/k

For example, rolling a die would have 6 k (classes) so accuracy of random guessing would be 1/6 = 0.1666...

Now, for the example in the video, Carlos explains that we should always compare our model accuracy with a simple baseline. A simple baseline could be a Random Guess (50% accurate for a Binary Classification) or a Majority Class (Guess everything to the most common class then the accuracy of your guess equals to that class' occurrence)

Carlos explains that 90% mail is spam, so spam is the "Majority Class"

That means that if I had a dataset with tons of emails and I just looked at the file without even opening any mail, and then I declared: "Fu** it, All of this mail is spam. Let's go have one beer", I would be 90% accurate. THEN, if I make a classifier model , it must be better than 90% accurate, else the "Fu** it" strategy is a better investment of my time.

So, let's say that I arrive to the bar to have my beer for a job well done, and I declare: "This bar is full of good beer!". The barman, (who also happens to be a statistician during the day), then corrects me: "The truth is, we sell 65 beer brands here. If you choose one at random there's 1/65 chance that you will like one. If I recommend you one, there's a 50% chance you will like it. I would recommend you our most sold beer, which is a total 2% of our business, or I could share to you this dataset with beer lovers reviews ..."

Being that the case, I could better invest my time to create a predictor model to guess which beers I might like from the Bar's variety. If I make a predictor with, say, a 80% accuracy, i'm truly in for some good beer without being wasted drinking lots of untasty beer.

Then, this is the approach I will take for question 7. "Which of the following ranges contains the accuracy of the majority class classifier, which simply predicts the majority class on the test_data?":

I will look for the accuracy of the majority class. To find the major class classifier from the test data, based in the column 'sentiment'. Just sum the entire column (to find the number of items classified as 1) and divide by the lenght of the SFrame.

2 則留言: