This simple book introduces you to statistical thinking. It will not take you ten thousand hours to get complete. This book provides specific directions so you save time. By slowly building your skills through practice and data journeys, you will begin to experience the joy of stats as you tackle bigger problems. You will learn how to use stats to collect data, to see what it tells you, and to share discoveries with others.
With the help of this book, you will learn the simple skills that all data-driven decision-makers need:
That's it. This will be extremely hard at first, and perhaps seem a bit dull. It's important that you stick with it. By working through this book, completing each exercise in an hour or two, you'll develop a strong foundation for embarking on your own data expeditions.
There is a rule of thumb that true mastery occurs when you invest ten thousand hours working in a field of study. [1] This book will not teach you to be a statistical master or data scientist extraordinaire. That's a big mission and beyond the scope of just one book. What this book does aim to do is to get you started on the journey of becoming a data Yoda, a wise decision maker.
[1] | The 10,000-hour concept is based on research by K. Anders Ericsson, Professor of Psychology at Florida State University, and popularized by Malcolm Gladwell's book Outliers and Geoff Colvin's book Talent is Overrated. Approximately ten thousand hours of investment in a field leads correlates -- or contributes directly -- to mastery. |
We will use a programming language called Python to helps us to understand statistics and, most importantly, to be ably to apply that understanding to real data.
Why can't I just do statistics 'by hand' instead of with a computer?
Hands are handy for many things, including for writing quick 'back of the napkin' calculations to think about a problem before you go to the trouble of making a computer do the heavy lifting -- the math. Hands are not very good at performing millions of calculations each second, making publication-quality charts, or sharing discoveries with others, especially online.
Why can't I just use a spreadsheet tool, like Excel?
Spreadsheets are only one way of thinking about and playing with data. When limiting oneself to a single piece of software like Excel, it can lead to a blind way of living. That's okay for some people. It's often the case that imaginary statisticians (usually business managers and their ilk) think that if it can't be done in Excel, then it can't be done at all. This is unfortunate, as there is much that can be done beyond Excel, or with just a little of Excel's help. That's why we're using a programming language.
Why programming?
Your laptop, phone, power grid, vehicle (at least post-1980), economy, and (most importantly, of course) your social network profile are all computer-dependent, from prototyping to construction to maintenance. Solar flares and power outages be darned, we will continue to integrate computers into our lives. It's increasingly likely you will have to bend technology to tie-in completely with your needs. It's like offering ice cream to a kid as a reward for getting potty trained; speak the same language and you can get others (e.g. computers, kids, whatever) to do what you want.
There are many programming languages, all with the same goal: Make the computer do what you want. We will use Python.
Why Python?
But a statistician told me to use R
R is a language designed by statisticians, for statisticians. It is a powerful and focused language. R provides many statistical tools.
However, as a non-programmer, it can be tricky to learn. Python is - able to replicate much of the functionality of R, such as with the pandas and pylab modules, - easier to learn (but still hard, like any other new language), - able to stay with you even when you're not doing statistics, such as making a computer game, a webapp, or whatever.
An attention to detail is critical to success in any profession, and this holds true in statistics and programming. As you work your way through this book, be vigilant in carefully reading the code in each exercise. Likewise, you have to be a decent typist to be able to type obscure symbols as you write your own code.
We all make mistakes. Perhaps you'll even spot some mistakes I've made while writing this book. Accept that you will screw up, and make a committment to try again. Persistance is a key factor in separating poor performance from good performance.