www.it-ebooks.info
www.it-ebooks.info
Strata Jumpstart
Sep 19, 2011, NY
Strata Summit
Sep 20-21, 2011, NY
Strata Conference
Sep 22-23, 2011, NY
Use your data – or lose
Save 20% with code EBOOK
www.it-ebooks.info
Register Now
Data Analysis with Open Source Tools
www.it-ebooks.info
www.it-ebooks.info
Data Analysis with
Open Source Tools
Philipp K. Janert
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
www.it-ebooks.info
Data Analysis with Open Source Tools
by Philipp K. Janert
Copyright c 2011 Philipp K. Janert. All rights reserved. Printed in the United States of America.
Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (http://my.safaribooksonline.com). For more information,
contact our corporate/institutional sales department: (800) 998-9938 or
[email protected].
Editor: Mike Loukides
Indexer: Fred Brown
Production Editor: Sumita Mukherji
Cover Designer: Karen Montgomery
Copyeditor: Matt Darnell
Interior Designer: Edie Freedman
and Ron Bilodeau
Production Services: MPS Limited, a Macmillan
Company, and Newgen North America, Inc.
Illustrator: Philipp K. Janert
Printing History:
November 2010: First Edition.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analysis with Open Source
Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc.
was aware of a trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author
assume no responsibility for errors or omissions, or for damages resulting from the use of the
information contained herein.
ISBN: 978-0-596-80235-6
[M]
[2011-05-27]
www.it-ebooks.info
Furious activity is no substitute for understanding.
—H. H. Williams
www.it-ebooks.info
www.it-ebooks.info
CONTENTS
PREFACE
xiii
INTRODUCTION
1
Data Analysis
What’s in This Book
What’s with the Workshops?
What’s with the Math?
What You’ll Need
What’s Missing
1
1
2
3
4
5
6
PART I Graphics: Looking at Data
4
11
12
14
23
30
33
38
45
TWO VARIABLES: ESTABLISHING RELATIONSHIPS
47
Scatter Plots
Conquering Noise: Smoothing
Logarithmic Plots
Banking
Linear Regression and All That
Showing What’s Important
Graphical Analysis and Presentation Graphics
Workshop: matplotlib
Further Reading
3
A SINGLE VARIABLE: SHAPE AND DISTRIBUTION
Dot and Jitter Plots
Histograms and Kernel Density Estimates
The Cumulative Distribution Function
Rank-Order Plots and Lift Charts
Only When Appropriate: Summary Statistics and Box Plots
Workshop: NumPy
Further Reading
2
47
48
57
61
62
66
68
69
78
TIME AS A VARIABLE: TIME-SERIES ANALYSIS
79
Examples
The Task
Smoothing
Don’t Overlook the Obvious!
The Correlation Function
79
83
84
90
91
vii
www.it-ebooks.info
Optional: Filters and Convolutions
Workshop: scipy.signal
Further Reading
5
95
96
98
MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 99
False-Color Plots
A Lot at a Glance: Multiplots
Composition Problems
Novel Plot Types
Interactive Explorations
Workshop: Tools for Multivariate Graphics
Further Reading
INTERMEZZO: A DATA ANALYSIS SESSION
127
A Data Analysis Session
Workshop: gnuplot
Further Reading
6
100
105
110
116
120
123
125
127
136
138
PART II Analytics: Modeling Data
viii
142
151
155
158
161
MODELS FROM SCALING ARGUMENTS
163
163
165
175
178
182
184
184
188
ARGUMENTS FROM PROBABILITY MODELS
191
The Binomial Distribution and Bernoulli Trials
The Gaussian Distribution and the Central Limit Theorem
Power-Law Distributions and Non-Normal Statistics
Other Distributions
Optional: Case Study—Unique Visitors over Time
Workshop: Power-Law Distributions
Further Reading
9
141
Models
Arguments from Scale
Mean-Field Approximations
Common Time-Evolution Scenarios
Case Study: How Many Servers Are Best?
Why Modeling?
Workshop: Sage
Further Reading
8
GUESSTIMATION AND THE BACK OF THE ENVELOPE
Principles of Guesstimation
How Good Are Those Numbers?
Optional: A Closer Look at Perturbation Theory and
Error Propagation
Workshop: The Gnu Scientific Library (GSL)
Further Reading
7
191
195
201
206
211
215
218
CONTENTS
www.it-ebooks.info
10
WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 221
Genesis
Statistics Defined
Statistics Explained
Controlled Experiments Versus Observational Studies
Optional: Bayesian Statistics—The Other Point of View
Workshop: R
Further Reading
INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES,
AND ALL THAT
253
How to Average Averages
The Standard Deviation
Least Squares
Further Reading
11
221
223
226
230
235
243
249
253
256
260
264
PART III Computation: Mining Data
15
267
270
276
280
291
FINDING CLUSTERS
293
293
298
304
311
314
316
319
320
324
SEEING THE FOREST FOR THE TREES: FINDING
IMPORTANT ATTRIBUTES
327
Principal Component Analysis
Visual Techniques
Kohonen Maps
Workshop: PCA with R
Further Reading
14
267
What Constitutes a Cluster?
Distance and Similarity Measures
Clustering Methods
Pre- and Postprocessing
Other Thoughts
A Special Case: Market Basket Analysis
A Word of Warning
Workshop: Pycluster and the C Clustering Library
Further Reading
13
SIMULATIONS
A Warm-Up Question
Monte Carlo Simulations
Resampling Methods
Workshop: Discrete Event Simulations with SimPy
Further Reading
12
328
337
339
342
348
INTERMEZZO: WHEN MORE IS DIFFERENT
351
A Horror Story
353
CONTENTS
www.it-ebooks.info
ix
Some Suggestions
What About Map/Reduce?
Workshop: Generating Permutations
Further Reading
354
356
357
358
PART IV Applications: Using Data
B
x
384
391
394
398
399
400
403
PREDICTIVE ANALYTICS
405
405
407
408
419
423
424
426
431
EPILOGUE: FACTS ARE NOT REALITY
433
PROGRAMMING ENVIRONMENTS FOR SCIENTIFIC COMPUTATION
AND DATA ANALYSIS
435
435
437
443
444
RESULTS FROM CALCULUS
447
Common Functions
Calculus
Useful Tricks
A
383
Software Tools
A Catalog of Scientific Software
Writing Your Own
Further Reading
19
FINANCIAL CALCULATIONS AND MODELING
Introduction
Some Classification Terminology
Algorithms for Classification
The Process
The Secret Sauce
The Nature of Statistical Learning
Workshop: Two Do-It-Yourself Classifiers
Further Reading
18
361
362
369
373
376
381
The Time Value of Money
Uncertainty in Planning and Opportunity Costs
Cost Concepts and Depreciation
Should You Care?
Is This All That Matters?
Workshop: The Newsvendor Problem
Further Reading
17
REPORTING, BUSINESS INTELLIGENCE, AND DASHBOARDS
Business Intelligence
Corporate Metrics and Dashboards
Data Quality Issues
Workshop: Berkeley DB and SQLite
Further Reading
16
448
460
468
CONTENTS
www.it-ebooks.info
Notation and Basic Math
Where to Go from Here
Further Reading
C
472
479
481
WORKING WITH DATA
485
Sources for Data
Cleaning and Conditioning
Sampling
Data File Formats
The Care and Feeding of Your Data Zoo
Skills
Terminology
Further Reading
485
487
489
490
492
493
495
497
INDEX
499
CONTENTS
www.it-ebooks.info
xi
www.it-ebooks.info
Preface
THIS BOOK GREW OUT OF MY EXPERIENCE OF WORKING WITH DATA FOR VARIOUS COMPANIES IN THE TECH
industry. It is a collection of those concepts and techniques that I have found to be the
most useful, including many topics that I wish I had known earlier—but didn’t.
My degree is in physics, but I also worked as a software engineer for several years. The
book reflects this dual heritage. On the one hand, it is written for programmers and others
in the software field: I assume that you, like me, have the ability to write your own
programs to manipulate data in any way you want.
On the other hand, the way I think about data has been shaped by my background and
education. As a physicist, I am not content merely to describe data or to make black-box
predictions: the purpose of an analysis is always to develop an understanding for the
processes or mechanisms that give rise to the data that we observe.
The instrument to express such understanding is the model: a description of the system
under study (in other words, not just a description of the data!), simplified as necessary
but nevertheless capturing the relevant information. A model may be crude (“Assume a
spherical cow . . . ”), but if it helps us develop better insight on how the system works, it is
a successful model nevertheless. (Additional precision can often be obtained at a later
time, if it is really necessary.)
This emphasis on models and simplified descriptions is not universal: other authors and
practitioners will make different choices. But it is essential to my approach and point of
view.
This is a rather personal book. Although I have tried to be reasonably comprehensive, I
have selected the topics that I consider relevant and useful in practice—whether they are
part of the “canon” or not. Also included are several topics that you won’t find in any
other book on data analysis. Although neither new nor original, they are usually not used
or discussed in this particular context—but I find them indispensable.
Throughout the book, I freely offer specific, explicit advice, opinions, and assessments.
These remarks are reflections of my personal interest, experience, and understanding. I do
not claim that my point of view is necessarily correct: evaluate what I say for yourself and
feel free to adapt it to your needs. In my view, a specific, well-argued position is of greater
use than a sterile laundry list of possible algorithms—even if you later decide to disagree
with me. The value is not in the opinion but rather in the arguments leading up to it. If
your arguments are better than mine, or even just more agreeable to you, then I will have
achieved my purpose!
xiii
www.it-ebooks.info
Data analysis, as I understand it, is not a fixed set of techniques. It is a way of life, and it
has a name: curiosity. There is always something else to find out and something more to
learn. This book is not the last word on the matter; it is merely a snapshot in time: things I
knew about and found useful today.
“Works are of value only if they give rise to better ones.”
(Alexander von Humboldt, writing to Charles Darwin, 18 September 1839)
Before We Begin
More data analysis efforts seem to go bad because of an excess of sophistication rather
than a lack of it.
This may come as a surprise, but it has been my experience again and again. As a
consultant, I am often called in when the initial project team has already gotten stuck.
Rarely (if ever) does the problem turn out to be that the team did not have the required
skills. On the contrary, I usually find that they tried to do something unnecessarily
complicated and are now struggling with the consequences of their own invention!
Based on what I have seen, two particular risk areas stand out:
•
The use of “statistical” concepts that are only partially understood (and given the
relative obscurity of most of statistics, this includes virtually all statistical concepts)
•
Complicated (and expensive) black-box solutions when a simple and transparent
approach would have worked at least as well or better
I strongly recommend that you make it a habit to avoid all statistical language. Keep it
simple and stick to what you know for sure. There is absolutely nothing wrong with
speaking of the “range over which points spread,” because this phrase means exactly what
it says: the range over which points spread, and only that! Once we start talking about
“standard deviations,” this clarity is gone. Are we still talking about the observed width of
the distribution? Or are we talking about one specific measure for this width? (The
standard deviation is only one of several that are available.) Are we already making an
implicit assumption about the nature of the distribution? (The standard deviation is only
suitable under certain conditions, which are often not fulfilled in practice.) Or are we even
confusing the predictions we could make if these assumptions were true with the actual
data? (The moment someone talks about “95 percent anything” we know it’s the latter!)
I’d also like to remind you not to discard simple methods until they have been proven
insufficient. Simple solutions are frequently rather effective: the marginal benefit that
more complicated methods can deliver is often quite small (and may be in no reasonable
relation to the increased cost). More importantly, simple methods have fewer
opportunities to go wrong or to obscure the obvious.
xiv
PREFACE
www.it-ebooks.info
True story: a company was tracking the occurrence of defects over time. Of course, the
actual number of defects varied quite a bit from one day to the next, and they were
looking for a way to obtain an estimate for the typical number of expected defects. The
solution proposed by their IT department involved a compute cluster running a neural
network! (I am not making this up.) In fact, a one-line calculation (involving a moving
average or single exponential smoothing) is all that was needed.
I think the primary reason for this tendency to make data analysis projects more
complicated than they are is discomfort: discomfort with an unfamiliar problem space and
uncertainty about how to proceed. This discomfort and uncertainty creates a desire to
bring in the “big guns”: fancy terminology, heavy machinery, large projects. In reality, of
course, the opposite is true: the complexities of the “solution” overwhelm the original
problem, and nothing gets accomplished.
Data analysis does not have to be all that hard. Although there are situations when
elementary methods will no longer be sufficient, they are much less prevalent than you
might expect. In the vast majority of cases, curiosity and a healthy dose of common sense
will serve you well.
The attitude that I am trying to convey can be summarized in a few points:
Simple is better than complex.
Cheap is better than expensive.
Explicit is better than opaque.
Purpose is more important than process.
Insight is more important than precision.
Understanding is more important than technique.
Think more, work less.
Although I do acknowledge that the items on the right are necessary at times, I will give
preference to those on the left whenever possible.
It is in this spirit that I am offering the concepts and techniques that make up the rest of
this book.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, and email addresses
Constant width
Used to refer to language and script elements
P R E FA C E
www.it-ebooks.info
xv
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in this
book in your programs and documentation. You do not need to contact us for permission
unless youre reproducing a significant portion of the code. For example, writing a
program that uses several chunks of code from this book does not require permission.
Selling or distributing a CD-ROM of examples from OReilly books does require
permission. Answering a question by citing this book and quoting example code does not
require permission. Incorporating a significant amount of example code from this book
into your products documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Data Analysis with Open Source Tools, by Philipp
K. Janert. Copyright 2011 Philipp K. Janert, 978-0-596-80235-6.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at
[email protected].
Safari® Books Online
.
Books Online is
search
Safari > Safari7,500 technology an on-demand digital library that lets you easily the
over
and creative reference books and videos to find
Books online
answers you need quickly.
With a subscription, you can read any page and watch any video from our library online.
Read books on your cell phone and mobile devices. Access new titles before they are
available for print, and get exclusive access to manuscripts in development and post
feedback for the authors. Copy and paste code samples, organize your favorites, download
chapters, bookmark key sections, create notes, print out pages, and benefit from tons of
other time-saving features.
O’Reilly Media has uploaded this book to the Safari Books Online service. To have full
digital access to this book and others on similar topics from OReilly and other publishers,
sign up for free at http://my.safaribooksonline.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
xvi
PREFACE
www.it-ebooks.info
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at:
http://oreilly.com/catalog/9780596802356
To comment or ask technical questions about this book, send email to:
[email protected]
For more information about our books, conferences, Resource Centers, and the O’Reilly
Network, see our website at:
http://oreilly.com
Acknowledgments
It was a pleasure to work with O’Reilly on this project. In particular, O’Reilly has been
most accommodating with regard to the technical challenges raised by my need to include
(for an O’Reilly book) an uncommonly large amount of mathematical material in the
manuscript.
Mike Loukides has accompanied this project as the editor since its beginning. I have
enjoyed our conversations about life, the universe, and everything, and I appreciate his
comments about the manuscript—either way.
I’d like to thank several of my friends for their help in bringing this book about:
•
Elizabeth Robson, for making the connection
•
Austin King, for pointing out the obvious
•
Scott White, for suffering my questions gladly
•
Richard Kreckel, for much-needed advice
As always, special thanks go to PAUL Schrader (Bremen).
The manuscript benefited from the feedback I received from various reviewers. Michael E.
Driscoll, Zachary Kessin, and Austin King read all or parts of the manuscript and provided
valuable comments.
I enjoyed personal correspondence with Joseph Adler, Joe Darcy, Hilary Mason, Stephen
Weston, Scott White, and Brian Zimmer. All very generously provided expert advice on
specific topics.
Particular thanks go to Richard Kreckel, who provided uncommonly detailed and
insightful feedback on most of the manuscript.
During the preparation of this book, the excellent collection at the University of
Washington libraries was an especially valuable resource to me.
P R E FA C E
www.it-ebooks.info
xvii