Đăng ký Đăng nhập
Trang chủ Công nghệ thông tin Cơ sở dữ liệu Data analysis with open source tools...

Tài liệu Data analysis with open source tools

.PDF
533
230
52

Mô tả:

www.it-ebooks.info www.it-ebooks.info Strata Jumpstart Sep 19, 2011, NY Strata Summit Sep 20-21, 2011, NY Strata Conference Sep 22-23, 2011, NY Use your data – or lose Save 20% with code EBOOK www.it-ebooks.info Register Now Data Analysis with Open Source Tools www.it-ebooks.info www.it-ebooks.info Data Analysis with Open Source Tools Philipp K. Janert Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Data Analysis with Open Source Tools by Philipp K. Janert Copyright c 2011 Philipp K. Janert. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]. Editor: Mike Loukides Indexer: Fred Brown Production Editor: Sumita Mukherji Cover Designer: Karen Montgomery Copyeditor: Matt Darnell Interior Designer: Edie Freedman and Ron Bilodeau Production Services: MPS Limited, a Macmillan Company, and Newgen North America, Inc. Illustrator: Philipp K. Janert Printing History: November 2010: First Edition. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analysis with Open Source Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-0-596-80235-6 [M] [2011-05-27] www.it-ebooks.info Furious activity is no substitute for understanding. —H. H. Williams www.it-ebooks.info www.it-ebooks.info CONTENTS PREFACE xiii INTRODUCTION 1 Data Analysis What’s in This Book What’s with the Workshops? What’s with the Math? What You’ll Need What’s Missing 1 1 2 3 4 5 6 PART I Graphics: Looking at Data 4 11 12 14 23 30 33 38 45 TWO VARIABLES: ESTABLISHING RELATIONSHIPS 47 Scatter Plots Conquering Noise: Smoothing Logarithmic Plots Banking Linear Regression and All That Showing What’s Important Graphical Analysis and Presentation Graphics Workshop: matplotlib Further Reading 3 A SINGLE VARIABLE: SHAPE AND DISTRIBUTION Dot and Jitter Plots Histograms and Kernel Density Estimates The Cumulative Distribution Function Rank-Order Plots and Lift Charts Only When Appropriate: Summary Statistics and Box Plots Workshop: NumPy Further Reading 2 47 48 57 61 62 66 68 69 78 TIME AS A VARIABLE: TIME-SERIES ANALYSIS 79 Examples The Task Smoothing Don’t Overlook the Obvious! The Correlation Function 79 83 84 90 91 vii www.it-ebooks.info Optional: Filters and Convolutions Workshop: scipy.signal Further Reading 5 95 96 98 MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 99 False-Color Plots A Lot at a Glance: Multiplots Composition Problems Novel Plot Types Interactive Explorations Workshop: Tools for Multivariate Graphics Further Reading INTERMEZZO: A DATA ANALYSIS SESSION 127 A Data Analysis Session Workshop: gnuplot Further Reading 6 100 105 110 116 120 123 125 127 136 138 PART II Analytics: Modeling Data viii 142 151 155 158 161 MODELS FROM SCALING ARGUMENTS 163 163 165 175 178 182 184 184 188 ARGUMENTS FROM PROBABILITY MODELS 191 The Binomial Distribution and Bernoulli Trials The Gaussian Distribution and the Central Limit Theorem Power-Law Distributions and Non-Normal Statistics Other Distributions Optional: Case Study—Unique Visitors over Time Workshop: Power-Law Distributions Further Reading 9 141 Models Arguments from Scale Mean-Field Approximations Common Time-Evolution Scenarios Case Study: How Many Servers Are Best? Why Modeling? Workshop: Sage Further Reading 8 GUESSTIMATION AND THE BACK OF THE ENVELOPE Principles of Guesstimation How Good Are Those Numbers? Optional: A Closer Look at Perturbation Theory and Error Propagation Workshop: The Gnu Scientific Library (GSL) Further Reading 7 191 195 201 206 211 215 218 CONTENTS www.it-ebooks.info 10 WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 221 Genesis Statistics Defined Statistics Explained Controlled Experiments Versus Observational Studies Optional: Bayesian Statistics—The Other Point of View Workshop: R Further Reading INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES, AND ALL THAT 253 How to Average Averages The Standard Deviation Least Squares Further Reading 11 221 223 226 230 235 243 249 253 256 260 264 PART III Computation: Mining Data 15 267 270 276 280 291 FINDING CLUSTERS 293 293 298 304 311 314 316 319 320 324 SEEING THE FOREST FOR THE TREES: FINDING IMPORTANT ATTRIBUTES 327 Principal Component Analysis Visual Techniques Kohonen Maps Workshop: PCA with R Further Reading 14 267 What Constitutes a Cluster? Distance and Similarity Measures Clustering Methods Pre- and Postprocessing Other Thoughts A Special Case: Market Basket Analysis A Word of Warning Workshop: Pycluster and the C Clustering Library Further Reading 13 SIMULATIONS A Warm-Up Question Monte Carlo Simulations Resampling Methods Workshop: Discrete Event Simulations with SimPy Further Reading 12 328 337 339 342 348 INTERMEZZO: WHEN MORE IS DIFFERENT 351 A Horror Story 353 CONTENTS www.it-ebooks.info ix Some Suggestions What About Map/Reduce? Workshop: Generating Permutations Further Reading 354 356 357 358 PART IV Applications: Using Data B x 384 391 394 398 399 400 403 PREDICTIVE ANALYTICS 405 405 407 408 419 423 424 426 431 EPILOGUE: FACTS ARE NOT REALITY 433 PROGRAMMING ENVIRONMENTS FOR SCIENTIFIC COMPUTATION AND DATA ANALYSIS 435 435 437 443 444 RESULTS FROM CALCULUS 447 Common Functions Calculus Useful Tricks A 383 Software Tools A Catalog of Scientific Software Writing Your Own Further Reading 19 FINANCIAL CALCULATIONS AND MODELING Introduction Some Classification Terminology Algorithms for Classification The Process The Secret Sauce The Nature of Statistical Learning Workshop: Two Do-It-Yourself Classifiers Further Reading 18 361 362 369 373 376 381 The Time Value of Money Uncertainty in Planning and Opportunity Costs Cost Concepts and Depreciation Should You Care? Is This All That Matters? Workshop: The Newsvendor Problem Further Reading 17 REPORTING, BUSINESS INTELLIGENCE, AND DASHBOARDS Business Intelligence Corporate Metrics and Dashboards Data Quality Issues Workshop: Berkeley DB and SQLite Further Reading 16 448 460 468 CONTENTS www.it-ebooks.info Notation and Basic Math Where to Go from Here Further Reading C 472 479 481 WORKING WITH DATA 485 Sources for Data Cleaning and Conditioning Sampling Data File Formats The Care and Feeding of Your Data Zoo Skills Terminology Further Reading 485 487 489 490 492 493 495 497 INDEX 499 CONTENTS www.it-ebooks.info xi www.it-ebooks.info Preface THIS BOOK GREW OUT OF MY EXPERIENCE OF WORKING WITH DATA FOR VARIOUS COMPANIES IN THE TECH industry. It is a collection of those concepts and techniques that I have found to be the most useful, including many topics that I wish I had known earlier—but didn’t. My degree is in physics, but I also worked as a software engineer for several years. The book reflects this dual heritage. On the one hand, it is written for programmers and others in the software field: I assume that you, like me, have the ability to write your own programs to manipulate data in any way you want. On the other hand, the way I think about data has been shaped by my background and education. As a physicist, I am not content merely to describe data or to make black-box predictions: the purpose of an analysis is always to develop an understanding for the processes or mechanisms that give rise to the data that we observe. The instrument to express such understanding is the model: a description of the system under study (in other words, not just a description of the data!), simplified as necessary but nevertheless capturing the relevant information. A model may be crude (“Assume a spherical cow . . . ”), but if it helps us develop better insight on how the system works, it is a successful model nevertheless. (Additional precision can often be obtained at a later time, if it is really necessary.) This emphasis on models and simplified descriptions is not universal: other authors and practitioners will make different choices. But it is essential to my approach and point of view. This is a rather personal book. Although I have tried to be reasonably comprehensive, I have selected the topics that I consider relevant and useful in practice—whether they are part of the “canon” or not. Also included are several topics that you won’t find in any other book on data analysis. Although neither new nor original, they are usually not used or discussed in this particular context—but I find them indispensable. Throughout the book, I freely offer specific, explicit advice, opinions, and assessments. These remarks are reflections of my personal interest, experience, and understanding. I do not claim that my point of view is necessarily correct: evaluate what I say for yourself and feel free to adapt it to your needs. In my view, a specific, well-argued position is of greater use than a sterile laundry list of possible algorithms—even if you later decide to disagree with me. The value is not in the opinion but rather in the arguments leading up to it. If your arguments are better than mine, or even just more agreeable to you, then I will have achieved my purpose! xiii www.it-ebooks.info Data analysis, as I understand it, is not a fixed set of techniques. It is a way of life, and it has a name: curiosity. There is always something else to find out and something more to learn. This book is not the last word on the matter; it is merely a snapshot in time: things I knew about and found useful today. “Works are of value only if they give rise to better ones.” (Alexander von Humboldt, writing to Charles Darwin, 18 September 1839) Before We Begin More data analysis efforts seem to go bad because of an excess of sophistication rather than a lack of it. This may come as a surprise, but it has been my experience again and again. As a consultant, I am often called in when the initial project team has already gotten stuck. Rarely (if ever) does the problem turn out to be that the team did not have the required skills. On the contrary, I usually find that they tried to do something unnecessarily complicated and are now struggling with the consequences of their own invention! Based on what I have seen, two particular risk areas stand out: • The use of “statistical” concepts that are only partially understood (and given the relative obscurity of most of statistics, this includes virtually all statistical concepts) • Complicated (and expensive) black-box solutions when a simple and transparent approach would have worked at least as well or better I strongly recommend that you make it a habit to avoid all statistical language. Keep it simple and stick to what you know for sure. There is absolutely nothing wrong with speaking of the “range over which points spread,” because this phrase means exactly what it says: the range over which points spread, and only that! Once we start talking about “standard deviations,” this clarity is gone. Are we still talking about the observed width of the distribution? Or are we talking about one specific measure for this width? (The standard deviation is only one of several that are available.) Are we already making an implicit assumption about the nature of the distribution? (The standard deviation is only suitable under certain conditions, which are often not fulfilled in practice.) Or are we even confusing the predictions we could make if these assumptions were true with the actual data? (The moment someone talks about “95 percent anything” we know it’s the latter!) I’d also like to remind you not to discard simple methods until they have been proven insufficient. Simple solutions are frequently rather effective: the marginal benefit that more complicated methods can deliver is often quite small (and may be in no reasonable relation to the increased cost). More importantly, simple methods have fewer opportunities to go wrong or to obscure the obvious. xiv PREFACE www.it-ebooks.info True story: a company was tracking the occurrence of defects over time. Of course, the actual number of defects varied quite a bit from one day to the next, and they were looking for a way to obtain an estimate for the typical number of expected defects. The solution proposed by their IT department involved a compute cluster running a neural network! (I am not making this up.) In fact, a one-line calculation (involving a moving average or single exponential smoothing) is all that was needed. I think the primary reason for this tendency to make data analysis projects more complicated than they are is discomfort: discomfort with an unfamiliar problem space and uncertainty about how to proceed. This discomfort and uncertainty creates a desire to bring in the “big guns”: fancy terminology, heavy machinery, large projects. In reality, of course, the opposite is true: the complexities of the “solution” overwhelm the original problem, and nothing gets accomplished. Data analysis does not have to be all that hard. Although there are situations when elementary methods will no longer be sufficient, they are much less prevalent than you might expect. In the vast majority of cases, curiosity and a healthy dose of common sense will serve you well. The attitude that I am trying to convey can be summarized in a few points: Simple is better than complex. Cheap is better than expensive. Explicit is better than opaque. Purpose is more important than process. Insight is more important than precision. Understanding is more important than technique. Think more, work less. Although I do acknowledge that the items on the right are necessary at times, I will give preference to those on the left whenever possible. It is in this spirit that I am offering the concepts and techniques that make up the rest of this book. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, and email addresses Constant width Used to refer to language and script elements P R E FA C E www.it-ebooks.info xv Using Code Examples This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your products documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Analysis with Open Source Tools, by Philipp K. Janert. Copyright 2011 Philipp K. Janert, 978-0-596-80235-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected]. Safari® Books Online . Books Online is search Safari > Safari7,500 technology an on-demand digital library that lets you easily the over and creative reference books and videos to find Books online answers you need quickly. With a subscription, you can read any page and watch any video from our library online. Read books on your cell phone and mobile devices. Access new titles before they are available for print, and get exclusive access to manuscripts in development and post feedback for the authors. Copy and paste code samples, organize your favorites, download chapters, bookmark key sections, create notes, print out pages, and benefit from tons of other time-saving features. O’Reilly Media has uploaded this book to the Safari Books Online service. To have full digital access to this book and others on similar topics from OReilly and other publishers, sign up for free at http://my.safaribooksonline.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) xvi PREFACE www.it-ebooks.info We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at: http://oreilly.com/catalog/9780596802356 To comment or ask technical questions about this book, send email to: [email protected] For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our website at: http://oreilly.com Acknowledgments It was a pleasure to work with O’Reilly on this project. In particular, O’Reilly has been most accommodating with regard to the technical challenges raised by my need to include (for an O’Reilly book) an uncommonly large amount of mathematical material in the manuscript. Mike Loukides has accompanied this project as the editor since its beginning. I have enjoyed our conversations about life, the universe, and everything, and I appreciate his comments about the manuscript—either way. I’d like to thank several of my friends for their help in bringing this book about: • Elizabeth Robson, for making the connection • Austin King, for pointing out the obvious • Scott White, for suffering my questions gladly • Richard Kreckel, for much-needed advice As always, special thanks go to PAUL Schrader (Bremen). The manuscript benefited from the feedback I received from various reviewers. Michael E. Driscoll, Zachary Kessin, and Austin King read all or parts of the manuscript and provided valuable comments. I enjoyed personal correspondence with Joseph Adler, Joe Darcy, Hilary Mason, Stephen Weston, Scott White, and Brian Zimmer. All very generously provided expert advice on specific topics. Particular thanks go to Richard Kreckel, who provided uncommonly detailed and insightful feedback on most of the manuscript. During the preparation of this book, the excellent collection at the University of Washington libraries was an especially valuable resource to me. P R E FA C E www.it-ebooks.info xvii
- Xem thêm -

Tài liệu liên quan