24 September 2023

Leo Breiman - Collected Quotes

"Probability theory has a right and a left hand. On the right is the rigorous foundational work using the tools of measure theory. The left hand 'thinks probabilistically', reduces problems to gambling situations, coin-tossing, motions of a physical particle." (Leo Breiman, "Probability", 1992) 

"Approaching problems by looking for a data model imposes an a priori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems. The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"As I left consulting to go back to the university, these were the perceptions I had about working with data to find answers to problems: (a) Focus on finding a good solution–that’s what consultants get paid for. (b) Live with the data before you plunge into modelling. (c) Search for a model that gives a good solution, either algorithmic or data. (d) Predictive accuracy on test sets is the criterion for how good the model is. (e) Computers are an indispensable partner." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science Vol. 16(3), 2001)

"Data modeling has given the statistics field many successes in analyzing data and getting information about the mechanisms producing the data. But there is also misuse leading to questionable conclusions about the underlying mechanism." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"One goal of statistics is to extract information from the data about the underlying mechanism producing the data. The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"Prediction is rarely perfect. There are usually many unmeasured variables whose effect is referred to as 'noise'. But the extent to which the model box emulates nature's box is a measure of how well our model can reproduce the natural phenomenon producing the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"Residual analysis is similarly unreliable. In a discussion after a presentation of residual analysis in a seminar at Berkeley in 1993, William Cleveland, one of the fathers of residual analysis, admitted that it could not uncover lack of fit in more than four to five dimensions. The papers I have read on using residual analysis to check lack of fit are confined to data sets with two or three variables. With higher dimensions, the interactions between the variables can produce passable residual plots for a variety of models. A residual plot is a goodness-of-fit test, and lacks power in more than a few dimensions. An acceptable residual plot does not imply that the model is a good fit to the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science Vol. 16(3), 2001)

"The goals in statistics are to use data to predict and to get information about the underlying data mechanism. Nowhere is it written on a stone tablet what kind of model should be used to solve problems involving data. To make my position clear, I am not against data models per se. In some situations they are the most appropriate way to solve the problem. But the emphasis needs to be on the problem and on the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses [...] different models, all of them equally good, may give different pictures of the relation between the predictor and response variables [...] One reason for this multiplicity is that goodness-of-fit tests and other methods for checking fit give a yes–no answer. With the lack of power of these tests with data having more than a small number of dimensions, there will be a large number of models whose fit is acceptable. There is no way, among the yes–no methods for gauging fit, of determining which is the better model." (Leo Breiman, "Statistical Modeling: The two cultures", Statistical Science 16(3), 2001)

"The point of a model is to get useful information about the relation between the response and predictor variables. Interpretability is a way of getting information. But a model does not have to be simple to provide reliable information about the relation between predictor and response variables; neither does it have to be a data model." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"The roots of statistics, as in science, lie in working with data and checking theory against data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...

On Data: Longitudinal Data

  "Longitudinal data sets are comprised of repeated observations of an outcome and a set of covariates for each of many subjects. One o...