01 November 2023

On Correlation (2010 -)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Need to consider outliers as they can affect statistics such as means, standard deviations, and correlations. They can either be explained, deleted, or accommodated (using either robust statistics or obtaining additional data to fill-in). Can be detected by methods such as box plots, scatterplots, histograms or frequency distributions." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Outliers or influential data points can be defined as data values that are extreme or atypical on either the independent (X variables) or dependent (Y variables) variables or both. Outliers can occur as a result of observation errors, data entry errors, instrument errors based on layout or instructions, or actual extreme values from self-report data. Because outliers affect the mean, the standard deviation, and correlation coefficient values, they must be explained, deleted, or accommodated by using robust statistics." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"[...] if you want to show change through time, use a time-series chart; if you need to compare, use a bar chart; or to display correlation, use a scatter-plot - because some of these rules make good common sense." (Alberto Cairo, "The Functional Art", 2011)

"Economists should study financial markets as they actually operate, not as they assume them to operate - observing the way in which information is actually processed, observing the serial correlations, bonanzas, and sudden stops, not assuming these away as noise around the edges of efficient and rational markets." (Adair Turner, "Economics after the Crisis: Objectives and means", 2012)

"If the distance from the mean for one variable tends to be broadly consistent with distance from the mean for the other variable (e.g., people who are far from the mean for height in either direction tend also to be far from the mean in the same direction for weight), then we would expect a strong positive correlation. If distance from the mean for one variable tends to correspond to a similar distance from the mean for the second variable in the other direction (e.g., people who are far above the mean in terms of exercise tend to be far below the mean in terms of weight), then we would expect a strong negative correlation. If two variables do not tend to deviate from the mean in any meaningful pattern (e.g., shoe size and exercise) then we would expect little or no correlation." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"The correlation coefficient has two fabulously attractive characteristics. First, for math reasons that have been relegated to the appendix, it is a single number ranging from –1 to 1. A correlation of 1, often described as perfect correlation, means that every change in one variable is associated with an equivalent change in the other variable in the same direction. A correlation of –1, or perfect negative correlation, means that every change in one variable is associated with an equivalent change in the other variable in the opposite direction. The closer the correlation is to 1 or –1, the stronger the association. […] The second attractive feature of the correlation coefficient is that it has no units attached to it. […] The correlation coefficient does a seemingly miraculous thing: It collapses a complex mess of data measured in different units (like our scatter plots of height and weight) into a single, elegant descriptive statistic." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Without precise predictability, control is impotent and almost meaningless. In other words, the lesser the predictability, the harder the entity or system is to control, and vice versa. If our universe actually operated on linear causality, with no surprises, uncertainty, or abrupt changes, all future events would be absolutely predictable in a sort of waveless orderliness." (Lawrence K Samuels, "Defense of Chaos", 2013)

"The problem of complexity is at the heart of mankind’s inability to predict future events with any accuracy. Complexity science has demonstrated that the more factors found within a complex system, the more chances of unpredictable behavior. And without predictability, any meaningful control is nearly impossible. Obviously, this means that you cannot control what you cannot predict. The ability ever to predict long-term events is a pipedream. Mankind has little to do with changing climate; complexity does." (Lawrence K Samuels, "The Real Science Behind Changing Climate", LewRockwell.com, August 1, 2014)

"An event occurring at one node will cause a cascade of events: often this cascade or avalanche propagates to affect only one or two further elements, occasionally it affects more, and more rarely it affects many. The mathematical theory of this - which is very much part of complexity theory - shows that propagations of events causing further events show characteristic properties such as power laws (caused by many and frequent small propagations, few and infrequent large ones), heavy tailed probability distributions (lengthy propagations though rare appear more frequently than normal distributions would predict), and long correlations (events can and do propagate for long distances and times)." (W Brian Arthur, "Complexity and the Economy", 2015) 

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We are superb causal-hypothesis generators. Given an effect, we are rarely at a loss for an explanation. Seeing a difference in observations over time, we readily come up with a causal interpretation. Much of the time, no causality at all is going on—just random variation. The compulsion to explain is particularly strong when we habitually see that one event typically occurs in conjunction with another event. Seeing such a correlation almost automatically provokes a causal explanation. It’s tremendously useful to be on our toes looking for causal relationships that explain our world. But there are two problems: (1) The explanations come too easily. If we recognized how facile our causal hypotheses were, we’d place less confidence in them. (2) Much of the time, no causal interpretation at all is appropriate and wouldn’t even be made if we had a better understanding of randomness." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We don’t recognize how easy it is to generate hypotheses about the world. If we did, we’d generate fewer of them, or at least hold them more tentatively. We sprout causal theories in abundance when we learn of a correlation, and we readily find causal explanations for the failure of the world to confirm our hypotheses. We don’t realize how easy it is for us to explain away evidence that would seem on the surface to contradict our hypotheses. And we fail to generate tests of a hypothesis that could falsify the hypothesis if in fact the hypothesis is wrong. This is one type of confirmation bias." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"A correlation is simply a bivariate relationship - a fancy way of saying that there is a relationship between two ('bi') variables ('variate'). And a bivariate relationship doesn’t prove that one thing caused the other. Think of it this way: you can observe that two things appear to be related statistically, but that doesn’t tell you the answer to any of the questions you might really care about - why is there a relationship and what does it mean to us as a consumer of data?" (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Confirmation bias can affect nearly every aspect of the way you look at data, from sampling and observation to forecasting - so it’s something  to keep in mind anytime you’re interpreting data. When it comes to correlation versus causation, confirmation bias is one reason that some people ignore omitted variables - because they’re making the jump from correlation to causation based on preconceptions, not the actual evidence." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"In the real world, statistical issues rarely exist in isolation. You’re going to come across cases where there’s more than one problem with the data. For example, just because you identify some sampling errors doesn’t mean there aren’t also issues with cherry picking and correlations and averages and forecasts - or simply more sampling issues, for that matter. Some cases may have no statistical issues, some may have dozens. But you need to keep your eyes open in order to spot them all." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Parameter estimation is a basic aspect of model construction and historically it has been assumed that data are sufficient to estimate the parameters, for instance, correlations that are part of the model; however, when the number of parameters is too large for the amount of data, accurate parameter estimation becomes impossible. The result is model uncertainty." (Edward R Dougherty, "The Evolution of Scientific Knowledge: From certainty to uncertainty", 2016)

"Correlation is not equivalent to cause for one major reason. Correlation is well defined in terms of a mathematical formula. Cause is not well defined." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"How do you know when a correlation indicates causation? One way is to conduct a controlled experiment. Another is to apply logic. But be careful - it’s easy to get bogged down in semantics." (Daniel J Levitin, "Weaponized Lies", 2017)

"The degree to which one variable can be predicted from another can be calculated as the correlation between them. The square of the correlation (R^2) is the proportion of the variance of one that can be 'explained' by knowledge of the other." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Using noise (the uncorrelated variables) to fit noise (the residual left from a simple model on the genuinely correlated variables) is asking for trouble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient [...]. A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards [...]." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Network theory confirms the view that information can take on 'a life of its own'. In the yeast network my colleagues found that 40 per cent of node pairs that are correlated via information transfer are not in fact physically connected; there is no direct chemical interaction. Conversely, about 35 per cent of node pairs transfer no information between them even though they are causally connected via a 'chemical wire' (edge). Patterns of information traversing the system may appear to be flowing down the 'wires' (along the edges of the graph) even when they are not. For some reason, 'correlation without causation' seems to be amplified in the biological case relative to random networks." (Paul Davies, "The Demon in the Machine: How Hidden Webs of Information Are Solving the Mystery of Life", 2019)

"Another problem is that while data visualizations may appear to be objective, the designer has a great deal of control over the message a graphic conveys. Even using accurate data, a designer can manipulate how those data make us feel. She can create the illusion of a correlation where none exists, or make a small difference between groups look big." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Correlation doesn't imply causation - but apparently it doesn't sell newspapers either."(Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Correlation quantifies the relationship between features. The purpose of correlation analysis is to understand the dependencies between features, so that observed effects can be explained or desired effects can be achieved." (Thomas A Runkler, "Data Analytics: Models and Algorithms for Intelligent Data Analysis" 3rd Ed., 2020)

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...

On Hypothesis Testing III

  "A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way...