Correlation Analysis in Engineering

Explore top LinkedIn content from expert professionals.

Summary

Correlation analysis in engineering is a process used to understand and measure the relationship between different variables, often to decide which factors are most important for building accurate models or making informed decisions. It involves using statistical methods like Pearson, Spearman, and Kendall to reveal how two sets of data move together, helping engineers interpret patterns and trends in everything from quality control to system design.

  • Select the right method: Choose a correlation approach based on your data’s characteristics, such as linearity, presence of outliers, or whether values are ranked or categorized.
  • Visualize relationships: Use scatter plots or other charts to spot patterns, outliers, and understand the type of connection between variables before running calculations.
  • Question your results: Always consider whether correlation truly reflects cause, and combine statistical findings with real-world understanding to avoid being misled by numbers.
Summarized by AI based on LinkedIn member posts
  • 🔍Correlation coefficients are powerful tools for understanding relationships between variables, but with so many options available-Pearson, Spearman, Kendall, and more. How do you know which one is right for your data? Choosing the correct correlation coefficient can make or break your analysis, so let’s dive into how to pick the best one for your specific needs. 💡Understand the Types of Correlation Coefficients Before deciding, it’s important to understand the most common correlation measures: Pearson Correlation Coefficient 🔹Measures linear relationships between two continuous variables. 🔹Assumes: 🔹Variables are normally distributed. 🔹The relationship is linear. 🔹No significant outliers. 🔹Best for: Analyzing straightforward linear relationships, like height vs. weight or temperature vs. ice cream sales. Spearman Rank Correlation 🔹Measures monotonic relationships (whether linear or not) based on ranks rather than raw values. 🔹Does not assume normality or linearity. 🔹Best for: Non-linear but consistently increasing or decreasing trends, such as survey rankings or skewed data. Kendall’s Tau 🔹Measures the strength of association based on concordant and discordant pairs. 🔹Less sensitive to outliers compared to Pearson and Spearman. 🔹Best for: Small datasets, ordinal data, or when robustness to outliers is critical. Other Options 🔹Point-Biserial Correlation : For one continuous variable and one binary variable (e.g., gender vs. income). 🔹Phi Coefficient : For two binary variables (e.g., pass/fail vs. attended/didn’t attend training). ❓Ask the Right Questions About Your Data Choosing the best correlation coefficient starts with understanding your data and the goals of your analysis. Ask yourself: Are the Variables Continuous or Categorical? 🔹If both variables are continuous, Pearson or Spearman may work. 🔹If one or both variables are ordinal or categorical, consider Spearman or Kendall. Is the Relationship Linear or Non-Linear? 🔹Use Pearson if the relationship is linear. 🔹Use Spearman if the relationship is monotonic but not necessarily linear. Is the Data Normally Distributed? 🔹Pearson assumes normality. If your data isn’t normally distributed, Spearman or Kendall might be better choices. Are There Outliers? 🔹Pearson is sensitive to outliers; Spearman and Kendall are more robust. 🔹If outliers are present, consider using Spearman or Kendall. What’s the Sample Size? 🔹Kendall’s Tau performs well with small datasets, while Pearson and Spearman require larger samples for reliable results. 📈Visualize Your Data A scatter plot is your best friend when choosing a correlation coefficient. It helps you: 🔹Identify whether the relationship is linear, monotonic, or something else. 🔹Spot outliers that might influence the correlation. 🔹Determine if transformations (e.g., log scaling) are needed before calculating the coefficient.

  • View profile for Puneet Khandelwal

    JPMC | Quant Modelling Analyst | IIT KGP | CFA L1 | Masters in Financial Engineering

    19,284 followers

    🥊 𝗧𝗵𝗲 𝗖𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 𝗕𝗮𝘁𝘁𝗹𝗲 𝟮: 𝗣𝗲𝗮𝗿𝘀𝗼𝗻, 𝗦𝗽𝗲𝗮𝗿𝗺𝗮𝗻, 𝗞𝗲𝗻𝗱𝗮𝗹𝗹, 𝗕𝗶𝗰𝗼𝗿, 𝗗𝗶𝘀𝘁𝗮𝗻𝗰𝗲 "Is Pearson Lying to You? Do we need any other type of correlation due to outliers" 🧪 𝗟𝗲𝘁'𝘀 𝗹𝗲𝗮𝗿𝗻 𝗳𝗿𝗼𝗺 𝗮 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗶𝗼𝗻: I plotted: 📍X-axis: Experience (Years) 📍Y-axis: Income ($) Then, I gradually added an 𝗼𝘂𝘁𝗹𝗶𝗲𝗿, increasing its distance. And then it started changing its direction. And here's what I found 👇 🤯 𝗥𝗲𝘀𝘂𝗹𝘁𝘀? Let’s just say: Not all correlation metrics are built equal... 𝟭. 𝗣𝗲𝗮𝗿𝘀𝗼𝗻 🚨 Skyrocketed when outlier followed the trend 📉 Crashed hard when the outlier flipped direction ➡️ Most sensitive to outliers 𝟮. 𝗦𝗽𝗲𝗮𝗿𝗺𝗮𝗻 💪 Stayed steady as long as the rank order wasn’t destroyed ⚠️ Dropped only when outlier became extreme ➡️ More robust, but not invincible 𝟯. 𝗞𝗲𝗻𝗱𝗮𝗹𝗹 𝗧𝗮𝘂 🧱 Even more resistant than Spearman ✅ Only dropped under very aggressive distortion ➡️ Solid for ordinal or ranked data 𝟰. 𝗕𝗶𝘄𝗲𝗶𝗴𝗵𝘁 𝗠𝗶𝗱𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 🛡️ Practically ignored the outlier 🧮 Downweighted its impact ➡️ Most robust among all 𝟱. 𝗗𝗶𝘀𝘁𝗮𝗻𝗰𝗲 𝗖𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 🔄 Captured complex, non-linear effects 📉 Dropped as dependency faded, not just due to linearity loss ➡️ Smart but moderately outlier-sensitive 𝗪𝗵𝗮𝘁 𝗶𝘁 𝘁𝗲𝗹𝗹𝘀 𝘂𝘀:  • One extreme point can artificially inflate or destroy your correlation.  • Just because Pearson is easy and standard doesn’t mean it’s always right. 💡 𝗠𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆: • Use Pearson for clean, linear data, or regret it later • Use Spearman or Kendall for monotonic trends or ranked data • Use Biweight or Distance when facing outliers, nonlinearities, or real-world noise 🤔 𝗢𝗽𝗲𝗻 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀 𝗳𝗼𝗿 𝗬𝗼𝘂: • Which correlation metric do you trust when your data is noisy or complex? • Have you tried robust correlation methods in production models? • In which direction does an outlier hurt correlation the most? 📉 Left? Bottom? Diagonal? 📌 𝗧𝗟;𝗗𝗥: Don’t just trust the numbers — question them. Your model’s insights are only as good as the assumptions behind your stats. 🔁 Repost if your models have ever been fooled by a “strong” Pearson 💬 Comment your experiences with robust correlation 🔔 Follow me for more real-world simulations & data science breakdowns #DataScience #MachineLearning #FeatureEngineering #Correlation #Outliers #Statistics #EDA #DataAnalysis #Quant #MLInsights

  • View profile for Danny Butvinik

    Chief Data Scientist | 100K+ Followers | FinCrime AI | Author | Inventor (10 Patents) |

    106,303 followers

    𝗣𝗲𝗮𝗿𝘀𝗼𝗻 𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 and 𝗦𝗽𝗲𝗮𝗿𝗺𝗮𝗻 𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 are two commonly used methods to measure the strength and direction of the relationship between two variables. While both methods are used to determine the correlation between two variables, they differ in how they calculate and interpret the correlation coefficients. Pearson correlation is a measure of linear correlation between two continuous variables. It measures the strength and direction of the relationship between two variables on a scale ranging from -1 to +1. Correlation coefficient of +1 indicates a perfect positive linear relationship, a coefficient of -1 indicates a perfect negative linear relationship, and a coefficient of 0 indicates no relationship. 𝗣𝗲𝗮𝗿𝘀𝗼𝗻 𝗖𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 𝗖𝗵𝗮𝗿𝗮𝗰𝘁𝗲𝗿𝗶𝘀𝘁𝗶𝗰𝘀 ● assumes that the relationship between the two variables is linear and that the data is normally distributed ● sensitive to outliers and can be affected by extreme values in the data ● sensitive to scale of measurement of the variables being correlated Spearman correlation is a non-parametric measure of correlation between two variables. It measures the strength and direction of the relationship between two variables on a scale ranging from -1 to +1, just like Pearson correlation. 𝗦𝗽𝗲𝗮𝗿𝗺𝗮𝗻 𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 𝗰𝗵𝗮𝗿𝗮𝗰𝘁𝗲𝗿𝗶𝘀𝘁𝗶𝗰𝘀 ● does not assume that the relationship between the two variables is linear, nor does it assume that the data is normally distributed. ● based on the ranks of the data rather than the actual values of the data ● less sensitive to outliers and can be used with both continuous and ordinal data ● not sensitive to scale of measurement of the variables being correlated 𝗖𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 𝗶𝗻 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 Correlation analysis plays an important role in feature selection, which is the process of identifying the most relevant variables or features that contribute to a predictive model's accuracy. By examining the correlations between the features and target variable, we can determine which features are most strongly related to the target variable and should be included in the model. In feature selection, Pearson correlation and Spearman correlation are commonly used to measure the relationship between each feature and the target variable. Features that have a high correlation with the target variable are generally considered to be more important and may be selected for inclusion in the model. Features with low or negative correlation with the target variable may be removed from the model. It's important to note that correlation does not always imply causation. Just because two variables are strongly correlated does not necessarily mean that one causes the other. Other factors such as domain knowledge and causal analysis should also be considered when selecting features for a predictive model. Image: Author

Explore categories