Essential Statistical Tests For Statistical Significance in Machine Learning

Must know statistical tests for testing statistical significance of features in Regression Model: One-sided & Two-sided T-test, F-test, 2 Sample Z-test, 2 Sample T-test, Chi-Squared test

“Facts are stubborn things, but statistics are pliable.”
― Mark Twain

Trying to determine whether a variable or set of variables had a real impact on another, dependent, variable is one of the most common reasons for using regression models such as Linear Regression.

In Regression Analysis hypothesis testing and statistical significance are central for which you must know the two most popular statistical tests, the T-test, and F-test.

In this article, I will cover all these topics which you should know when using Regression type of models for detecting Statistical Significance.

Note that it’s part of a more extended guide to fundamentals of Statistics that Every Data Scientist should know. So, If you have no prior Statistical knowledge, you can simply skip the statistical derivations and formulas. However, if you want to learn or refresh your knowledge in the essential statistical concepts you can check this article: Fundamentals of statistics for Data Scientists and Data Analysts

[Fundamentals Of Statistics For Data Scientists and Data Analysts
Key statistical concepts for your data science or data analytics journeytowardsdatascience.com](https://towardsdatascience.com/fundamentals-of-statistics-for-data-scientists-and-data-analysts-69d93a05aae7 "towardsdatascience.com/fundamentals-of-stat..")

Image Source: Melissa Thomas

Statistical Hypothesis testing

Testing a hypothesis in Statistics is a way to test the results of an experiment or survey to determine how meaningful they the results are. Basically, one is testing whether the obtained results are valid by figuring out the odds that the results have occurred by chance. If it is the letter, then the results are not reliable and neither is the experiment. Hypothesis Testing is part of Statistical Inference.

Null and Alternative Hypothesis

Firstly, you need to determine the thesis you wish to test, then you need to formulate the Null Hypothesis and the Alternative Hypothesis. The test can have two possible outcomes and based on the statistical results you can either reject the stated hypothesis or accept it. As a rule of thumb, statisticians tend to put the version or formulation of the hypothesis under the Null Hypothesis that needs to be rejected, whereas the acceptable and desired version is stated under the Alternative Hypothesis.

Statistical significance

Let’s look at the earlier example where the Linear Regression model is used to investigate whether a penguin’s Flipper Length, the independent variable, has an impact on Body Mass, the dependent variable. We can formulate this model with the following statistical expression:

Then, once the OLS estimates of the coefficients are estimated, we can formulate the following Null and Alternative Hypotheses to test whether the Flipper Length has a statistically significant impact on the Body Mass:

where H0 and H1 represent the Null Hypothesis and Alternative Hypothesis, respectively. Rejecting the Null Hypothesis would mean that a one-unit increase in Flipper Length has a direct impact on the Body Mass. Given that the parameter estimate of β1 is describing this impact of the independent variable, Flipper Length, on the dependent variable, Body Mass. This hypothesis can be reformulated as follows:

where H0 states that the parameter estimate of β1 is equal to 0, that is Flipper Length effect on Body Mass is statistically insignificant whereas H0 states that the parameter estimate of β1 is not equal to 0 suggesting that the Flipper Length effect on Body Mass is statistically significant*.*

Image Source: Josh Hild

Statistical Tests

Once the Null and the Alternative Hypotheses are stated and the test assumptions are defined, the next step is to determine which statistical test is appropriate and to calculate the test statistic. Whether or not to reject or not reject the Null can be determined by comparing the test statistic with the critical value. This comparison shows whether or not the observed test statistic is more extreme than the defined critical value and it can have two possible results:

The test statistic is more extreme than the critical value → the null hypothesis can be rejected
The test statistic is not as extreme as the critical value → the null hypothesis cannot be rejected

The critical value is based on a prespecified significance level α (usually chosen to be equal to 5%) and the type of probability distribution the test statistic follows. The critical value divides the area under this probability distribution curve into the rejection region(s) and non-rejection region. There are numerous statistical tests used to test various hypotheses. Examples of Statistical tests are Student’s t-test, F-test, Chi-squared test, Durbin-Hausman-Wu Endogeneity test, and White Heteroskedasticity test. In this article, we will look at two of these statistical tests.

Student’s t-test

One of the simplest and most popular statistical tests is the Student’s t-test. which can be used for testing various hypotheses, especially when dealing with a hypothesis where the main area of interest is to find evidence for the statistically significant effect of a single variable*. The test statistics of the t-test follow [**Student’s t distribution*](en.wikipedia.org/wiki/Student%27s_t-distrib..) and can be determined as follows:

where h0 in the nominator is the value against which the parameter estimate is being tested. So, the t-test statistics are equal to the parameter estimate minus the hypothesized value divided by the standard error of the coefficient estimate. In the earlier stated hypothesis, where we wanted to test whether Flipper Length has a statistically significant impact on Body Mass or not. This test can be performed using a t-test and the h0 is in that case equal to the 0 since the slope coefficient estimate is tested against the value 0.

There are two versions of the t-test: a two-sided t-test and a one-sided t-test. Whether you need the former or the latter version of the test depends entirely on the hypothesis that you want to test.

Image Source: Michael Block

Two-sided t-test

The two-sided or two-tailed t-test can be used when the hypothesis is testing an equal versus not equal relationship under the Null and Alternative Hypotheses that is similar to the following example:

The two-sided t-test has two rejection regions as visualized in the figure below:

Image Source: Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin

In this version of the t-test, the Null is rejected if the calculated t-statistics is either too small or too large.

Here, the test statistics are compared to the critical values based on the sample size and the chosen significance level. To determine the exact value of the cutoff point, the two-sided t-distribution table can be used.

One-sided t-test

The one-sided or one-tailed t-test can be used when the hypothesis is testing a positive/negative versus negative/positive relationship under the Null and Alternative Hypotheses that is similar to the following examples:

The one-sided t-test has a single rejection region and depending on the hypothesis side the rejection region is either on the left-hand side or the right-hand side as visualized in the figure below:

Image Source: Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin

In this version of the t-test, the Null is rejected if the calculated t-statistics is smaller/larger than the critical value.

Image Source: Mark Neal

There are two versions of the t-test: a two-sided t-test and a one-sided t-test. Whether you need the former or the latter version of the test depends entirely on the hypothesis that you want to test.

F-test For Joint Statistical Significance

F-test is another very popular statistical test often used to test hypotheses testing a joint statistical significance of multiple variables*.* This is the case when you want to test whether multiple independent variables have a statistically significant impact on a dependent variable. Following is an example of a statistical hypothesis that can be tested using the F-test:

where the Null states that the three variables corresponding to these coefficients are jointly statistically insignificant and the Alternative states that these three variables are jointly statistically significant. The test statistics of the F-test follow F distribution and can be determined as follows:

where the SSRrestricted is the sum of squared residuals of the restricted model which is the same model excluding from the data the target variables stated as insignificant under the Null, the SSRunrestricted is the sum of squared residuals of the unrestricted model which is the model that includes all variables, the q represents the number of variables that are being jointly tested for the insignificance under the Null, N is the sample size, and the k is the total number of variables in the unrestricted model. SSR values are provided next to the parameter estimates after running the OLS regression and the same holds for the F-statistics as well. Following is an example of MLR model output where the SSR and F-statistics values are marked.

Image Source: Stock and Watson

F-test has a single rejection region as visualized below:

Image Source: U of Michigan

If the calculated F-statistics is bigger than the critical value, then the Null can be rejected which suggests that the independent variables are jointly statistically significant. The rejection rule can be expressed as follows:

Image Source: Mario Cuadros

2-sample T-test

If you want to test whether there is a statistically significant difference between the control and experimental groups’ metrics that are in the form of averages (e.g. average purchase amount), metric follows student-t distribution and when the sample size is smaller than 30, you can use 2-sample T-test to test the following hypothesis:

where the sampling distribution of means of Control group follows Student-t distribution with degrees of freedom N_con-1. Moreover, the sampling distribution of means of the Experimental group also follows the Student-t distribution with degrees of freedom N_exp-1. Note that, the N_con and N_exp are the number of users in the Control and Experimental groups, respectively.

Then an estimate for the pooled variance of the two samples can be calculated as follows:

where σ²_con and σ²_exp are the sample variances of the Control and Experimental groups, respectively. Then the Standard Error is equal to the square root of the estimate of the pooled variance and can be defined as:

Consequently, the test statistics of the 2-sample T-test with the hypothesis stated earlier can be calculated as follows:

In order to test the statistical significance of the observed difference between sample means, we need to calculate the p-value of our test statistics. The p-value is the probability of observing values at least as extreme as the common value when this is due to a random chance. Stated differently, the p-value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the null hypothesis is true. Then the p-value of the test statistics can be calculated as follows:

The interpretation of a p-value is dependent on the chosen significance level, alpha, which was chosen before running the test during the power analysis. If the calculated p-value appears to be smaller than equal to alpha (e.g. 0.05 for 5% significance level) we can reject the null hypothesis and state that there is a statistically significant difference between the primary metrics of the Control and Experimental groups.

Finally, to determine how accurate the obtained results are and also to comment about the practical significance of the obtained results, you can compute the Confidence Interval of your test by using the following formula:

where the t_(1-alpha/2) is the critical value of the test corresponding to the two-sided t-test with alpha significance level and can be found using the t-table.

2-sample Z-test

If you want to test whether there is a statistically significant difference between the control and experimental groups’ metrics that are in the form of averages (e.g. average purchase amount) or proportions (e.g. Click Through Rate), metric follows Normal distribution, or when the sample size is larger than 30 such that you can use Central Limit Theorem (CLT) to state that the sampling distributions of Control and Experimental groups are asymptotically Normal, you can use 2-sample Z-test. Here we will make a distinction between two cases: where the primary metric is in the form of proportions (e.g. Click Through Rate) and where the primary metric is in the form of averages (e.g. average purchase amount).

Case 1: Z-test for comparing proportions (2-sided)

If you want to test whether there is a statistically significant difference between the Control and Experimental groups’ metrics that are in the form of proportions (e.g. CTR) and if the click event occurs independently, you can use a 2-sample Z-test to test the following hypothesis:

where each click event can be described by a random variable that can take two possible values 1 (success) and 0 (failure) that follows a Bernoulli distribution (click: success and no click: failure) with p_con and p_exp are the probabilities of clicking (probability of success) of Control and Experimental groups, respectively. That is:

Hence, after collecting the interaction data of the Control and Experimental users, you can calculate the estimates of these two probabilities as follows:

Since we are testing for the difference in these probabilities, we need to obtain an estimate for the pooled probability of success and an estimate for pooled variance, which can be done as follows:

Then the Standard Error is equal to the square root of the estimate of the pooled variance and can be defined as:

Consequently, the test statistics of the 2-sample Z-test for the difference in proportions can be calculated as follows:

Then the p-value of this test statistics can be calculated as follows:

Finally, you can compute the Confidence Interval of the test as follows:

where the z_(1-alpha/2) is the critical value of the test corresponding to the two-sided Z-test with alpha significance level and can be found using the Z-table. The rejection region of this two-sided 2-sample Z-test can be visualized by the following graph.

Image Source: The Author

Z-test for comparing means (2-sided)

If you want to test whether there is a statistically significant difference between the Control and Experimental groups’ metrics that are in the form of averages (e.g. CTR) you can use a 2-sample Z-test to test the following hypothesis:

where the sampling distribution of means of Control group follows Normal distribution with mean mu_con and σ²_con/N_con. Moreover, the sampling distribution of means of the Experimental group also follows the Normal distribution with mean mu_exp and σ²_exp/N_exp.

then the difference in the means of the control and experimental groups also follows Normal distributions with mean mu_con-mu_exp and variance σ²_con/N_con + σ²_exp/N_exp.

Consequently, the test statistics of the 2-sample Z-test for the difference in means can be calculated as follows:

The Standard Error is equal to the square root of the estimate of the pooled variance and can be defined as:

Then the p-value of this test statistics can be calculated as follows:

Finally, you can compute the Confidence Interval of the test as follows:

Chi-Squared test

If you want to test whether there is a statistically significant difference between the Control and Experimental groups’ performance metrics (for example their conversions) and you don’t really want to know the nature of this relationship (which one is better) you can use a Chi-Square test to test the following hypothesis:

Note that the metric should be in the form of a binary variable (e.g. conversion or no conversion/click or no click). The data can then be represented in the form of the following table, where O and T correspond to observed and theoretical values, respectively.

Then the test statistics of the Chi-2 test can be expressed as follows:

where the Observered corresponds to the observed data and the Expected corresponds to the theoretical value, and i can take values 0 (no conversion) and 1(conversion). It’s important to see that each of these factors has a separate denominator. The formula for the test statistics when you have two groups only can be represented as follows:

The expected value is simply equal to the number of times each version of the product is viewed multiplied by the probability of it leading to conversion (or to a click in case of CTR).

Note that, since the Chi-2 test is not a parametric test, its Standard Error and Confidence Interval can’t be calculated in a standard way as it was done in the parametric Z-test or T-test.

The rejection region of this two-sided 2-sample Z-test can be visualized by the following graph.

Image Source: The Author

P-Values of F-test

Another quick way to determine whether to reject or support the Null Hypothesis is by using p-values. The p-value is the probability of the condition under the Null occurring. Stated differently, the p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. The smaller the p-value, the stronger is the evidence against the Null Hypothesis, suggesting that it can be rejected.

The interpretation of a p-value is dependent on the chosen significance level. Most often, 1%, 5%, or 10% significance levels are used to interpret the p-value. So, instead of using the t-test and the F-test, the p-values of these test statistics can be used to test the same hypotheses.

The following figure shows a sample output of an OLS regression with two independent variables. In this table, the p-value of the t-test, testing the statistical significance of the class_size variable’s parameter estimate, and the p-value of the F-test, testing the joint statistical significance of the class_size, and el_pct variables parameter estimates, are underlined.

Image Source: Stock and Whatson

The p-value corresponding to the class_size variable is 0.011 and when comparing this value to the significance levels 1% or 0.01 , 5% or 0.05, 10% or 0.1, then the following conclusions can be made:

0.011 > 0.01 → Null of the t-test can’t be rejected at 1% significance level
0.011 < 0.05 → Null of the t-test can be rejected at 5% significance level
0.011 < 0.10 →Null of the t-test can be rejected at 10% significance level

So, this p-value suggests that the coefficient of the class_size variable is statistically significant at 5% and 10% significance levels. The p-value corresponding to the F-test is 0.0000 and since 0 is smaller than all three cutoff values; 0.01, 0.05, and 0.10, we can conclude that the Null of the F-test can be rejected in all three cases. This suggests that the coefficients of class_size and el_pct variables are jointly statistically significant at 1%, 5%, and 10% significance levels.

Image Source: Pixabay

If you liked this article, here are some other articles you may enjoy:

[How to Crack Spotify Data Science Onsite Interview
End-to-end Data Science Case Study to Crack Spotify Onsite Interview with Tips and Python Implementationtowardsdatascience.com](https://towardsdatascience.com/spotify-data-science-case-study-what-makes-a-playlist-successful-28fec482c523 "towardsdatascience.com/spotify-data-science..")

[How To Crack Spotify Data Science Technical Screen Interview
List of exact Python/SQL commands and experimentation topics you should know to nail Spotify Tech Screentowardsdatascience.co](https://towardsdatascience.com/how-to-crack-spotify-data-science-technical-screen-interview-23f0f7205928 "towardsdatascience.com/how-to-crack-spotify..")

[Understanding Bias-Variance Trade-Off, Overfitting and Regularization in Machine Learning
Introduction to bias-variance trade-off, overfitting & how to solve overfitting using regularization: Ridge and Lasso…towardsdatascience.com](https://towardsdatascience.com/bias-variance-trade-off-overfitting-regularization-in-machine-learning-d79c6d8f20b4 "towardsdatascience.com/bias-variance-trade-..")

[Data Sampling Methods in Python
A ready-to-run code with different data sampling techniques to create a random sample in Pythontatev-aslanyan.medium.com](https://tatev-aslanyan.medium.com/data-sampling-methods-in-python-a4400628ea1b "tatev-aslanyan.medium.com/data-sampling-met..")

[Simple and Complete Guide to A/B Testing
End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…towardsdatascience.com](https://towardsdatascience.com/simple-and-complet-guide-to-a-b-testing-c34154d0ce5a "towardsdatascience.com/simple-and-complet-g..")

[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")

About the Author — That’s Me!

Hello, fellow enthusiasts! I am Tatev, the person behind the insights and information shared in this blog. My journey in the vibrant world of Data Science and AI has been nothing short of incredible, and it’s a privilege to be able to share this wealth of knowledge with all of you.

Connect and Learn More:

[Get to know me better on LinkedIn here]
[Visit my personal website here https://tatevaslanyan.com/]
[Schedule a 1-on-1 call with me here https://tatevaslanyan.com/contact/]

Feel free to connect; whether it’s to discuss the latest trends, seek career advice, or just to share your own exciting journey in this field. I believe in fostering a community where knowledge meets passion, and I’m always here to support and guide aspiring individuals in this vibrant industry.

Want to learn everything about Data Science and how to land a Data Science job? Download this FREE Data Science and AI Career Handbook

Thanks for the read

I encourage you to join Medium today to have complete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics.

Happy learning!