Welcome to the project on regression. We will use the Boston house price dataset for this project.
Objective¶
The problem at hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features affecting the price of the house. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for the unseen data.
Dataset¶
Each record in the database describes a house in Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below:
Attribute Information:
- CRIM: Per capita crime rate by town
- ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: Proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: Nitric Oxide concentration (parts per 10 million)
- RM: The average number of rooms per dwelling
- AGE: Proportion of owner-occupied units built before 1940
- DIS: Weighted distances to five Boston employment centers
- RAD: Index of accessibility to radial highways
- TAX: Full-value property-tax rate per 10,000 dollars
- PTRATIO: Pupil-teacher ratio by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes in 1000 dollars
Importing the necessary libraries¶
# Importing libraries for data manipulation
import numpy as np
import pandas as pd
# Importing libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
import scipy.stats as stats
# Importing libraries for building linear regression model
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
# Importing libraries for model evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Importing library for splitting data
from sklearn.model_selection import train_test_split
# Importing library for data preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
# Importing filter to ignore deprecation warnings
import warnings
warnings.filterwarnings("ignore")
Mounting Drive¶
# Let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Loading the dataset¶
# Using the pd.read_csv() funtion to load the dataset
df = pd.read_csv("/content/drive/MyDrive/MIT - Applied Data Science/Projects/Elective Project/Boston.csv")
Data Overview¶
- Observations
- Sanity checks
# Looking at the top 5 rows of the dataset to start building some intuition
df.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 5.33 | 36.2 |
So it appears that our dataset only has numerical values. Being CHAS the only categorical variable in our data.
The variable MEDV is our variable of interest or dependent variable. All the others are our independent variables that will influence the values on the MEDV target field, and shall be used by the predictive model to get the predicted values.
# Checking the size of the dataset using the .shape method
df.shape
(506, 13)
- Our dataset has 506 rows and 13 columns.
# Using the .info() function to get information about the data types and also to gain some intuition about missing values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null int64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null int64 9 TAX 506 non-null int64 10 PTRATIO 506 non-null float64 11 LSTAT 506 non-null float64 12 MEDV 506 non-null float64 dtypes: float64(10), int64(3) memory usage: 51.5 KB
We can see that the dataset is composed of only numerical datatypes.
10 columns have float and 3 columns have integers datatypes.
We can also conclude that all rows have 506 non-null entries, meaning there should be no missing values on the dataset.
# Just to double-check that in fact there are no missing values, let's use the .isnull().sum() function, that will return us a count of the missing values in our data
df.isnull().sum()
| 0 | |
|---|---|
| CRIM | 0 |
| ZN | 0 |
| INDUS | 0 |
| CHAS | 0 |
| NOX | 0 |
| RM | 0 |
| AGE | 0 |
| DIS | 0 |
| RAD | 0 |
| TAX | 0 |
| PTRATIO | 0 |
| LSTAT | 0 |
| MEDV | 0 |
- As we can see there are 0 missing values in all of our columns.
# Just to be absolutely sure, I'm now going to run the is.na().sum() function that will return a count of any fields that may have NaN values
df.isna().sum()
| 0 | |
|---|---|
| CRIM | 0 |
| ZN | 0 |
| INDUS | 0 |
| CHAS | 0 |
| NOX | 0 |
| RM | 0 |
| AGE | 0 |
| DIS | 0 |
| RAD | 0 |
| TAX | 0 |
| PTRATIO | 0 |
| LSTAT | 0 |
| MEDV | 0 |
- As we can observe, there are also no NaN values in our data and this will save us a lot of time once we don't have to do any missing value treatment to our dataset.
# I'm now running the .duplicated().sum() function, to know if there are any duplicated records on the dataset
df.duplicated().sum()
0
- No double values on the data so no need to drop any duplicated rows.
# In order to start building some more intuition on our data, I'm now using the .describe() function, which will return a statistical summary of our columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CRIM | 506.0 | 3.613524 | 8.601545 | 0.00632 | 0.082045 | 0.25651 | 3.677083 | 88.9762 |
| ZN | 506.0 | 11.363636 | 23.322453 | 0.00000 | 0.000000 | 0.00000 | 12.500000 | 100.0000 |
| INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 | 27.7400 |
| CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 1.0000 |
| NOX | 506.0 | 0.554695 | 0.115878 | 0.38500 | 0.449000 | 0.53800 | 0.624000 | 0.8710 |
| RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 | 8.7800 |
| AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 | 100.0000 |
| DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 | 12.1265 |
| RAD | 506.0 | 9.549407 | 8.707259 | 1.00000 | 4.000000 | 5.00000 | 24.000000 | 24.0000 |
| TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 | 711.0000 |
| PTRATIO | 506.0 | 18.455534 | 2.164946 | 12.60000 | 17.400000 | 19.05000 | 20.200000 | 22.0000 |
| LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 | 37.9700 |
| MEDV | 506.0 | 22.532806 | 9.197104 | 5.00000 | 17.025000 | 21.20000 | 25.000000 | 50.0000 |
CRIM (Per capita crime rate) - Having a mean value of 3.61 but a max value of 88.97, suggests that probabily there are some town's with extreme crime rates. These may generate some potential outliers in our data.
ZN (Proportion of residential land zoned for lots over 25,000 sq.ft.) - Although the mean is 11.36, the 25th and 50th percentiles are 0. This may imply that we have some suburbs with no large residential areas. This also has the potential of generating some outliers in the data.
RAD (Index of accessibility to radial highways) - With a mean of 9.55 and a value of 24 for the 75th percentile, it suggests that either the towns are very accessible or not.
TAX ( Full-value property-tax rate)- Here the mean value is 408 while the max value is 711, which indicates that there is a high variation of tax values between towns.
LSTAT (% Lower status of the population) - This feature has a possible high correlation with our target variable. Higher value of LSTAT, lower the house price. I'll have to analyse this further ahead with a correlation matrix.
RM (Average number of rooms per dwelling)- With a mean value of 6.28 in a range between 3.56 and 8.78, this feature values tell us that most of the houses have several rooms. This also may have a high correlation with, our target, MEDV.
PTRATIO (Pupil-teacher ratio) - Also another possible variable that has some important influence on MEDV. This is, higher PTRATIO, means fewer professors by student, which lowers the house prices.
NOX (Nitric Oxide concentration)- This variable might also affect negatively the housing prices.
DIS (Weighted distances to five Boston employment centers) - The higher the distance means the further the suburb is from town centers, which will also have a negative affect on the house prices.
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What does the distribution of 'MEDV' look like?
- What can we infer from the correlation heatmap? Is there correlation between the dependent and independent variables?
- What are all the inferences that can be found by doing univariate analysis for different variables?
- Do bivariate analysis to visualize the relationship between the features having significant correlations (>= 0.7 or <= -0.7)
1. What does the distribution of 'MEDV' look like?¶
# So let's start the EDA, ploting a histogram for our target variable, MEDV, to check on how it's data is distributed
# I'll focus on the distribution type (normal or skewed) and extreme values that can be potential outliers
sns.histplot(df['MEDV'], kde=True, bins=30)
plt.title("Distribution of MEDV (House Prices)")
plt.xlabel("MEDV (in $1000)")
plt.ylabel("Frequency")
plt.show()
We can't say that our target variable is normaly distributed once it has some tail to the right, making it a right skewed distribution.
I shall perform a log transformation to this feature and after that do a new plot to confirm that it was transformed into a normal distribution.
# Using the np.log() function to transform the variable distribution
df['MEDV_log'] = np.log(df['MEDV'])
sns.histplot(df['MEDV_log'], kde=True, bins=30)
plt.title("Distribution of MEDV_log (House Prices)")
plt.xlabel("MEDV_log (in $1000)")
plt.ylabel("Frequency")
plt.show()
Now our target variable seems to have a more normal distribution without such a long skew. It's not perfectly normal but I'll assume that it will be suficient for our analysis.
I'm now removing the original - MEDV- field from our dataset so our data doens't become redundant and influence our further analysis.
2. What can we infer from the correlation heatmap? Is there correlation between the dependent and independent variables?¶
# Now I'm plotting a heatmap to analyse the feature correlations and check relationships between them
# I'll focus on which idependent variables have a high correlation with our target variable MEDV
# Check for multicollinearity presence on our features
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()
The heatmap shows strong correlations between MEDV and:
- RM (+0.7, positive): More rooms typically mean higher house prices.
- LSTAT (-0.74, negative): Higher percentage of lower-status population leads to lower house prices.
- PTRATIO (-0.51, negative): A higher student-to-teacher ratio negatively impacts house prices.
- NOX (-0.43, negative): Higher pollution levels slightly reduce house prices.
There is also significant multicollinearity between some independent variables:
- INDUS, NOX, and TAX have high positive correlations (above 0.7).
- DIS has a strong negative correlation with NOX (-0.77) and AGE (-0.75), meaning suburbs farther from employment centers tend to be newer and have lower pollution.
- RAD and TAX (correlation: +0.91): This suggests that areas with high accessibility to highways often have higher property taxes.
We'll need to handle multicollinearity before building our regression model, by removing redundant features or using regularization techniques like Ridge/Lasso.
Univariate Analysis¶
3. What are all the inferences that can be found by doing univariate analysis for different variables?¶
# Now I'm going to analyse the variables independently so I'm able to do some inferences
# I'm focusing on checking the distribution in the data as well as trying to detect potential outliers present in each feature
features = df.columns
for feature in features:
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.histplot(df[feature], kde=True, bins=30)
plt.title(f"Distribution of {feature}")
plt.subplot(1, 2, 2)
sns.boxplot(y=df[feature])
plt.title(f"Boxplot of {feature}")
plt.show()
- CRIM: Shows outliers with very high crime values in some towns.
- ZN: Most values are 0, suggesting many towns don’t have large residential lots.
- TAX: Wide range, indicating large tax differences between towns.
- RM: Mostly between 5-7 rooms, but some larger houses exist.
- LSTAT: Right-skewed, with some areas having very high poverty levels.
I might have to handle outliers for variables like CRIM, ZN, and TAX.
Bivariate Analysis¶
4. Do bivariate analysis to visualize the relationship between the features having significant correlations (>= 0.7 or <= -0.7)¶
# I'm now plotting Scatter plots between the variables that show a strong correlation
# Focusing on positive and negative correlations with values between >= 0.7 or <= -0.7
sns.pairplot(df, vars=['MEDV', 'RM', 'LSTAT', 'NOX', 'DIS', 'INDUS', 'TAX', 'AGE', 'RAD'])
plt.show()
Scatter plots confirm strong relationships:
RM vs. MEDV (Positive correlation, 0.7): More rooms - Higher price.
LSTAT vs. MEDV (Negative correlation, -0.74): Higher lower-status % - Lower price.
NOX vs. DIS (Negative correlation, -0.77): Closer to industrial areas - Higher pollution.
TAX vs. RAD (Strong positive correlation, 0.91): Towns with higher highway accessibility tend to have higher property taxes.
AGE vs. DIS (Negative correlation, -0.75): Older houses are usually closer to city centers.
Data Preprocessing¶
- Missing value treatment
- Log transformation of dependent variable if skewed
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
Outlier Treatment¶
As we have observed some outliers in our data, that tend to increase the relationship between the variables, I'm now removing this extreme values so they don't affect our data. After removing the outliers, I'll check again the correlation value to see if has decresead.
# Importing the required function
from scipy.stats import pearsonr
# Removing the outliers in TAX
df = df[df["TAX"] < 600]
# Checking the new correlation between TAX and RAD
corr_tax_rad = pearsonr(df["TAX"], df["RAD"])[0]
print(f"The correlation between TAX and RAD after outlier removal: {corr_tax_rad:.4f}")
The correlation between TAX and RAD after outlier removal: 0.2498
- We can confirm that the correlation between TAX and RAD has dropped significantly. The extremely high tax rates might be due to some other reason rather then the accessibility to highways.
Splitting the Dataset¶
# Defining the indenpendent variables (x) and target variable (y)
Y = df['MEDV_log']
X = df.drop(columns = {'MEDV', 'MEDV_log'})
# Add the intercept term
X = sm.add_constant(X)
# Splitting the data into train and test (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)
Handling Multicollinearity¶
We've observed multicollinearity was present in our data. I'll now handle this issue by calculating the Variance Inflation Factor (VIF) and droping features that show a score higher then 5.
# Defining the function to calculate VIF
def calculate_vif(train):
vif_data = pd.DataFrame()
vif_data["Feature"] = train.columns
vif_data["VIF"] = [variance_inflation_factor(train.values, i) for i in range(len(train.columns))]
return vif_data
# Displaying VIF scores for training data
print(calculate_vif(X_train))
Feature VIF 0 const 708.370336 1 CRIM 2.872608 2 ZN 2.583947 3 INDUS 2.964985 4 CHAS 1.104334 5 NOX 5.631089 6 RM 2.566053 7 AGE 2.718199 8 DIS 3.433860 9 RAD 1.196240 10 TAX 1.661921 11 PTRATIO 1.653781 12 LSTAT 3.052094
- By analysing the VIF scores we're able to see that only the NOX feature is higher then 5. Once it's value is very close to our threshold I'm opting to keep it in the dataset, since it can be useful for the prediction.
Feature Scaling¶
Since we're builing a regression model, I'm cosidering scaling the features, using Z-score normalization, to make the coefficients more interpretable and the ensure that our features are centered with a unit of variance, hoping this will also bring stability to our model.
# Initializing StandardScaler
scaler = StandardScaler()
# Applying scaling only on the independent variables (X_train and X_test)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Converting back to DataFrame to retain feature names
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
# Display a preview of scaled data
print("Scaled Training Data:\n", X_train_scaled.head())
Scaled Training Data:
const CRIM ZN INDUS CHAS NOX RM AGE \
0 0.0 -0.518552 1.079000 -0.836244 -0.30532 -0.713256 0.543155 -1.080120
1 0.0 -0.561877 0.325751 -0.589128 -0.30532 -0.820128 0.525468 -0.984791
2 0.0 -0.537923 -0.615810 -0.841465 -0.30532 -0.208043 0.938161 0.879417
3 0.0 0.012980 -0.615810 -0.355935 -0.30532 -0.062308 1.168089 0.653453
4 0.0 -0.523104 -0.615810 -0.871049 -0.30532 -0.489797 -0.740612 -1.256654
DIS RAD TAX PTRATIO LSTAT
0 0.937553 0.329909 1.237076 -1.162043 -0.987817
1 0.424121 -0.286878 -0.520378 0.552429 -0.891426
2 -0.517479 -1.520452 -0.685608 0.011017 -0.857689
3 -0.615493 2.180271 -0.129832 -0.169454 -0.719530
4 0.335576 -0.286878 1.717747 -0.395042 -0.139580
Now that we've successfully scaled our data we can observe that:
- The mean values are centered - most values are around 0.
- Unit variance is accomplished and the values are now standardized.
Model Building - Linear Regression¶
# I'm first training a linear regression model using sklearn.LinearRegression() to make the predictions
# Initializing the Linear Regression model
lin_reg = LinearRegression()
# Training the model on the scaled training data
lin_reg.fit(X_train_scaled, y_train)
# Making predictions on the training and test sets
y_train_pred = lin_reg.predict(X_train_scaled)
y_test_pred = lin_reg.predict(X_test_scaled)
# I'll now train the sm.OLS() model to check the statistical summary of the model
# Ensuring indices are aligned
y_train_aligned = y_train.reset_index(drop=True)
X_train_aligned = X_train_scaled.drop(columns=["const"], errors="ignore").reset_index(drop=True) # Remove 'const' if present
# Fiting OLS model using the corrected training data
ols_model = sm.OLS(y_train_aligned, sm.add_constant(X_train_aligned)).fit()
# Displaying the full statistical summary
print(ols_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV_log R-squared: 0.869
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 134.9
Date: Mon, 17 Feb 2025 Prob (F-statistic): 1.44e-100
Time: 23:09:19 Log-Likelihood: 193.70
No. Observations: 258 AIC: -361.4
Df Residuals: 245 BIC: -315.2
Df Model: 12
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.1531 0.007 432.132 0.000 3.139 3.167
CRIM -0.0077 0.012 -0.619 0.536 -0.032 0.017
ZN 0.0115 0.012 0.977 0.330 -0.012 0.035
INDUS 0.0090 0.013 0.713 0.476 -0.016 0.034
CHAS 0.0104 0.008 1.357 0.176 -0.005 0.026
NOX -0.0439 0.017 -2.534 0.012 -0.078 -0.010
RM 0.1869 0.012 15.994 0.000 0.164 0.210
AGE -0.0549 0.012 -4.566 0.000 -0.079 -0.031
DIS -0.0781 0.014 -5.779 0.000 -0.105 -0.052
RAD 0.0178 0.008 2.233 0.026 0.002 0.034
TAX -0.0388 0.009 -4.128 0.000 -0.057 -0.020
PTRATIO -0.0607 0.009 -6.473 0.000 -0.079 -0.042
LSTAT -0.0552 0.013 -4.328 0.000 -0.080 -0.030
==============================================================================
Omnibus: 15.519 Durbin-Watson: 1.965
Prob(Omnibus): 0.000 Jarque-Bera (JB): 18.382
Skew: 0.497 Prob(JB): 0.000102
Kurtosis: 3.849 Cond. No. 6.20
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Key insights from model performing
R² Score is now 0.869 - This means our model now explains ~86.9% of the variance in house prices.
Adjusted R² is now 0.862 - Also excellent, proving our features are really adding value to the model.
F-statistic: 134.9, p-value ≈ 0 - The model as a whole is highly significant.
Durbin-Watson = 1.965 → No autocorrelation issues
Condition Number = 6.20 → Multicollinearity is now well-controlled
Key insights from p-values
What do the p-values tell us? A p-value < 0.05 means the feature is statistically significant (it contributes meaningfully to the model).
Highly significant features (p < 0.05, strong impact on house prices):
- RM (+0.1869, p = 0.000) - More rooms increase house price significantly.
- AGE (-0.0549, p = 0.000) - Older houses decrease house price significantly.
- DIS (-0.0781, p = 0.000) - Farther distances from job centers lower house prices.
- TAX (-0.0388, p = 0.000) - Higher property taxes reduce house prices.
- PTRATIO (-0.0607, p = 0.000) - Higher student-to-teacher ratio decreases prices.
- LSTAT (-0.0552, p = 0.000) - Higher lower-status population percentage lowers prices.
- NOX (-0.0439, p = 0.012) - Higher pollution negatively affects house prices.
- RAD (+0.0178, p = 0.026) - Surprisingly, accessibility to highways slightly increases prices.
Not significant (p > 0.05, weak or no impact on house prices)
- CRIM (-0.0077, p = 0.536) - Crime rate has no statistically significant effect.
- ZN (+0.0115, p = 0.330) - Zoning has no statistically significant effect.
- INDUS (+0.0090, p = 0.476) - Industrial land zoning has no statistically significant effect.
- CHAS (+0.0104, p = 0.176) - Surprisingly being near the Charles River has no significant effect.
I'll now try to tune the model by training without the features that have a low impact on the predition to see if there are any improvements in our model.
# Defining the features to keep (removing CRIM, ZN, INDUS, CHAS)
X_train_refined = X_train_scaled.drop(columns=["CRIM", "ZN", "INDUS", "CHAS"])
X_test_refined = X_test_scaled.drop(columns=["CRIM", "ZN", "INDUS", "CHAS"])
# Ensuring indices are aligned
y_train_aligned = y_train.reset_index(drop=True)
X_train_refined = X_train_refined.reset_index(drop=True)
# Fiting the refined OLS model
ols_model_refined = sm.OLS(y_train_aligned, sm.add_constant(X_train_refined)).fit()
# Displaying the new OLS summary
print(ols_model_refined.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV_log R-squared: 0.866
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 202.0
Date: Mon, 17 Feb 2025 Prob (F-statistic): 3.00e-104
Time: 20:38:49 Log-Likelihood: 191.68
No. Observations: 258 AIC: -365.4
Df Residuals: 249 BIC: -333.4
Df Model: 8
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.1531 0.007 432.235 0.000 3.139 3.167
const -2.58e-17 5.07e-18 -5.088 0.000 -3.58e-17 -1.58e-17
NOX -0.0473 0.014 -3.497 0.001 -0.074 -0.021
RM 0.1888 0.011 16.655 0.000 0.167 0.211
AGE -0.0569 0.012 -4.812 0.000 -0.080 -0.034
DIS -0.0775 0.012 -6.619 0.000 -0.101 -0.054
RAD 0.0157 0.008 2.058 0.041 0.001 0.031
TAX -0.0360 0.008 -4.320 0.000 -0.052 -0.020
PTRATIO -0.0638 0.009 -7.362 0.000 -0.081 -0.047
LSTAT -0.0533 0.013 -4.242 0.000 -0.078 -0.029
==============================================================================
Omnibus: 12.087 Durbin-Watson: 2.003
Prob(Omnibus): 0.002 Jarque-Bera (JB): 13.670
Skew: 0.426 Prob(JB): 0.00108
Kurtosis: 3.738 Cond. No. 2.80e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.05e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
As we can see from the summary statistics there isn't really any significant improvements on the new model:
Adjusted R² decresed (Original: 0.869, Refined: 0.866)
p-values & significance remain almost identical.
The refined model has multicollinearity warnings.
Hence the new model downgraded the performance, we're keeping our first model.
Model Performance Check¶
- How does the model is performing? Check using Rsquared, RSME, MAE, MAPE
- Is there multicollinearity? Check using VIF
- How does the model is performing after cross validation?
1. How does the model is performing? Check using Rsquared, RSME, MAE, MAPE¶
# Evaluating model performance with Rsquared
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
print("**Linear Regression Model Performance with Rsquared**")
print(f"R² Score (Train): {train_r2:.4f}")
print(f"R² Score (Test) : {test_r2:.4f}")
**Linear Regression Model Performance with Rsquared** R² Score (Train): 0.8686 R² Score (Test) : 0.8567
R² Score (Train = 0.8686, Test = 0.8567)
This means our model explains ~86% of the variance in house prices.
The fact that train and test R² are very close means no overfitting.
# Evaluating model performance with MAE
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
print("**Linear Regression Model Performance with MAE**")
print(f"Mean Absolute Error (Train): {train_mae:.4f}")
print(f"Mean Absolute Error (Test) : {test_mae:.4f}")
**Linear Regression Model Performance with MAE** Mean Absolute Error (Train): 0.0858 Mean Absolute Error (Test) : 0.0788
Mean Absolute Error (Train = 0.0858, Test = 0.0788)
- On average, our predictions are only ~0.08 log-units off from actual values. That's a small error margin, meaning our model is making very accurate predictions.
# Evaluating model performance with RSME
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("**Linear Regression Model Performance with RSME**")
print(f"Root Mean Squared Error (Train): {train_rmse:.4f}")
print(f"Root Mean Squared Error (Test) : {test_rmse:.4f}")
**Linear Regression Model Performance with RSME** Root Mean Squared Error (Train): 0.1142 Root Mean Squared Error (Test) : 0.1051
Root Mean Squared Error (Train = 0.1142, Test = 0.1051)
Since RMSE penalizes larger errors more than MAE, this confirms that our model isn't making extreme mistakes.
Train and Test RMSE are close, again proving no overfitting.
# Evaluating model performance in MAPE
# Defining a function to calculate MAPE
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100 # Convert to percentage
# Calculating MAPE for train and test sets
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
print("**Linear Regression Model Performance with MAPE**")
print(f"Mean Absolute Percentage Error (Train): {train_mape:.2f}%")
print(f"Mean Absolute Percentage Error (Test) : {test_mape:.2f}%")
**Linear Regression Model Performance with MAPE** Mean Absolute Percentage Error (Train): 2.77% Mean Absolute Percentage Error (Test) : 2.52%
MAPE Score (Train = 2.77%, Test = 2.58%)
Our model's average prediction error is only ~2.5% to 2.8%.
This shows our model is not only well-fitted but also generalizing extremely well to unseen data.
2. Is there multicollinearity? Check using VIF¶
# Double-checking multicollinearity using VIF
# Computing VIF on the final feature set
final_features = X_train_scaled.columns
vif_data = pd.DataFrame()
vif_data["Feature"] = final_features
vif_data["VIF"] = [variance_inflation_factor(X_train_scaled.values, i) for i in range(len(final_features))]
# Display VIF scores
print("**Final Multicollinearity Check (VIF Scores):**")
print(vif_data)
**Final Multicollinearity Check (VIF Scores):**
Feature VIF
0 const NaN
1 CRIM 2.872608
2 ZN 2.583947
3 INDUS 2.964985
4 CHAS 1.104334
5 NOX 5.631089
6 RM 2.566053
7 AGE 2.718199
8 DIS 3.433860
9 RAD 1.196240
10 TAX 1.661921
11 PTRATIO 1.653781
12 LSTAT 3.052094
All VIF values are below 5, except for NOX (5.63), which is only slightly above the threshold and I've decided not to drop it to improve predictive power of the model.
No extreme multicollinearity detected, meaning our features are independent enough for regression.
The model is stable, and there's no need to remove any features.
3. How does the model is performing after cross validation?¶
# Cross-Validation performance check
# Initializing Linear Regression model
lin_reg = LinearRegression()
# Performing 5-Fold Cross-Validation (scoring based on R²)
cv_scores = cross_val_score(lin_reg, X_train_scaled, y_train, cv=5, scoring='r2')
# Displaying cross-validation results
print("**Cross-Validation Results:**")
print(f"Mean R² Score: {np.mean(cv_scores):.4f}")
print(f"Standard Deviation of R²: {np.std(cv_scores):.4f}")
print(f"All R² Scores: {cv_scores}")
**Cross-Validation Results:** Mean R² Score: 0.8497 Standard Deviation of R²: 0.0148 All R² Scores: [0.85566448 0.82114747 0.85066249 0.85944711 0.86178792]
Mean R² Score = 0.8497 - This is very close to our original Test R² (~0.86), confirming our model is consistent and reliable.
Low Standard Deviation (0.0148) - This means minimal variation across different splits, proving our model is stable and not sensitive to specific training data.
All R² Scores are in the 0.82 - 0.86 range - No major outliers, meaning the model performs well across all folds of cross-validation.
Checking Linear Regression Assumptions¶
- In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied.
Checking the mean of the residuals¶
# We're aiming for a mean value close to 0
residuals = y_test - y_test_pred
mean_residuals = np.mean(residuals)
print(f"Mean of Residuals: {mean_residuals:.6f}")
Mean of Residuals: -0.005186
A value of -0.005186 confirms that our model doesn't have systematic bias, meaning predictions are evenly distributed around the actual values.
This assumption is satisfied.
Homoscedasticity Check¶
# Ploting the residual vs. predicted values
plt.figure(figsize=(8,5))
sns.scatterplot(x=y_test_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values (Homoscedasticity Check)")
plt.show()
Residuals are randomly scattered around the 0 line across all predicted values.
No clear pattern or funnel shape, this means variance is constant across all levels of predictions.
Confirms that our model satisfies the assumption of homoscedasticity.
Linearity of Variables¶
# Ploting a Q-Q plot to verify if residuals align with a normal distribution
plt.figure(figsize=(8,5))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")
plt.show()
Residuals follow the diagonal line closely, this suggests that errors are mostly normally distributed.
The slight S-shape, indicates some minor deviations from normality, which is not uncommon in real-world data.
The linearity of variables assumption can be considered satisfied.
Normality of Error Terms¶
# Plotting an histogram to check if the errors follow a normal distribution
plt.figure(figsize=(8,5))
sns.histplot(residuals, bins=30, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Histogram of Residuals (Normality Check)")
plt.show()
Roughly bell-shaped, confirms that residuals follow a normal distribution.
Slight left skewness is not a major issue, especially since our target variable (MEDV-log) also had a similar distribution.
This confirms that our model meets the normality assumption and that all assumptions are satisfied.
Final Model¶
# Extracting coefficients from the OLS model
coef = ols_model.params
# Starting and creating equation
equation = "log (Price) ="
print(equation, end=" ")
for i in range(len(coef)):
if i == len(coef) - 1: # No '+' at the end
print(f"({coef[i]:.4f}) * {coef.index[i]}", end=" ")
else:
print(f"({coef[i]:.4f}) * {coef.index[i]} +", end=" ")
log (Price) = (3.1531) * const + (-0.0077) * CRIM + (0.0115) * ZN + (0.0090) * INDUS + (0.0104) * CHAS + (-0.0439) * NOX + (0.1869) * RM + (-0.0549) * AGE + (-0.0781) * DIS + (0.0178) * RAD + (-0.0388) * TAX + (-0.0607) * PTRATIO + (-0.0552) * LSTAT
Model Equation¶
log (Price) = (3.1531) * const + (-0.0077) * CRIM + (0.0115) * ZN + (0.0090) * INDUS + (0.0104) * CHAS + (-0.0439) * NOX + (0.1869) * RM + (-0.0549) * AGE + (-0.0781) * DIS + (0.0178) * RAD + (-0.0388) * TAX + (-0.0607) * PTRATIO + (-0.0552) * LSTAT
RM (Number of Rooms) has the strongest positive effect (+0.1869) - More rooms increase house prices significantly.
NOX (Pollution), AGE (Older houses), DIS (Distance to jobs), TAX, PTRATIO, and LSTAT all have negative impacts - Higher values decrease house prices.
CRIM, ZN, INDUS, and CHAS have weak effects predictive influence and their impact isn't statistically significant.
Predictions Visualization¶
Next I'll create a new dataframe with the predicted and real values, so we can visually compare how well our model performs. In order to try and improve the visualization I'll create a scatter plot to better show both values.
# Creating a DataFrame with actual vs predicted values
predictions_df = pd.DataFrame({
"Actual MEDV_log": y_test,
"Predicted MEDV_log": y_test_pred,
"Actual MEDV": np.exp(y_test), # Convert log values back to original price
"Predicted MEDV": np.exp(y_test_pred), # Convert log predictions back
"Residuals": y_test - y_test_pred # Difference between actual and predicted
})
# Displaying the first few rows of the dataset
print("**Predictions DataFrame:**")
print(predictions_df.head())
# Scatter Plot: Actual vs. Predicted Values
plt.figure(figsize=(8, 6))
sns.scatterplot(x=predictions_df["Actual MEDV"], y=predictions_df["Predicted MEDV"], alpha=0.7)
plt.plot([min(predictions_df["Actual MEDV"]), max(predictions_df["Actual MEDV"])],
[min(predictions_df["Actual MEDV"]), max(predictions_df["Actual MEDV"])],
color='red', linestyle='--') # Perfect prediction line
plt.xlabel("Actual House Prices")
plt.ylabel("Predicted House Prices")
plt.title("Actual vs. Predicted House Prices")
plt.show()
**Predictions DataFrame:**
Actual MEDV_log Predicted MEDV_log Actual MEDV Predicted MEDV \
302 3.273364 3.357222 26.4 28.709338
122 3.020425 3.007894 20.5 20.244712
228 3.843744 3.749873 46.7 42.515696
257 3.912023 3.995833 50.0 54.371121
102 2.923162 3.053223 18.6 21.183506
Residuals
302 -0.083858
122 0.012531
228 0.093871
257 -0.083810
102 -0.130061
In the dataframe:
We have both log-transformed and actual house prices (so we can compare results in both scales).
Residuals (errors) are relatively small, meaning our model is predicting well.
Predicted MEDV values are close to Actual MEDV values, confirming our model is performing solidly.
In the scatterplot:
Points are well aligned with the diagonal means our model is making highly accurate predictions.
No major deviations or weird patterns means systematic errors
Performance is strong across all house price ranges
Actionable Insights and Recommendations¶
Key Factors Influencing House Prices
Rooms per dwelling (RM) has the strongest positive effect (+0.1869). More rooms = Higher house prices. Real Estate Developers should focus on properties with spacious layouts for higher value.
Pollution levels (NOX) negatively impact prices (-0.0439). Higher pollution lowers property value. Urban Planners & Environmental Agencies should work on reducing pollution in high-density areas.
Accessibility to highways (RAD) has a slight positive effect (+0.0178). Better transportation access = More valuable properties. Government Policies should invest in infrastructure near suburbs to increase demand.
Higher property tax (TAX) reduces home values (-0.0388). Heavy taxation discourages buyers, leading to lower prices. Local Authorities should evaluate tax policies to maintain a competitive housing market.
Homes farther from employment centers (DIS) are less valuable (-0.0781). Proximity to jobs = Higher demand. Companies & Businesses should consider expanding offices into suburban areas to drive up home prices.
Business Recommendations
For Real Estate Investors & Buyers:
- Target homes with more rooms (RM) & lower pollution (NOX) for long-term value. Invest in properties close to highways & job hubs (RAD, DIS).
For Urban Planners & Government Officials:
- Improve transportation networks (RAD) to increase home values.
- Reduce pollution in dense housing areas (NOX) to make locations more desirable.
For Tax Authorities & Policy Makers:
- High property tax (TAX) reduces market demand - Consider tax adjustments for affordability.