Regression Project: Boston House Price Prediction¶

Marks: 60¶

Welcome to the project on regression. We will use the Boston house price dataset for this project.


Objective¶


The problem at hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features affecting the price of the house. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for the unseen data.


Dataset¶


Each record in the database describes a house in Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below:

Attribute Information:

  • CRIM: Per capita crime rate by town
  • ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS: Proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX: Nitric Oxide concentration (parts per 10 million)
  • RM: The average number of rooms per dwelling
  • AGE: Proportion of owner-occupied units built before 1940
  • DIS: Weighted distances to five Boston employment centers
  • RAD: Index of accessibility to radial highways
  • TAX: Full-value property-tax rate per 10,000 dollars
  • PTRATIO: Pupil-teacher ratio by town
  • LSTAT: % lower status of the population
  • MEDV: Median value of owner-occupied homes in 1000 dollars

Importing the necessary libraries¶

In [31]:
# Importing libraries for data manipulation
import numpy as np
import pandas as pd

# Importing libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
import scipy.stats as stats

# Importing libraries for building linear regression model
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression

# Importing libraries for model evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing library for splitting data
from sklearn.model_selection import train_test_split

# Importing library for data preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Importing filter to ignore deprecation warnings
import warnings
warnings.filterwarnings("ignore")

Mounting Drive¶

In [2]:
# Let colab access my google drive

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Loading the dataset¶

In [3]:
# Using the pd.read_csv() funtion to load the dataset

df = pd.read_csv("/content/drive/MyDrive/MIT - Applied Data Science/Projects/Elective Project/Boston.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [4]:
# Looking at the top 5 rows of the dataset to start building some intuition

df.head()
Out[4]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
  • So it appears that our dataset only has numerical values. Being CHAS the only categorical variable in our data.

  • The variable MEDV is our variable of interest or dependent variable. All the others are our independent variables that will influence the values on the MEDV target field, and shall be used by the predictive model to get the predicted values.

In [25]:
# Checking the size of the dataset using the .shape method

df.shape
Out[25]:
(506, 13)
  • Our dataset has 506 rows and 13 columns.
In [26]:
# Using the .info() function to get information about the data types and also to gain some intuition about missing values

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  LSTAT    506 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB
  • We can see that the dataset is composed of only numerical datatypes.

  • 10 columns have float and 3 columns have integers datatypes.

  • We can also conclude that all rows have 506 non-null entries, meaning there should be no missing values on the dataset.

In [27]:
# Just to double-check that in fact there are no missing values, let's use the .isnull().sum() function, that will return us a count of the missing values in our data

df.isnull().sum()
Out[27]:
0
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
LSTAT 0
MEDV 0

  • As we can see there are 0 missing values in all of our columns.
In [8]:
# Just to be absolutely sure, I'm now going to run the is.na().sum() function that will return a count of any fields that may have NaN values

df.isna().sum()
Out[8]:
0
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
LSTAT 0
MEDV 0

  • As we can observe, there are also no NaN values in our data and this will save us a lot of time once we don't have to do any missing value treatment to our dataset.
In [9]:
# I'm now running the .duplicated().sum() function, to know if there are any duplicated records on the dataset

df.duplicated().sum()
Out[9]:
0
  • No double values on the data so no need to drop any duplicated rows.
In [10]:
# In order to start building some more intuition on our data, I'm now using the .describe() function, which will return a statistical summary of our columns

df.describe().T
Out[10]:
count mean std min 25% 50% 75% max
CRIM 506.0 3.613524 8.601545 0.00632 0.082045 0.25651 3.677083 88.9762
ZN 506.0 11.363636 23.322453 0.00000 0.000000 0.00000 12.500000 100.0000
INDUS 506.0 11.136779 6.860353 0.46000 5.190000 9.69000 18.100000 27.7400
CHAS 506.0 0.069170 0.253994 0.00000 0.000000 0.00000 0.000000 1.0000
NOX 506.0 0.554695 0.115878 0.38500 0.449000 0.53800 0.624000 0.8710
RM 506.0 6.284634 0.702617 3.56100 5.885500 6.20850 6.623500 8.7800
AGE 506.0 68.574901 28.148861 2.90000 45.025000 77.50000 94.075000 100.0000
DIS 506.0 3.795043 2.105710 1.12960 2.100175 3.20745 5.188425 12.1265
RAD 506.0 9.549407 8.707259 1.00000 4.000000 5.00000 24.000000 24.0000
TAX 506.0 408.237154 168.537116 187.00000 279.000000 330.00000 666.000000 711.0000
PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000 19.05000 20.200000 22.0000
LSTAT 506.0 12.653063 7.141062 1.73000 6.950000 11.36000 16.955000 37.9700
MEDV 506.0 22.532806 9.197104 5.00000 17.025000 21.20000 25.000000 50.0000
  • CRIM (Per capita crime rate) - Having a mean value of 3.61 but a max value of 88.97, suggests that probabily there are some town's with extreme crime rates. These may generate some potential outliers in our data.

  • ZN (Proportion of residential land zoned for lots over 25,000 sq.ft.) - Although the mean is 11.36, the 25th and 50th percentiles are 0. This may imply that we have some suburbs with no large residential areas. This also has the potential of generating some outliers in the data.

  • RAD (Index of accessibility to radial highways) - With a mean of 9.55 and a value of 24 for the 75th percentile, it suggests that either the towns are very accessible or not.

  • TAX ( Full-value property-tax rate)- Here the mean value is 408 while the max value is 711, which indicates that there is a high variation of tax values between towns.

  • LSTAT (% Lower status of the population) - This feature has a possible high correlation with our target variable. Higher value of LSTAT, lower the house price. I'll have to analyse this further ahead with a correlation matrix.

  • RM (Average number of rooms per dwelling)- With a mean value of 6.28 in a range between 3.56 and 8.78, this feature values tell us that most of the houses have several rooms. This also may have a high correlation with, our target, MEDV.

  • PTRATIO (Pupil-teacher ratio) - Also another possible variable that has some important influence on MEDV. This is, higher PTRATIO, means fewer professors by student, which lowers the house prices.

  • NOX (Nitric Oxide concentration)- This variable might also affect negatively the housing prices.

  • DIS (Weighted distances to five Boston employment centers) - The higher the distance means the further the suburb is from town centers, which will also have a negative affect on the house prices.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What does the distribution of 'MEDV' look like?
  2. What can we infer from the correlation heatmap? Is there correlation between the dependent and independent variables?
  3. What are all the inferences that can be found by doing univariate analysis for different variables?
  4. Do bivariate analysis to visualize the relationship between the features having significant correlations (>= 0.7 or <= -0.7)

1. What does the distribution of 'MEDV' look like?¶

In [37]:
# So let's start the EDA, ploting a histogram for our target variable, MEDV, to check on how it's data is distributed
# I'll focus on the distribution type (normal or skewed) and extreme values that can be potential outliers

sns.histplot(df['MEDV'], kde=True, bins=30)
plt.title("Distribution of MEDV (House Prices)")
plt.xlabel("MEDV (in $1000)")
plt.ylabel("Frequency")
plt.show()
No description has been provided for this image
  • We can't say that our target variable is normaly distributed once it has some tail to the right, making it a right skewed distribution.

  • I shall perform a log transformation to this feature and after that do a new plot to confirm that it was transformed into a normal distribution.

In [5]:
# Using the np.log() function to transform the variable distribution

df['MEDV_log'] = np.log(df['MEDV'])

sns.histplot(df['MEDV_log'], kde=True, bins=30)
plt.title("Distribution of MEDV_log (House Prices)")
plt.xlabel("MEDV_log (in $1000)")
plt.ylabel("Frequency")
plt.show()
No description has been provided for this image
  • Now our target variable seems to have a more normal distribution without such a long skew. It's not perfectly normal but I'll assume that it will be suficient for our analysis.

  • I'm now removing the original - MEDV- field from our dataset so our data doens't become redundant and influence our further analysis.

2. What can we infer from the correlation heatmap? Is there correlation between the dependent and independent variables?¶

In [6]:
# Now I'm plotting a heatmap to analyse the feature correlations and check relationships between them
# I'll focus on which idependent variables have a high correlation with our target variable MEDV
# Check for multicollinearity presence on our features

plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()
No description has been provided for this image

The heatmap shows strong correlations between MEDV and:

  • RM (+0.7, positive): More rooms typically mean higher house prices.
  • LSTAT (-0.74, negative): Higher percentage of lower-status population leads to lower house prices.
  • PTRATIO (-0.51, negative): A higher student-to-teacher ratio negatively impacts house prices.
  • NOX (-0.43, negative): Higher pollution levels slightly reduce house prices.

There is also significant multicollinearity between some independent variables:

  • INDUS, NOX, and TAX have high positive correlations (above 0.7).
  • DIS has a strong negative correlation with NOX (-0.77) and AGE (-0.75), meaning suburbs farther from employment centers tend to be newer and have lower pollution.
  • RAD and TAX (correlation: +0.91): This suggests that areas with high accessibility to highways often have higher property taxes.

We'll need to handle multicollinearity before building our regression model, by removing redundant features or using regularization techniques like Ridge/Lasso.

Univariate Analysis¶

3. What are all the inferences that can be found by doing univariate analysis for different variables?¶

In [40]:
# Now I'm going to analyse the variables independently so I'm able to do some inferences
# I'm focusing on checking the distribution in the data as well as trying to detect potential outliers present in each feature

features = df.columns

for feature in features:
    plt.figure(figsize=(12, 4))


    plt.subplot(1, 2, 1)
    sns.histplot(df[feature], kde=True, bins=30)
    plt.title(f"Distribution of {feature}")


    plt.subplot(1, 2, 2)
    sns.boxplot(y=df[feature])
    plt.title(f"Boxplot of {feature}")

    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • CRIM: Shows outliers with very high crime values in some towns.
  • ZN: Most values are 0, suggesting many towns don’t have large residential lots.
  • TAX: Wide range, indicating large tax differences between towns.
  • RM: Mostly between 5-7 rooms, but some larger houses exist.
  • LSTAT: Right-skewed, with some areas having very high poverty levels.

I might have to handle outliers for variables like CRIM, ZN, and TAX.

Bivariate Analysis¶

4. Do bivariate analysis to visualize the relationship between the features having significant correlations (>= 0.7 or <= -0.7)¶

In [41]:
# I'm now plotting Scatter plots between the variables that show a strong correlation
# Focusing on positive and negative correlations with values between >= 0.7 or <= -0.7

sns.pairplot(df, vars=['MEDV', 'RM', 'LSTAT', 'NOX', 'DIS', 'INDUS', 'TAX', 'AGE', 'RAD'])
plt.show()
No description has been provided for this image

Scatter plots confirm strong relationships:

RM vs. MEDV (Positive correlation, 0.7): More rooms - Higher price.

LSTAT vs. MEDV (Negative correlation, -0.74): Higher lower-status % - Lower price.

NOX vs. DIS (Negative correlation, -0.77): Closer to industrial areas - Higher pollution.

TAX vs. RAD (Strong positive correlation, 0.91): Towns with higher highway accessibility tend to have higher property taxes.

AGE vs. DIS (Negative correlation, -0.75): Older houses are usually closer to city centers.

Data Preprocessing¶

  • Missing value treatment
  • Log transformation of dependent variable if skewed
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Outlier Treatment¶

As we have observed some outliers in our data, that tend to increase the relationship between the variables, I'm now removing this extreme values so they don't affect our data. After removing the outliers, I'll check again the correlation value to see if has decresead.

In [8]:
# Importing the required function

from scipy.stats import pearsonr

# Removing the outliers in TAX

df = df[df["TAX"] < 600]

# Checking the new correlation between TAX and RAD

corr_tax_rad = pearsonr(df["TAX"], df["RAD"])[0]
print(f"The correlation between TAX and RAD after outlier removal: {corr_tax_rad:.4f}")
The correlation between TAX and RAD after outlier removal: 0.2498
  • We can confirm that the correlation between TAX and RAD has dropped significantly. The extremely high tax rates might be due to some other reason rather then the accessibility to highways.

Splitting the Dataset¶

In [9]:
# Defining the indenpendent variables (x) and target variable (y)

Y = df['MEDV_log']

X = df.drop(columns = {'MEDV', 'MEDV_log'})

# Add the intercept term

X = sm.add_constant(X)

# Splitting the data into train and test (70% train, 30% test)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)

Handling Multicollinearity¶

We've observed multicollinearity was present in our data. I'll now handle this issue by calculating the Variance Inflation Factor (VIF) and droping features that show a score higher then 5.

In [10]:
# Defining the function to calculate VIF

def calculate_vif(train):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = train.columns
    vif_data["VIF"] = [variance_inflation_factor(train.values, i) for i in range(len(train.columns))]
    return vif_data

# Displaying VIF scores for training data

print(calculate_vif(X_train))
    Feature         VIF
0     const  708.370336
1      CRIM    2.872608
2        ZN    2.583947
3     INDUS    2.964985
4      CHAS    1.104334
5       NOX    5.631089
6        RM    2.566053
7       AGE    2.718199
8       DIS    3.433860
9       RAD    1.196240
10      TAX    1.661921
11  PTRATIO    1.653781
12    LSTAT    3.052094
  • By analysing the VIF scores we're able to see that only the NOX feature is higher then 5. Once it's value is very close to our threshold I'm opting to keep it in the dataset, since it can be useful for the prediction.

Feature Scaling¶

Since we're builing a regression model, I'm cosidering scaling the features, using Z-score normalization, to make the coefficients more interpretable and the ensure that our features are centered with a unit of variance, hoping this will also bring stability to our model.

In [14]:
# Initializing StandardScaler

scaler = StandardScaler()

# Applying scaling only on the independent variables (X_train and X_test)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Converting back to DataFrame to retain feature names

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Display a preview of scaled data

print("Scaled Training Data:\n", X_train_scaled.head())
Scaled Training Data:
    const      CRIM        ZN     INDUS     CHAS       NOX        RM       AGE  \
0    0.0 -0.518552  1.079000 -0.836244 -0.30532 -0.713256  0.543155 -1.080120   
1    0.0 -0.561877  0.325751 -0.589128 -0.30532 -0.820128  0.525468 -0.984791   
2    0.0 -0.537923 -0.615810 -0.841465 -0.30532 -0.208043  0.938161  0.879417   
3    0.0  0.012980 -0.615810 -0.355935 -0.30532 -0.062308  1.168089  0.653453   
4    0.0 -0.523104 -0.615810 -0.871049 -0.30532 -0.489797 -0.740612 -1.256654   

        DIS       RAD       TAX   PTRATIO     LSTAT  
0  0.937553  0.329909  1.237076 -1.162043 -0.987817  
1  0.424121 -0.286878 -0.520378  0.552429 -0.891426  
2 -0.517479 -1.520452 -0.685608  0.011017 -0.857689  
3 -0.615493  2.180271 -0.129832 -0.169454 -0.719530  
4  0.335576 -0.286878  1.717747 -0.395042 -0.139580  

Now that we've successfully scaled our data we can observe that:

  • The mean values are centered - most values are around 0.
  • Unit variance is accomplished and the values are now standardized.

Model Building - Linear Regression¶

In [15]:
# I'm first training a linear regression model using sklearn.LinearRegression() to make the predictions


# Initializing the Linear Regression model

lin_reg = LinearRegression()

# Training the model on the scaled training data

lin_reg.fit(X_train_scaled, y_train)

# Making predictions on the training and test sets

y_train_pred = lin_reg.predict(X_train_scaled)
y_test_pred = lin_reg.predict(X_test_scaled)
In [16]:
# I'll now train the sm.OLS() model to check the statistical summary of the model

# Ensuring indices are aligned

y_train_aligned = y_train.reset_index(drop=True)
X_train_aligned = X_train_scaled.drop(columns=["const"], errors="ignore").reset_index(drop=True)  # Remove 'const' if present

# Fiting OLS model using the corrected training data

ols_model = sm.OLS(y_train_aligned, sm.add_constant(X_train_aligned)).fit()

# Displaying the full statistical summary

print(ols_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               MEDV_log   R-squared:                       0.869
Model:                            OLS   Adj. R-squared:                  0.862
Method:                 Least Squares   F-statistic:                     134.9
Date:                Mon, 17 Feb 2025   Prob (F-statistic):          1.44e-100
Time:                        23:09:19   Log-Likelihood:                 193.70
No. Observations:                 258   AIC:                            -361.4
Df Residuals:                     245   BIC:                            -315.2
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.1531      0.007    432.132      0.000       3.139       3.167
CRIM          -0.0077      0.012     -0.619      0.536      -0.032       0.017
ZN             0.0115      0.012      0.977      0.330      -0.012       0.035
INDUS          0.0090      0.013      0.713      0.476      -0.016       0.034
CHAS           0.0104      0.008      1.357      0.176      -0.005       0.026
NOX           -0.0439      0.017     -2.534      0.012      -0.078      -0.010
RM             0.1869      0.012     15.994      0.000       0.164       0.210
AGE           -0.0549      0.012     -4.566      0.000      -0.079      -0.031
DIS           -0.0781      0.014     -5.779      0.000      -0.105      -0.052
RAD            0.0178      0.008      2.233      0.026       0.002       0.034
TAX           -0.0388      0.009     -4.128      0.000      -0.057      -0.020
PTRATIO       -0.0607      0.009     -6.473      0.000      -0.079      -0.042
LSTAT         -0.0552      0.013     -4.328      0.000      -0.080      -0.030
==============================================================================
Omnibus:                       15.519   Durbin-Watson:                   1.965
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.382
Skew:                           0.497   Prob(JB):                     0.000102
Kurtosis:                       3.849   Cond. No.                         6.20
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Key insights from model performing

  • R² Score is now 0.869 - This means our model now explains ~86.9% of the variance in house prices.

  • Adjusted R² is now 0.862 - Also excellent, proving our features are really adding value to the model.

  • F-statistic: 134.9, p-value ≈ 0 - The model as a whole is highly significant.

  • Durbin-Watson = 1.965 → No autocorrelation issues

  • Condition Number = 6.20 → Multicollinearity is now well-controlled

Key insights from p-values

What do the p-values tell us? A p-value < 0.05 means the feature is statistically significant (it contributes meaningfully to the model).

Highly significant features (p < 0.05, strong impact on house prices):

  • RM (+0.1869, p = 0.000) - More rooms increase house price significantly.
  • AGE (-0.0549, p = 0.000) - Older houses decrease house price significantly.
  • DIS (-0.0781, p = 0.000) - Farther distances from job centers lower house prices.
  • TAX (-0.0388, p = 0.000) - Higher property taxes reduce house prices.
  • PTRATIO (-0.0607, p = 0.000) - Higher student-to-teacher ratio decreases prices.
  • LSTAT (-0.0552, p = 0.000) - Higher lower-status population percentage lowers prices.
  • NOX (-0.0439, p = 0.012) - Higher pollution negatively affects house prices.
  • RAD (+0.0178, p = 0.026) - Surprisingly, accessibility to highways slightly increases prices.

Not significant (p > 0.05, weak or no impact on house prices)

  • CRIM (-0.0077, p = 0.536) - Crime rate has no statistically significant effect.
  • ZN (+0.0115, p = 0.330) - Zoning has no statistically significant effect.
  • INDUS (+0.0090, p = 0.476) - Industrial land zoning has no statistically significant effect.
  • CHAS (+0.0104, p = 0.176) - Surprisingly being near the Charles River has no significant effect.

I'll now try to tune the model by training without the features that have a low impact on the predition to see if there are any improvements in our model.

In [17]:
# Defining the features to keep (removing CRIM, ZN, INDUS, CHAS)

X_train_refined = X_train_scaled.drop(columns=["CRIM", "ZN", "INDUS", "CHAS"])
X_test_refined = X_test_scaled.drop(columns=["CRIM", "ZN", "INDUS", "CHAS"])

# Ensuring indices are aligned

y_train_aligned = y_train.reset_index(drop=True)
X_train_refined = X_train_refined.reset_index(drop=True)

# Fiting the refined OLS model

ols_model_refined = sm.OLS(y_train_aligned, sm.add_constant(X_train_refined)).fit()

# Displaying the new OLS summary

print(ols_model_refined.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               MEDV_log   R-squared:                       0.866
Model:                            OLS   Adj. R-squared:                  0.862
Method:                 Least Squares   F-statistic:                     202.0
Date:                Mon, 17 Feb 2025   Prob (F-statistic):          3.00e-104
Time:                        20:38:49   Log-Likelihood:                 191.68
No. Observations:                 258   AIC:                            -365.4
Df Residuals:                     249   BIC:                            -333.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.1531      0.007    432.235      0.000       3.139       3.167
const       -2.58e-17   5.07e-18     -5.088      0.000   -3.58e-17   -1.58e-17
NOX           -0.0473      0.014     -3.497      0.001      -0.074      -0.021
RM             0.1888      0.011     16.655      0.000       0.167       0.211
AGE           -0.0569      0.012     -4.812      0.000      -0.080      -0.034
DIS           -0.0775      0.012     -6.619      0.000      -0.101      -0.054
RAD            0.0157      0.008      2.058      0.041       0.001       0.031
TAX           -0.0360      0.008     -4.320      0.000      -0.052      -0.020
PTRATIO       -0.0638      0.009     -7.362      0.000      -0.081      -0.047
LSTAT         -0.0533      0.013     -4.242      0.000      -0.078      -0.029
==============================================================================
Omnibus:                       12.087   Durbin-Watson:                   2.003
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               13.670
Skew:                           0.426   Prob(JB):                      0.00108
Kurtosis:                       3.738   Cond. No.                     2.80e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.05e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

As we can see from the summary statistics there isn't really any significant improvements on the new model:

  • Adjusted R² decresed (Original: 0.869, Refined: 0.866)

  • p-values & significance remain almost identical.

  • The refined model has multicollinearity warnings.

Hence the new model downgraded the performance, we're keeping our first model.

Model Performance Check¶

  1. How does the model is performing? Check using Rsquared, RSME, MAE, MAPE
  2. Is there multicollinearity? Check using VIF
  3. How does the model is performing after cross validation?

1. How does the model is performing? Check using Rsquared, RSME, MAE, MAPE¶

In [20]:
# Evaluating model performance with Rsquared

train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("**Linear Regression Model Performance with Rsquared**")
print(f"R² Score (Train): {train_r2:.4f}")
print(f"R² Score (Test) : {test_r2:.4f}")
**Linear Regression Model Performance with Rsquared**
R² Score (Train): 0.8686
R² Score (Test) : 0.8567

R² Score (Train = 0.8686, Test = 0.8567)

  • This means our model explains ~86% of the variance in house prices.

  • The fact that train and test R² are very close means no overfitting.

In [21]:
# Evaluating model performance with MAE

train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

print("**Linear Regression Model Performance with MAE**")
print(f"Mean Absolute Error (Train): {train_mae:.4f}")
print(f"Mean Absolute Error (Test) : {test_mae:.4f}")
**Linear Regression Model Performance with MAE**
Mean Absolute Error (Train): 0.0858
Mean Absolute Error (Test) : 0.0788

Mean Absolute Error (Train = 0.0858, Test = 0.0788)

  • On average, our predictions are only ~0.08 log-units off from actual values. That's a small error margin, meaning our model is making very accurate predictions.
In [22]:
# Evaluating model performance with RSME

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("**Linear Regression Model Performance with RSME**")
print(f"Root Mean Squared Error (Train): {train_rmse:.4f}")
print(f"Root Mean Squared Error (Test) : {test_rmse:.4f}")
**Linear Regression Model Performance with RSME**
Root Mean Squared Error (Train): 0.1142
Root Mean Squared Error (Test) : 0.1051

Root Mean Squared Error (Train = 0.1142, Test = 0.1051)

  • Since RMSE penalizes larger errors more than MAE, this confirms that our model isn't making extreme mistakes.

  • Train and Test RMSE are close, again proving no overfitting.

In [23]:
# Evaluating model performance in MAPE

# Defining a function to calculate MAPE

def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100  # Convert to percentage

# Calculating MAPE for train and test sets

train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
test_mape = mean_absolute_percentage_error(y_test, y_test_pred)

print("**Linear Regression Model Performance with MAPE**")
print(f"Mean Absolute Percentage Error (Train): {train_mape:.2f}%")
print(f"Mean Absolute Percentage Error (Test) : {test_mape:.2f}%")
**Linear Regression Model Performance with MAPE**
Mean Absolute Percentage Error (Train): 2.77%
Mean Absolute Percentage Error (Test) : 2.52%

MAPE Score (Train = 2.77%, Test = 2.58%)

  • Our model's average prediction error is only ~2.5% to 2.8%.

  • This shows our model is not only well-fitted but also generalizing extremely well to unseen data.

2. Is there multicollinearity? Check using VIF¶

In [25]:
# Double-checking multicollinearity using VIF

# Computing VIF on the final feature set

final_features = X_train_scaled.columns
vif_data = pd.DataFrame()
vif_data["Feature"] = final_features
vif_data["VIF"] = [variance_inflation_factor(X_train_scaled.values, i) for i in range(len(final_features))]

# Display VIF scores

print("**Final Multicollinearity Check (VIF Scores):**")
print(vif_data)
**Final Multicollinearity Check (VIF Scores):**
    Feature       VIF
0     const       NaN
1      CRIM  2.872608
2        ZN  2.583947
3     INDUS  2.964985
4      CHAS  1.104334
5       NOX  5.631089
6        RM  2.566053
7       AGE  2.718199
8       DIS  3.433860
9       RAD  1.196240
10      TAX  1.661921
11  PTRATIO  1.653781
12    LSTAT  3.052094
  • All VIF values are below 5, except for NOX (5.63), which is only slightly above the threshold and I've decided not to drop it to improve predictive power of the model.

  • No extreme multicollinearity detected, meaning our features are independent enough for regression.

  • The model is stable, and there's no need to remove any features.

3. How does the model is performing after cross validation?¶

In [26]:
# Cross-Validation performance check

# Initializing Linear Regression model

lin_reg = LinearRegression()

# Performing 5-Fold Cross-Validation (scoring based on R²)

cv_scores = cross_val_score(lin_reg, X_train_scaled, y_train, cv=5, scoring='r2')

# Displaying cross-validation results

print("**Cross-Validation Results:**")
print(f"Mean R² Score: {np.mean(cv_scores):.4f}")
print(f"Standard Deviation of R²: {np.std(cv_scores):.4f}")
print(f"All R² Scores: {cv_scores}")
**Cross-Validation Results:**
Mean R² Score: 0.8497
Standard Deviation of R²: 0.0148
All R² Scores: [0.85566448 0.82114747 0.85066249 0.85944711 0.86178792]
  • Mean R² Score = 0.8497 - This is very close to our original Test R² (~0.86), confirming our model is consistent and reliable.

  • Low Standard Deviation (0.0148) - This means minimal variation across different splits, proving our model is stable and not sensitive to specific training data.

  • All R² Scores are in the 0.82 - 0.86 range - No major outliers, meaning the model performs well across all folds of cross-validation.

Checking Linear Regression Assumptions¶

  • In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied.

Checking the mean of the residuals¶

In [27]:
# We're aiming for a mean value close to 0

residuals = y_test - y_test_pred
mean_residuals = np.mean(residuals)
print(f"Mean of Residuals: {mean_residuals:.6f}")
Mean of Residuals: -0.005186

A value of -0.005186 confirms that our model doesn't have systematic bias, meaning predictions are evenly distributed around the actual values.

This assumption is satisfied.

Homoscedasticity Check¶

In [28]:
# Ploting the residual vs. predicted values

plt.figure(figsize=(8,5))
sns.scatterplot(x=y_test_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values (Homoscedasticity Check)")
plt.show()
No description has been provided for this image
  • Residuals are randomly scattered around the 0 line across all predicted values.

  • No clear pattern or funnel shape, this means variance is constant across all levels of predictions.

  • Confirms that our model satisfies the assumption of homoscedasticity.

Linearity of Variables¶

In [33]:
# Ploting a Q-Q plot to verify if residuals align with a normal distribution

plt.figure(figsize=(8,5))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")
plt.show()
No description has been provided for this image
  • Residuals follow the diagonal line closely, this suggests that errors are mostly normally distributed.

  • The slight S-shape, indicates some minor deviations from normality, which is not uncommon in real-world data.

  • The linearity of variables assumption can be considered satisfied.

Normality of Error Terms¶

In [34]:
# Plotting an histogram to check if the errors follow a normal distribution

plt.figure(figsize=(8,5))
sns.histplot(residuals, bins=30, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Histogram of Residuals (Normality Check)")
plt.show()
No description has been provided for this image
  • Roughly bell-shaped, confirms that residuals follow a normal distribution.

  • Slight left skewness is not a major issue, especially since our target variable (MEDV-log) also had a similar distribution.

  • This confirms that our model meets the normality assumption and that all assumptions are satisfied.

Final Model¶

In [35]:
# Extracting coefficients from the OLS model

coef = ols_model.params

# Starting and creating equation

equation = "log (Price) ="
print(equation, end=" ")
for i in range(len(coef)):
    if i == len(coef) - 1:  # No '+' at the end
        print(f"({coef[i]:.4f}) * {coef.index[i]}", end=" ")
    else:
        print(f"({coef[i]:.4f}) * {coef.index[i]} +", end=" ")
log (Price) = (3.1531) * const + (-0.0077) * CRIM + (0.0115) * ZN + (0.0090) * INDUS + (0.0104) * CHAS + (-0.0439) * NOX + (0.1869) * RM + (-0.0549) * AGE + (-0.0781) * DIS + (0.0178) * RAD + (-0.0388) * TAX + (-0.0607) * PTRATIO + (-0.0552) * LSTAT 

Model Equation¶

log (Price) = (3.1531) * const + (-0.0077) * CRIM + (0.0115) * ZN + (0.0090) * INDUS + (0.0104) * CHAS + (-0.0439) * NOX + (0.1869) * RM + (-0.0549) * AGE + (-0.0781) * DIS + (0.0178) * RAD + (-0.0388) * TAX + (-0.0607) * PTRATIO + (-0.0552) * LSTAT

  • RM (Number of Rooms) has the strongest positive effect (+0.1869) - More rooms increase house prices significantly.

  • NOX (Pollution), AGE (Older houses), DIS (Distance to jobs), TAX, PTRATIO, and LSTAT all have negative impacts - Higher values decrease house prices.

  • CRIM, ZN, INDUS, and CHAS have weak effects predictive influence and their impact isn't statistically significant.

Predictions Visualization¶

Next I'll create a new dataframe with the predicted and real values, so we can visually compare how well our model performs. In order to try and improve the visualization I'll create a scatter plot to better show both values.

In [36]:
# Creating a DataFrame with actual vs predicted values

predictions_df = pd.DataFrame({
    "Actual MEDV_log": y_test,
    "Predicted MEDV_log": y_test_pred,
    "Actual MEDV": np.exp(y_test),  # Convert log values back to original price
    "Predicted MEDV": np.exp(y_test_pred),  # Convert log predictions back
    "Residuals": y_test - y_test_pred  # Difference between actual and predicted
})

# Displaying the first few rows of the dataset

print("**Predictions DataFrame:**")
print(predictions_df.head())

# Scatter Plot: Actual vs. Predicted Values

plt.figure(figsize=(8, 6))
sns.scatterplot(x=predictions_df["Actual MEDV"], y=predictions_df["Predicted MEDV"], alpha=0.7)
plt.plot([min(predictions_df["Actual MEDV"]), max(predictions_df["Actual MEDV"])],
         [min(predictions_df["Actual MEDV"]), max(predictions_df["Actual MEDV"])],
         color='red', linestyle='--')  # Perfect prediction line
plt.xlabel("Actual House Prices")
plt.ylabel("Predicted House Prices")
plt.title("Actual vs. Predicted House Prices")
plt.show()
**Predictions DataFrame:**
     Actual MEDV_log  Predicted MEDV_log  Actual MEDV  Predicted MEDV  \
302         3.273364            3.357222         26.4       28.709338   
122         3.020425            3.007894         20.5       20.244712   
228         3.843744            3.749873         46.7       42.515696   
257         3.912023            3.995833         50.0       54.371121   
102         2.923162            3.053223         18.6       21.183506   

     Residuals  
302  -0.083858  
122   0.012531  
228   0.093871  
257  -0.083810  
102  -0.130061  
No description has been provided for this image

In the dataframe:

  • We have both log-transformed and actual house prices (so we can compare results in both scales).

  • Residuals (errors) are relatively small, meaning our model is predicting well.

  • Predicted MEDV values are close to Actual MEDV values, confirming our model is performing solidly.

In the scatterplot:

  • Points are well aligned with the diagonal means our model is making highly accurate predictions.

  • No major deviations or weird patterns means systematic errors

  • Performance is strong across all house price ranges

Actionable Insights and Recommendations¶

Key Factors Influencing House Prices

  • Rooms per dwelling (RM) has the strongest positive effect (+0.1869). More rooms = Higher house prices. Real Estate Developers should focus on properties with spacious layouts for higher value.

  • Pollution levels (NOX) negatively impact prices (-0.0439). Higher pollution lowers property value. Urban Planners & Environmental Agencies should work on reducing pollution in high-density areas.

  • Accessibility to highways (RAD) has a slight positive effect (+0.0178). Better transportation access = More valuable properties. Government Policies should invest in infrastructure near suburbs to increase demand.

  • Higher property tax (TAX) reduces home values (-0.0388). Heavy taxation discourages buyers, leading to lower prices. Local Authorities should evaluate tax policies to maintain a competitive housing market.

  • Homes farther from employment centers (DIS) are less valuable (-0.0781). Proximity to jobs = Higher demand. Companies & Businesses should consider expanding offices into suburban areas to drive up home prices.

Business Recommendations

For Real Estate Investors & Buyers:

  • Target homes with more rooms (RM) & lower pollution (NOX) for long-term value. Invest in properties close to highways & job hubs (RAD, DIS).

For Urban Planners & Government Officials:

  • Improve transportation networks (RAD) to increase home values.
  • Reduce pollution in dense housing areas (NOX) to make locations more desirable.

For Tax Authorities & Policy Makers:

  • High property tax (TAX) reduces market demand - Consider tax adjustments for affordability.