Used Cars Price Prediction¶

Problem Definition¶

The Context:¶

There's a great demand for used cars in the Indian Market, as the sales of new cars have slowed down recently, the used car market as continued to grow and as surpassed the new car market. Cars4U is a tech start-up that as the objective to find holes in this market in order to take advantage of them and make bigger sales.

For a quick look at the numbers, the new car market sold 3.6 million units in 2018/19, as compared to around 4 million units for the used car market.

The bigger challenge with this growing market of used cars is determining the price for the vehicles, because unlike new cars that have their values determined and managed by OEMs, the price for the used car market as large uncertainties because of variables, like mileage, year, ownership, etc. that influence the value set for every car. It's also dificult to predict and guarentee the supply. Coming with a solution that facilitates the pricing as pivotal importance not only for owners but for delears as well.

The objective:¶

Build a pricing model that effectively predicts the price for used cars and that can help out business coming up with profitable stratagies using differential pricing

The key questions:¶

  • How do the multiple variables affect the price of the cars?
  • For predicting the price, can we rule out some of the variables?
  • What are the most important features for predicting the price?
  • What are the less relevant features for the prediction?

The problem formulation:¶

Our goal is to develop a robust predictive model that can:

  • Estimate the price of a used car based on its features (age, mileage, brand, etc.).
  • Provide a data-driven pricing strategy for sellers and dealerships.
  • Reduce uncertainty in used car pricing, making the market more transparent.

We will explore multiple machine learning techniques to identify the best-performing model based on key evaluation metrics such as R² Score & RMSE, ensuring accurate and reliable price predictions.

Data Dictionary¶

S.No. : Serial Number

Name : Name of the car which includes Brand name and Model name

Location : The location in which the car is being sold or is available for purchase (Cities)

Year : Manufacturing year of the car

Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM

Fuel_Type : The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)

Transmission : The type of transmission used by the car (Automatic / Manual)

Owner : Type of ownership

Mileage : The standard mileage offered by the car company in kmpl or km/kg

Engine : The displacement volume of the engine in CC

Power : The maximum power of the engine in bhp

Seats : The number of seats in the car

New_Price : The price of a new car of the same model in INR 100,000

Price : The price of the used car in INR 100,000 (Target Variable)

Loading libraries¶

In [6]:
# Importing libraries for data manipulation
import numpy as np
import pandas as pd

# Importing libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
import scipy.stats as stats

# Importing libraries for building linear regression model
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso

# Importing libraries for tree Based models
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestRegressor

# Importing libraries for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Importing libraries for model evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing library for splitting data
from sklearn.model_selection import train_test_split

# Importing library for data preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Importing filter to ignore deprecation warnings
import warnings
warnings.filterwarnings("ignore")

# Removing the limit from the number of displayed columns and rows.
pd.set_option("display.max_columns", None)

Let us load the data¶

In [7]:
# Leting colab access my google drive

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [8]:
# Using the pd.read_csv() funtion to load the dataset

df = pd.read_csv("/content/drive/MyDrive/MIT - Applied Data Science/Projects/Capstone/used_cars.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [9]:
# Looking at the top 5 rows of the dataset to start building some intuition

df.head()
Out[9]:
S.No. Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats New_price Price
0 0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 26.60 998.0 58.16 5.0 NaN 1.75
1 1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 19.67 1582.0 126.20 5.0 NaN 12.50
2 2 Honda Jazz V Chennai 2011 46000 Petrol Manual First 18.20 1199.0 88.70 5.0 8.61 4.50
3 3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 20.77 1248.0 88.76 7.0 NaN 6.00
4 4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 Diesel Automatic Second 15.20 1968.0 140.80 5.0 NaN 17.74
  • The first thing I notice by analysing the first rows of our data is that 4 out of 5, in the New_Price feature, have null values. I'll have to adress this issue and treat the missing values accordingly after some further analysis, to see which method will be the most effective for our modeling.

  • For the Name Feature, we can see that it combines the brand and model of the car, which might not be directly usable in its current form for modeling. We might want to extract the Brand and possibly Model separately as additional categorical features.

  • The Year variable, year of manufacturing could be transformed into a more meaningful feature, such as "age of the car", which may better represent car depreciation.

  • In the features Mileage, Engine and Power, although these variable are numerical they might contain units implicitly, like, Mileage in kmpl, Engine in cc, Power in bhp. We should double-check if all the values are consistent and formatted numerically.

  • For the categorical variables, Location, Fuel_Type, Transmission and Owner_Type, these will likely need to be encoded using techniques like one-hot encoding or label encoding.

  • The Seats feature, it does appear clean at first glance but later, I might consider checking if unusual seat counts, like, check if unusually high or low numbers exist.

  • For Price variable, it looks clean and numeric. We'll later examine its distribution to look for skewness or outliers.

  • I'll now run some other functions and methods to inspect the full dataset and see if there are any more features or topics that need atention.

In [10]:
# Checking the size of the dataset using the .shape method

df.shape
Out[10]:
(7253, 14)
  • Our dataset has 7253 rows and 14 columns.
In [11]:
# Displaying basic information about the dataset

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7253 entries, 0 to 7252
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   S.No.              7253 non-null   int64  
 1   Name               7253 non-null   object 
 2   Location           7253 non-null   object 
 3   Year               7253 non-null   int64  
 4   Kilometers_Driven  7253 non-null   int64  
 5   Fuel_Type          7253 non-null   object 
 6   Transmission       7253 non-null   object 
 7   Owner_Type         7253 non-null   object 
 8   Mileage            7251 non-null   float64
 9   Engine             7207 non-null   float64
 10  Power              7078 non-null   float64
 11  Seats              7200 non-null   float64
 12  New_price          1006 non-null   float64
 13  Price              6019 non-null   float64
dtypes: float64(6), int64(3), object(5)
memory usage: 793.4+ KB
  1. After examining the .info() function we can easily see that we have missing values for the following features:

Mileage - 7251 entries has 2 missing values

Engine - 7207 entries has 46 missing values

Power - 7078 entries has 175 missing values

Seats - 7200 entries has 53 missing values

New_price - 1006 entries has 6247 missing values

Price - 6019 entries has 1234 missing values.

  • New_Price has the highest number of missing values (6247 out of 7253 rows). This suggests we may need to drop this column unless we find a reliable way to impute these values.

  • Price (Target Variable) has 1234 missing values. Since this is what we are predicting, we need to remove these rows before training the model because we can't predict without a target.

  • Mileage, Engine, Power, and Seats have relatively fewer missing values, so I will impute them rather than drop rows.

  1. Regarding the datatypes, we have 11 numerical features, 3 being integers and 6 floats, and 5 object or string datatype variables:

Integers - S.No., Year and Kilometers_Driven

Floats - Mileage, Engine, Power, Seats, New_price and Price

Objects - Name, Location, Fuel_Type, Transmission and Owner_Type

  • Kilometers_Driven is an integer, but we may need to check for extreme values, like, unrealistic kilometer readings.

  • Year is an integer but might be better represented as "Age of Car", like I said before, for better modeling.

  • Power and Engine should be checked for unit consistency like, BHP vs CC vs KMPL.

  1. For the categorical features (object type):
  • Name contains both brand & model, and we should extract Brand as a separate feature.

  • Location, Fuel_Type, Transmission, and Owner_Type will require encoding.

In [12]:
# Checking the missing values, let's use the .isnull().sum() function, that will return us a count of the missing values in our data

df.isnull().sum()
Out[12]:
0
S.No. 0
Name 0
Location 0
Year 0
Kilometers_Driven 0
Fuel_Type 0
Transmission 0
Owner_Type 0
Mileage 2
Engine 46
Power 175
Seats 53
New_price 6247
Price 1234

In [13]:
# I'm now going to run the is.na().sum() function that will return a count of any fields that may have NaN values

df.isna().sum()
Out[13]:
0
S.No. 0
Name 0
Location 0
Year 0
Kilometers_Driven 0
Fuel_Type 0
Transmission 0
Owner_Type 0
Mileage 2
Engine 46
Power 175
Seats 53
New_price 6247
Price 1234

  1. Price (Target Variable) has 1234 missing values
  • Since Price is our dependent variable, we must drop these rows before modeling.
  • I'll remove them right before training the model, ensuring we don't lose useful data during EDA.
  1. New_Price has 6247 missing values (~86% of the data)
  • This feature is mostly empty and likely not useful for modeling.

Two options:

Drop it entirely, since it doesn't add much information.

Try imputing it based on category (Brand/Model) (if we find a strong pattern).

  1. Power (175 missing values), Engine (46 missing values), and Mileage (2 missing values)
  • These are important numerical features and we should not drop them.

  • I'll use mean/median imputation based on similar car types (e.g., impute by Brand, Model, or Fuel Type).

  1. Seats has 53 missing values
  • Seat count is usually fixed per car model, likely safe to impute using mode (most frequent value).
In [14]:
# I'm now running the .duplicated().sum() function, to know if there are any duplicated records on the dataset

df.duplicated().sum()
Out[14]:
0
  • No duplicated rows in our dataset, so there's no need to drop any duplicate data.

Exploratory Data Analysis¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
  2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
  3. Check the extreme values in different columns of the given data and write down the observtions? Remove the data where the values are un-realistic
  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
In [15]:
# In order to start building some more intuition on our data, I'm now using the .describe() function, which will return a statistical summary of our columns

df.describe().T
Out[15]:
count mean std min 25% 50% 75% max
S.No. 7253.0 3626.000000 2093.905084 0.00 1813.000 3626.00 5439.0000 7252.00
Year 7253.0 2013.365366 3.254421 1996.00 2011.000 2014.00 2016.0000 2019.00
Kilometers_Driven 7253.0 58699.063146 84427.720583 171.00 34000.000 53416.00 73000.0000 6500000.00
Mileage 7251.0 18.141580 4.562197 0.00 15.170 18.16 21.1000 33.54
Engine 7207.0 1616.573470 595.285137 72.00 1198.000 1493.00 1968.0000 5998.00
Power 7078.0 112.765214 53.493553 34.20 75.000 94.00 138.1000 616.00
Seats 7200.0 5.280417 0.809277 2.00 5.000 5.00 5.0000 10.00
New_price 1006.0 22.779692 27.759344 3.91 7.885 11.57 26.0425 375.00
Price 6019.0 9.479468 11.187917 0.44 3.500 5.64 9.9500 160.00
  1. Year (Manufacturing Year)
  • Mean: 2013, Min: 1996, Max: 2019. Data includes cars ranging from 1996 to 2019.
  • Some very old cars (pre-2000s) might be outliers or rare cases in the dataset.
  • I'm going to convert Year into Car Age (Current Year - Manufacturing Year).
  1. Kilometers Driven
  • Mean: ~58,699 km, but Max: 6,500,000 km. Huge outlier.
  • Standard deviation (~84,428 km) is much larger than the mean, suggesting extreme values.
  • Minimum: 171 km. Possibly a listing error or nearly new cars.
  • I'll Log-transform this variable to handle skewness.
  • Cap extreme outliers (e.g., above 99th percentile).
  1. Mileage (KMPL)
  • Mean: 18.14 KMPL, but Min: 0.00. Indicates missing or incorrect data.
  • Max: 33.54 KMPL. Seems realistic.
  • I'll replace 0.00 values with mean/median based on Brand and Fuel Type.
  1. Engine (CC)
  • Mean: 1616 CC, but Min: 72 CC, Max: 5998 CC.
  • The 72 CC value seems highly unrealistic (possibly an error).
  • I'll examine and replace low values using median by Brand/Model.
  1. Power (BHP)
  • Mean: 112 BHP, Min: 34.2 BHP, Max: 616 BHP. High-end sports cars present.
  • Possible data entry errors for very low power cars.
  • I'll Check for incorrect formatting (some datasets store Power with text like 110 bhp).
  • I'm replacing missing values using mean/median by Brand/Model.
  1. Seats
  • Mean: 5.28, Mostly 5-seaters, Min: 2, Max: 10.
  • Higher values (7-10 seats) seem to be SUVs, Vans, or Buses.
  • I'm going to impute missing values using mode (most common seat count per car type).
  1. New_Price
  • Mean: 22.77 INR Lakhs (2.27M INR), but only 1006 values available (out of 7253), so 86% missing.
  • I'll likely drop this column, unless we find a strong category-based imputation method.
  1. Used Car Price (Target Variable)
  • Mean: 9.47 INR Lakhs (947,000 INR), but Max: 160 Lakhs (16M INR).
  • High standard deviation (11.18) suggests a wide range in prices.
  • I'll aplly Log-transformation on Price for better prediction accuracy.
  • I'll Keep original Price for final evaluation (R² & RMSE).
  1. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
In [16]:
# Checking for unique values in categorical columns

categorical_columns = df.select_dtypes(include=["object"]).columns
print("\n**Unique Values in Categorical Columns:**")
for col in categorical_columns:
    print(f"{col}: {df[col].nunique()} unique values")
**Unique Values in Categorical Columns:**
Name: 2041 unique values
Location: 11 unique values
Fuel_Type: 5 unique values
Transmission: 2 unique values
Owner_Type: 4 unique values
  1. Name (2041 unique values)
  • This confirms that each car model is highly unique so directly using this column won't be effective.

  • I'll Extract Brand from Name (e.g., "Maruti Wagon R" - Brand = "Maruti"). Then I'll drop the full Name column afterward since individual model names won't be useful.

  1. Location (11 unique values)
  • Since 11 is a manageable number, we can apply get_dummies() encoding to handle locations properly.

  • I'll use One-Hot Encoding.

  1. Fuel_Type (5 unique values)
  • This is a low number of categories, so we can use One-Hot Encoding. Categories: Petrol, Diesel, Electric, CNG, LPG.
  1. Transmission (2 unique values: Automatic, Manual)
  • Since it's binary (only 2 categories), we can use Label Encoding (0 = Manual, 1 = Automatic).

  • This prevents adding unnecessary dimensions to the dataset.

  1. Owner_Type (4 unique values: First, Second, Third, Fourth & Above)
  • We have 4 distinct categories, One-Hot Encoding is the best approach to retain interpretability.

Feature Engineering¶

  • The Name column contains both the brand and model, but the model names are too unique (2041 unique values).

  • By extracting the brand, we get a more generalizable categorical feature.

In [17]:
# Extracting the brand from the Name column. I'll extract the first word (brand name)

df["Brand"] = df["Name"].str.split().str[0]

# Droping the original Name column

df.drop(columns=["Name"], inplace=True)

# Displaying the unique brands

print("Unique Brands in the Dataset:")
print(df["Brand"].nunique(), "unique brands")
print(df["Brand"].unique())
Unique Brands in the Dataset:
33 unique brands
['Maruti' 'Hyundai' 'Honda' 'Audi' 'Nissan' 'Toyota' 'Volkswagen' 'Tata'
 'Land' 'Mitsubishi' 'Renault' 'Mercedes-Benz' 'BMW' 'Mahindra' 'Ford'
 'Porsche' 'Datsun' 'Jaguar' 'Volvo' 'Chevrolet' 'Skoda' 'Mini' 'Fiat'
 'Jeep' 'Smart' 'Ambassador' 'Isuzu' 'ISUZU' 'Force' 'Bentley'
 'Lamborghini' 'Hindustan' 'OpelCorsa']
  • 33 brands is a manageable number, meaning we can apply One-Hot Encoding if needed.

  • Some brand names appear inconsistently (e.g., "ISUZU" vs "Isuzu"). I should standardize brand names (convert everything to uppercase/lowercase).

  • Some brands have extra words (e.g., "Land" instead of "Land Rover"). I should inspect if any brands need corrections.

Standardizing Brand Names

In [18]:
# Standardizing brand names (convert all to uppercase)

df["Brand"] = df["Brand"].str.upper()

# Displaying unique brands
print("Unique Brands After Standardization:")
print(df["Brand"].nunique(), "unique brands")
print(df["Brand"].unique())
Unique Brands After Standardization:
32 unique brands
['MARUTI' 'HYUNDAI' 'HONDA' 'AUDI' 'NISSAN' 'TOYOTA' 'VOLKSWAGEN' 'TATA'
 'LAND' 'MITSUBISHI' 'RENAULT' 'MERCEDES-BENZ' 'BMW' 'MAHINDRA' 'FORD'
 'PORSCHE' 'DATSUN' 'JAGUAR' 'VOLVO' 'CHEVROLET' 'SKODA' 'MINI' 'FIAT'
 'JEEP' 'SMART' 'AMBASSADOR' 'ISUZU' 'FORCE' 'BENTLEY' 'LAMBORGHINI'
 'HINDUSTAN' 'OPELCORSA']
  • All names are now uppercase, so I've eliminated inconsistencies like ISUZU vs Isuzu.

  • The number of brands reduced from 33 → 32, meaning a duplicate or inconsistency was resolved.

  • One potential correction: "LAND" might actually be LAND ROVER, I will verify if that's correct.

In [19]:
# Checking all car names containing "LAND"
print(df[df["Brand"] == "LAND"]["Brand"].value_counts())
print(df[df["Brand"] == "LAND"])
Brand
LAND    67
Name: count, dtype: int64
      S.No.    Location  Year  Kilometers_Driven Fuel_Type Transmission  \
13       13       Delhi  2014              72000    Diesel    Automatic   
14       14        Pune  2012              85000    Diesel    Automatic   
191     191  Coimbatore  2018              36091    Diesel    Automatic   
311     311       Delhi  2017              44000    Diesel    Automatic   
399     399   Hyderabad  2012              56000    Diesel    Automatic   
...     ...         ...   ...                ...       ...          ...   
6434   6434       Kochi  2012              89190    Diesel    Automatic   
6717   6717       Kochi  2018              23342    Diesel    Automatic   
6857   6857      Mumbai  2011              87000    Diesel    Automatic   
7157   7157   Hyderabad  2015              49000    Diesel    Automatic   
7198   7198   Hyderabad  2012             147202    Diesel    Automatic   

     Owner_Type  Mileage  Engine   Power  Seats  New_price  Price Brand  
13        First    12.70  2179.0  187.70    5.0        NaN  27.00  LAND  
14       Second     0.00  2179.0  115.00    5.0        NaN  17.50  LAND  
191       First    12.70  2179.0  187.70    5.0        NaN  55.76  LAND  
311       First    12.70  2179.0  187.70    5.0        NaN  44.00  LAND  
399       First    12.70  2179.0  187.70    5.0        NaN  30.00  LAND  
...         ...      ...     ...     ...    ...        ...    ...   ...  
6434     Second    11.40  2993.0  245.41    7.0        NaN    NaN  LAND  
6717      First    12.83  2179.0  147.50    5.0        NaN    NaN  LAND  
6857      First     0.00  2179.0  115.00    5.0        NaN    NaN  LAND  
7157     Second    12.70  2179.0  187.70    5.0        NaN    NaN  LAND  
7198      First    11.80  2993.0  241.60    7.0        NaN    NaN  LAND  

[67 rows x 14 columns]
  • We know from real-world knowledge that the only major brand that starts with LAND is LAND ROVER.

  • There is no separate ROVER brand in the dataset, which further confirms that LAND is likely an incorrect truncation.

  • Consistent engine sizes (2179cc, 2993cc), matches known LAND ROVER models.

  • Power values (187.70 BHP, 147.50 BHP), similar to LAND ROVER vehicles.

  • All cars labeled as LAND have high-end Diesel engines, matches LAND ROVER's lineup

In [20]:
# Correcting LAND to LAND ROVER

df["Brand"] = df["Brand"].replace("LAND", "LAND ROVER")

# Verifying the correction

print(df["Brand"].unique())
['MARUTI' 'HYUNDAI' 'HONDA' 'AUDI' 'NISSAN' 'TOYOTA' 'VOLKSWAGEN' 'TATA'
 'LAND ROVER' 'MITSUBISHI' 'RENAULT' 'MERCEDES-BENZ' 'BMW' 'MAHINDRA'
 'FORD' 'PORSCHE' 'DATSUN' 'JAGUAR' 'VOLVO' 'CHEVROLET' 'SKODA' 'MINI'
 'FIAT' 'JEEP' 'SMART' 'AMBASSADOR' 'ISUZU' 'FORCE' 'BENTLEY'
 'LAMBORGHINI' 'HINDUSTAN' 'OPELCORSA']

Missing value treatment¶

  • I'll first treat the easiest missing values:

Seats - Imputing with mode, since seats are fixed per car type.

  • For the features Mileage, Power, Engine, I'll check their distributions in EDA first in order to choose the most appropriate way to treat them.
In [21]:
# Imputing Seats with mode

df["Seats"].fillna(df["Seats"].mode()[0], inplace=True)

# Verifying missing values again

print("Missing Values After Seats Imputation:")
print(df.isnull().sum())
Missing Values After Seats Imputation:
S.No.                   0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 46
Power                 175
Seats                   0
New_price            6247
Price                1234
Brand                   0
dtype: int64
  • Seats has now 0 missing values.

I'm now deciding to drop the New_price field for the following reasons:

  • 86% missing values (6247 out of 7253 rows). Too much missing data to reliably impute

  • Used car prices are already independent of new car prices. Dealers set used car prices based on market conditions, not just original price.

  • This Feature isn't useful for modeling as it doesn't directly impact our price prediction goal.

  • Keeping it adds unnecessary complexity while dropping it simplifies the dataset.

In [22]:
# Droping the New_price column

df.drop(columns=["New_price"], inplace=True)

# Verifying that it's gone

print("Columns after dropping 'New_price':")
print(df.columns)
Columns after dropping 'New_price':
Index(['S.No.', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
       'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
       'Price', 'Brand'],
      dtype='object')
  • As we can confirm from the resulting list of the columns of our dataset, the New_price field is no longer present in our data.

Univariate Analysis¶

Questions:

  1. Do univariate analysis for numerical and categorical variables?
  2. Check the distribution of the different variables? is the distributions skewed?
  3. Do we need to do log_transformation, if so for what variables we need to do?
  4. Perform the log_transformation(if needed) and write down your observations?
  1. Do univariate analysis for numerical and categorical variables.

Let's start by analysing the numerical variables first

In [23]:
# Defining  numerical columns to analyze

numerical_features = ["Year", "Kilometers_Driven", "Mileage", "Engine", "Power", "Seats", "Price"]

# Ploting histograms for numerical features

plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features, 1):
    plt.subplot(3, 3, i)  # Adjusting rows and columns based on number of features
    sns.histplot(df[col], bins=30, kde=True)
    plt.title(f"Distribution of {col}")
plt.tight_layout()
plt.show()
No description has been provided for this image
  1. Year (Manufacturing Year)
  • Left-skewed (most cars are from recent years).

  • No need for log transformation.

  • Instead, we should convert this to "Car Age".

  1. Kilometers Driven
  • Highly right-skewed with extreme outliers.
  • Log transformation to make it more normally distributed.
  1. Mileage (KMPL)
  • Looks roughly normal but has some low and high extremes.
  • No log transformation needed.
  • I'll handle missing values based on Fuel Type & Brand.
  1. Engine (CC)
  • Right-skewed, showing different peaks for different car segments.
  • Log transformation to normalize it.
  1. Power (BHP)
  • Right-skewed with multiple peaks (different categories of vehicles).
  • Log transformation to reduce skewness.
  1. Seats
  • Categorical in nature (5-seaters dominate).
  • No log transformation needed.
  1. Price (Target Variable)
  • Highly right-skewed
  • Log transformation needed.
  • I'll keep the original Price for final R² and RMSE evaluation.

The Year variable, year of manufacturing could be transformed into a more meaningful feature, such as "age of the car", which may better represent car depreciation.

In [24]:
# Converting the column Year to Car Age and droping the year field

df["Car_Age"] = 2024 - df["Year"]
df.drop(columns=["Year"], inplace=True)

Now I'll apply the Log Transformations on the required features.

In [25]:
# Applying log transformation to skewed numerical features

df["Kilometers_Driven_Log"] = np.log1p(df["Kilometers_Driven"])
df["Engine_Log"] = np.log1p(df["Engine"])
df["Power_Log"] = np.log1p(df["Power"])
df["Price_Log"] = np.log1p(df["Price"])  # Target variable, but keep original

# Dropping original versions of transformed features (except for Price)

df.drop(columns=["Kilometers_Driven", "Engine", "Power"], inplace=True)

# Verifying changes

print(df.head())
   S.No.    Location Fuel_Type Transmission Owner_Type  Mileage  Seats  Price  \
0      0      Mumbai       CNG       Manual      First    26.60    5.0   1.75   
1      1        Pune    Diesel       Manual      First    19.67    5.0  12.50   
2      2     Chennai    Petrol       Manual      First    18.20    5.0   4.50   
3      3     Chennai    Diesel       Manual      First    20.77    7.0   6.00   
4      4  Coimbatore    Diesel    Automatic     Second    15.20    5.0  17.74   

     Brand  Car_Age  Kilometers_Driven_Log  Engine_Log  Power_Log  Price_Log  
0   MARUTI       14              11.184435    6.906755   4.080246   1.011601  
1  HYUNDAI        9              10.621352    7.367077   4.845761   2.602690  
2    HONDA       13              10.736418    7.090077   4.496471   1.704748  
3   MARUTI       12              11.373675    7.130099   4.497139   1.945910  
4     AUDI       11              10.613271    7.585281   4.954418   2.930660  

Bivariate Analysis¶

Questions:

  1. Plot a scatter plot for the log transformed values(if log_transformation done in previous steps)?
  2. What can we infer form the correlation heatmap? Is there correlation between the dependent and independent variables?
  3. Plot a box plot for target variable and categorical variable 'Location' and write your observations?
  1. Plot a scatter plot for the log transformed values(if log_transformation done in previous steps)?
In [26]:
# Plotting Scatter Plots for Log-Transformed Features

numerical_features = ["Kilometers_Driven_Log", "Engine_Log", "Power_Log", "Car_Age"]

plt.figure(figsize=(12, 10))
for i, col in enumerate(numerical_features, 1):
    plt.subplot(2, 2, i)
    sns.scatterplot(x=df[col], y=df["Price_Log"], alpha=0.5)
    plt.title(f"Scatter Plot: Price_Log vs {col}")
plt.tight_layout()
plt.show()
No description has been provided for this image
  1. Price_Log vs Kilometers_Driven_Log
  • Weak negative correlation, as Kilometers_Driven_Log increases, Price_Log slightly decreases.

  • This makes sense, as cars with higher mileage usually have lower resale value. But there's a lot of spread, meaning mileage alone isn't a strong predictor.

  1. Price_Log vs Engine_Log
  • Strong positive correlation, bigger engines tend to have higher prices.
  • This aligns with expectations—luxury and performance cars usually have larger engines.
  1. Price_Log vs Power_Log
  • Strongest positive correlation among all features
  • More power directly influences price since powerful cars are more expensive.
  • I'll have to check if Power_Log and Engine_Log are highly correlated (multicollinearity risk).
  1. Price_Log vs Car_Age
  • Clear negative correlation, older cars have lower prices.
  1. What can we infer form the correlation heatmap? Is there correlation between the dependent and independent variables?
In [27]:
# Plotting the correlation heatmap

numerical_df = df.select_dtypes(include=["number"])

plt.figure(figsize=(12, 8))
sns.heatmap(numerical_df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
No description has been provided for this image
  1. Price_Log (Target Variable)
  • Strong positive correlation with Power_Log (0.80), higher power leads to higher prices.

  • High correlation with Engine_Log (0.70), bigger engines also lead to higher prices, but slightly less than power.

  • Moderate negative correlation with Car_Age (-0.47), older cars have lower prices.

  • Weak correlation with Kilometers_Driven_Log (-0.20), Mileage affects Price, but not as strongly as power and engine size.

  1. Power_Log & Engine_Log (0.88)
  • These two are highly correlated, meaning we might have multicollinearity. I'll need to check Variance Inflation Factor when preparing data for modeling.
  1. Mileage has a moderate negative correlation with Power and Engine
  • Mileage is negatively correlated with Power_Log (-0.55) and Engine_Log (-0.59). This makes sense since cars with higher mileage tend to be less powerful
  1. Plot a box plot for target variable and categorical variable 'Location' and write your observations.
In [28]:
# Plotting Box Plot for Price_Log vs Location

plt.figure(figsize=(15, 6))
sns.boxplot(x=df["Location"], y=df["Price_Log"])
plt.xticks(rotation=90)
plt.title("Box Plot: Price_Log vs Location")
plt.show()
No description has been provided for this image
  1. Price Variation Across Locations
  • Bangalore and Coimbatore have the highest median Price_Log values.

  • Kolkata, Hyderabad, and Jaipur have the lowest median used car prices.

  • This suggests that used car prices are higher in certain cities, possibly due to demand, purchasing power, or local market trends.

  1. Presence of Outliers
  • All locations have significant outliers on the higher end.

  • This is expected because luxury and high-performance cars exist in every city, but they are not the majority.

  • We might need to handle extreme outliers carefully when training the model.

  1. Overall Distribution Similarity
  • Most locations have a similar Interquartile Range, meaning price distribution is somewhat consistent across cities.

  • However, Bangalore and Coimbatore show greater price variation, suggesting a mix of both affordable and luxury vehicles.

Missing value treatment (Part 2)¶

  1. Mileage (Missing: 2 values)
  • Since there are only 2 missing values, we can impute them using the median Mileage of the same Fuel_Type.

  • Because fuel type affects mileage (Diesel cars have better mileage than Petrol cars).

  1. Engine_Log (Missing: 46 values)
  • I' ll impute using the median Engine_Log based on Brand.

  • Because different brands manufacture cars with different engine sizes.

  1. Power_Log (Missing: 175 values)
  • Similar to Engine_Log, we'll impute missing values using the median Power_Log based on Brand.
In [29]:
# Imputing Mileage based on Fuel_Type

df["Mileage"].fillna(df.groupby("Fuel_Type")["Mileage"].transform("median"), inplace=True)

# Imputing Engine_Log based on Brand

df["Engine_Log"].fillna(df.groupby("Brand")["Engine_Log"].transform("median"), inplace=True)

# Imputing Power_Log based on Brand

df["Power_Log"].fillna(df.groupby("Brand")["Power_Log"].transform("median"), inplace=True)

# Verifying that all missing values are handled

print("Missing Values After Imputation:\n", df.isnull().sum())
Missing Values After Imputation:
 S.No.                       0
Location                    0
Fuel_Type                   0
Transmission                0
Owner_Type                  0
Mileage                     2
Seats                       0
Price                    1234
Brand                       0
Car_Age                     0
Kilometers_Driven_Log       0
Engine_Log                  0
Power_Log                   2
Price_Log                1234
dtype: int64
  • Mileage is still missing (2 values) The imputation should have worked. I'll double-check and reapply if necessary.

  • Power_Log still has 2 missing values, I'll investigate if these cars belong to a brand where all values were missing (so no median could be calculated).

  • Price & Price_Log still have 1234 missing values, I'll drop these before modeling.

In [30]:
# Re-checking Mileage and Power_Log for missing values

print(df[df["Mileage"].isnull()])  # Check the rows where Mileage is still missing
print(df[df["Power_Log"].isnull()])  # Check missing Power_Log rows
      S.No. Location Fuel_Type Transmission Owner_Type  Mileage  Seats  Price  \
4446   4446  Chennai  Electric    Automatic      First      NaN    5.0  13.00   
4904   4904   Mumbai  Electric    Automatic      First      NaN    5.0  12.75   

         Brand  Car_Age  Kilometers_Driven_Log  Engine_Log  Power_Log  \
4446  MAHINDRA        8              10.819798    4.290459   3.737670   
4904    TOYOTA       13              10.691968    7.494986   4.304065   

      Price_Log  
4446   2.639057  
4904   2.621039  
      S.No. Location Fuel_Type Transmission Owner_Type  Mileage  Seats  Price  \
915     915     Pune    Diesel    Automatic     Second      0.0    2.0    3.0   
6216   6216     Pune    Diesel       Manual     Second     14.1    5.0    NaN   

          Brand  Car_Age  Kilometers_Driven_Log  Engine_Log  Power_Log  \
915       SMART       16              11.542494    6.684612        NaN   
6216  HINDUSTAN       28              11.082158    7.598900        NaN   

      Price_Log  
915    1.386294  
6216        NaN  
  1. Mileage (2 missing values), both cars are Electric (Mahindra & Toyota)
  • Since electric cars don't have a "Mileage" value in the traditional sense, we can impute them with the median of other Electric cars (if available). If not, we can use a default estimate.
  1. Power_Log (2 missing values), both cars belong to rare brands (SMART and HINDUSTAN).
  • These brands likely had very few entries, so their median was NaN. I'll impute them with the median Power_Log of all cars instead.
In [31]:
# Fixing Mileage for Electric cars

electric_median_mileage = df[df["Fuel_Type"] == "Electric"]["Mileage"].median()
df.loc[df["Fuel_Type"] == "Electric", "Mileage"] = df["Mileage"].fillna(electric_median_mileage)

# Fixing Power_Log for rare brands using overall median

overall_median_power = df["Power_Log"].median()
df["Power_Log"].fillna(overall_median_power, inplace=True)

# Verifying if all missing values are gone
print("Missing Values After Final Fix:\n", df.isnull().sum())
Missing Values After Final Fix:
 S.No.                       0
Location                    0
Fuel_Type                   0
Transmission                0
Owner_Type                  0
Mileage                     2
Seats                       0
Price                    1234
Brand                       0
Car_Age                     0
Kilometers_Driven_Log       0
Engine_Log                  0
Power_Log                   0
Price_Log                1234
dtype: int64
  • Since we still have 2 missing values in Mileage, I'll drop them once I've already tried imputing logically and they still remain. Only 2 rows will not affect the dataset significantly.
In [32]:
# Dropping rows where Mileage is still missing

df.dropna(subset=["Mileage"], inplace=True)

# Dropping rows where Price is missing

df.dropna(subset=["Price"], inplace=True)

# Verifying that all missing values are gone

print("Final Missing Value Check:\n", df.isnull().sum())
Final Missing Value Check:
 S.No.                    0
Location                 0
Fuel_Type                0
Transmission             0
Owner_Type               0
Mileage                  0
Seats                    0
Price                    0
Brand                    0
Car_Age                  0
Kilometers_Driven_Log    0
Engine_Log               0
Power_Log                0
Price_Log                0
dtype: int64

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

  1. Price Drivers: What Affects Car Prices the Most?
  • Power_Log & Engine_Log strongly correlate with Price_Log

  • Cars with higher power and bigger engines tend to have higher prices. (Correlation: Power_Log = 0.80, Engine_Log = 0.70 with Price_Log)

  • Car Age has a significant negative correlation with Price

  • Older cars are cheaper, depreciation plays a key role. (Correlation: Car_Age = -0.47 with Price_Log)

  • Kilometers Driven has only a weak negative correlation with Price

  • Higher mileage slightly reduces price, but not as much as age or engine power. (Correlation: Kilometers_Driven_Log = -0.20 with Price_Log)

  • Fuel Type & Transmission influence pricing

  • Diesel & Automatic cars generally have higher prices than Petrol & Manual cars.

  1. Market Trends: How Prices Vary Across Locations?
  • Bangalore & Coimbatore have the highest used car prices.

  • Suggests higher demand or luxury vehicle preference in these cities.

  • Kolkata, Hyderabad, and Jaipur have the lowest median prices.

  • Local market conditions, affordability, and demand could be factors.

  1. Data Quality & Fixes
  • We've handled missing values accordingly:

  • Used median values grouped by relevant features (Fuel_Type, Brand).

  • For extreme cases (rare brands, electric cars), we used a logical approach.

  • Dropped New_Price as it had too many missing values.

  • Log transformation improved the dataset.

  • Price, Kilometers Driven, Engine, and Power were highly skewed, meaning log transformation improved distribution.

  • Final dataset is 100% clean and ready for modeling.

Building Various Models¶

  1. What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
  2. Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like Name.
  3. We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
  4. Build Regression models using train data.
  5. Evaluate the model performance.

Encoding¶

  • Since we have categorical features (Location, Fuel_Type, Transmission, Owner_Type, Brand), we need to convert them into numerical values so models can understand them.

  • Use One-Hot Encoding (pd.get_dummies()) for variables with more than 2 categories.

In [33]:
# One-Hot Encoding for categorical variables

df_encoded = pd.get_dummies(df, columns=["Location", "Fuel_Type", "Transmission", "Owner_Type", "Brand"], drop_first=True)

# Verify the new dataset

print("Encoded DataFrame Shape:", df_encoded.shape)
print("First Rows After Encoding:\n", df_encoded.head())
Encoded DataFrame Shape: (6017, 55)
First Rows After Encoding:
    S.No.  Mileage  Seats  Price  Car_Age  Kilometers_Driven_Log  Engine_Log  \
0      0    26.60    5.0   1.75       14              11.184435    6.906755   
1      1    19.67    5.0  12.50        9              10.621352    7.367077   
2      2    18.20    5.0   4.50       13              10.736418    7.090077   
3      3    20.77    7.0   6.00       12              11.373675    7.130099   
4      4    15.20    5.0  17.74       11              10.613271    7.585281   

   Power_Log  Price_Log  Location_Bangalore  Location_Chennai  \
0   4.080246   1.011601               False             False   
1   4.845761   2.602690               False             False   
2   4.496471   1.704748               False              True   
3   4.497139   1.945910               False              True   
4   4.954418   2.930660               False             False   

   Location_Coimbatore  Location_Delhi  Location_Hyderabad  Location_Jaipur  \
0                False           False               False            False   
1                False           False               False            False   
2                False           False               False            False   
3                False           False               False            False   
4                 True           False               False            False   

   Location_Kochi  Location_Kolkata  Location_Mumbai  Location_Pune  \
0           False             False             True          False   
1           False             False            False           True   
2           False             False            False          False   
3           False             False            False          False   
4           False             False            False          False   

   Fuel_Type_Diesel  Fuel_Type_LPG  Fuel_Type_Petrol  Transmission_Manual  \
0             False          False             False                 True   
1              True          False             False                 True   
2             False          False              True                 True   
3              True          False             False                 True   
4              True          False             False                False   

   Owner_Type_Fourth & Above  Owner_Type_Second  Owner_Type_Third  Brand_AUDI  \
0                      False              False             False       False   
1                      False              False             False       False   
2                      False              False             False       False   
3                      False              False             False       False   
4                      False               True             False        True   

   Brand_BENTLEY  Brand_BMW  Brand_CHEVROLET  Brand_DATSUN  Brand_FIAT  \
0          False      False            False         False       False   
1          False      False            False         False       False   
2          False      False            False         False       False   
3          False      False            False         False       False   
4          False      False            False         False       False   

   Brand_FORCE  Brand_FORD  Brand_HONDA  Brand_HYUNDAI  Brand_ISUZU  \
0        False       False        False          False        False   
1        False       False        False           True        False   
2        False       False         True          False        False   
3        False       False        False          False        False   
4        False       False        False          False        False   

   Brand_JAGUAR  Brand_JEEP  Brand_LAMBORGHINI  Brand_LAND ROVER  \
0         False       False              False             False   
1         False       False              False             False   
2         False       False              False             False   
3         False       False              False             False   
4         False       False              False             False   

   Brand_MAHINDRA  Brand_MARUTI  Brand_MERCEDES-BENZ  Brand_MINI  \
0           False          True                False       False   
1           False         False                False       False   
2           False         False                False       False   
3           False          True                False       False   
4           False         False                False       False   

   Brand_MITSUBISHI  Brand_NISSAN  Brand_PORSCHE  Brand_RENAULT  Brand_SKODA  \
0             False         False          False          False        False   
1             False         False          False          False        False   
2             False         False          False          False        False   
3             False         False          False          False        False   
4             False         False          False          False        False   

   Brand_SMART  Brand_TATA  Brand_TOYOTA  Brand_VOLKSWAGEN  Brand_VOLVO  
0        False       False         False             False        False  
1        False       False         False             False        False  
2        False       False         False             False        False  
3        False       False         False             False        False  
4        False       False         False             False        False  
  • The dataset now has 55 columns, this means our categorical features have been effectively converted into numerical form.

  • One-Hot Encoding worked correctly:

Location, Fuel_Type, Transmission, Owner_Type, and Brand have been transformed into binary indicator columns (True/False).

  • drop_first=True ensured that we avoided dummy variable traps (e.g., avoiding redundancy).

  • All categorical variables are now numerical.

Split the Data¶

  • Step1: Seperating the indepdent variables (X) and the dependent variable (y).
  • Step2: Encode the categorical variables in X using pd.dummies.
  • Step3: Split the data into train and test using train_test_split.
  • Question:

    1. Why we should drop 'Name','Price','price_log','Kilometers_Driven' from X before splitting?
    1. Name (Already Dropped Earlier)
    • This column contained both Brand and Model names.

    • We've extracted Brand as a separate categorical feature.

    • Model names were too unique (~2000 values), making them useless for prediction.

    • Already dropped earlier in Feature Engineering.

    1. Price (Target Variable)
    • We're predicting Price, so it must be in y, NOT in X.

    • Including it in X would leak the target into the model, making predictions meaningless.

    1. Price_Log (Transformed Target Variable)
    • We've applied log transformation on Price to normalize it.

    • But we only use it for model performance analysis, like checking normality, not as an input feature.

    • We've train the model on the original Price (y) and later convert predictions back if needed.

    1. Kilometers_Driven (Replaced by Kilometers_Driven_Log)
    • We've log-transformed Kilometers_Driven into Kilometers_Driven_Log to reduce skewness.

    • The original Kilometers_Driven is no longer useful, as the model should use the transformed version.

    • Already dropped earlier in Feature Engineering.

    In [34]:
    # Defining features and target
    
    X = df_encoded.drop(columns=["Price", "Price_Log"])
    y = df_encoded["Price"]  # Target variable
    
    # Train-test split (80% Train, 20% Test)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Verifying split
    
    print(f"Training Set: {X_train.shape}, Test Set: {X_test.shape}")
    
    Training Set: (4813, 53), Test Set: (1204, 53)
    

    For Regression Problems, some of the algorithms used are :

    1) Linear Regression
    2) Ridge / Lasso Regression
    3) Decision Trees
    4) Random Forest

    1) Linear Regression

    In [35]:
    # Initializing the Linear Regression model
    
    lin_reg = LinearRegression()
    
    # Training the model
    
    lin_reg.fit(X_train, y_train)
    
    # Making predictions
    
    y_train_pred = lin_reg.predict(X_train)
    y_test_pred = lin_reg.predict(X_test)
    
    # Evaluating model performance
    
    # R2
    
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    # RMSE
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    
    # MAE
    
    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    # MAPE
    
    def mean_absolute_percentage_error(y_true, y_pred):
        return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
    test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
    
    # Displaying results
    print("Linear Regression Model Performance:")
    print(f"R² Score (Train): {train_r2:.4f}")
    print(f"R² Score (Test) : {test_r2:.4f}")
    print(f"RMSE (Train)    : {train_rmse:.4f}")
    print(f"RMSE (Test)     : {test_rmse:.4f}")
    print(f"MAE (Train)     : {train_mae:.4f}")
    print(f"MAE (Test)      : {test_mae:.4f}")
    print(f"MAPE (Train)    : {train_mape:.2f}%")
    print(f"MAPE (Test)     : {test_mape:.2f}%")
    
    Linear Regression Model Performance:
    R² Score (Train): 0.7557
    R² Score (Test) : 0.7588
    RMSE (Train)    : 5.5733
    RMSE (Test)     : 5.3205
    MAE (Train)     : 3.0761
    MAE (Test)      : 3.0943
    MAPE (Train)    : 60.80%
    MAPE (Test)     : 65.63%
    
    1. R² Score (Train: 0.7557, Test: 0.7588)
    • This means our model explains ~75.6% of the variance in car prices.

    • The train and test R² are very close, meaning no overfitting.

    1. RMSE (Train: 5.5733, Test: 5.3205)
    • On average, our model predicts used car prices with an error of ~5.3.

    • This is quite reasonable, but we'll try reducing the error further using other models.

    1. MAE (Train: 3.08, Test: 3.09)
    • On average, our model predicts car prices within 3.08 - 3.09 of the actual price.

    • This is reasonable for a baseline model, but we'll aim to reduce it.

    1. MAPE (Train: 60.80%, Test: 65.63%)
    • This means that, on average, our model's predictions are ~60-65% off from the actual prices.

    • This is quite high, indicating that a linear model may not fully capture complex pricing patterns.

    Checking Linear Regression Assumptions¶

    Checking the mean of the residuals¶

    In [36]:
    # Predicting on train data to get residuals
    
    y_train_pred = lin_reg.predict(X_train)
    residuals = y_train - y_train_pred
    
    # Checking the Mean of Residuals
    
    mean_residuals = np.mean(residuals)
    print(f"Mean of Residuals: {mean_residuals:.5f}")
    
    Mean of Residuals: 0.00000
    
    • Mean of Residuals = 0.00000, which confirms that the residuals are centered around zero. This is a sign that our model is unbiased and making reasonable prediction.

    Homoscedasticity Check¶

    In [37]:
    # Homoscedasticity Check - Residuals vs Predicted
    
    plt.figure(figsize=(6, 4))
    sns.scatterplot(x=y_train_pred, y=residuals, alpha=0.5)
    plt.axhline(y=0, color="r", linestyle="--", linewidth=2)
    plt.xlabel("Predicted Values")
    plt.ylabel("Residuals")
    plt.title("Homoscedasticity Check: Residuals vs Predicted")
    plt.show()
    
    No description has been provided for this image
    • The residuals do not appear to be randomly scattered around the zero line.

    • Instead, there is a visible funnel shape, meaning that the variance of residuals increases with higher predicted values.

    • This indicates heteroscedasticity, which means our model's errors are not constant across all predictions.

    • The model might not be capturing variance well, especially for higher-priced cars.

    • This violates the assumption of constant variance and suggests that some transformations or alternative modeling techniques might improve performance.

    Linearity of Variables¶

    In [38]:
    # Ploting a Q-Q plot to verify if residuals align with a normal distribution
    
    plt.figure(figsize=(6, 4))
    stats.probplot(residuals, dist="norm", plot=plt)
    plt.title("Q-Q Plot for Linearity Check")
    plt.show()
    
    No description has been provided for this image
    • The residuals should lie on the red diagonal line if they follow a normal distribution.

    • However, our residuals deviate significantly at both tails, especially at higher quantiles (right side).

    • The S-shape pattern indicates that the errors are not normally distributed.

    • This suggests the presence of non-linearity in the relationship between features and the target variable.

    • The model may be underestimating or overestimating extreme values.

    • This violates the assumption of normality of residuals.

    Normality of Error Terms¶

    In [39]:
    # Plotting an histogram to check if the errors follow a normal distribution
    
    plt.figure(figsize=(6, 4))
    sns.histplot(residuals, bins=30, kde=True)
    plt.xlabel("Residuals")
    plt.ylabel("Frequency")
    plt.title("Normality of Residuals")
    plt.show()
    
    No description has been provided for this image
    • The histogram shows that most residuals are clustered around zero, but there is a sharp peak in the center, which indicates high kurtosis (a heavier central concentration).

    • The distribution also has long tails, suggesting outliers and non-normality.

    • The right tail is significantly stretched, which means some predictions are much higher than expected.

    How does the model is performing after cross validation?¶

    In [40]:
    # Cross-Validation performance check
    
    # Performing 5-Fold Cross-Validation (scoring based on R²)
    
    cv_scores = cross_val_score(lin_reg, X_train, y_train, cv=5, scoring='r2')
    
    # Displaying cross-validation results
    
    print(f"\nCross-Validation Results:")
    print(f"Mean R² Score: {cv_scores.mean():.4f}")
    print(f"Standard Deviation of R²: {cv_scores.std():.4f}")
    print(f"All R² Scores: {cv_scores}")
    
    Cross-Validation Results:
    Mean R² Score: 0.7358
    Standard Deviation of R²: 0.0281
    All R² Scores: [0.74678498 0.76442222 0.70710684 0.76301225 0.69773929]
    
    • Mean R² Score: 0.7358

    • The model explains about 73.58% of the variance in the test data on average.

    • This is consistent with our initial test R² score (0.7588), confirming that our linear regression model is relatively stable.

    • Standard Deviation of R²: 0.0281

    • A lower standard deviation indicates that the model's performance is fairly consistent across different validation folds. However, 0.0281 is not negligible, meaning some variation exists across different train-test splits.

    • The scores for the undividual R² Scores Across 5 Folds, range from 0.6977 to 0.7644, showing some fluctuations in predictive power.

    • The lower score (0.6977) suggests that in some folds, the model struggles with certain subsets of the data.

    2) Ridge / Lasso Regression

    I'll now train a Ridge and a Lasso Regression model

    • Ridge Regression adds L2 regularization, which helps reduce overfitting by penalizing large coefficients.

    • Lasso Regression, adds L1 regularization, which shrinks some coefficients to zero, effectively performing feature selection.

    • Both models help with multicollinearity, especially since Power_Log and Engine_Log are highly correlated.

    In [41]:
    # Training Ridge Regression
    
    ridge = Ridge(alpha=1.0)  # Alpha is the regularization strength
    ridge.fit(X_train, y_train)
    y_train_pred_ridge = ridge.predict(X_train)
    y_test_pred_ridge = ridge.predict(X_test)
    
    # Training Lasso Regression
    
    lasso = Lasso(alpha=0.01)  # Alpha should be small for Lasso to avoid aggressive feature elimination
    lasso.fit(X_train, y_train)
    y_train_pred_lasso = lasso.predict(X_train)
    y_test_pred_lasso = lasso.predict(X_test)
    
    # Evaluating both models
    
    def evaluate_model(model_name, y_train, y_train_pred, y_test, y_test_pred):
        train_r2 = r2_score(y_train, y_train_pred)
        test_r2 = r2_score(y_test, y_test_pred)
        train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
        test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
        train_mae = mean_absolute_error(y_train, y_train_pred)
        test_mae = mean_absolute_error(y_test, y_test_pred)
        train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
        test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
    
        print(f"\n{model_name} Model Performance:")
        print(f"R² Score (Train): {train_r2:.4f}")
        print(f"R² Score (Test) : {test_r2:.4f}")
        print(f"RMSE (Train)    : {train_rmse:.4f}")
        print(f"RMSE (Test)     : {test_rmse:.4f}")
        print(f"MAE (Train)     : {train_mae:.4f}")
        print(f"MAE (Test)      : {test_mae:.4f}")
        print(f"MAPE (Train)    : {train_mape:.2f}%")
        print(f"MAPE (Test)     : {test_mape:.2f}%")
    
    # Printing results
    
    evaluate_model("Ridge Regression", y_train, y_train_pred_ridge, y_test, y_test_pred_ridge)
    evaluate_model("Lasso Regression", y_train, y_train_pred_lasso, y_test, y_test_pred_lasso)
    
    Ridge Regression Model Performance:
    R² Score (Train): 0.7519
    R² Score (Test) : 0.7596
    RMSE (Train)    : 5.6158
    RMSE (Test)     : 5.3118
    MAE (Train)     : 3.1204
    MAE (Test)      : 3.1185
    MAPE (Train)    : 61.73%
    MAPE (Test)     : 66.54%
    
    Lasso Regression Model Performance:
    R² Score (Train): 0.7469
    R² Score (Test) : 0.7580
    RMSE (Train)    : 5.6724
    RMSE (Test)     : 5.3294
    MAE (Train)     : 3.1705
    MAE (Test)      : 3.1508
    MAPE (Train)    : 62.94%
    MAPE (Test)     : 67.32%
    
    • R² Scores (Train & Test) are nearly identical across all models.

    • Linear Regression: Train: 0.7557, Test: 0.7588

    • Ridge Regression: Train: 0.7519, Test: 0.7596

    • Lasso Regression: Train: 0.7469, Test: 0.7580

    • Conclusion: Regularization (Ridge & Lasso) didn't significantly improve generalization.

    • RMSE, MAE, and MAPE are also very close.

    • Lasso performs slightly worse in Train R² (0.7469) because Lasso shrinks some coefficients to zero, meaning it's removing some features.

    • Ridge performs almost exactly like Linear Regression.

    • MAPE is still high (~60-67%) in all models.

    • This suggests that a purely linear model may not fully capture the complexities of used car pricing.

    1. Ridge and Lasso didn't improve much.
    • Since our linear models are hitting a performance ceiling, let's train Decision Trees & Random Forest, which handle non-linear relationships better.
    1. We should check feature importance in Lasso.
    • Lasso removes some features by shrinking coefficients to zero. Would you like me to print which features were eliminated by Lasso?
    In [42]:
    # Getting feature names
    
    feature_names = X_train.columns
    
    # Getting Lasso coefficients
    lasso_coeffs = lasso.coef_
    
    # Identifying features with zero coefficients (dropped by Lasso)
    
    dropped_features = feature_names[lasso_coeffs == 0]
    
    # Printing dropped Features
    
    print("Features Dropped by Lasso Regression:")
    print(dropped_features)
    
    Features Dropped by Lasso Regression:
    Index(['Location_Pune', 'Fuel_Type_Diesel', 'Fuel_Type_LPG',
           'Owner_Type_Fourth & Above', 'Brand_BENTLEY', 'Brand_DATSUN',
           'Brand_FIAT', 'Brand_FORCE', 'Brand_ISUZU', 'Brand_JEEP',
           'Brand_MITSUBISHI', 'Brand_SMART', 'Brand_TOYOTA'],
          dtype='object')
    
    • Most of the dropped features are categorical variables (Location, Fuel_Type, Owner_Type, and specific Brands).

    • Some brands (e.g., Bentley, Jeep, Mitsubishi) were dropped, this means their effect on price is either negligible or already explained by other features.

    • Location_Pune was dropped, possibly because location isn't a strong predictor of price when other features (like Power, Engine, and Car Age) are present.

    • Fuel_Type_Diesel & Fuel_Type_LPG were removed, this suggests that Fuel Type doesn't significantly impact used car prices when other variables are considered.

    3) Decision Trees

    In [43]:
    # Training Decision Tree Model
    
    dt_model = DecisionTreeRegressor(random_state=42, max_depth=10)  # Limiting depth to prevent overfitting
    dt_model.fit(X_train, y_train)
    
    # Making predictions
    y_train_pred_dt = dt_model.predict(X_train)
    y_test_pred_dt = dt_model.predict(X_test)
    
    # Evaluate model performance
    
    def evaluate_model(model_name, y_train, y_train_pred, y_test, y_test_pred):
        train_r2 = r2_score(y_train, y_train_pred)
        test_r2 = r2_score(y_test, y_test_pred)
        train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
        test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
        train_mae = mean_absolute_error(y_train, y_train_pred)
        test_mae = mean_absolute_error(y_test, y_test_pred)
        train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
        test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
    
        print(f"\n{model_name} Model Performance:")
        print(f"R² Score (Train): {train_r2:.4f}")
        print(f"R² Score (Test) : {test_r2:.4f}")
        print(f"RMSE (Train)    : {train_rmse:.4f}")
        print(f"RMSE (Test)     : {test_rmse:.4f}")
        print(f"MAE (Train)     : {train_mae:.4f}")
        print(f"MAE (Test)      : {test_mae:.4f}")
        print(f"MAPE (Train)    : {train_mape:.2f}%")
        print(f"MAPE (Test)     : {test_mape:.2f}%")
    
    # Print Decision Tree Results
    
    evaluate_model("Decision Tree", y_train, y_train_pred_dt, y_test, y_test_pred_dt)
    
    Decision Tree Model Performance:
    R² Score (Train): 0.9717
    R² Score (Test) : 0.7441
    RMSE (Train)    : 1.8960
    RMSE (Test)     : 5.4799
    MAE (Train)     : 1.0258
    MAE (Test)      : 2.0590
    MAPE (Train)    : 14.22%
    MAPE (Test)     : 26.36%
    
    1. R² Score (Train: 0.9717, Test: 0.7441)
    • The model fits the training data extremely well (~97% of variance explained). But there's a gap between Train & Test R², suggesting some overfitting.
    1. RMSE (Train: 1.8960, Test: 5.4799)
    • The training error is very low, but the test error is higher, indicating potential overfitting.
    1. MAE & MAPE are significantly better than Linear Models.
    • MAE dropped from ~3.1 (Linear Regression) to ~2.05 in the Test Set, great improvement on the Regression Models

    • MAPE is much lower (~26.36% vs ~65% for Linear Regression), the predictions are now much closer to actual prices.

    Hyperparameter Tuning: Decision Tree¶

    In [44]:
    # Defining parameter grid
    
    param_grid = {
        "max_depth": [5, 7, 10, 15],
        "min_samples_split": [5, 10, 20],
        "min_samples_leaf": [5, 10, 15]
    }
    
    # Initializing decision tree model
    
    dt = DecisionTreeRegressor(random_state=42)
    
    # Running GridSearchCV to find the best parameters
    
    grid_search = GridSearchCV(dt, param_grid, cv=5, scoring="r2", n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    # Best parameters
    
    best_params = grid_search.best_params_
    print("Best Parameters:", best_params)
    
    # Training Decision Tree with best parameters
    
    best_dt = DecisionTreeRegressor(**best_params, random_state=42)
    best_dt.fit(X_train, y_train)
    
    # Making Predictions
    
    y_train_pred_best_dt = best_dt.predict(X_train)
    y_test_pred_best_dt = best_dt.predict(X_test)
    
    # Evaluating Tuned Model
    
    evaluate_model("Tuned Decision Tree", y_train, y_train_pred_best_dt, y_test, y_test_pred_best_dt)
    
    Best Parameters: {'max_depth': 15, 'min_samples_leaf': 5, 'min_samples_split': 20}
    
    Tuned Decision Tree Model Performance:
    R² Score (Train): 0.9208
    R² Score (Test) : 0.8195
    RMSE (Train)    : 3.1737
    RMSE (Test)     : 4.6028
    MAE (Train)     : 1.4059
    MAE (Test)      : 1.9617
    MAPE (Train)    : 14.97%
    MAPE (Test)     : 24.66%
    
    • R² Score (Test: 0.8195 vs 0.7441 before), big improvement.

    • The model is now explaining ~82% of the variance in car prices.

    • Overfitting is reduced (Train R² went from 0.9717 to 0.9208, meaning the model is no longer memorizing the training data as much).

    • RMSE (Test: 4.6028 vs 5.4799 before), means Lower Error.

    • Our model's average price prediction error dropped by almost 90,000 INR.

    • MAE (Test: 1.9617) and MAPE (Test: 24.66%), meaning Lower Errors.

    • MAE improved slightly (from 2.05 to 1.96 INR).

    • MAPE dropped from 26.36% to 24.66%, meaning our model's predictions are now 2% more accurate overall!

    In [45]:
    # Plotting the Decision Tree
    
    # Setting figure size
    
    plt.figure(figsize=(20, 10))
    
    # Plot the decision tree (limiting depth for better visualization)
    
    plot_tree(best_dt, feature_names=X_train.columns, filled=True, rounded=True, max_depth=4)  # Adjust max_depth if needed
    
    # Showing the plot
    plt.title("Decision Tree Visualization (Pruned)")
    plt.show()
    
    No description has been provided for this image

    Feature Importance

    In [46]:
    # Extracing feature importances from the tuned decision tree model
    
    feature_importances = best_dt.feature_importances_
    
    # Creating a DataFrame for better visualization
    
    importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})
    
    # Sorting by importance in descending order
    
    importance_df = importance_df.sort_values(by='Importance', ascending=False)
    
    # Displaying the top 15 features
    
    plt.figure(figsize=(12, 6))
    plt.barh(importance_df['Feature'][:15], importance_df['Importance'][:15], color='skyblue')
    plt.xlabel("Feature Importance")
    plt.ylabel("Feature")
    plt.title("Top 15 Feature Importances - Decision Tree")
    plt.gca().invert_yaxis()
    plt.show()
    
    # Displaying the DataFrame
    
    print(importance_df)
    
    No description has been provided for this image
                          Feature  Importance
    6                   Power_Log    0.697919
    3                     Car_Age    0.162960
    5                  Engine_Log    0.035850
    4       Kilometers_Driven_Log    0.030605
    38           Brand_LAND ROVER    0.015809
    1                     Mileage    0.012793
    11         Location_Hyderabad    0.009526
    20        Transmission_Manual    0.009494
    2                       Seats    0.006371
    41        Brand_MERCEDES-BENZ    0.004749
    42                 Brand_MINI    0.003954
    9         Location_Coimbatore    0.002448
    24                 Brand_AUDI    0.001512
    32                Brand_HONDA    0.001210
    0                       S.No.    0.001210
    33              Brand_HYUNDAI    0.000649
    50               Brand_TOYOTA    0.000544
    19           Fuel_Type_Petrol    0.000490
    10             Location_Delhi    0.000486
    14           Location_Kolkata    0.000339
    40               Brand_MARUTI    0.000193
    49                 Brand_TATA    0.000186
    22          Owner_Type_Second    0.000166
    26                  Brand_BMW    0.000144
    15            Location_Mumbai    0.000089
    39             Brand_MAHINDRA    0.000080
    7          Location_Bangalore    0.000058
    27            Brand_CHEVROLET    0.000040
    31                 Brand_FORD    0.000036
    51           Brand_VOLKSWAGEN    0.000034
    17           Fuel_Type_Diesel    0.000021
    8            Location_Chennai    0.000010
    16              Location_Pune    0.000010
    12            Location_Jaipur    0.000008
    13             Location_Kochi    0.000005
    46              Brand_RENAULT    0.000000
    45              Brand_PORSCHE    0.000000
    44               Brand_NISSAN    0.000000
    43           Brand_MITSUBISHI    0.000000
    48                Brand_SMART    0.000000
    47                Brand_SKODA    0.000000
    23           Owner_Type_Third    0.000000
    25              Brand_BENTLEY    0.000000
    37          Brand_LAMBORGHINI    0.000000
    36                 Brand_JEEP    0.000000
    35               Brand_JAGUAR    0.000000
    34                Brand_ISUZU    0.000000
    18              Fuel_Type_LPG    0.000000
    30                Brand_FORCE    0.000000
    29                 Brand_FIAT    0.000000
    28               Brand_DATSUN    0.000000
    21  Owner_Type_Fourth & Above    0.000000
    52                Brand_VOLVO    0.000000
    
    • Power_Log (69.8%) - By far, the most influential factor in predicting car prices. This aligns well with our expectations—higher power engines typically correlate with higher car prices.

    • Car_Age (16.3%) - The second most important feature. Newer cars tend to retain more value, while older ones depreciate.

    • Engine_Log (3.6%) and Kilometers_Driven_Log (3.1%) - These features still hold some significance, likely because engine size impacts performance and perceived value, while mileage affects depreciation.

    • Brand_LAND ROVER (1.6%) - This makes sense, as luxury brands like Land Rover command higher resale values.

    • Mileage, Location_Hyderabad, Transmission_Manual - These features contribute slightly to price prediction but are significantly less impactful than power and age.

    • Several categorical variables have little to no impact - Some brands and fuel types have near-zero importance, meaning they don't significantly influence the decision tree's splits.

    4) Random Forest

    In [47]:
    # Initializing and training the random forest model
    
    rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
    rf_model.fit(X_train, y_train)
    
    # Predictions
    
    y_train_pred_rf = rf_model.predict(X_train)
    y_test_pred_rf = rf_model.predict(X_test)
    
    # Performance Evaluation
    
    rf_train_r2 = r2_score(y_train, y_train_pred_rf)
    rf_test_r2 = r2_score(y_test, y_test_pred_rf)
    rf_train_rmse = mean_squared_error(y_train, y_train_pred_rf) ** 0.5
    rf_test_rmse = mean_squared_error(y_test, y_test_pred_rf) ** 0.5
    rf_train_mae = mean_absolute_error(y_train, y_train_pred_rf)
    rf_test_mae = mean_absolute_error(y_test, y_test_pred_rf)
    rf_train_mape = np.mean(np.abs((y_train - y_train_pred_rf) / y_train)) * 100
    rf_test_mape = np.mean(np.abs((y_test - y_test_pred_rf) / y_test)) * 100
    
    # Displaying results
    
    print(f"Random Forest Model Performance:")
    print(f"R² Score (Train): {rf_train_r2:.4f}")
    print(f"R² Score (Test) : {rf_test_r2:.4f}")
    print(f"RMSE (Train)    : {rf_train_rmse:.4f}")
    print(f"RMSE (Test)     : {rf_test_rmse:.4f}")
    print(f"MAE (Train)     : {rf_train_mae:.4f}")
    print(f"MAE (Test)      : {rf_test_mae:.4f}")
    print(f"MAPE (Train)    : {rf_train_mape:.2f}%")
    print(f"MAPE (Test)     : {rf_test_mape:.2f}%")
    
    Random Forest Model Performance:
    R² Score (Train): 0.9851
    R² Score (Test) : 0.8839
    RMSE (Train)    : 1.3759
    RMSE (Test)     : 3.6912
    MAE (Train)     : 0.5495
    MAE (Test)      : 1.4871
    MAPE (Train)    : 5.95%
    MAPE (Test)     : 20.21%
    
    1. High R² Score on Test Data (0.8839)
    • This means 88.39% of the variance in car prices is explained by the model, which is a huge improvement over the previous models.

    • Compared to Decision Tree (81.95% after tuning), this is a major step up.

    1. RMSE & MAE Scores Are Lower
    • Test RMSE = 3.6912, lower than Decision Tree (4.6028), meaning better predictions.

    • Test MAE = 1.4871, the average absolute error is just 1.49 lakh, which is pretty good for car price prediction.

    1. MAPE (Mean Absolute Percentage Error)
    • Test MAPE = 20.21%, meaning on average, predictions are within +/- 20.21% of the actual price.
    • A significant improvement over the Decision Tree (24.66%).
    1. Train vs Test Performance
    • R² Score (Train) = 0.9851, means almost perfect fit, but slightly concerning.

    • R² Score (Test) = 0.8839, shows some generalization but a bit of overfitting is present.

    • MAPE (Train) = 5.95% vs Test = 20.21%, confirms overfitting, but this can be controlled with hyperparameter tuning.

    Hyperparameter Tuning: Random Forest¶

    In [48]:
    # Defining the parameter grid
    
    rf_param_grid = {
        "n_estimators": [100, 200],  # Reducing tree count options
        "max_depth": [10, 15],       # Keeping it within a reasonable range
        "min_samples_split": [10, 20],  # Higher values prevent overfitting
        "min_samples_leaf": [5, 10]   # Prevents too small leaves
    }
    
    # Initializing random forest regressor
    
    rf_model = RandomForestRegressor(random_state=42)
    
    # Using RandomizedSearchCV for faster tuning
    
    rf_random_search = RandomizedSearchCV(
        estimator=rf_model,
        param_distributions=rf_param_grid,
        n_iter=10,  # Runs only 10 combinations instead of all
        cv=3,  # Reduced cross-validation folds
        scoring="r2",
        verbose=2,
        n_jobs=-1  # Uses all available CPU cores
    )
    
    # Fitting the model
    
    rf_random_search.fit(X_train, y_train)
    
    # Getting the best parameters
    
    best_rf_params = rf_random_search.best_params_
    print("Best Parameters:", best_rf_params)
    
    # Training Random Forest with the best parameters
    
    best_rf = RandomForestRegressor(**best_rf_params, random_state=42)
    best_rf.fit(X_train, y_train)
    
    # Making predictions
    
    y_train_pred_best_rf = best_rf.predict(X_train)
    y_test_pred_best_rf = best_rf.predict(X_test)
    
    # Evaluating the tuned Random Forest model
    
    best_rf_train_r2 = r2_score(y_train, y_train_pred_best_rf)
    best_rf_test_r2 = r2_score(y_test, y_test_pred_best_rf)
    best_rf_train_rmse = mean_squared_error(y_train, y_train_pred_best_rf) ** 0.5
    best_rf_test_rmse = mean_squared_error(y_test, y_test_pred_best_rf) ** 0.5
    best_rf_train_mae = mean_absolute_error(y_train, y_train_pred_best_rf)
    best_rf_test_mae = mean_absolute_error(y_test, y_test_pred_best_rf)
    best_rf_train_mape = np.mean(np.abs((y_train - y_train_pred_best_rf) / y_train)) * 100
    best_rf_test_mape = np.mean(np.abs((y_test - y_test_pred_best_rf) / y_test)) * 100
    
    # Displaying results
    
    print("\n**Tuned Random Forest Model Performance:**")
    print(f"R² Score (Train): {best_rf_train_r2:.4f}")
    print(f"R² Score (Test) : {best_rf_test_r2:.4f}")
    print(f"RMSE (Train)    : {best_rf_train_rmse:.4f}")
    print(f"RMSE (Test)     : {best_rf_test_rmse:.4f}")
    print(f"MAE (Train)     : {best_rf_train_mae:.4f}")
    print(f"MAE (Test)      : {best_rf_test_mae:.4f}")
    print(f"MAPE (Train)    : {best_rf_train_mape:.2f}%")
    print(f"MAPE (Test)     : {best_rf_test_mape:.2f}%")
    
    Fitting 3 folds for each of 10 candidates, totalling 30 fits
    Best Parameters: {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_depth': 15}
    
    **Tuned Random Forest Model Performance:**
    R² Score (Train): 0.9369
    R² Score (Test) : 0.8571
    RMSE (Train)    : 2.8318
    RMSE (Test)     : 4.0956
    MAE (Train)     : 1.1745
    MAE (Test)      : 1.6715
    MAPE (Train)    : 12.83%
    MAPE (Test)     : 22.21%
    
    1. Better Generalization:
    • Test R² Score improved to 0.8571 (was 0.8839 before tuning).

    • Train R² Score reduced to 0.9369 (was 0.9851 before tuning).

    • This means less overfitting—the model is more balanced and will perform better on new data.

    1. Lower RMSE & MAE:
    • RMSE (Test) improved to 4.0956 (was 3.6912 before tuning).

    • MAE (Test) is now 1.6715, meaning on average, predictions are within 1.67 of actual prices.

    • MAPE (Test) dropped to 22.21%, which is an improvement from 20.21%, still strong, but we sacrificed a little accuracy to reduce overfitting.

    Feature Importance

    In [49]:
    # Extracting feature importances from the tuned Random forest model
    
    rf_feature_importances = best_rf.feature_importances_
    
    # Creating a DataFrame for better visualization
    
    rf_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_feature_importances})
    
    # Sorting by importance in descending order
    
    rf_importance_df = rf_importance_df.sort_values(by='Importance', ascending=False)
    
    # Displaying the top 15 features
    
    plt.figure(figsize=(12, 6))
    plt.barh(rf_importance_df['Feature'][:15], rf_importance_df['Importance'][:15], color='lightgreen')
    plt.xlabel("Feature Importance")
    plt.ylabel("Feature")
    plt.title("Top 15 Feature Importances - Random Forest")
    plt.gca().invert_yaxis()  # Invert y-axis for better readability
    plt.show()
    
    # Displaying the DataFrame
    
    print(rf_importance_df)
    
    No description has been provided for this image
                          Feature    Importance
    6                   Power_Log  7.141844e-01
    3                     Car_Age  1.634773e-01
    4       Kilometers_Driven_Log  3.420207e-02
    5                  Engine_Log  2.767288e-02
    1                     Mileage  1.625237e-02
    0                       S.No.  8.523266e-03
    20        Transmission_Manual  6.785359e-03
    2                       Seats  4.636363e-03
    41        Brand_MERCEDES-BENZ  4.201282e-03
    38           Brand_LAND ROVER  3.752296e-03
    42                 Brand_MINI  2.693046e-03
    24                 Brand_AUDI  1.726941e-03
    9         Location_Coimbatore  1.332947e-03
    26                  Brand_BMW  1.289749e-03
    11         Location_Hyderabad  1.231823e-03
    50               Brand_TOYOTA  1.189698e-03
    32                Brand_HONDA  1.009778e-03
    17           Fuel_Type_Diesel  7.312384e-04
    15            Location_Mumbai  6.892281e-04
    19           Fuel_Type_Petrol  6.132026e-04
    36                 Brand_JEEP  4.755899e-04
    22          Owner_Type_Second  4.492650e-04
    14           Location_Kolkata  4.293241e-04
    7          Location_Bangalore  4.158671e-04
    39             Brand_MAHINDRA  3.284861e-04
    10             Location_Delhi  3.168373e-04
    35               Brand_JAGUAR  2.922101e-04
    33              Brand_HYUNDAI  2.051417e-04
    40               Brand_MARUTI  1.729231e-04
    13             Location_Kochi  1.652453e-04
    47                Brand_SKODA  1.293168e-04
    49                 Brand_TATA  1.259220e-04
    31                 Brand_FORD  7.929768e-05
    16              Location_Pune  5.555644e-05
    27            Brand_CHEVROLET  4.188421e-05
    8            Location_Chennai  3.876059e-05
    51           Brand_VOLKSWAGEN  2.992249e-05
    46              Brand_RENAULT  2.395319e-05
    12            Location_Jaipur  1.434947e-05
    23           Owner_Type_Third  8.711058e-06
    43           Brand_MITSUBISHI  4.234619e-06
    44               Brand_NISSAN  1.936499e-06
    18              Fuel_Type_LPG  1.418812e-08
    48                Brand_SMART  0.000000e+00
    45              Brand_PORSCHE  0.000000e+00
    28               Brand_DATSUN  0.000000e+00
    29                 Brand_FIAT  0.000000e+00
    21  Owner_Type_Fourth & Above  0.000000e+00
    37          Brand_LAMBORGHINI  0.000000e+00
    34                Brand_ISUZU  0.000000e+00
    25              Brand_BENTLEY  0.000000e+00
    30                Brand_FORCE  0.000000e+00
    52                Brand_VOLVO  0.000000e+00
    
    • Power_Log is by far the most critical factor in predicting car prices.

    • Car_Age follows, which makes total sense, as older cars tend to be less expensive.

    • Kilometers_Driven_Log is also key, indicating that the more a car has been driven, the lower its price tends to be.

    • Engine_Log and Mileage contribute but with smaller effects.

    • Some locations and brands hold value, with premium brands, like Mercedes-Benz, Land Rover, Audi, BMW, appearing in the ranking.

    Conclusions and Recommendations¶

    1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

    • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
    • Linear models (Linear, Ridge, Lasso) performed consistently but weakly. They struggled to capture complex relationships, yielding high errors and low R² scores.

    • Decision Tree (Base Model) overfitted massively (R² = 0.9717 on Train, 0.7441 on Test). After hyperparameter tuning, the overfitting reduced, and performance improved to R² = 0.8195.

    • Random Forest outperformed all models, delivering the best generalized accuracy with R² = 0.8839 on Test and the lowest RMSE (3.6912) & MAPE (20.21%). Tuned Random Forest still performed well, but slightly lost generalization (R² dropped from 0.8839 to 0.8571).

    • There is scope for further improvements by:

    • Using more hyperparameter tuning (grid search with more granular settings)

    • Using ensemble techniques like stacking or boosting (XGBoost, LightGBM)

    • More feature engineering (possibly polynomial features)

    2. Refined insights:

    • What are the most meaningful insights relevant to the problem?

    Most Important Features Driving Car Prices:

    • Power (Horsepower) is the single most influential factor.
    • Car Age plays a significant role in depreciation.
    • Kilometers Driven & Engine Size also impact price significantly.
    • Brand & Location impact price but are secondary factors.

    Trends & Market Insights:

    • Luxury brands (BMW, Mercedes, Land Rover, Audi) retain value better than economy brands.

    • Automatic transmission vehicles have a slight price premium over manual ones.

    • Cars in locations like Bangalore, Delhi, and Mumbai tend to have higher prices.

    • Depreciation is significant after 5-10 years of usage.

    Final Recommendations for a Buyer/Seller in the Used Car Market:

    • If buying a used car: Focus on Power, Age, and Brand to find value-for-money deals.

    • If selling a used car: Maintain your car in good condition with low mileage, and consider selling before it reaches 10+ years old for maximum resale value.

    • For dealerships/platforms: Pricing algorithms should heavily weight power and age while considering location trends.

    3. Proposal for the final solution design:

    • What model do you propose to be adopted? Why is this the best solution to adopt?

    Best Model to Adopt: Tuned Random Forest

    • Best Generalization: While base Random Forest had the best test R² (0.8839), the tuned version still maintained a strong R² (0.8571) while reducing overfitting.

    • Better RMSE & MAPE Scores: The errors in price predictions are lower, meaning it's more reliable.

    • Handles Non-Linear Relationships: Unlike linear models, Random Forest captures complex interactions between variables.

    • Feature Importance Explainability: We can rank the most important features, making it a transparent, interpretable model.