Used Cars Price Prediction¶
Problem Definition¶
The Context:¶
There's a great demand for used cars in the Indian Market, as the sales of new cars have slowed down recently, the used car market as continued to grow and as surpassed the new car market. Cars4U is a tech start-up that as the objective to find holes in this market in order to take advantage of them and make bigger sales.
For a quick look at the numbers, the new car market sold 3.6 million units in 2018/19, as compared to around 4 million units for the used car market.
The bigger challenge with this growing market of used cars is determining the price for the vehicles, because unlike new cars that have their values determined and managed by OEMs, the price for the used car market as large uncertainties because of variables, like mileage, year, ownership, etc. that influence the value set for every car. It's also dificult to predict and guarentee the supply. Coming with a solution that facilitates the pricing as pivotal importance not only for owners but for delears as well.
The objective:¶
Build a pricing model that effectively predicts the price for used cars and that can help out business coming up with profitable stratagies using differential pricing
The key questions:¶
- How do the multiple variables affect the price of the cars?
- For predicting the price, can we rule out some of the variables?
- What are the most important features for predicting the price?
- What are the less relevant features for the prediction?
The problem formulation:¶
Our goal is to develop a robust predictive model that can:
- Estimate the price of a used car based on its features (age, mileage, brand, etc.).
- Provide a data-driven pricing strategy for sellers and dealerships.
- Reduce uncertainty in used car pricing, making the market more transparent.
We will explore multiple machine learning techniques to identify the best-performing model based on key evaluation metrics such as R² Score & RMSE, ensuring accurate and reliable price predictions.
Data Dictionary¶
S.No. : Serial Number
Name : Name of the car which includes Brand name and Model name
Location : The location in which the car is being sold or is available for purchase (Cities)
Year : Manufacturing year of the car
Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM
Fuel_Type : The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
Transmission : The type of transmission used by the car (Automatic / Manual)
Owner : Type of ownership
Mileage : The standard mileage offered by the car company in kmpl or km/kg
Engine : The displacement volume of the engine in CC
Power : The maximum power of the engine in bhp
Seats : The number of seats in the car
New_Price : The price of a new car of the same model in INR 100,000
Price : The price of the used car in INR 100,000 (Target Variable)
Loading libraries¶
# Importing libraries for data manipulation
import numpy as np
import pandas as pd
# Importing libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
import scipy.stats as stats
# Importing libraries for building linear regression model
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
# Importing libraries for tree Based models
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestRegressor
# Importing libraries for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# Importing libraries for model evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Importing library for splitting data
from sklearn.model_selection import train_test_split
# Importing library for data preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
# Importing filter to ignore deprecation warnings
import warnings
warnings.filterwarnings("ignore")
# Removing the limit from the number of displayed columns and rows.
pd.set_option("display.max_columns", None)
Let us load the data¶
# Leting colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Using the pd.read_csv() funtion to load the dataset
df = pd.read_csv("/content/drive/MyDrive/MIT - Applied Data Science/Projects/Capstone/used_cars.csv")
Data Overview¶
- Observations
- Sanity checks
# Looking at the top 5 rows of the dataset to start building some intuition
df.head()
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 |
| 1 | 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 |
| 2 | 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 |
| 3 | 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 |
| 4 | 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 |
The first thing I notice by analysing the first rows of our data is that 4 out of 5, in the New_Price feature, have null values. I'll have to adress this issue and treat the missing values accordingly after some further analysis, to see which method will be the most effective for our modeling.
For the Name Feature, we can see that it combines the brand and model of the car, which might not be directly usable in its current form for modeling. We might want to extract the Brand and possibly Model separately as additional categorical features.
The Year variable, year of manufacturing could be transformed into a more meaningful feature, such as "age of the car", which may better represent car depreciation.
In the features Mileage, Engine and Power, although these variable are numerical they might contain units implicitly, like, Mileage in kmpl, Engine in cc, Power in bhp. We should double-check if all the values are consistent and formatted numerically.
For the categorical variables, Location, Fuel_Type, Transmission and Owner_Type, these will likely need to be encoded using techniques like one-hot encoding or label encoding.
The Seats feature, it does appear clean at first glance but later, I might consider checking if unusual seat counts, like, check if unusually high or low numbers exist.
For Price variable, it looks clean and numeric. We'll later examine its distribution to look for skewness or outliers.
I'll now run some other functions and methods to inspect the full dataset and see if there are any more features or topics that need atention.
# Checking the size of the dataset using the .shape method
df.shape
(7253, 14)
- Our dataset has 7253 rows and 14 columns.
# Displaying basic information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 S.No. 7253 non-null int64 1 Name 7253 non-null object 2 Location 7253 non-null object 3 Year 7253 non-null int64 4 Kilometers_Driven 7253 non-null int64 5 Fuel_Type 7253 non-null object 6 Transmission 7253 non-null object 7 Owner_Type 7253 non-null object 8 Mileage 7251 non-null float64 9 Engine 7207 non-null float64 10 Power 7078 non-null float64 11 Seats 7200 non-null float64 12 New_price 1006 non-null float64 13 Price 6019 non-null float64 dtypes: float64(6), int64(3), object(5) memory usage: 793.4+ KB
- After examining the .info() function we can easily see that we have missing values for the following features:
Mileage - 7251 entries has 2 missing values
Engine - 7207 entries has 46 missing values
Power - 7078 entries has 175 missing values
Seats - 7200 entries has 53 missing values
New_price - 1006 entries has 6247 missing values
Price - 6019 entries has 1234 missing values.
New_Price has the highest number of missing values (6247 out of 7253 rows). This suggests we may need to drop this column unless we find a reliable way to impute these values.
Price (Target Variable) has 1234 missing values. Since this is what we are predicting, we need to remove these rows before training the model because we can't predict without a target.
Mileage, Engine, Power, and Seats have relatively fewer missing values, so I will impute them rather than drop rows.
- Regarding the datatypes, we have 11 numerical features, 3 being integers and 6 floats, and 5 object or string datatype variables:
Integers - S.No., Year and Kilometers_Driven
Floats - Mileage, Engine, Power, Seats, New_price and Price
Objects - Name, Location, Fuel_Type, Transmission and Owner_Type
Kilometers_Driven is an integer, but we may need to check for extreme values, like, unrealistic kilometer readings.
Year is an integer but might be better represented as "Age of Car", like I said before, for better modeling.
Power and Engine should be checked for unit consistency like, BHP vs CC vs KMPL.
- For the categorical features (object type):
Name contains both brand & model, and we should extract Brand as a separate feature.
Location, Fuel_Type, Transmission, and Owner_Type will require encoding.
# Checking the missing values, let's use the .isnull().sum() function, that will return us a count of the missing values in our data
df.isnull().sum()
| 0 | |
|---|---|
| S.No. | 0 |
| Name | 0 |
| Location | 0 |
| Year | 0 |
| Kilometers_Driven | 0 |
| Fuel_Type | 0 |
| Transmission | 0 |
| Owner_Type | 0 |
| Mileage | 2 |
| Engine | 46 |
| Power | 175 |
| Seats | 53 |
| New_price | 6247 |
| Price | 1234 |
# I'm now going to run the is.na().sum() function that will return a count of any fields that may have NaN values
df.isna().sum()
| 0 | |
|---|---|
| S.No. | 0 |
| Name | 0 |
| Location | 0 |
| Year | 0 |
| Kilometers_Driven | 0 |
| Fuel_Type | 0 |
| Transmission | 0 |
| Owner_Type | 0 |
| Mileage | 2 |
| Engine | 46 |
| Power | 175 |
| Seats | 53 |
| New_price | 6247 |
| Price | 1234 |
- Price (Target Variable) has 1234 missing values
- Since Price is our dependent variable, we must drop these rows before modeling.
- I'll remove them right before training the model, ensuring we don't lose useful data during EDA.
- New_Price has 6247 missing values (~86% of the data)
- This feature is mostly empty and likely not useful for modeling.
Two options:
Drop it entirely, since it doesn't add much information.
Try imputing it based on category (Brand/Model) (if we find a strong pattern).
- Power (175 missing values), Engine (46 missing values), and Mileage (2 missing values)
These are important numerical features and we should not drop them.
I'll use mean/median imputation based on similar car types (e.g., impute by Brand, Model, or Fuel Type).
- Seats has 53 missing values
- Seat count is usually fixed per car model, likely safe to impute using mode (most frequent value).
# I'm now running the .duplicated().sum() function, to know if there are any duplicated records on the dataset
df.duplicated().sum()
0
- No duplicated rows in our dataset, so there's no need to drop any duplicate data.
Exploratory Data Analysis¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
- Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
- Check the extreme values in different columns of the given data and write down the observtions? Remove the data where the values are un-realistic
- What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
# In order to start building some more intuition on our data, I'm now using the .describe() function, which will return a statistical summary of our columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| S.No. | 7253.0 | 3626.000000 | 2093.905084 | 0.00 | 1813.000 | 3626.00 | 5439.0000 | 7252.00 |
| Year | 7253.0 | 2013.365366 | 3.254421 | 1996.00 | 2011.000 | 2014.00 | 2016.0000 | 2019.00 |
| Kilometers_Driven | 7253.0 | 58699.063146 | 84427.720583 | 171.00 | 34000.000 | 53416.00 | 73000.0000 | 6500000.00 |
| Mileage | 7251.0 | 18.141580 | 4.562197 | 0.00 | 15.170 | 18.16 | 21.1000 | 33.54 |
| Engine | 7207.0 | 1616.573470 | 595.285137 | 72.00 | 1198.000 | 1493.00 | 1968.0000 | 5998.00 |
| Power | 7078.0 | 112.765214 | 53.493553 | 34.20 | 75.000 | 94.00 | 138.1000 | 616.00 |
| Seats | 7200.0 | 5.280417 | 0.809277 | 2.00 | 5.000 | 5.00 | 5.0000 | 10.00 |
| New_price | 1006.0 | 22.779692 | 27.759344 | 3.91 | 7.885 | 11.57 | 26.0425 | 375.00 |
| Price | 6019.0 | 9.479468 | 11.187917 | 0.44 | 3.500 | 5.64 | 9.9500 | 160.00 |
- Year (Manufacturing Year)
- Mean: 2013, Min: 1996, Max: 2019. Data includes cars ranging from 1996 to 2019.
- Some very old cars (pre-2000s) might be outliers or rare cases in the dataset.
- I'm going to convert Year into Car Age (Current Year - Manufacturing Year).
- Kilometers Driven
- Mean: ~58,699 km, but Max: 6,500,000 km. Huge outlier.
- Standard deviation (~84,428 km) is much larger than the mean, suggesting extreme values.
- Minimum: 171 km. Possibly a listing error or nearly new cars.
- I'll Log-transform this variable to handle skewness.
- Cap extreme outliers (e.g., above 99th percentile).
- Mileage (KMPL)
- Mean: 18.14 KMPL, but Min: 0.00. Indicates missing or incorrect data.
- Max: 33.54 KMPL. Seems realistic.
- I'll replace 0.00 values with mean/median based on Brand and Fuel Type.
- Engine (CC)
- Mean: 1616 CC, but Min: 72 CC, Max: 5998 CC.
- The 72 CC value seems highly unrealistic (possibly an error).
- I'll examine and replace low values using median by Brand/Model.
- Power (BHP)
- Mean: 112 BHP, Min: 34.2 BHP, Max: 616 BHP. High-end sports cars present.
- Possible data entry errors for very low power cars.
- I'll Check for incorrect formatting (some datasets store Power with text like 110 bhp).
- I'm replacing missing values using mean/median by Brand/Model.
- Seats
- Mean: 5.28, Mostly 5-seaters, Min: 2, Max: 10.
- Higher values (7-10 seats) seem to be SUVs, Vans, or Buses.
- I'm going to impute missing values using mode (most common seat count per car type).
- New_Price
- Mean: 22.77 INR Lakhs (2.27M INR), but only 1006 values available (out of 7253), so 86% missing.
- I'll likely drop this column, unless we find a strong category-based imputation method.
- Used Car Price (Target Variable)
- Mean: 9.47 INR Lakhs (947,000 INR), but Max: 160 Lakhs (16M INR).
- High standard deviation (11.18) suggests a wide range in prices.
- I'll aplly Log-transformation on Price for better prediction accuracy.
- I'll Keep original Price for final evaluation (R² & RMSE).
- Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
# Checking for unique values in categorical columns
categorical_columns = df.select_dtypes(include=["object"]).columns
print("\n**Unique Values in Categorical Columns:**")
for col in categorical_columns:
print(f"{col}: {df[col].nunique()} unique values")
**Unique Values in Categorical Columns:** Name: 2041 unique values Location: 11 unique values Fuel_Type: 5 unique values Transmission: 2 unique values Owner_Type: 4 unique values
- Name (2041 unique values)
This confirms that each car model is highly unique so directly using this column won't be effective.
I'll Extract Brand from Name (e.g., "Maruti Wagon R" - Brand = "Maruti"). Then I'll drop the full Name column afterward since individual model names won't be useful.
- Location (11 unique values)
Since 11 is a manageable number, we can apply get_dummies() encoding to handle locations properly.
I'll use One-Hot Encoding.
- Fuel_Type (5 unique values)
- This is a low number of categories, so we can use One-Hot Encoding. Categories: Petrol, Diesel, Electric, CNG, LPG.
- Transmission (2 unique values: Automatic, Manual)
Since it's binary (only 2 categories), we can use Label Encoding (0 = Manual, 1 = Automatic).
This prevents adding unnecessary dimensions to the dataset.
- Owner_Type (4 unique values: First, Second, Third, Fourth & Above)
- We have 4 distinct categories, One-Hot Encoding is the best approach to retain interpretability.
Feature Engineering¶
The Name column contains both the brand and model, but the model names are too unique (2041 unique values).
By extracting the brand, we get a more generalizable categorical feature.
# Extracting the brand from the Name column. I'll extract the first word (brand name)
df["Brand"] = df["Name"].str.split().str[0]
# Droping the original Name column
df.drop(columns=["Name"], inplace=True)
# Displaying the unique brands
print("Unique Brands in the Dataset:")
print(df["Brand"].nunique(), "unique brands")
print(df["Brand"].unique())
Unique Brands in the Dataset: 33 unique brands ['Maruti' 'Hyundai' 'Honda' 'Audi' 'Nissan' 'Toyota' 'Volkswagen' 'Tata' 'Land' 'Mitsubishi' 'Renault' 'Mercedes-Benz' 'BMW' 'Mahindra' 'Ford' 'Porsche' 'Datsun' 'Jaguar' 'Volvo' 'Chevrolet' 'Skoda' 'Mini' 'Fiat' 'Jeep' 'Smart' 'Ambassador' 'Isuzu' 'ISUZU' 'Force' 'Bentley' 'Lamborghini' 'Hindustan' 'OpelCorsa']
33 brands is a manageable number, meaning we can apply One-Hot Encoding if needed.
Some brand names appear inconsistently (e.g., "ISUZU" vs "Isuzu"). I should standardize brand names (convert everything to uppercase/lowercase).
Some brands have extra words (e.g., "Land" instead of "Land Rover"). I should inspect if any brands need corrections.
Standardizing Brand Names
# Standardizing brand names (convert all to uppercase)
df["Brand"] = df["Brand"].str.upper()
# Displaying unique brands
print("Unique Brands After Standardization:")
print(df["Brand"].nunique(), "unique brands")
print(df["Brand"].unique())
Unique Brands After Standardization: 32 unique brands ['MARUTI' 'HYUNDAI' 'HONDA' 'AUDI' 'NISSAN' 'TOYOTA' 'VOLKSWAGEN' 'TATA' 'LAND' 'MITSUBISHI' 'RENAULT' 'MERCEDES-BENZ' 'BMW' 'MAHINDRA' 'FORD' 'PORSCHE' 'DATSUN' 'JAGUAR' 'VOLVO' 'CHEVROLET' 'SKODA' 'MINI' 'FIAT' 'JEEP' 'SMART' 'AMBASSADOR' 'ISUZU' 'FORCE' 'BENTLEY' 'LAMBORGHINI' 'HINDUSTAN' 'OPELCORSA']
All names are now uppercase, so I've eliminated inconsistencies like ISUZU vs Isuzu.
The number of brands reduced from 33 → 32, meaning a duplicate or inconsistency was resolved.
One potential correction: "LAND" might actually be LAND ROVER, I will verify if that's correct.
# Checking all car names containing "LAND"
print(df[df["Brand"] == "LAND"]["Brand"].value_counts())
print(df[df["Brand"] == "LAND"])
Brand
LAND 67
Name: count, dtype: int64
S.No. Location Year Kilometers_Driven Fuel_Type Transmission \
13 13 Delhi 2014 72000 Diesel Automatic
14 14 Pune 2012 85000 Diesel Automatic
191 191 Coimbatore 2018 36091 Diesel Automatic
311 311 Delhi 2017 44000 Diesel Automatic
399 399 Hyderabad 2012 56000 Diesel Automatic
... ... ... ... ... ... ...
6434 6434 Kochi 2012 89190 Diesel Automatic
6717 6717 Kochi 2018 23342 Diesel Automatic
6857 6857 Mumbai 2011 87000 Diesel Automatic
7157 7157 Hyderabad 2015 49000 Diesel Automatic
7198 7198 Hyderabad 2012 147202 Diesel Automatic
Owner_Type Mileage Engine Power Seats New_price Price Brand
13 First 12.70 2179.0 187.70 5.0 NaN 27.00 LAND
14 Second 0.00 2179.0 115.00 5.0 NaN 17.50 LAND
191 First 12.70 2179.0 187.70 5.0 NaN 55.76 LAND
311 First 12.70 2179.0 187.70 5.0 NaN 44.00 LAND
399 First 12.70 2179.0 187.70 5.0 NaN 30.00 LAND
... ... ... ... ... ... ... ... ...
6434 Second 11.40 2993.0 245.41 7.0 NaN NaN LAND
6717 First 12.83 2179.0 147.50 5.0 NaN NaN LAND
6857 First 0.00 2179.0 115.00 5.0 NaN NaN LAND
7157 Second 12.70 2179.0 187.70 5.0 NaN NaN LAND
7198 First 11.80 2993.0 241.60 7.0 NaN NaN LAND
[67 rows x 14 columns]
We know from real-world knowledge that the only major brand that starts with LAND is LAND ROVER.
There is no separate ROVER brand in the dataset, which further confirms that LAND is likely an incorrect truncation.
Consistent engine sizes (2179cc, 2993cc), matches known LAND ROVER models.
Power values (187.70 BHP, 147.50 BHP), similar to LAND ROVER vehicles.
All cars labeled as LAND have high-end Diesel engines, matches LAND ROVER's lineup
# Correcting LAND to LAND ROVER
df["Brand"] = df["Brand"].replace("LAND", "LAND ROVER")
# Verifying the correction
print(df["Brand"].unique())
['MARUTI' 'HYUNDAI' 'HONDA' 'AUDI' 'NISSAN' 'TOYOTA' 'VOLKSWAGEN' 'TATA' 'LAND ROVER' 'MITSUBISHI' 'RENAULT' 'MERCEDES-BENZ' 'BMW' 'MAHINDRA' 'FORD' 'PORSCHE' 'DATSUN' 'JAGUAR' 'VOLVO' 'CHEVROLET' 'SKODA' 'MINI' 'FIAT' 'JEEP' 'SMART' 'AMBASSADOR' 'ISUZU' 'FORCE' 'BENTLEY' 'LAMBORGHINI' 'HINDUSTAN' 'OPELCORSA']
Missing value treatment¶
- I'll first treat the easiest missing values:
Seats - Imputing with mode, since seats are fixed per car type.
- For the features Mileage, Power, Engine, I'll check their distributions in EDA first in order to choose the most appropriate way to treat them.
# Imputing Seats with mode
df["Seats"].fillna(df["Seats"].mode()[0], inplace=True)
# Verifying missing values again
print("Missing Values After Seats Imputation:")
print(df.isnull().sum())
Missing Values After Seats Imputation: S.No. 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 2 Engine 46 Power 175 Seats 0 New_price 6247 Price 1234 Brand 0 dtype: int64
- Seats has now 0 missing values.
I'm now deciding to drop the New_price field for the following reasons:
86% missing values (6247 out of 7253 rows). Too much missing data to reliably impute
Used car prices are already independent of new car prices. Dealers set used car prices based on market conditions, not just original price.
This Feature isn't useful for modeling as it doesn't directly impact our price prediction goal.
Keeping it adds unnecessary complexity while dropping it simplifies the dataset.
# Droping the New_price column
df.drop(columns=["New_price"], inplace=True)
# Verifying that it's gone
print("Columns after dropping 'New_price':")
print(df.columns)
Columns after dropping 'New_price':
Index(['S.No.', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
'Price', 'Brand'],
dtype='object')
- As we can confirm from the resulting list of the columns of our dataset, the New_price field is no longer present in our data.
Univariate Analysis¶
Questions:
- Do univariate analysis for numerical and categorical variables?
- Check the distribution of the different variables? is the distributions skewed?
- Do we need to do log_transformation, if so for what variables we need to do?
- Perform the log_transformation(if needed) and write down your observations?
- Do univariate analysis for numerical and categorical variables.
Let's start by analysing the numerical variables first
# Defining numerical columns to analyze
numerical_features = ["Year", "Kilometers_Driven", "Mileage", "Engine", "Power", "Seats", "Price"]
# Ploting histograms for numerical features
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features, 1):
plt.subplot(3, 3, i) # Adjusting rows and columns based on number of features
sns.histplot(df[col], bins=30, kde=True)
plt.title(f"Distribution of {col}")
plt.tight_layout()
plt.show()
- Year (Manufacturing Year)
Left-skewed (most cars are from recent years).
No need for log transformation.
Instead, we should convert this to "Car Age".
- Kilometers Driven
- Highly right-skewed with extreme outliers.
- Log transformation to make it more normally distributed.
- Mileage (KMPL)
- Looks roughly normal but has some low and high extremes.
- No log transformation needed.
- I'll handle missing values based on Fuel Type & Brand.
- Engine (CC)
- Right-skewed, showing different peaks for different car segments.
- Log transformation to normalize it.
- Power (BHP)
- Right-skewed with multiple peaks (different categories of vehicles).
- Log transformation to reduce skewness.
- Seats
- Categorical in nature (5-seaters dominate).
- No log transformation needed.
- Price (Target Variable)
- Highly right-skewed
- Log transformation needed.
- I'll keep the original Price for final R² and RMSE evaluation.
The Year variable, year of manufacturing could be transformed into a more meaningful feature, such as "age of the car", which may better represent car depreciation.
# Converting the column Year to Car Age and droping the year field
df["Car_Age"] = 2024 - df["Year"]
df.drop(columns=["Year"], inplace=True)
Now I'll apply the Log Transformations on the required features.
# Applying log transformation to skewed numerical features
df["Kilometers_Driven_Log"] = np.log1p(df["Kilometers_Driven"])
df["Engine_Log"] = np.log1p(df["Engine"])
df["Power_Log"] = np.log1p(df["Power"])
df["Price_Log"] = np.log1p(df["Price"]) # Target variable, but keep original
# Dropping original versions of transformed features (except for Price)
df.drop(columns=["Kilometers_Driven", "Engine", "Power"], inplace=True)
# Verifying changes
print(df.head())
S.No. Location Fuel_Type Transmission Owner_Type Mileage Seats Price \
0 0 Mumbai CNG Manual First 26.60 5.0 1.75
1 1 Pune Diesel Manual First 19.67 5.0 12.50
2 2 Chennai Petrol Manual First 18.20 5.0 4.50
3 3 Chennai Diesel Manual First 20.77 7.0 6.00
4 4 Coimbatore Diesel Automatic Second 15.20 5.0 17.74
Brand Car_Age Kilometers_Driven_Log Engine_Log Power_Log Price_Log
0 MARUTI 14 11.184435 6.906755 4.080246 1.011601
1 HYUNDAI 9 10.621352 7.367077 4.845761 2.602690
2 HONDA 13 10.736418 7.090077 4.496471 1.704748
3 MARUTI 12 11.373675 7.130099 4.497139 1.945910
4 AUDI 11 10.613271 7.585281 4.954418 2.930660
Bivariate Analysis¶
Questions:
- Plot a scatter plot for the log transformed values(if log_transformation done in previous steps)?
- What can we infer form the correlation heatmap? Is there correlation between the dependent and independent variables?
- Plot a box plot for target variable and categorical variable 'Location' and write your observations?
- Plot a scatter plot for the log transformed values(if log_transformation done in previous steps)?
# Plotting Scatter Plots for Log-Transformed Features
numerical_features = ["Kilometers_Driven_Log", "Engine_Log", "Power_Log", "Car_Age"]
plt.figure(figsize=(12, 10))
for i, col in enumerate(numerical_features, 1):
plt.subplot(2, 2, i)
sns.scatterplot(x=df[col], y=df["Price_Log"], alpha=0.5)
plt.title(f"Scatter Plot: Price_Log vs {col}")
plt.tight_layout()
plt.show()
- Price_Log vs Kilometers_Driven_Log
Weak negative correlation, as Kilometers_Driven_Log increases, Price_Log slightly decreases.
This makes sense, as cars with higher mileage usually have lower resale value. But there's a lot of spread, meaning mileage alone isn't a strong predictor.
- Price_Log vs Engine_Log
- Strong positive correlation, bigger engines tend to have higher prices.
- This aligns with expectations—luxury and performance cars usually have larger engines.
- Price_Log vs Power_Log
- Strongest positive correlation among all features
- More power directly influences price since powerful cars are more expensive.
- I'll have to check if Power_Log and Engine_Log are highly correlated (multicollinearity risk).
- Price_Log vs Car_Age
- Clear negative correlation, older cars have lower prices.
- What can we infer form the correlation heatmap? Is there correlation between the dependent and independent variables?
# Plotting the correlation heatmap
numerical_df = df.select_dtypes(include=["number"])
plt.figure(figsize=(12, 8))
sns.heatmap(numerical_df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
- Price_Log (Target Variable)
Strong positive correlation with Power_Log (0.80), higher power leads to higher prices.
High correlation with Engine_Log (0.70), bigger engines also lead to higher prices, but slightly less than power.
Moderate negative correlation with Car_Age (-0.47), older cars have lower prices.
Weak correlation with Kilometers_Driven_Log (-0.20), Mileage affects Price, but not as strongly as power and engine size.
- Power_Log & Engine_Log (0.88)
- These two are highly correlated, meaning we might have multicollinearity. I'll need to check Variance Inflation Factor when preparing data for modeling.
- Mileage has a moderate negative correlation with Power and Engine
- Mileage is negatively correlated with Power_Log (-0.55) and Engine_Log (-0.59). This makes sense since cars with higher mileage tend to be less powerful
- Plot a box plot for target variable and categorical variable 'Location' and write your observations.
# Plotting Box Plot for Price_Log vs Location
plt.figure(figsize=(15, 6))
sns.boxplot(x=df["Location"], y=df["Price_Log"])
plt.xticks(rotation=90)
plt.title("Box Plot: Price_Log vs Location")
plt.show()
- Price Variation Across Locations
Bangalore and Coimbatore have the highest median Price_Log values.
Kolkata, Hyderabad, and Jaipur have the lowest median used car prices.
This suggests that used car prices are higher in certain cities, possibly due to demand, purchasing power, or local market trends.
- Presence of Outliers
All locations have significant outliers on the higher end.
This is expected because luxury and high-performance cars exist in every city, but they are not the majority.
We might need to handle extreme outliers carefully when training the model.
- Overall Distribution Similarity
Most locations have a similar Interquartile Range, meaning price distribution is somewhat consistent across cities.
However, Bangalore and Coimbatore show greater price variation, suggesting a mix of both affordable and luxury vehicles.
Missing value treatment (Part 2)¶
- Mileage (Missing: 2 values)
Since there are only 2 missing values, we can impute them using the median Mileage of the same Fuel_Type.
Because fuel type affects mileage (Diesel cars have better mileage than Petrol cars).
- Engine_Log (Missing: 46 values)
I' ll impute using the median Engine_Log based on Brand.
Because different brands manufacture cars with different engine sizes.
- Power_Log (Missing: 175 values)
- Similar to Engine_Log, we'll impute missing values using the median Power_Log based on Brand.
# Imputing Mileage based on Fuel_Type
df["Mileage"].fillna(df.groupby("Fuel_Type")["Mileage"].transform("median"), inplace=True)
# Imputing Engine_Log based on Brand
df["Engine_Log"].fillna(df.groupby("Brand")["Engine_Log"].transform("median"), inplace=True)
# Imputing Power_Log based on Brand
df["Power_Log"].fillna(df.groupby("Brand")["Power_Log"].transform("median"), inplace=True)
# Verifying that all missing values are handled
print("Missing Values After Imputation:\n", df.isnull().sum())
Missing Values After Imputation: S.No. 0 Location 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 2 Seats 0 Price 1234 Brand 0 Car_Age 0 Kilometers_Driven_Log 0 Engine_Log 0 Power_Log 2 Price_Log 1234 dtype: int64
Mileage is still missing (2 values) The imputation should have worked. I'll double-check and reapply if necessary.
Power_Log still has 2 missing values, I'll investigate if these cars belong to a brand where all values were missing (so no median could be calculated).
Price & Price_Log still have 1234 missing values, I'll drop these before modeling.
# Re-checking Mileage and Power_Log for missing values
print(df[df["Mileage"].isnull()]) # Check the rows where Mileage is still missing
print(df[df["Power_Log"].isnull()]) # Check missing Power_Log rows
S.No. Location Fuel_Type Transmission Owner_Type Mileage Seats Price \
4446 4446 Chennai Electric Automatic First NaN 5.0 13.00
4904 4904 Mumbai Electric Automatic First NaN 5.0 12.75
Brand Car_Age Kilometers_Driven_Log Engine_Log Power_Log \
4446 MAHINDRA 8 10.819798 4.290459 3.737670
4904 TOYOTA 13 10.691968 7.494986 4.304065
Price_Log
4446 2.639057
4904 2.621039
S.No. Location Fuel_Type Transmission Owner_Type Mileage Seats Price \
915 915 Pune Diesel Automatic Second 0.0 2.0 3.0
6216 6216 Pune Diesel Manual Second 14.1 5.0 NaN
Brand Car_Age Kilometers_Driven_Log Engine_Log Power_Log \
915 SMART 16 11.542494 6.684612 NaN
6216 HINDUSTAN 28 11.082158 7.598900 NaN
Price_Log
915 1.386294
6216 NaN
- Mileage (2 missing values), both cars are Electric (Mahindra & Toyota)
- Since electric cars don't have a "Mileage" value in the traditional sense, we can impute them with the median of other Electric cars (if available). If not, we can use a default estimate.
- Power_Log (2 missing values), both cars belong to rare brands (SMART and HINDUSTAN).
- These brands likely had very few entries, so their median was NaN. I'll impute them with the median Power_Log of all cars instead.
# Fixing Mileage for Electric cars
electric_median_mileage = df[df["Fuel_Type"] == "Electric"]["Mileage"].median()
df.loc[df["Fuel_Type"] == "Electric", "Mileage"] = df["Mileage"].fillna(electric_median_mileage)
# Fixing Power_Log for rare brands using overall median
overall_median_power = df["Power_Log"].median()
df["Power_Log"].fillna(overall_median_power, inplace=True)
# Verifying if all missing values are gone
print("Missing Values After Final Fix:\n", df.isnull().sum())
Missing Values After Final Fix: S.No. 0 Location 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 2 Seats 0 Price 1234 Brand 0 Car_Age 0 Kilometers_Driven_Log 0 Engine_Log 0 Power_Log 0 Price_Log 1234 dtype: int64
- Since we still have 2 missing values in Mileage, I'll drop them once I've already tried imputing logically and they still remain. Only 2 rows will not affect the dataset significantly.
# Dropping rows where Mileage is still missing
df.dropna(subset=["Mileage"], inplace=True)
# Dropping rows where Price is missing
df.dropna(subset=["Price"], inplace=True)
# Verifying that all missing values are gone
print("Final Missing Value Check:\n", df.isnull().sum())
Final Missing Value Check: S.No. 0 Location 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 0 Seats 0 Price 0 Brand 0 Car_Age 0 Kilometers_Driven_Log 0 Engine_Log 0 Power_Log 0 Price_Log 0 dtype: int64
Important Insights from EDA and Data Preprocessing¶
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
- Price Drivers: What Affects Car Prices the Most?
Power_Log & Engine_Log strongly correlate with Price_Log
Cars with higher power and bigger engines tend to have higher prices. (Correlation: Power_Log = 0.80, Engine_Log = 0.70 with Price_Log)
Car Age has a significant negative correlation with Price
Older cars are cheaper, depreciation plays a key role. (Correlation: Car_Age = -0.47 with Price_Log)
Kilometers Driven has only a weak negative correlation with Price
Higher mileage slightly reduces price, but not as much as age or engine power. (Correlation: Kilometers_Driven_Log = -0.20 with Price_Log)
Fuel Type & Transmission influence pricing
Diesel & Automatic cars generally have higher prices than Petrol & Manual cars.
- Market Trends: How Prices Vary Across Locations?
Bangalore & Coimbatore have the highest used car prices.
Suggests higher demand or luxury vehicle preference in these cities.
Kolkata, Hyderabad, and Jaipur have the lowest median prices.
Local market conditions, affordability, and demand could be factors.
- Data Quality & Fixes
We've handled missing values accordingly:
Used median values grouped by relevant features (Fuel_Type, Brand).
For extreme cases (rare brands, electric cars), we used a logical approach.
Dropped New_Price as it had too many missing values.
Log transformation improved the dataset.
Price, Kilometers Driven, Engine, and Power were highly skewed, meaning log transformation improved distribution.
Final dataset is 100% clean and ready for modeling.
Building Various Models¶
- What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
- Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like Name.
- We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
- Build Regression models using train data.
- Evaluate the model performance.
Encoding¶
Since we have categorical features (Location, Fuel_Type, Transmission, Owner_Type, Brand), we need to convert them into numerical values so models can understand them.
Use One-Hot Encoding (pd.get_dummies()) for variables with more than 2 categories.
# One-Hot Encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=["Location", "Fuel_Type", "Transmission", "Owner_Type", "Brand"], drop_first=True)
# Verify the new dataset
print("Encoded DataFrame Shape:", df_encoded.shape)
print("First Rows After Encoding:\n", df_encoded.head())
Encoded DataFrame Shape: (6017, 55)
First Rows After Encoding:
S.No. Mileage Seats Price Car_Age Kilometers_Driven_Log Engine_Log \
0 0 26.60 5.0 1.75 14 11.184435 6.906755
1 1 19.67 5.0 12.50 9 10.621352 7.367077
2 2 18.20 5.0 4.50 13 10.736418 7.090077
3 3 20.77 7.0 6.00 12 11.373675 7.130099
4 4 15.20 5.0 17.74 11 10.613271 7.585281
Power_Log Price_Log Location_Bangalore Location_Chennai \
0 4.080246 1.011601 False False
1 4.845761 2.602690 False False
2 4.496471 1.704748 False True
3 4.497139 1.945910 False True
4 4.954418 2.930660 False False
Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur \
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 True False False False
Location_Kochi Location_Kolkata Location_Mumbai Location_Pune \
0 False False True False
1 False False False True
2 False False False False
3 False False False False
4 False False False False
Fuel_Type_Diesel Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual \
0 False False False True
1 True False False True
2 False False True True
3 True False False True
4 True False False False
Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_AUDI \
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False True False True
Brand_BENTLEY Brand_BMW Brand_CHEVROLET Brand_DATSUN Brand_FIAT \
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
Brand_FORCE Brand_FORD Brand_HONDA Brand_HYUNDAI Brand_ISUZU \
0 False False False False False
1 False False False True False
2 False False True False False
3 False False False False False
4 False False False False False
Brand_JAGUAR Brand_JEEP Brand_LAMBORGHINI Brand_LAND ROVER \
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
Brand_MAHINDRA Brand_MARUTI Brand_MERCEDES-BENZ Brand_MINI \
0 False True False False
1 False False False False
2 False False False False
3 False True False False
4 False False False False
Brand_MITSUBISHI Brand_NISSAN Brand_PORSCHE Brand_RENAULT Brand_SKODA \
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
Brand_SMART Brand_TATA Brand_TOYOTA Brand_VOLKSWAGEN Brand_VOLVO
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
The dataset now has 55 columns, this means our categorical features have been effectively converted into numerical form.
One-Hot Encoding worked correctly:
Location, Fuel_Type, Transmission, Owner_Type, and Brand have been transformed into binary indicator columns (True/False).
drop_first=True ensured that we avoided dummy variable traps (e.g., avoiding redundancy).
All categorical variables are now numerical.
Split the Data¶
Question:
- Why we should drop 'Name','Price','price_log','Kilometers_Driven' from X before splitting?
- Name (Already Dropped Earlier)
This column contained both Brand and Model names.
We've extracted Brand as a separate categorical feature.
Model names were too unique (~2000 values), making them useless for prediction.
Already dropped earlier in Feature Engineering.
- Price (Target Variable)
We're predicting Price, so it must be in y, NOT in X.
Including it in X would leak the target into the model, making predictions meaningless.
- Price_Log (Transformed Target Variable)
We've applied log transformation on Price to normalize it.
But we only use it for model performance analysis, like checking normality, not as an input feature.
We've train the model on the original Price (y) and later convert predictions back if needed.
- Kilometers_Driven (Replaced by Kilometers_Driven_Log)
We've log-transformed Kilometers_Driven into Kilometers_Driven_Log to reduce skewness.
The original Kilometers_Driven is no longer useful, as the model should use the transformed version.
Already dropped earlier in Feature Engineering.
# Defining features and target
X = df_encoded.drop(columns=["Price", "Price_Log"])
y = df_encoded["Price"] # Target variable
# Train-test split (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Verifying split
print(f"Training Set: {X_train.shape}, Test Set: {X_test.shape}")
Training Set: (4813, 53), Test Set: (1204, 53)
For Regression Problems, some of the algorithms used are :
1) Linear Regression
2) Ridge / Lasso Regression
3) Decision Trees
4) Random Forest
1) Linear Regression
# Initializing the Linear Regression model
lin_reg = LinearRegression()
# Training the model
lin_reg.fit(X_train, y_train)
# Making predictions
y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_test)
# Evaluating model performance
# R2
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
# RMSE
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
# MAE
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
# MAPE
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
# Displaying results
print("Linear Regression Model Performance:")
print(f"R² Score (Train): {train_r2:.4f}")
print(f"R² Score (Test) : {test_r2:.4f}")
print(f"RMSE (Train) : {train_rmse:.4f}")
print(f"RMSE (Test) : {test_rmse:.4f}")
print(f"MAE (Train) : {train_mae:.4f}")
print(f"MAE (Test) : {test_mae:.4f}")
print(f"MAPE (Train) : {train_mape:.2f}%")
print(f"MAPE (Test) : {test_mape:.2f}%")
Linear Regression Model Performance: R² Score (Train): 0.7557 R² Score (Test) : 0.7588 RMSE (Train) : 5.5733 RMSE (Test) : 5.3205 MAE (Train) : 3.0761 MAE (Test) : 3.0943 MAPE (Train) : 60.80% MAPE (Test) : 65.63%
- R² Score (Train: 0.7557, Test: 0.7588)
This means our model explains ~75.6% of the variance in car prices.
The train and test R² are very close, meaning no overfitting.
- RMSE (Train: 5.5733, Test: 5.3205)
On average, our model predicts used car prices with an error of ~5.3.
This is quite reasonable, but we'll try reducing the error further using other models.
- MAE (Train: 3.08, Test: 3.09)
On average, our model predicts car prices within 3.08 - 3.09 of the actual price.
This is reasonable for a baseline model, but we'll aim to reduce it.
- MAPE (Train: 60.80%, Test: 65.63%)
This means that, on average, our model's predictions are ~60-65% off from the actual prices.
This is quite high, indicating that a linear model may not fully capture complex pricing patterns.
Checking Linear Regression Assumptions¶
Checking the mean of the residuals¶
# Predicting on train data to get residuals
y_train_pred = lin_reg.predict(X_train)
residuals = y_train - y_train_pred
# Checking the Mean of Residuals
mean_residuals = np.mean(residuals)
print(f"Mean of Residuals: {mean_residuals:.5f}")
Mean of Residuals: 0.00000
- Mean of Residuals = 0.00000, which confirms that the residuals are centered around zero. This is a sign that our model is unbiased and making reasonable prediction.
Homoscedasticity Check¶
# Homoscedasticity Check - Residuals vs Predicted
plt.figure(figsize=(6, 4))
sns.scatterplot(x=y_train_pred, y=residuals, alpha=0.5)
plt.axhline(y=0, color="r", linestyle="--", linewidth=2)
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Homoscedasticity Check: Residuals vs Predicted")
plt.show()
The residuals do not appear to be randomly scattered around the zero line.
Instead, there is a visible funnel shape, meaning that the variance of residuals increases with higher predicted values.
This indicates heteroscedasticity, which means our model's errors are not constant across all predictions.
The model might not be capturing variance well, especially for higher-priced cars.
This violates the assumption of constant variance and suggests that some transformations or alternative modeling techniques might improve performance.
Linearity of Variables¶
# Ploting a Q-Q plot to verify if residuals align with a normal distribution
plt.figure(figsize=(6, 4))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot for Linearity Check")
plt.show()
The residuals should lie on the red diagonal line if they follow a normal distribution.
However, our residuals deviate significantly at both tails, especially at higher quantiles (right side).
The S-shape pattern indicates that the errors are not normally distributed.
This suggests the presence of non-linearity in the relationship between features and the target variable.
The model may be underestimating or overestimating extreme values.
This violates the assumption of normality of residuals.
Normality of Error Terms¶
# Plotting an histogram to check if the errors follow a normal distribution
plt.figure(figsize=(6, 4))
sns.histplot(residuals, bins=30, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Normality of Residuals")
plt.show()
The histogram shows that most residuals are clustered around zero, but there is a sharp peak in the center, which indicates high kurtosis (a heavier central concentration).
The distribution also has long tails, suggesting outliers and non-normality.
The right tail is significantly stretched, which means some predictions are much higher than expected.
How does the model is performing after cross validation?¶
# Cross-Validation performance check
# Performing 5-Fold Cross-Validation (scoring based on R²)
cv_scores = cross_val_score(lin_reg, X_train, y_train, cv=5, scoring='r2')
# Displaying cross-validation results
print(f"\nCross-Validation Results:")
print(f"Mean R² Score: {cv_scores.mean():.4f}")
print(f"Standard Deviation of R²: {cv_scores.std():.4f}")
print(f"All R² Scores: {cv_scores}")
Cross-Validation Results: Mean R² Score: 0.7358 Standard Deviation of R²: 0.0281 All R² Scores: [0.74678498 0.76442222 0.70710684 0.76301225 0.69773929]
Mean R² Score: 0.7358
The model explains about 73.58% of the variance in the test data on average.
This is consistent with our initial test R² score (0.7588), confirming that our linear regression model is relatively stable.
Standard Deviation of R²: 0.0281
A lower standard deviation indicates that the model's performance is fairly consistent across different validation folds. However, 0.0281 is not negligible, meaning some variation exists across different train-test splits.
The scores for the undividual R² Scores Across 5 Folds, range from 0.6977 to 0.7644, showing some fluctuations in predictive power.
The lower score (0.6977) suggests that in some folds, the model struggles with certain subsets of the data.
2) Ridge / Lasso Regression
I'll now train a Ridge and a Lasso Regression model
Ridge Regression adds L2 regularization, which helps reduce overfitting by penalizing large coefficients.
Lasso Regression, adds L1 regularization, which shrinks some coefficients to zero, effectively performing feature selection.
Both models help with multicollinearity, especially since Power_Log and Engine_Log are highly correlated.
# Training Ridge Regression
ridge = Ridge(alpha=1.0) # Alpha is the regularization strength
ridge.fit(X_train, y_train)
y_train_pred_ridge = ridge.predict(X_train)
y_test_pred_ridge = ridge.predict(X_test)
# Training Lasso Regression
lasso = Lasso(alpha=0.01) # Alpha should be small for Lasso to avoid aggressive feature elimination
lasso.fit(X_train, y_train)
y_train_pred_lasso = lasso.predict(X_train)
y_test_pred_lasso = lasso.predict(X_test)
# Evaluating both models
def evaluate_model(model_name, y_train, y_train_pred, y_test, y_test_pred):
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
print(f"\n{model_name} Model Performance:")
print(f"R² Score (Train): {train_r2:.4f}")
print(f"R² Score (Test) : {test_r2:.4f}")
print(f"RMSE (Train) : {train_rmse:.4f}")
print(f"RMSE (Test) : {test_rmse:.4f}")
print(f"MAE (Train) : {train_mae:.4f}")
print(f"MAE (Test) : {test_mae:.4f}")
print(f"MAPE (Train) : {train_mape:.2f}%")
print(f"MAPE (Test) : {test_mape:.2f}%")
# Printing results
evaluate_model("Ridge Regression", y_train, y_train_pred_ridge, y_test, y_test_pred_ridge)
evaluate_model("Lasso Regression", y_train, y_train_pred_lasso, y_test, y_test_pred_lasso)
Ridge Regression Model Performance: R² Score (Train): 0.7519 R² Score (Test) : 0.7596 RMSE (Train) : 5.6158 RMSE (Test) : 5.3118 MAE (Train) : 3.1204 MAE (Test) : 3.1185 MAPE (Train) : 61.73% MAPE (Test) : 66.54% Lasso Regression Model Performance: R² Score (Train): 0.7469 R² Score (Test) : 0.7580 RMSE (Train) : 5.6724 RMSE (Test) : 5.3294 MAE (Train) : 3.1705 MAE (Test) : 3.1508 MAPE (Train) : 62.94% MAPE (Test) : 67.32%
R² Scores (Train & Test) are nearly identical across all models.
Linear Regression: Train: 0.7557, Test: 0.7588
Ridge Regression: Train: 0.7519, Test: 0.7596
Lasso Regression: Train: 0.7469, Test: 0.7580
Conclusion: Regularization (Ridge & Lasso) didn't significantly improve generalization.
RMSE, MAE, and MAPE are also very close.
Lasso performs slightly worse in Train R² (0.7469) because Lasso shrinks some coefficients to zero, meaning it's removing some features.
Ridge performs almost exactly like Linear Regression.
MAPE is still high (~60-67%) in all models.
This suggests that a purely linear model may not fully capture the complexities of used car pricing.
- Ridge and Lasso didn't improve much.
- Since our linear models are hitting a performance ceiling, let's train Decision Trees & Random Forest, which handle non-linear relationships better.
- We should check feature importance in Lasso.
- Lasso removes some features by shrinking coefficients to zero. Would you like me to print which features were eliminated by Lasso?
# Getting feature names
feature_names = X_train.columns
# Getting Lasso coefficients
lasso_coeffs = lasso.coef_
# Identifying features with zero coefficients (dropped by Lasso)
dropped_features = feature_names[lasso_coeffs == 0]
# Printing dropped Features
print("Features Dropped by Lasso Regression:")
print(dropped_features)
Features Dropped by Lasso Regression:
Index(['Location_Pune', 'Fuel_Type_Diesel', 'Fuel_Type_LPG',
'Owner_Type_Fourth & Above', 'Brand_BENTLEY', 'Brand_DATSUN',
'Brand_FIAT', 'Brand_FORCE', 'Brand_ISUZU', 'Brand_JEEP',
'Brand_MITSUBISHI', 'Brand_SMART', 'Brand_TOYOTA'],
dtype='object')
Most of the dropped features are categorical variables (Location, Fuel_Type, Owner_Type, and specific Brands).
Some brands (e.g., Bentley, Jeep, Mitsubishi) were dropped, this means their effect on price is either negligible or already explained by other features.
Location_Pune was dropped, possibly because location isn't a strong predictor of price when other features (like Power, Engine, and Car Age) are present.
Fuel_Type_Diesel & Fuel_Type_LPG were removed, this suggests that Fuel Type doesn't significantly impact used car prices when other variables are considered.
3) Decision Trees
# Training Decision Tree Model
dt_model = DecisionTreeRegressor(random_state=42, max_depth=10) # Limiting depth to prevent overfitting
dt_model.fit(X_train, y_train)
# Making predictions
y_train_pred_dt = dt_model.predict(X_train)
y_test_pred_dt = dt_model.predict(X_test)
# Evaluate model performance
def evaluate_model(model_name, y_train, y_train_pred, y_test, y_test_pred):
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
print(f"\n{model_name} Model Performance:")
print(f"R² Score (Train): {train_r2:.4f}")
print(f"R² Score (Test) : {test_r2:.4f}")
print(f"RMSE (Train) : {train_rmse:.4f}")
print(f"RMSE (Test) : {test_rmse:.4f}")
print(f"MAE (Train) : {train_mae:.4f}")
print(f"MAE (Test) : {test_mae:.4f}")
print(f"MAPE (Train) : {train_mape:.2f}%")
print(f"MAPE (Test) : {test_mape:.2f}%")
# Print Decision Tree Results
evaluate_model("Decision Tree", y_train, y_train_pred_dt, y_test, y_test_pred_dt)
Decision Tree Model Performance: R² Score (Train): 0.9717 R² Score (Test) : 0.7441 RMSE (Train) : 1.8960 RMSE (Test) : 5.4799 MAE (Train) : 1.0258 MAE (Test) : 2.0590 MAPE (Train) : 14.22% MAPE (Test) : 26.36%
- R² Score (Train: 0.9717, Test: 0.7441)
- The model fits the training data extremely well (~97% of variance explained). But there's a gap between Train & Test R², suggesting some overfitting.
- RMSE (Train: 1.8960, Test: 5.4799)
- The training error is very low, but the test error is higher, indicating potential overfitting.
- MAE & MAPE are significantly better than Linear Models.
MAE dropped from ~3.1 (Linear Regression) to ~2.05 in the Test Set, great improvement on the Regression Models
MAPE is much lower (~26.36% vs ~65% for Linear Regression), the predictions are now much closer to actual prices.
Hyperparameter Tuning: Decision Tree¶
# Defining parameter grid
param_grid = {
"max_depth": [5, 7, 10, 15],
"min_samples_split": [5, 10, 20],
"min_samples_leaf": [5, 10, 15]
}
# Initializing decision tree model
dt = DecisionTreeRegressor(random_state=42)
# Running GridSearchCV to find the best parameters
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring="r2", n_jobs=-1)
grid_search.fit(X_train, y_train)
# Best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
# Training Decision Tree with best parameters
best_dt = DecisionTreeRegressor(**best_params, random_state=42)
best_dt.fit(X_train, y_train)
# Making Predictions
y_train_pred_best_dt = best_dt.predict(X_train)
y_test_pred_best_dt = best_dt.predict(X_test)
# Evaluating Tuned Model
evaluate_model("Tuned Decision Tree", y_train, y_train_pred_best_dt, y_test, y_test_pred_best_dt)
Best Parameters: {'max_depth': 15, 'min_samples_leaf': 5, 'min_samples_split': 20}
Tuned Decision Tree Model Performance:
R² Score (Train): 0.9208
R² Score (Test) : 0.8195
RMSE (Train) : 3.1737
RMSE (Test) : 4.6028
MAE (Train) : 1.4059
MAE (Test) : 1.9617
MAPE (Train) : 14.97%
MAPE (Test) : 24.66%
R² Score (Test: 0.8195 vs 0.7441 before), big improvement.
The model is now explaining ~82% of the variance in car prices.
Overfitting is reduced (Train R² went from 0.9717 to 0.9208, meaning the model is no longer memorizing the training data as much).
RMSE (Test: 4.6028 vs 5.4799 before), means Lower Error.
Our model's average price prediction error dropped by almost 90,000 INR.
MAE (Test: 1.9617) and MAPE (Test: 24.66%), meaning Lower Errors.
MAE improved slightly (from 2.05 to 1.96 INR).
MAPE dropped from 26.36% to 24.66%, meaning our model's predictions are now 2% more accurate overall!
# Plotting the Decision Tree
# Setting figure size
plt.figure(figsize=(20, 10))
# Plot the decision tree (limiting depth for better visualization)
plot_tree(best_dt, feature_names=X_train.columns, filled=True, rounded=True, max_depth=4) # Adjust max_depth if needed
# Showing the plot
plt.title("Decision Tree Visualization (Pruned)")
plt.show()
Feature Importance
# Extracing feature importances from the tuned decision tree model
feature_importances = best_dt.feature_importances_
# Creating a DataFrame for better visualization
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})
# Sorting by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Displaying the top 15 features
plt.figure(figsize=(12, 6))
plt.barh(importance_df['Feature'][:15], importance_df['Importance'][:15], color='skyblue')
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Top 15 Feature Importances - Decision Tree")
plt.gca().invert_yaxis()
plt.show()
# Displaying the DataFrame
print(importance_df)
Feature Importance 6 Power_Log 0.697919 3 Car_Age 0.162960 5 Engine_Log 0.035850 4 Kilometers_Driven_Log 0.030605 38 Brand_LAND ROVER 0.015809 1 Mileage 0.012793 11 Location_Hyderabad 0.009526 20 Transmission_Manual 0.009494 2 Seats 0.006371 41 Brand_MERCEDES-BENZ 0.004749 42 Brand_MINI 0.003954 9 Location_Coimbatore 0.002448 24 Brand_AUDI 0.001512 32 Brand_HONDA 0.001210 0 S.No. 0.001210 33 Brand_HYUNDAI 0.000649 50 Brand_TOYOTA 0.000544 19 Fuel_Type_Petrol 0.000490 10 Location_Delhi 0.000486 14 Location_Kolkata 0.000339 40 Brand_MARUTI 0.000193 49 Brand_TATA 0.000186 22 Owner_Type_Second 0.000166 26 Brand_BMW 0.000144 15 Location_Mumbai 0.000089 39 Brand_MAHINDRA 0.000080 7 Location_Bangalore 0.000058 27 Brand_CHEVROLET 0.000040 31 Brand_FORD 0.000036 51 Brand_VOLKSWAGEN 0.000034 17 Fuel_Type_Diesel 0.000021 8 Location_Chennai 0.000010 16 Location_Pune 0.000010 12 Location_Jaipur 0.000008 13 Location_Kochi 0.000005 46 Brand_RENAULT 0.000000 45 Brand_PORSCHE 0.000000 44 Brand_NISSAN 0.000000 43 Brand_MITSUBISHI 0.000000 48 Brand_SMART 0.000000 47 Brand_SKODA 0.000000 23 Owner_Type_Third 0.000000 25 Brand_BENTLEY 0.000000 37 Brand_LAMBORGHINI 0.000000 36 Brand_JEEP 0.000000 35 Brand_JAGUAR 0.000000 34 Brand_ISUZU 0.000000 18 Fuel_Type_LPG 0.000000 30 Brand_FORCE 0.000000 29 Brand_FIAT 0.000000 28 Brand_DATSUN 0.000000 21 Owner_Type_Fourth & Above 0.000000 52 Brand_VOLVO 0.000000
Power_Log (69.8%) - By far, the most influential factor in predicting car prices. This aligns well with our expectations—higher power engines typically correlate with higher car prices.
Car_Age (16.3%) - The second most important feature. Newer cars tend to retain more value, while older ones depreciate.
Engine_Log (3.6%) and Kilometers_Driven_Log (3.1%) - These features still hold some significance, likely because engine size impacts performance and perceived value, while mileage affects depreciation.
Brand_LAND ROVER (1.6%) - This makes sense, as luxury brands like Land Rover command higher resale values.
Mileage, Location_Hyderabad, Transmission_Manual - These features contribute slightly to price prediction but are significantly less impactful than power and age.
Several categorical variables have little to no impact - Some brands and fuel types have near-zero importance, meaning they don't significantly influence the decision tree's splits.
4) Random Forest
# Initializing and training the random forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Predictions
y_train_pred_rf = rf_model.predict(X_train)
y_test_pred_rf = rf_model.predict(X_test)
# Performance Evaluation
rf_train_r2 = r2_score(y_train, y_train_pred_rf)
rf_test_r2 = r2_score(y_test, y_test_pred_rf)
rf_train_rmse = mean_squared_error(y_train, y_train_pred_rf) ** 0.5
rf_test_rmse = mean_squared_error(y_test, y_test_pred_rf) ** 0.5
rf_train_mae = mean_absolute_error(y_train, y_train_pred_rf)
rf_test_mae = mean_absolute_error(y_test, y_test_pred_rf)
rf_train_mape = np.mean(np.abs((y_train - y_train_pred_rf) / y_train)) * 100
rf_test_mape = np.mean(np.abs((y_test - y_test_pred_rf) / y_test)) * 100
# Displaying results
print(f"Random Forest Model Performance:")
print(f"R² Score (Train): {rf_train_r2:.4f}")
print(f"R² Score (Test) : {rf_test_r2:.4f}")
print(f"RMSE (Train) : {rf_train_rmse:.4f}")
print(f"RMSE (Test) : {rf_test_rmse:.4f}")
print(f"MAE (Train) : {rf_train_mae:.4f}")
print(f"MAE (Test) : {rf_test_mae:.4f}")
print(f"MAPE (Train) : {rf_train_mape:.2f}%")
print(f"MAPE (Test) : {rf_test_mape:.2f}%")
Random Forest Model Performance: R² Score (Train): 0.9851 R² Score (Test) : 0.8839 RMSE (Train) : 1.3759 RMSE (Test) : 3.6912 MAE (Train) : 0.5495 MAE (Test) : 1.4871 MAPE (Train) : 5.95% MAPE (Test) : 20.21%
- High R² Score on Test Data (0.8839)
This means 88.39% of the variance in car prices is explained by the model, which is a huge improvement over the previous models.
Compared to Decision Tree (81.95% after tuning), this is a major step up.
- RMSE & MAE Scores Are Lower
Test RMSE = 3.6912, lower than Decision Tree (4.6028), meaning better predictions.
Test MAE = 1.4871, the average absolute error is just 1.49 lakh, which is pretty good for car price prediction.
- MAPE (Mean Absolute Percentage Error)
- Test MAPE = 20.21%, meaning on average, predictions are within +/- 20.21% of the actual price.
- A significant improvement over the Decision Tree (24.66%).
- Train vs Test Performance
R² Score (Train) = 0.9851, means almost perfect fit, but slightly concerning.
R² Score (Test) = 0.8839, shows some generalization but a bit of overfitting is present.
MAPE (Train) = 5.95% vs Test = 20.21%, confirms overfitting, but this can be controlled with hyperparameter tuning.
Hyperparameter Tuning: Random Forest¶
# Defining the parameter grid
rf_param_grid = {
"n_estimators": [100, 200], # Reducing tree count options
"max_depth": [10, 15], # Keeping it within a reasonable range
"min_samples_split": [10, 20], # Higher values prevent overfitting
"min_samples_leaf": [5, 10] # Prevents too small leaves
}
# Initializing random forest regressor
rf_model = RandomForestRegressor(random_state=42)
# Using RandomizedSearchCV for faster tuning
rf_random_search = RandomizedSearchCV(
estimator=rf_model,
param_distributions=rf_param_grid,
n_iter=10, # Runs only 10 combinations instead of all
cv=3, # Reduced cross-validation folds
scoring="r2",
verbose=2,
n_jobs=-1 # Uses all available CPU cores
)
# Fitting the model
rf_random_search.fit(X_train, y_train)
# Getting the best parameters
best_rf_params = rf_random_search.best_params_
print("Best Parameters:", best_rf_params)
# Training Random Forest with the best parameters
best_rf = RandomForestRegressor(**best_rf_params, random_state=42)
best_rf.fit(X_train, y_train)
# Making predictions
y_train_pred_best_rf = best_rf.predict(X_train)
y_test_pred_best_rf = best_rf.predict(X_test)
# Evaluating the tuned Random Forest model
best_rf_train_r2 = r2_score(y_train, y_train_pred_best_rf)
best_rf_test_r2 = r2_score(y_test, y_test_pred_best_rf)
best_rf_train_rmse = mean_squared_error(y_train, y_train_pred_best_rf) ** 0.5
best_rf_test_rmse = mean_squared_error(y_test, y_test_pred_best_rf) ** 0.5
best_rf_train_mae = mean_absolute_error(y_train, y_train_pred_best_rf)
best_rf_test_mae = mean_absolute_error(y_test, y_test_pred_best_rf)
best_rf_train_mape = np.mean(np.abs((y_train - y_train_pred_best_rf) / y_train)) * 100
best_rf_test_mape = np.mean(np.abs((y_test - y_test_pred_best_rf) / y_test)) * 100
# Displaying results
print("\n**Tuned Random Forest Model Performance:**")
print(f"R² Score (Train): {best_rf_train_r2:.4f}")
print(f"R² Score (Test) : {best_rf_test_r2:.4f}")
print(f"RMSE (Train) : {best_rf_train_rmse:.4f}")
print(f"RMSE (Test) : {best_rf_test_rmse:.4f}")
print(f"MAE (Train) : {best_rf_train_mae:.4f}")
print(f"MAE (Test) : {best_rf_test_mae:.4f}")
print(f"MAPE (Train) : {best_rf_train_mape:.2f}%")
print(f"MAPE (Test) : {best_rf_test_mape:.2f}%")
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Parameters: {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_depth': 15}
**Tuned Random Forest Model Performance:**
R² Score (Train): 0.9369
R² Score (Test) : 0.8571
RMSE (Train) : 2.8318
RMSE (Test) : 4.0956
MAE (Train) : 1.1745
MAE (Test) : 1.6715
MAPE (Train) : 12.83%
MAPE (Test) : 22.21%
- Better Generalization:
Test R² Score improved to 0.8571 (was 0.8839 before tuning).
Train R² Score reduced to 0.9369 (was 0.9851 before tuning).
This means less overfitting—the model is more balanced and will perform better on new data.
- Lower RMSE & MAE:
RMSE (Test) improved to 4.0956 (was 3.6912 before tuning).
MAE (Test) is now 1.6715, meaning on average, predictions are within 1.67 of actual prices.
MAPE (Test) dropped to 22.21%, which is an improvement from 20.21%, still strong, but we sacrificed a little accuracy to reduce overfitting.
Feature Importance
# Extracting feature importances from the tuned Random forest model
rf_feature_importances = best_rf.feature_importances_
# Creating a DataFrame for better visualization
rf_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_feature_importances})
# Sorting by importance in descending order
rf_importance_df = rf_importance_df.sort_values(by='Importance', ascending=False)
# Displaying the top 15 features
plt.figure(figsize=(12, 6))
plt.barh(rf_importance_df['Feature'][:15], rf_importance_df['Importance'][:15], color='lightgreen')
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Top 15 Feature Importances - Random Forest")
plt.gca().invert_yaxis() # Invert y-axis for better readability
plt.show()
# Displaying the DataFrame
print(rf_importance_df)
Feature Importance 6 Power_Log 7.141844e-01 3 Car_Age 1.634773e-01 4 Kilometers_Driven_Log 3.420207e-02 5 Engine_Log 2.767288e-02 1 Mileage 1.625237e-02 0 S.No. 8.523266e-03 20 Transmission_Manual 6.785359e-03 2 Seats 4.636363e-03 41 Brand_MERCEDES-BENZ 4.201282e-03 38 Brand_LAND ROVER 3.752296e-03 42 Brand_MINI 2.693046e-03 24 Brand_AUDI 1.726941e-03 9 Location_Coimbatore 1.332947e-03 26 Brand_BMW 1.289749e-03 11 Location_Hyderabad 1.231823e-03 50 Brand_TOYOTA 1.189698e-03 32 Brand_HONDA 1.009778e-03 17 Fuel_Type_Diesel 7.312384e-04 15 Location_Mumbai 6.892281e-04 19 Fuel_Type_Petrol 6.132026e-04 36 Brand_JEEP 4.755899e-04 22 Owner_Type_Second 4.492650e-04 14 Location_Kolkata 4.293241e-04 7 Location_Bangalore 4.158671e-04 39 Brand_MAHINDRA 3.284861e-04 10 Location_Delhi 3.168373e-04 35 Brand_JAGUAR 2.922101e-04 33 Brand_HYUNDAI 2.051417e-04 40 Brand_MARUTI 1.729231e-04 13 Location_Kochi 1.652453e-04 47 Brand_SKODA 1.293168e-04 49 Brand_TATA 1.259220e-04 31 Brand_FORD 7.929768e-05 16 Location_Pune 5.555644e-05 27 Brand_CHEVROLET 4.188421e-05 8 Location_Chennai 3.876059e-05 51 Brand_VOLKSWAGEN 2.992249e-05 46 Brand_RENAULT 2.395319e-05 12 Location_Jaipur 1.434947e-05 23 Owner_Type_Third 8.711058e-06 43 Brand_MITSUBISHI 4.234619e-06 44 Brand_NISSAN 1.936499e-06 18 Fuel_Type_LPG 1.418812e-08 48 Brand_SMART 0.000000e+00 45 Brand_PORSCHE 0.000000e+00 28 Brand_DATSUN 0.000000e+00 29 Brand_FIAT 0.000000e+00 21 Owner_Type_Fourth & Above 0.000000e+00 37 Brand_LAMBORGHINI 0.000000e+00 34 Brand_ISUZU 0.000000e+00 25 Brand_BENTLEY 0.000000e+00 30 Brand_FORCE 0.000000e+00 52 Brand_VOLVO 0.000000e+00
Power_Log is by far the most critical factor in predicting car prices.
Car_Age follows, which makes total sense, as older cars tend to be less expensive.
Kilometers_Driven_Log is also key, indicating that the more a car has been driven, the lower its price tends to be.
Engine_Log and Mileage contribute but with smaller effects.
Some locations and brands hold value, with premium brands, like Mercedes-Benz, Land Rover, Audi, BMW, appearing in the ranking.
Conclusions and Recommendations¶
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
Linear models (Linear, Ridge, Lasso) performed consistently but weakly. They struggled to capture complex relationships, yielding high errors and low R² scores.
Decision Tree (Base Model) overfitted massively (R² = 0.9717 on Train, 0.7441 on Test). After hyperparameter tuning, the overfitting reduced, and performance improved to R² = 0.8195.
Random Forest outperformed all models, delivering the best generalized accuracy with R² = 0.8839 on Test and the lowest RMSE (3.6912) & MAPE (20.21%). Tuned Random Forest still performed well, but slightly lost generalization (R² dropped from 0.8839 to 0.8571).
There is scope for further improvements by:
Using more hyperparameter tuning (grid search with more granular settings)
Using ensemble techniques like stacking or boosting (XGBoost, LightGBM)
More feature engineering (possibly polynomial features)
2. Refined insights:
- What are the most meaningful insights relevant to the problem?
Most Important Features Driving Car Prices:
- Power (Horsepower) is the single most influential factor.
- Car Age plays a significant role in depreciation.
- Kilometers Driven & Engine Size also impact price significantly.
- Brand & Location impact price but are secondary factors.
Trends & Market Insights:
Luxury brands (BMW, Mercedes, Land Rover, Audi) retain value better than economy brands.
Automatic transmission vehicles have a slight price premium over manual ones.
Cars in locations like Bangalore, Delhi, and Mumbai tend to have higher prices.
Depreciation is significant after 5-10 years of usage.
Final Recommendations for a Buyer/Seller in the Used Car Market:
If buying a used car: Focus on Power, Age, and Brand to find value-for-money deals.
If selling a used car: Maintain your car in good condition with low mileage, and consider selling before it reaches 10+ years old for maximum resale value.
For dealerships/platforms: Pricing algorithms should heavily weight power and age while considering location trends.
3. Proposal for the final solution design:
- What model do you propose to be adopted? Why is this the best solution to adopt?
Best Model to Adopt: Tuned Random Forest
Best Generalization: While base Random Forest had the best test R² (0.8839), the tuned version still maintained a strong R² (0.8571) while reducing overfitting.
Better RMSE & MAPE Scores: The errors in price predictions are lower, meaning it's more reliable.
Handles Non-Linear Relationships: Unlike linear models, Random Forest captures complex interactions between variables.
Feature Importance Explainability: We can rank the most important features, making it a transparent, interpretable model.