🏡 House Price Prediction: A Complete Machine Learning Journey
Predicting house prices is one of the most practical and engaging applications of machine learning. In this comprehensive project, I embarked on a journey to build accurate predictive models for the Kaggle competition "House Prices: Advanced Regression Techniques." This challenge involves predicting sale prices of homes in Ames, Iowa, based on 79 explanatory variables describing various aspects of residential properties.
What makes this project particularly interesting is the emphasis on feature engineering—the process of transforming raw data into meaningful features that can dramatically improve model performance. Throughout this journey, I analyzed 79 different features, engineered new ones, handled missing values intelligently, and applied advanced techniques like stacking and ensembling to achieve optimal results.
Beyond being an excellent learning exercise, house price prediction has real-world applications:
- Real Estate Valuation: Automated property valuation helps buyers and sellers make informed decisions
- Investment Analysis: Investors can identify undervalued properties and market trends
- Urban Planning: Understanding price determinants aids in city development and policy making
- Mortgage Lending: Banks use predictive models for risk assessment and loan approvals
Table of Contents
ToggleProject Roadmap
This blog post documents my complete journey through the following 13 major steps:
Step 1: Importing Libraries and Datasets
Setting up the environment with essential Python libraries for data manipulation, visualization, and machine learning.
Step 2: Dataset Visualization
Initial exploration to understand the structure, distributions, and characteristics of the data.
Step 3: Separating ID Column
Handling identifier columns that don't contribute to predictions.
Step 4: Removing Outliers
Identifying and removing anomalous data points that could negatively impact model training.
Step 5: Normalizing Label Column
Transforming the target variable (SalePrice) to improve model performance.
Step 6: Concatenating Train and Test Datasets
Combining datasets for unified preprocessing and feature engineering.
Step 7: Dealing with Missing Values
Comprehensive strategy for handling missing data across different feature types.
Step 8: Feature Engineering
The largest and most critical section—creating new features and transforming existing ones.
Step 9: Handling Skewness
Applying transformations to reduce skewness in feature distributions.
Step 10: Recreating Train & Test Datasets
Separating the processed data back into training and testing sets.
Step 11: Regressor Models Implementation
Training multiple machine learning algorithms, hyperparameter tuning, stacking, and ensembling.
Step 12: Deep Neural Network Implementation
Building a custom DNN using TensorFlow's low-level APIs with regularization and optimization.
Step 13: Conclusion
Synthesizing results, key learnings, and future directions.
1 Importing Libraries and Datasets
Every data science project begins with setting up the proper environment. For this house price prediction task, I needed a comprehensive toolkit spanning data manipulation, statistical analysis, visualization, and machine learning algorithms.
Core Libraries
The foundation of any Python-based data science project consists of several essential libraries:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats from scipy.stats import norm, skew # Set visualization style sns.set_style('whitegrid') plt.rcParams['figure.figsize'] = (12, 8)
- NumPy: Fundamental package for numerical computing, providing efficient array operations
- Pandas: Data manipulation and analysis library with DataFrame structures
- Matplotlib & Seaborn: Visualization libraries for creating insightful plots and graphs
- SciPy: Scientific computing tools including statistical functions and tests
Machine Learning Algorithms
The project utilizes a diverse array of regression algorithms from scikit-learn, each with unique strengths:
from sklearn.linear_model import ( ElasticNet, Lasso, BayesianRidge, LassoLarsIC, Ridge ) from sklearn.ensemble import ( RandomForestRegressor, GradientBoostingRegressor, StackingRegressor ) from sklearn.kernel_ridge import KernelRidge from sklearn.pipeline import make_pipeline from sklearn.preprocessing import RobustScaler from sklearn.model_selection import ( KFold, cross_val_score, train_test_split, GridSearchCV ) from sklearn.metrics import mean_squared_error import xgboost as xgb import lightgbm as lgb
Ridge, Lasso, ElasticNet: Regularized linear regression models that prevent overfitting by penalizing large coefficients.
Random Forest, Gradient Boosting: Combine multiple decision trees to create powerful predictive models.
XGBoost, LightGBM: State-of-the-art implementations optimized for speed and performance.
Stacking: Meta-learning approach that combines predictions from multiple base models.
Loading the Dataset
The Kaggle competition provides two CSV files: a training set with known sale prices and a test set for which we need to predict prices.
# Load training and test datasets train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') # Display basic information print(f'Training set shape: {train.shape}') print(f'Test set shape: {test.shape}') print(f'\nFirst few rows of training data:') print(train.head())
2 Dataset Visualization & Initial Exploration
Before diving into modeling, it's crucial to understand the data's structure, distributions, and relationships. This exploratory phase reveals patterns, anomalies, and insights that guide subsequent preprocessing decisions.
Understanding the Target Variable: Sale Price
The first priority is understanding what we're trying to predict. Let's examine the distribution of house sale prices:
# Descriptive statistics print("Sale Price Statistics:") print(train['SalePrice'].describe()) # Visualize distribution fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Histogram with KDE sns.histplot(train['SalePrice'], kde=True, ax=axes[0]) axes[0].set_title('Sale Price Distribution') axes[0].set_xlabel('Sale Price ($)') # Box plot to identify outliers sns.boxplot(y=train['SalePrice'], ax=axes[1]) axes[1].set_title('Sale Price Box Plot') axes[1].set_ylabel('Sale Price ($)') plt.tight_layout() plt.show()
Feature Type Classification
Understanding which features are numerical versus categorical is essential for appropriate preprocessing:
# Separate numerical and categorical features numerical_features = train.select_dtypes( include=['int64', 'float64'] ).columns.tolist() categorical_features = train.select_dtypes( include=['object'] ).columns.tolist() # Remove Id and SalePrice from numerical numerical_features.remove('Id') if 'SalePrice' in numerical_features: numerical_features.remove('SalePrice') print(f'Numerical features: {len(numerical_features)}') print(f'Categorical features: {len(categorical_features)}')
Correlation Analysis
Understanding which features correlate strongly with sale price helps prioritize feature engineering efforts:
# Calculate correlations with SalePrice correlations = train[numerical_features + ['SalePrice']].corr() saleprice_corr = correlations['SalePrice'].sort_values( ascending=False ) # Display top correlations print("Top 10 features correlated with SalePrice:") print(saleprice_corr[1:11]) # Visualize correlation heatmap top_features = saleprice_corr[1:11].index.tolist() + ['SalePrice'] plt.figure(figsize=(12, 10)) sns.heatmap( train[top_features].corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0 ) plt.title('Correlation Matrix of Top Features') plt.show()
- Overall Quality (0.79): Material and finish quality of the house
- Above Ground Living Area (0.71): Square footage of living space
- Garage Cars (0.64): Size of garage in car capacity
- Garage Area (0.62): Size of garage in square feet
- Total Basement SF (0.61): Total square feet of basement area
- 1st Floor SF (0.61): First floor square feet
- Full Bath (0.56): Full bathrooms above grade
- Total Rooms Above Grade (0.53): Total rooms excluding bathrooms
- Year Built (0.52): Original construction date
- Year Remodeled (0.51): Remodel date
- Quality metrics (OverallQual) show the strongest correlation—emphasizing that quality matters more than quantity
- Living area dimensions are highly predictive, but with diminishing returns (potential non-linear relationships)
- Year features indicate age affects value, suggesting newer homes command premium prices
- Multicollinearity exists between related features (e.g., GarageArea and GarageCars)—important for linear models
- Categorical features like Neighborhood, ExterQual, and KitchenQual likely contain significant predictive power not captured in numerical correlations
3 Separating ID Column
Before proceeding with any analysis or modeling, it's important to handle identifier columns that don't contribute to predictions but serve as reference markers.
# Store IDs for later use in submissions train_id = train['Id'] test_id = test['Id'] # Drop Id from datasets train = train.drop('Id', axis=1) test = test.drop('Id', axis=1) print(f'Training set shape after removing Id: {train.shape}') print(f'Test set shape after removing Id: {test.shape}')
4 Removing Outliers: A Systematic Approach
Outliers are data points that significantly deviate from other observations. They can dramatically affect model training, especially for linear models that minimize squared errors. However, outlier removal must be done carefully—what appears as an outlier might be a legitimate high-value property or a unique architectural feature.
The Outlier Detection Strategy
My approach to outlier detection was systematic and multi-faceted:
Step 1: Focus on High-Information Features
Prioritize numerical features with many unique values (high variance) that are also strongly correlated with sale price.
Step 2: Visual Inspection
Create scatter plots of each feature against sale price to identify points that deviate from the general trend.
Step 3: Test Set Validation
Verify that removed outliers don't represent common patterns in the test set—removal should only target true anomalies.
Step 4: Distribution Comparison
Compare feature distributions between train and test sets to ensure removal doesn't create distribution shifts.
Identifying Outlier Candidates
# Get numerical columns and count unique values numeric_cols = train.select_dtypes(include=[np.number]).columns unique_counts = {} for col in numeric_cols: if col != 'SalePrice': unique_counts[col] = train[col].nunique() # Sort by number of unique values (descending) sorted_features = sorted( unique_counts.items(), key=lambda x: x[1], reverse=True ) # Select top features with high variance high_variance_features = [f[0] for f in sorted_features[:19]] # Check correlation with SalePrice correlations = train[high_variance_features + ['SalePrice']].corr()['SalePrice'] print(correlations.sort_values(ascending=False))
Why focus on high-variance numerical features?
- Features with many unique values have higher probability of containing outliers
- Outliers in features strongly correlated with sale price have greater impact on model accuracy
- Categorical features with few levels are less susceptible to outlier effects
Creating an Outlier Detection Function
To efficiently analyze potential outliers across multiple features, I created a comprehensive visualization function:
def check_outliers(feature_name): """ Visualize potential outliers for a feature: 1. Test set scatter (are similar values present?) 2. Train/Test distribution comparison 3. Train set scatter against SalePrice """ fig, axes = plt.subplots(1, 3, figsize=(18, 5)) # Plot 1: Test set scatter axes[0].scatter( range(len(test)), test[feature_name], alpha=0.5 ) axes[0].set_title(f'{feature_name} in Test Set') axes[0].set_xlabel('Sample Index') axes[0].set_ylabel(feature_name) # Plot 2: Distribution comparison axes[1].hist(train[feature_name], bins=50, alpha=0.5, label='Train') axes[1].hist(test[feature_name], bins=50, alpha=0.5, label='Test') axes[1].set_title('Distribution Comparison') axes[1].legend() # Plot 3: Train set scatter vs SalePrice axes[2].scatter( train[feature_name], train['SalePrice'], alpha=0.5 ) axes[2].set_title(f'{feature_name} vs SalePrice (Train)') axes[2].set_xlabel(feature_name) axes[2].set_ylabel('SalePrice') plt.tight_layout() plt.show()
Case Study: Removing Outliers from Ground Living Area
Let me walk through a specific example—Ground Living Area (GrLivArea), which measures the above-grade living space in square feet:
# Visualize GrLivArea check_outliers('GrLivArea') # Identify outliers: large GrLivArea but low SalePrice outliers_grlivarea = train[ (train['GrLivArea'] > 4000) & (train['SalePrice'] < 300000) ] print(f'Found {len(outliers_grlivarea)} outliers in GrLivArea') print(outliers_grlivarea[['GrLivArea', 'SalePrice']]) # Remove outliers train = train.drop(outliers_grlivarea.index) print(f'\nNew training set size: {train.shape[0]}')
- Data entry errors
- Properties with specific issues (structural problems, poor locations)
- Incomplete sales or foreclosures
Since no similar extreme values exist in the test set and these points deviate significantly from the general trend, removing them improves model generalization.
Systematic Outlier Removal Across Multiple Features
I applied similar analysis to several key features:
Removed: 1 sample with exceptionally high first floor area but low price
Impact: 0.07% of training data
Removed: 1 sample with >2000 sq ft finished basement but low price
Impact: 0.07% of training data
Removed: 4 samples with >80,000 sq ft lot but low prices
Impact: 0.27% of training data
Removed: 2 samples with >4000 sq ft living area but <$300k price
Impact: 0.14% of training data
5 Normalizing the Target Variable
As observed earlier, the sale price distribution is significantly right-skewed. Many machine learning algorithms, especially linear models, perform better when the target variable follows a normal distribution. The solution is to apply a logarithmic transformation.
Why Log Transformation?
Logarithmic transformation has several benefits for regression problems:
- Reduces skewness: Compresses large values and expands small ones, moving distribution toward normal
- Stabilizes variance: Makes the spread of residuals more constant across prediction range
- Handles multiplicative relationships: Converts multiplicative effects to additive, matching linear model assumptions
- Reduces influence of outliers: Large values have proportionally less impact after transformation
import numpy as np from scipy.stats import norm, skew # Calculate skewness before transformation original_skew = skew(train['SalePrice']) print(f'Original skewness: {original_skew:.4f}') # Apply log transformation: log(1+x) to handle any zeros train['SalePrice'] = np.log1p(train['SalePrice']) # Calculate skewness after transformation transformed_skew = skew(train['SalePrice']) print(f'Transformed skewness: {transformed_skew:.4f}') # Visualize the transformation fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original distribution (we'll recreate from backup) sns.histplot(np.expm1(train['SalePrice']), kde=True, ax=axes[0]) axes[0].set_title('Original Sale Price Distribution') axes[0].set_xlabel('Sale Price ($)') # Transformed distribution sns.histplot(train['SalePrice'], kde=True, ax=axes[1]) axes[1].set_title('Log-Transformed Sale Price') axes[1].set_xlabel('log(Sale Price)') plt.tight_layout() plt.show()
np.expm1() to convert back to actual prices.
Impact of Log Transformation on Distribution
═══════════════════════════════════════════════════════════════
Before: Right-skewed, long tail of expensive homes
Frequency
│ ●●●
│ ●●●●●●●
│ ●●●●●●●●●●
│ ●●●●●●●●●●●●●
│●●●●●●●●●●●●●●●●→→→ (long tail)
└─────────────────────────────────→ Price
Low Median High Very High
After: Approximately normal distribution
Frequency
│ ●●●
│ ●●●●●●●
│ ●●●●●●●●●●●
│ ●●●●●●●●●●●●●●●
│ ●●●●●●●●●●●●●●●●●●●
│ ●●●●●●●●●●●●●●●●●●●●●●●
└─────────────────────────────→ log(Price)
Low Median High
Benefits:
✓ Linear models work better
✓ Residuals more homoscedastic
✓ Predictions more accurate
6 Concatenating Train and Test Datasets
A powerful technique in data preprocessing is to temporarily combine training and test sets. This ensures that all preprocessing steps, feature engineering, and transformations are applied consistently to both datasets.
- Consistent preprocessing: Ensures identical feature transformations across both sets
- Complete category information: Categorical encoding sees all possible values
- Unified missing value handling: Imputation strategies consider full data distribution
- Prevents data leakage: When done correctly, no information from test labels influences training
# Store the target variable separately y_train = train['SalePrice'].copy() train = train.drop('SalePrice', axis=1) # Concatenate train and test n_train = train.shape[0] n_test = test.shape[0] all_data = pd.concat([train, test], axis=0, sort=False) print(f'Training samples: {n_train}') print(f'Test samples: {n_test}') print(f'Combined dataset shape: {all_data.shape}') print(f'\nNumber of features: {all_data.shape[1]}')
7 Comprehensive Missing Value Treatment
Missing data is one of the most common challenges in real-world datasets. The Ames housing dataset contains missing values across numerous features, each requiring careful consideration. The key insight: not all missing values mean the same thing.
Missing Value Analysis
# Calculate missing value percentages missing = all_data.isnull().sum() missing_percent = (100 * missing / len(all_data)).sort_values( ascending=False ) # Filter to show only features with missing values missing_data = pd.DataFrame({ 'Missing Count': missing, 'Percentage': missing_percent }) missing_data = missing_data[missing_data['Missing Count'] > 0] print(f'Features with missing values: {len(missing_data)}') print('\nTop 10 features with most missing data:') print(missing_data.head(10))
Understanding Types of Missing Values
1. Missing = Absence of Feature (Structural Zeros)
For many features, "missing" actually means the house doesn't have that feature:
- PoolQC (99.5% missing): No pool quality rating because most houses don't have pools
- Fence (80.4% missing): No fence present
- FireplaceQu (48.6% missing): No fireplace
- GarageType/Finish/Cond (5-6% missing): No garage
Solution: Fill with "None" or 0 depending on data type
2. Missing = Truly Unknown Information
Some features have missing values due to incomplete data collection:
- LotFrontage (16.6% missing): Linear feet of street connected to property—data not recorded
- Electrical (0.03% missing): Electrical system type unknown
Solution: Intelligent imputation based on similar properties
Strategy 1: Low Percentage Missing Values (<5%)
For features with minimal missing data, we can use simple yet effective imputation strategies:
# Separate low percentage missing value columns (LPMVC) low_missing_cols = missing_data[ missing_data['Percentage'] < 5 ].index.tolist() print(f'Features with <5% missing: {len(low_missing_cols)}') # Distinguish numeric vs categorical numeric_lpmvc = all_data[low_missing_cols].select_dtypes( include=[np.number] ).columns.tolist() categorical_lpmvc = all_data[low_missing_cols].select_dtypes( include=['object'] ).columns.tolist() # Fill numeric features with 0 (structural zeros) for col in numeric_lpmvc: all_data[col].fillna(0, inplace=True) # Fill categorical features with mode (most frequent value) for col in categorical_lpmvc: if col != 'Utilities': # Handle Utilities separately mode_value = all_data[col].mode()[0] all_data[col].fillna(mode_value, inplace=True) print(f'{col}: Filled with "{mode_value}"')
Strategy 2: High Percentage Missing Values (>5%)
Features with substantial missing data require careful consideration:
# High percentage missing value columns (HPMVC) high_missing_cols = missing_data[ missing_data['Percentage'] >= 5 ].index.tolist() # These are mostly "absence of feature" type missing values # Fill numeric with 0, categorical with 'None' for col in high_missing_cols: if all_data[col].dtype == 'object': all_data[col].fillna('None', inplace=True) else: all_data[col].fillna(0, inplace=True)
Strategy 3: LotFrontage - Neighborhood-Based Imputation
LotFrontage deserves special treatment because it's truly missing (not structural) and can be estimated based on neighborhood characteristics:
# Fill LotFrontage based on median of neighborhood all_data['LotFrontage'] = all_data.groupby('Neighborhood')[ 'LotFrontage' ].transform( lambda x: x.fillna(x.median()) ) print('LotFrontage missing values after imputation:', all_data['LotFrontage'].isnull().sum())
LotFrontage (linear feet of street connected to property) varies significantly by neighborhood due to urban planning and lot design patterns. Properties in the same neighborhood tend to have similar lot configurations. Using the neighborhood median provides a much more accurate estimate than using the overall dataset median or zero.
Final Verification
# Check for any remaining missing values remaining_missing = all_data.isnull().sum().sum() print(f'Total missing values remaining: {remaining_missing}') if remaining_missing == 0: print('✅ Successfully handled all missing values!') else: print(f'⚠️ Still have {remaining_missing} missing values to address')
8 Feature Engineering: The Heart of the Project
Feature engineering is arguably the most critical phase in this project. Good features can make a simple model outperform a complex one with poor features. This section consumed the largest portion of my time and effort, involving detailed analysis of all 79 features and creation of new meaningful predictors.
Feature engineering is the process of using domain knowledge to extract or create features from raw data that make machine learning algorithms work better. It involves:
- Creating new features: Combining existing features in meaningful ways
- Transforming features: Applying mathematical transformations to improve distributions
- Encoding categorical variables: Converting text categories to numerical representations
- Binning continuous variables: Grouping numerical values into categories when appropriate
- Extracting information: Deriving new insights from existing data
Creating Composite Features
One powerful approach is creating features that combine multiple related attributes. For example, a house's total area is more meaningful than individual floor measurements:
# Total square footage = basement + first floor + second floor all_data['TotalSF'] = ( all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF'] ) # Calculate correlation with target (on training data) correlation = train_with_target[ ['TotalSF', 'SalePrice'] ].corr().iloc[0,1] print(f'TotalSF correlation with SalePrice: {correlation:.4f}')
Creating Total Bathrooms Feature
Bathroom counts are split across multiple features. Combining them creates a more intuitive predictor:
# Total bathrooms = full baths + (0.5 × half baths) all_data['TotalBathrooms'] = ( all_data['FullBath'] + (0.5 * all_data['HalfBath']) + all_data['BsmtFullBath'] + (0.5 * all_data['BsmtHalfBath']) ) print('TotalBathrooms statistics:') print(all_data['TotalBathrooms'].describe())
Feature Engineering for Basement Features
Basement features require careful handling because they involve both quality ratings and measurements:
# Basement Quality - Ordinal Encoding # Higher number = better quality basement_quality_map = { 'None': 0, 'Po': 1, # Poor 'Fa': 2, # Fair 'TA': 3, # Average/Typical 'Gd': 4, # Good 'Ex': 5 # Excellent } all_data['BsmtQual'] = all_data['BsmtQual'].map(basement_quality_map) all_data['BsmtCond'] = all_data['BsmtCond'].map(basement_quality_map) # Basement Exposure - Ordinal with different meaning basement_exposure_map = { 'None': 0, 'No': 1, # No exposure 'Mn': 2, # Minimum exposure 'Av': 3, # Average exposure 'Gd': 4 # Good exposure } all_data['BsmtExposure'] = all_data['BsmtExposure'].map( basement_exposure_map )
Ordinal Encoding (used above) assigns ordered numbers to categories with inherent ranking:
- Use when: Categories have natural order (Poor < Fair < Good < Excellent)
- Benefit: Preserves ordinal relationship, reduces dimensionality
- Example: Quality ratings, condition ratings, exposure levels
One-Hot Encoding creates binary columns for each category:
- Use when: Categories have no natural order (Neighborhood names)
- Benefit: No false ordinal assumption, works well with tree-based models
- Drawback: Increases dimensionality significantly
Kitchen and Fireplace Quality
Quality features throughout the dataset follow similar patterns and can be encoded uniformly:
# Standard quality mapping used across multiple features quality_map = { 'None': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5 } # Apply to multiple quality-related features quality_features = [ 'KitchenQual', 'FireplaceQu', 'ExterQual', 'ExterCond', 'HeatingQC', 'GarageQual', 'GarageCond', 'PoolQC' ] for feature in quality_features: all_data[feature] = all_data[feature].map(quality_map) print(f'Encoded {feature}')
Garage Features Engineering
Garage features include type, finish, size, and quality. Each provides valuable information:
# Garage Finish - Ordinal encoding garage_finish_map = { 'None': 0, 'Unf': 1, # Unfinished 'RFn': 2, # Rough Finished 'Fin': 3 # Finished } all_data['GarageFinish'] = all_data['GarageFinish'].map( garage_finish_map ) # GarageType - Keep as categorical for one-hot encoding later # Types: Attchd, Detchd, BuiltIn, Basment, CarPort, 2Types, None # Create interaction: Garage Score = Cars × Quality all_data['GarageScore'] = ( all_data['GarageCars'] * all_data['GarageQual'] ) print('Created GarageScore interaction feature')
Temporal Features: Age and Remodeling
Time-related features require thoughtful transformation to capture their effect on value:
# House Age at time of sale all_data['HouseAge'] = all_data['YrSold'] - all_data['YearBuilt'] # Years since remodeling all_data['YearsSinceRemodel'] = ( all_data['YrSold'] - all_data['YearRemodAdd'] ) # Was the house remodeled? (Binary feature) all_data['WasRemodeled'] = ( all_data['YearRemodAdd'] != all_data['YearBuilt'] ).astype(int) # Garage Age all_data['GarageAge'] = all_data['YrSold'] - all_data['GarageYrBlt'] all_data['GarageAge'].fillna(0, inplace=True) # No garage = 0 print('Created age-related features') print(f'Average house age: {all_data["HouseAge"].mean():.1f} years')
Neighborhood and Location Features
Neighborhood is one of the most important categorical features. The famous real estate mantra "location, location, location" applies here:
# Analyze neighborhood impact on price neighborhood_stats = train.groupby('Neighborhood')['SalePrice'].agg( ['mean', 'median', 'count'] ).sort_values('mean', ascending=False) print('Top 5 most expensive neighborhoods:') print(neighborhood_stats.head()) print('\nBottom 5 least expensive neighborhoods:') print(neighborhood_stats.tail()) # Neighborhood will be one-hot encoded later # But we can create a "NeighborhoodScore" based on average price neighborhood_scores = train.groupby('Neighborhood')[ 'SalePrice' ].mean().to_dict() all_data['NeighborhoodScore'] = all_data['Neighborhood'].map( neighborhood_scores )
Polynomial Features for Non-Linear Relationships
Sometimes the relationship between features and price is non-linear. Polynomial features can capture these patterns:
from sklearn.preprocessing import PolynomialFeatures # Select key features for polynomial expansion poly_features = ['OverallQual', 'GrLivArea', 'TotalSF'] # Create squared terms (degree=2) for feature in poly_features: all_data[f'{feature}_Squared'] = all_data[feature] ** 2 print(f'Created {feature}_Squared') # Create interaction term: Quality × Size all_data['QualitySize'] = ( all_data['OverallQual'] * all_data['GrLivArea'] ) print('\nCreated polynomial and interaction features')
Linear models assume linear relationships, but reality is often non-linear:
- Diminishing returns: Adding 100 sq ft to a 1000 sq ft house has more impact than adding to a 4000 sq ft house
- Quality-size interaction: A large house with poor quality may be worth less than a smaller high-quality house
- Squared terms: Capture accelerating or decelerating effects
However, be cautious: too many polynomial features can lead to overfitting and multicollinearity.
Binary Indicator Features
Sometimes the mere presence or absence of a feature is more important than its specific value:
# Has swimming pool? all_data['HasPool'] = (all_data['PoolArea'] > 0).astype(int) # Has second floor? all_data['Has2ndFloor'] = (all_data['2ndFlrSF'] > 0).astype(int) # Has basement? all_data['HasBasement'] = (all_data['TotalBsmtSF'] > 0).astype(int) # Has garage? all_data['HasGarage'] = (all_data['GarageArea'] > 0).astype(int) # Has fireplace? all_data['HasFireplace'] = (all_data['Fireplaces'] > 0).astype(int) # Has porch? all_data['HasPorch'] = ( all_data['OpenPorchSF'] + all_data['EnclosedPorch'] + all_data['3SsnPorch'] + all_data['ScreenPorch'] ) > 0 all_data['HasPorch'] = all_data['HasPorch'].astype(int) print('Created binary indicator features')
Final Feature Engineering Summary
- Created composite features that better represent house characteristics (TotalSF, TotalBathrooms)
- Applied ordinal encoding to quality ratings preserving their natural ordering
- Generated interaction terms capturing combined effects (QualitySize, GarageScore)
- Engineered temporal features to represent age and remodeling impact
- Created binary indicators for presence/absence of amenities
- Developed polynomial features to capture non-linear relationships
- Computed neighborhood scores based on average sale prices
This comprehensive feature engineering formed the foundation for high model performance, demonstrating that domain knowledge and creative feature design often matter more than algorithm choice.
9 Handling Skewness in Features
Just as we transformed the target variable to reduce skewness, many numerical features also exhibit skewed distributions that can negatively impact model performance. Applying transformations to normalize these distributions often improves predictions.
Skewness measures the asymmetry of a probability distribution:
- Skewness = 0: Perfectly symmetric (like normal distribution)
- Skewness > 0: Right-skewed (long tail on right, most values on left)
- Skewness < 0: Left-skewed (long tail on left, most values on right)
- |Skewness| > 0.75: Typically considered significantly skewed
Why it matters: Many ML algorithms (especially linear models) assume features are approximately normally distributed. Highly skewed features can lead to:
- Biased parameter estimates
- Increased sensitivity to outliers
- Poor generalization performance
Identifying Skewed Features
from scipy.stats import skew # Get all numerical features numeric_features = all_data.dtypes[ all_data.dtypes != 'object' ].index # Calculate skewness for each feature skewness = all_data[numeric_features].apply( lambda x: skew(x.dropna()) ) # Filter features with high skewness (|skew| > 0.75) skewed_features = skewness[abs(skewness) > 0.75].sort_values( ascending=False ) print(f'Found {len(skewed_features)} skewed features') print('\nTop 10 most skewed features:') print(skewed_features.head(10))
Applying Box-Cox Transformation
The Box-Cox transformation is a powerful method to reduce skewness and make data more normal-like:
from scipy.special import boxcox1p from scipy.stats import boxcox_normmax # Box-Cox parameter lambda=0.15 works well for most features lam = 0.15 for feature in skewed_features.index: # boxcox1p applies Box-Cox transformation to (1 + x) # This handles zero values gracefully all_data[feature] = boxcox1p(all_data[feature], lam) # Verify skewness reduction new_skewness = all_data[skewed_features.index].apply( lambda x: skew(x.dropna()) ) print('\nSkewness before and after transformation:') comparison = pd.DataFrame({ 'Before': skewed_features, 'After': new_skewness }) print(comparison.head(10))
Log Transformation: log(1 + x)
- Simple and interpretable
- Works well for right-skewed data
- Cannot handle negative values or zeros (hence the +1)
Box-Cox Transformation: More flexible power transformation
- Automatically finds optimal transformation parameter
- Can handle various types of skewness
- Generally more effective than simple log
- Formula:
(x^λ - 1) / λfor λ ≠ 0,log(x)for λ = 0
10 Recreating Train & Test Datasets
After all preprocessing and feature engineering on the combined dataset, we need to separate it back into training and test sets for modeling:
# One-hot encode categorical variables all_data = pd.get_dummies(all_data) # Separate back into train and test train = all_data[:len(y_train)] test = all_data[len(y_train):] print(f'Final training set shape: {train.shape}') print(f'Final test set shape: {test.shape}') print(f'Total features after encoding: {train.shape[1]}')
11 Machine Learning Model Implementation
With thoroughly prepared data, I implemented and compared multiple regression algorithms. The goal was to find the best individual models and then combine them through ensemble techniques.
Model Selection Strategy
Phase 1: Baseline Models
Train multiple algorithms with default parameters to establish baseline performance.
Phase 2: Hyperparameter Tuning
Use GridSearchCV to optimize parameters for top-performing models.
Phase 3: Stacking
Combine predictions from multiple base models using a meta-learner.
Phase 4: Ensembling
Create weighted average of best models for final predictions.
Evaluation Metric: RMSLE
The competition uses RMSLE as the evaluation metric:
RMSLE = √(1/n Σ[log(predicted + 1) - log(actual + 1)]²)
Why RMSLE instead of RMSE?
- Scale invariance: Treats percentage errors equally regardless of absolute value
- Penalizes underestimation more: Better for real estate where underpricing hurts sellers
- Handles wide price ranges: Works well with prices ranging from $35k to $755k
Cross-Validation Setup
from sklearn.model_selection import KFold, cross_val_score from sklearn.metrics import mean_squared_error # Define cross-validation strategy kfolds = KFold(n_splits=10, shuffle=True, random_state=42) def rmsle_cv(model, X, y): """Calculate RMSLE using cross-validation""" # Cross-validation returns negative MSE rmse = np.sqrt(-cross_val_score( model, X, y, scoring='neg_mean_squared_error', cv=kfolds )) return rmse def rmsle(y_true, y_pred): """Calculate RMSLE for predictions""" return np.sqrt(mean_squared_error(y_true, y_pred))
Training Multiple Models
I trained six different algorithms, each with unique strengths:
from sklearn.linear_model import Ridge, Lasso, ElasticNet from sklearn.ensemble import GradientBoostingRegressor from sklearn.kernel_ridge import KernelRidge import xgboost as xgb import lightgbm as lgb # Initialize models with optimized parameters ridge = Ridge(alpha=10.0) lasso = Lasso(alpha=0.0005, max_iter=10000) elastic = ElasticNet(alpha=0.0005, l1_ratio=0.9, max_iter=10000) kridge = KernelRidge(alpha=0.6, kernel='polynomial', degree=2) gbr = GradientBoostingRegressor( n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=10 ) xgboost = xgb.XGBRegressor( learning_rate=0.01, n_estimators=3500, max_depth=3, min_child_weight=0, gamma=0, subsample=0.7, colsample_bytree=0.7 ) lightgbm = lgb.LGBMRegressor( objective='regression', num_leaves=5, learning_rate=0.05, n_estimators=720, max_bin=55, bagging_fraction=0.8 ) # Evaluate each model using cross-validation models = { 'Ridge': ridge, 'Lasso': lasso, 'ElasticNet': elastic, 'Kernel Ridge': kridge, 'Gradient Boosting': gbr, 'XGBoost': xgboost, 'LightGBM': lightgbm } print('Cross-Validation Scores (RMSLE):') print('=' * 50) results = {} for name, model in models.items(): scores = rmsle_cv(model, train, y_train) results[name] = scores.mean() print(f'{name:20s}: {scores.mean():.5f} (+/- {scores.std():.5f})')
CV Score: 0.1115
L2 regularization, good baseline
CV Score: 0.1101
L1 regularization, feature selection
CV Score: 0.1103
Combines L1 and L2 regularization
CV Score: 0.1089
Non-linear kernel, polynomial degree 2
CV Score: 0.1078
Sequential tree ensemble
CV Score: 0.1082
Optimized gradient boosting
CV Score: 0.1075
Fast gradient boosting, leaf-wise
Stacking Models
Stacking combines multiple base models through a meta-learner that learns optimal weights:
from sklearn.ensemble import StackingRegressor # Define base models base_models = [ ('lasso', lasso), ('elastic', elastic), ('kridge', kridge), ('gbr', gbr), ('xgb', xgboost), ('lgbm', lightgbm) ] # Use Ridge as meta-learner stacked_model = StackingRegressor( estimators=base_models, final_estimator=Ridge(alpha=10.0), cv=5 ) # Train stacked model stacked_model.fit(train, y_train) # Evaluate stacked_scores = rmsle_cv(stacked_model, train, y_train) print(f'Stacked Model RMSLE: {stacked_scores.mean():.5f}')
Stacking Architecture
═══════════════════════════════════════════════════════════════
Base Layer (Level 0):
┌────────────────────────────────────────────────────────┐
│ Model 1 Model 2 Model 3 Model 4 Model 5 │
│ Lasso ElasticNet KRidge GBR XGBoost │
│ │ │ │ │ │ │
│ └──────────┴─────────┴─────────┴──────────┘ │
│ │ │
│ Predictions (P1, P2, P3, P4, P5) │
└────────────────────────────────────────────────────────┘
│
▼
Meta Layer (Level 1):
┌────────────────────────────────────────────────────────┐
│ Ridge Regression │
│ Input: [P1, P2, P3, P4, P5] │
│ Output: Final Prediction │
└────────────────────────────────────────────────────────┘
Benefits:
✓ Captures complementary strengths
✓ Reduces overfitting
✓ Often beats individual models
Weighted Ensemble
The final step is creating a weighted average of the best models:
def ensemble_predict(models, weights, X): """Weighted ensemble prediction""" predictions = np.zeros(len(X)) for model, weight in zip(models, weights): predictions += weight * model.predict(X) return predictions # Train final ensemble with optimal weights # Weights determined through validation performance ensemble_models = [lasso, elastic, kridge, gbr, xgboost, lightgbm] ensemble_weights = [0.10, 0.10, 0.15, 0.20, 0.20, 0.25] # Train all models for model in ensemble_models: model.fit(train, y_train) # Make predictions ensemble_preds = ensemble_predict( ensemble_models, ensemble_weights, test ) # Convert back from log scale final_predictions = np.expm1(ensemble_preds) print('Ensemble prediction statistics:') print(f'Mean: ${final_predictions.mean():.2f}') print(f'Median: ${np.median(final_predictions):.2f}') print(f'Std: ${final_predictions.std():.2f}')
12 Deep Neural Network Implementation
In addition to traditional machine learning models, I implemented a Deep Neural Network (DNN) from scratch using TensorFlow's low-level APIs. This provided deep insights into how neural networks operate under the hood.
Neural Network Architecture
Deep Neural Network Architecture
═══════════════════════════════════════════════════════════════
Input Layer: 220+ features
│
▼
┌─────────────────────────────────────────┐
│ Hidden Layer 1: 256 neurons │
│ Activation: ReLU │
│ Dropout: 0.3 │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Hidden Layer 2: 128 neurons │
│ Activation: ReLU │
│ Dropout: 0.3 │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Hidden Layer 3: 64 neurons │
│ Activation: ReLU │
│ Dropout: 0.2 │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Output Layer: 1 neuron │
│ Activation: Linear │
└─────────────────────────────────────────┘
│
▼
Predicted log(Price)
Optimization:
- Optimizer: Adam (learning_rate=0.001)
- Loss: Mean Squared Error + L2 Regularization
- Batch Size: 32
- Epochs: 500 with early stopping
- Architecture: Progressively narrowing layers (256 → 128 → 64 → 1) to compress information
- ReLU Activation: Prevents vanishing gradients, allows network to learn non-linear patterns
- Dropout: Randomly disables neurons during training to prevent overfitting
- L2 Regularization: Penalizes large weights to improve generalization
- Adam Optimizer: Adaptive learning rate for each parameter
- Batch Training: Updates weights using mini-batches of 32 samples
import tensorflow as tf def build_dnn_model(input_dim, layers=[256, 128, 64], dropout_rate=0.3, l2_reg=0.01): """Build and compile DNN model""" model = tf.keras.Sequential() # Input layer model.add(tf.keras.layers.Dense( layers[0], activation='relu', input_dim=input_dim, kernel_regularizer=tf.keras.regularizers.l2(l2_reg) )) model.add(tf.keras.layers.Dropout(dropout_rate)) # Hidden layers for neurons in layers[1:]: model.add(tf.keras.layers.Dense( neurons, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(l2_reg) )) model.add(tf.keras.layers.Dropout(dropout_rate)) # Output layer model.add(tf.keras.layers.Dense(1, activation='linear')) # Compile model model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse', metrics=['mae'] ) return model # Build and train dnn_model = build_dnn_model(input_dim=train.shape[1]) # Early stopping to prevent overfitting early_stop = tf.keras.callbacks.EarlyStopping( monitor='val_loss', patience=50, restore_best_weights=True ) # Train model history = dnn_model.fit( train, y_train, epochs=500, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=0 ) # Make predictions dnn_predictions = dnn_model.predict(test) dnn_predictions = np.expm1(dnn_predictions.flatten()) print(f'DNN Final Validation Loss: {history.history["val_loss"][-1]:.5f}')
13 Conclusion & Key Learnings
This comprehensive journey through house price prediction has been both challenging and rewarding. From initial data exploration to sophisticated ensemble methods, each step contributed to building an accurate and robust predictive model.
Final Results Summary
Key Learnings and Insights
The most significant performance gains came from thoughtful feature engineering—creating composite features, encoding categorical variables appropriately, and generating interaction terms. Time spent understanding domain knowledge and feature relationships paid enormous dividends.
Systematic handling of missing values, outlier removal, and skewness treatment formed the foundation for success. No sophisticated model can overcome poor data quality.
Combining multiple models through stacking and weighted averaging consistently outperformed individual models. Different algorithms capture different patterns, and ensembling leverages all perspectives.
Gradient boosting algorithms (XGBoost, LightGBM) performed best on this structured dataset. They handle non-linearities, interactions, and mixed feature types naturally without extensive preprocessing.
10-fold cross-validation provided reliable performance estimates and helped detect overfitting. Never trust a single train/test split for model evaluation.
L1, L2, and elastic net regularization in linear models, along with dropout in neural networks, were crucial for generalization. High-dimensional feature spaces demand regularization.
What Worked Best
- Creating TotalSF feature: Combined basement, first floor, and second floor areas—became one of the strongest predictors
- Ordinal encoding of quality features: Preserved natural ordering in ratings (Poor < Fair < Good < Excellent)
- Log transformation of target: Reduced skewness from 1.88 to 0.12, dramatically improving linear model performance
- Neighborhood-based LotFrontage imputation: Intelligent missing value handling based on similar properties
- Polynomial and interaction features: Captured non-linear relationships and combined effects
- Weighted ensemble: Final 1.6% improvement by combining model strengths
Challenges Overcome
- Missing data complexity: 34 features had missing values, each requiring different treatment strategies based on what "missing" actually meant
- High dimensionality: After one-hot encoding, 220+ features required careful regularization to avoid overfitting
- Outlier identification: Distinguishing true outliers from legitimate high-value properties required domain knowledge and careful analysis
- Hyperparameter tuning: GridSearchCV across multiple models was computationally expensive but necessary for optimal performance
- Model interpretation: Balancing predictive accuracy with interpretability—ensemble models are powerful but less transparent
Future Improvements
- Feature selection: Apply recursive feature elimination to identify truly important features and reduce dimensionality
- Advanced ensembling: Experiment with blending, more sophisticated stacking architectures
- Automated feature engineering: Explore tools like Featuretools for systematic feature generation
- Bayesian optimization: More efficient hyperparameter search than grid search
- External data: Incorporate economic indicators, neighborhood demographics, school ratings
- Time series analysis: Model temporal trends in house prices for specific neighborhoods
- Deep learning refinement: Experiment with more sophisticated DNN architectures and embeddings for categorical features
Practical Applications
The techniques developed in this project extend far beyond the Kaggle competition:
- Real Estate Valuation: Automated property appraisal systems for buyers, sellers, and real estate agents
- Investment Analysis: Identifying undervalued properties and predicting future appreciation
- Mortgage Lending: Risk assessment and loan approval automation
- Urban Planning: Understanding factors that drive property values for policy decisions
- Insurance Underwriting: Property valuation for insurance coverage determination
Technical Skills Developed
- Comprehensive data preprocessing pipeline development
- Advanced feature engineering techniques
- Multiple regression algorithm implementation and tuning
- Ensemble methods: stacking and weighted averaging
- Deep neural network design and training
- Cross-validation and robust model evaluation
- Handling missing data with domain knowledge
- Outlier detection and treatment
- Statistical transformations for normality
- Model interpretation and feature importance analysis
Final Thoughts
This project reinforced a fundamental truth in data science: success comes from understanding the problem deeply, preparing data meticulously, and applying appropriate techniques thoughtfully. The flashiest algorithm won't save poor data quality or thoughtless feature engineering.
The iterative process of hypothesis, experimentation, and refinement led to continuous improvements. Each small optimization—better outlier handling, a new composite feature, refined hyperparameters—compounded into significant performance gains.
Most importantly, this project demonstrated that data science is as much art as science. Domain knowledge, creative feature design, and intuitive understanding of model behavior often matter more than mathematical sophistication or computational power.
"In data science, as in real estate,
success is about location, preparation, and smart decisions."
Thank you for joining me on this comprehensive journey through house price prediction!

