What drives property prices - and can we prove it?

Real estate pricing is driven by multiple factors simultaneously. The challenge isn't just building a model - it's building one you can trust. A model that overfits to 414 data points is useless. So is one that can't tell you which features matter and by how much.

This project focused on statistical rigour: proper diagnostics, regularisation to handle multicollinearity, and resampling methods to quantify uncertainty around every coefficient estimate.

Dataset: 414 real estate transactions from Sindian District, Taipei (UCI repository). Features: house age, distance to nearest MRT station, number of nearby convenience stores, latitude, longitude, transaction date. Target: price per unit area (NT$/Ping).

From EDA to validated predictions

Exploratory Data Analysis

Distribution plots, correlation heatmap, scatter matrix. Identified that distance-to-MRT is severely right-skewed - the closest stations dominate. House age shows non-linear price effects.

Log Transform on MRT Distance

Applying log(distance_to_MRT) stabilised variance and made the relationship with price linear - a classic fix that materially improved regression diagnostics and model fit.

OLS Regression + Diagnostics

Built multiple OLS models, tested assumptions via Breusch-Pagan (heteroscedasticity), Shapiro-Wilk (normality of residuals), and VIF scores (multicollinearity). Iterated until diagnostics passed.

LASSO Regularisation

Applied LASSO (L1) to simultaneously regularise and perform feature selection. Used LassoCV to find optimal α via cross-validation. Geographic coordinates were correctly shrunk to near-zero.

10-Fold Cross-Validation

Evaluated generalisation with 10-fold CV - reporting CV MSE ~79 and R² ~0.58. This is the honest performance estimate, not the in-sample R² which always looks better.

Bootstrap Confidence Intervals

1,000 bootstrap samples to estimate 95% confidence intervals for each coefficient. MRT distance coefficient: -0.31 [−0.38, −0.24] - statistically robust across all resamples.

# Bootstrap confidence intervals for regression coefficients
n_bootstrap = 1000
boot_coefs = []
for _ in range(n_bootstrap):
    idx = np.random.choice(len(X), size=len(X), replace=True)
    X_b, y_b = X[idx], y[idx]
    model = LinearRegression().fit(X_b, y_b)
    boot_coefs.append(model.coef_)

boot_coefs = np.array(boot_coefs)
ci_lower = np.percentile(boot_coefs, 2.5, axis=0)
ci_upper = np.percentile(boot_coefs, 97.5, axis=0)
# Result: log(MRT_distance) CI = [-0.38, -0.24]  -  robust negative effect

What drives Taipei house prices

Strongest driver
MRT distance (negative)

Coefficient: −0.31 (log scale). Every doubling of MRT distance reduces price per unit area by ~18%. The effect is non-linear - being near an MRT station matters far more than being slightly further away.

Positive driver
Nearby convenience stores

Each additional nearby store adds approximately 1.1 NT$/Ping. Proxy for urban density and walkability - areas with many stores command a consistent premium.

Negative driver
House age

Older buildings are discounted - approximately −0.27 NT$/Ping per year of age. Effect is more pronounced for properties over 20 years old.

Model performance
CV R² ~0.58

Explains 58% of price variance on held-out data. Solid for a linear model on real estate data - the remaining variance reflects factors not in the dataset (interior quality, floor level, negotiation).

Technologies Used

ToolPurpose
StatsmodelsOLS regression, statistical inference, diagnostic tests (BP, SW, VIF)
scikit-learnLASSO (LassoCV), cross-validation, Logistic Regression for classification
Pandas & NumPyData manipulation, bootstrap loop, feature engineering
Matplotlib & SeabornDistribution plots, residual plots, coefficient CI visualisations
DatasetUCI Real Estate Valuation - 414 Taipei transactions

Results

R²~0.58
generalisation via 10-fold cross-validation
1,000
bootstrap samples for coefficient uncertainty
MRT
confirmed as strongest negative price predictor
  • MRT distance identified as the dominant pricing factor - consistent across OLS, LASSO, and classification models
  • Log transformation of MRT distance was key - materially improved model diagnostics and fit
  • Bootstrap CIs confirmed coefficient stability - findings are not artefacts of a particular data split
  • LASSO correctly zeroed geographic coordinates - they're collinear with MRT distance at this scale

What I took away

  • Cross-validation R² is the honest number - in-sample R² on 414 points can be inflated by a few outliers. Always report held-out performance.
  • Diagnostic testing isn't optional - heteroscedasticity in the residuals invalidates standard errors and confidence intervals, making the model's uncertainty estimates unreliable.
  • Bootstrap CIs are more interpretable than p-values for communicating "how sure are we about this coefficient" to a non-statistical audience.
  • Log transformations on skewed predictors often matter more than model choice - getting the functional form right first, then regularising.