What drives property prices - and can we prove it?
Real estate pricing is driven by multiple factors simultaneously. The challenge isn't just building a model - it's building one you can trust. A model that overfits to 414 data points is useless. So is one that can't tell you which features matter and by how much.
This project focused on statistical rigour: proper diagnostics, regularisation to handle multicollinearity, and resampling methods to quantify uncertainty around every coefficient estimate.
Dataset: 414 real estate transactions from Sindian District, Taipei (UCI repository). Features: house age, distance to nearest MRT station, number of nearby convenience stores, latitude, longitude, transaction date. Target: price per unit area (NT$/Ping).
From EDA to validated predictions
Exploratory Data Analysis
Distribution plots, correlation heatmap, scatter matrix. Identified that distance-to-MRT is severely right-skewed - the closest stations dominate. House age shows non-linear price effects.
Log Transform on MRT Distance
Applying log(distance_to_MRT) stabilised variance and made the relationship with price linear - a classic fix that materially improved regression diagnostics and model fit.
OLS Regression + Diagnostics
Built multiple OLS models, tested assumptions via Breusch-Pagan (heteroscedasticity), Shapiro-Wilk (normality of residuals), and VIF scores (multicollinearity). Iterated until diagnostics passed.
LASSO Regularisation
Applied LASSO (L1) to simultaneously regularise and perform feature selection. Used LassoCV to find optimal α via cross-validation. Geographic coordinates were correctly shrunk to near-zero.
10-Fold Cross-Validation
Evaluated generalisation with 10-fold CV - reporting CV MSE ~79 and R² ~0.58. This is the honest performance estimate, not the in-sample R² which always looks better.
Bootstrap Confidence Intervals
1,000 bootstrap samples to estimate 95% confidence intervals for each coefficient. MRT distance coefficient: -0.31 [−0.38, −0.24] - statistically robust across all resamples.
# Bootstrap confidence intervals for regression coefficients n_bootstrap = 1000 boot_coefs = [] for _ in range(n_bootstrap): idx = np.random.choice(len(X), size=len(X), replace=True) X_b, y_b = X[idx], y[idx] model = LinearRegression().fit(X_b, y_b) boot_coefs.append(model.coef_) boot_coefs = np.array(boot_coefs) ci_lower = np.percentile(boot_coefs, 2.5, axis=0) ci_upper = np.percentile(boot_coefs, 97.5, axis=0) # Result: log(MRT_distance) CI = [-0.38, -0.24] - robust negative effect
What drives Taipei house prices
MRT distance (negative)
Coefficient: −0.31 (log scale). Every doubling of MRT distance reduces price per unit area by ~18%. The effect is non-linear - being near an MRT station matters far more than being slightly further away.
Nearby convenience stores
Each additional nearby store adds approximately 1.1 NT$/Ping. Proxy for urban density and walkability - areas with many stores command a consistent premium.
House age
Older buildings are discounted - approximately −0.27 NT$/Ping per year of age. Effect is more pronounced for properties over 20 years old.
CV R² ~0.58
Explains 58% of price variance on held-out data. Solid for a linear model on real estate data - the remaining variance reflects factors not in the dataset (interior quality, floor level, negotiation).
Technologies Used
| Tool | Purpose |
|---|---|
| Statsmodels | OLS regression, statistical inference, diagnostic tests (BP, SW, VIF) |
| scikit-learn | LASSO (LassoCV), cross-validation, Logistic Regression for classification |
| Pandas & NumPy | Data manipulation, bootstrap loop, feature engineering |
| Matplotlib & Seaborn | Distribution plots, residual plots, coefficient CI visualisations |
| Dataset | UCI Real Estate Valuation - 414 Taipei transactions |
Results
- MRT distance identified as the dominant pricing factor - consistent across OLS, LASSO, and classification models
- Log transformation of MRT distance was key - materially improved model diagnostics and fit
- Bootstrap CIs confirmed coefficient stability - findings are not artefacts of a particular data split
- LASSO correctly zeroed geographic coordinates - they're collinear with MRT distance at this scale
What I took away
- Cross-validation R² is the honest number - in-sample R² on 414 points can be inflated by a few outliers. Always report held-out performance.
- Diagnostic testing isn't optional - heteroscedasticity in the residuals invalidates standard errors and confidence intervals, making the model's uncertainty estimates unreliable.
- Bootstrap CIs are more interpretable than p-values for communicating "how sure are we about this coefficient" to a non-statistical audience.
- Log transformations on skewed predictors often matter more than model choice - getting the functional form right first, then regularising.